Hi all,

I recall our initial discussions regarding OpenLineage integration where we
agreed on an event-first approach, as this aligns best with Polaris's
positioning. Other aspects of OpenLineage seem more suited to the query
engine level.

The proposal to treat Polaris as a proxy is an interesting approach,
particularly for the potential to augment events with additional
information.

Regards,
JB

On Tue, Apr 21, 2026 at 7:13 PM Adnan Hemani via dev <[email protected]>
wrote:

> Hi all,
>
> Great thoughts!
>
> I-Ting, you're exactly right. I will share the proposal to your email on
> Google Drive (it still needs a bit of work but it's about 90% complete
> right now). I'd love to get your feedback and work on the doc!
>
> Yufei, I 100% agree with your feedback as well. The proposal's current
> structure highlights a few advantages:
> 1. Centralized auth for the OL server: Marquez, the reference OL
> implementation, does not have an auth system. Polaris can provide
> significant value here immediately.
> 2. OL, in general, has many concepts, like Jobs/Runs/etc. I think it is
> likely that some customers do not need that level of lineage tracking and
> would rather not deal with the additional maintenance of yet another web
> server. Polaris can provide a "lite" version of data lineage OOTB so that
> users don't need to install a full-functionality OL server if they don't
> require it. But Polaris should be able to support a pass-through model to a
> full-functionality OL server in case users still want to use Polaris as the
> "single pane of glass".
> 3. Although the proposal does not explicitly state catalog-enhanced data
> enrichment for the OL data that Polaris could send downstream (to focus the
> scope of the initial changes), this proposal would help us unlock this
> functionality. This is something I discussed with Michael Collado earlier
> :)
>
> As next steps, I will work with I-Ting to introduce this proposal - perhaps
> later this week!
>
> Best,
> Adnan Hemani
>
> On Tue, Apr 21, 2026 at 9:45 AM Yufei Gu <[email protected]> wrote:
>
> > Agreed: engines provide more accurate lineage information.
> >
> > I think there are two viable architectures here:
> >
> >    - Engines send OpenLineage events directly to a lineage backend (e.g.
> >    Marquez)
> >    - Engines send events to Polaris, which acts as an ingestion layer
> (and
> >    optionally forwards downstream)
> >
> > The first option is the standard and simplest model. It avoids extra hops
> > and works well when lineage collection is the only goal. The second
> option
> > becomes interesting only if Polaris adds clear control-plane value, for
> > example:
> >
> >    - central auth and policy for lineage ingestion
> >    - catalog-aware enrichment (stable table identity, namespace,
> snapshots,
> >    federation context)
> >    - decoupling engines from the downstream lineage backend
> >
> > If Polaris is only acting as a pass-through, the extra layer likely does
> > not justify the added complexity. However, if we position Polaris as a
> > catalog-aware lineage gateway, the second model aligns well with the
> > “single pane of glass” direction. I think this seems a direction the
> > Polaris community can pursue. My inclination is to focus Polaris on
> > ingestion, normalization, and forwarding, and avoid expanding into a full
> > lineage storage or analytics system.
> >
> > Curious to hear others’ thoughts.
> >
> > Yufei
> >
> >
> > On Tue, Apr 21, 2026 at 7:52 AM ITing Lee <[email protected]> wrote:
> >
> > > Hi Adnan Hemani,
> > >
> > > Yes, I agree with your point of view. After thinking thoroughly, it
> might
> > > be possible to lose complete data lineage for my POC.
> > >
> > > Additionally, from your direction, we can decouple Spark from
> OpenLineage
> > > and make Polaris as the only interface that Spark interacts with.
> > >
> > > It is my pleasure to co-work for this project, and I would like to make
> > > sure we’re aligned before proceeding and please correct me if I’m
> wrong.
> > >
> > > Does it mean that we will implement OL Rest API in Polaris?
> > >
> > > For example, there's one OpenLineage server running on 5000 ports, and
> a
> > > Polaris server running 8000.  What the user needs to do are:, install
> > > `openlineage-spark-*.jar` [1], which stays as-is for the user, then
> > second,
> > > sets the host to 8000.
> > >
> > > If my thought is correct, would you like to share the proposal you have
> > > done, then we will not step on each other.
> > >
> > > Thanks!
> > >
> > > [1]
> > >
> >
> https://openlineage.io/docs/integrations/spark/developing/built_in_lineage/
> > >
> > > Best regards,
> > >
> > > ITing
> > >
> > >
> > > Adnan Hemani via dev <[email protected]> 於 2026年4月21日週二
> 上午10:49寫道:
> > >
> > > > Hi ITing,
> > > >
> > > > I'm glad you are looking into this - I was also informally/verbally
> > > > floating a proposal with some community members during Iceberg Summit
> > > > earlier this month regarding OpenLineage support. Specifically, the
> > > > proposal was to open Polaris as an OpenLineage (OL) server
> > implementation
> > > > (with exposed OL APIs) that could either retain OL information
> > internally
> > > > (as a "lite" version offering through the Polaris Persistence layer)
> or
> > > > reflect it downstream to an OL server like Marquez for a richer
> feature
> > > set
> > > > (up to admin preference). All compute engines would then call
> Polaris'
> > > > OpenLineage APIs (through their respective OL connectors) to keep
> > Polaris
> > > > as a single-pane of glass for metadata. I was working on that
> proposal
> > > > (albeit slowly :') ), but I'm happy to merge proposals as co-authors
> if
> > > you
> > > > are open to it!
> > > >
> > > > I read through the proposal and skimmed through the POC code; I don't
> > > think
> > > > this is a correct approach to solve this problem. As the base
> > assumption
> > > > your code makes is that: all previously loaded tables within a Spark
> > > > Application contributed to the creation of a new table. This is a
> very
> > > > simplistic heuristic that easily breaks in simple cases, such as when
> > > many
> > > > tables are read within a single Spark application but only a handful
> of
> > > > them are used to generate a new table. You've correctly pointed out
> the
> > > > flaw in Question 10.2: Polaris cannot determine what populated a
> table
> > > > based on IRC calls alone. This is why I aimed to propose the idea
> > above -
> > > > have the engines send this information directly to Polaris. This
> > mirrors
> > > > how any other OL servers rely on the engine's OL connectors to create
> > > this
> > > > information, as the engine is the sole source of truth regarding the
> > > user's
> > > > actual intent. Let me know your thoughts.
> > > >
> > > > Best,
> > > > Adnan Hemani
> > > >
> > > > On Sun, Apr 19, 2026 at 9:56 PM Jean-Baptiste Onofré <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi ITing,
> > > > >
> > > > > Thanks for your message!
> > > > >
> > > > > For context, we discussed OpenLineage support a while ago,
> > specifically
> > > > > regarding the initial support of OpenLineage events.
> > > > >
> > > > > I will take a look at your proposal and design document, and I am
> > sure
> > > > Mike
> > > > > will as well.
> > > > >
> > > > > Regards,
> > > > > JB
> > > > >
> > > > > On Mon, Apr 20, 2026 at 3:08 AM ITing Lee <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I’ve been working on OpenLineage support in Polaris and wanted to
> > ask
> > > > for
> > > > > > feedback on the current implementation approach and where it
> should
> > > go
> > > > > > next.
> > > > > >
> > > > > > The current direction is to emit OpenLineage events from
> > > > Polaris-managed
> > > > > > operations and publish them through an event-listener-based
> design.
> > > > > Before
> > > > > > moving further, I’d like to get suggestions on whether this
> > > > > implementation
> > > > > > model makes sense and what should be adjusted early.
> > > > > >
> > > > > > Below is the link for the proposal and PoC branch.
> > > > > >
> > > > > > - Proposal:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1NEPzcMIcbxKEvBIWd6SJGllRIu4brjV-23xO5Qp3DlM/edit?tab=t.0
> > > > > > - PoC branch:
> > https://github.com/iting0321/polaris/tree/data-lineage
> > > > > >
> > > > > > I’d appreciate feedback on the implementation direction and any
> > > > > suggestions
> > > > > > before continuing further.
> > > > > >
> > > > > > Best regards,
> > > > > > ITing
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to