Re: [DISCUSS] Add OpenLineage to Apache Polaris

ITing Lee Tue, 21 Apr 2026 07:52:57 -0700

Hi Adnan Hemani,

Yes, I agree with your point of view. After thinking thoroughly, it might
be possible to lose complete data lineage for my POC.


Additionally, from your direction, we can decouple Spark from OpenLineage
and make Polaris as the only interface that Spark interacts with.

It is my pleasure to co-work for this project, and I would like to make
sure we’re aligned before proceeding and please correct me if I’m wrong.

Does it mean that we will implement OL Rest API in Polaris?

For example, there's one OpenLineage server running on 5000 ports, and a
Polaris server running 8000.  What the user needs to do are:, install
`openlineage-spark-*.jar` [1], which stays as-is for the user, then second,
sets the host to 8000.

If my thought is correct, would you like to share the proposal you have
done, then we will not step on each other.

Thanks!

[1]
https://openlineage.io/docs/integrations/spark/developing/built_in_lineage/

Best regards,

ITing


Adnan Hemani via dev <[email protected]> 於 2026年4月21日週二 上午10:49寫道：

> Hi ITing,
>
> I'm glad you are looking into this - I was also informally/verbally
> floating a proposal with some community members during Iceberg Summit
> earlier this month regarding OpenLineage support. Specifically, the
> proposal was to open Polaris as an OpenLineage (OL) server implementation
> (with exposed OL APIs) that could either retain OL information internally
> (as a "lite" version offering through the Polaris Persistence layer) or
> reflect it downstream to an OL server like Marquez for a richer feature set
> (up to admin preference). All compute engines would then call Polaris'
> OpenLineage APIs (through their respective OL connectors) to keep Polaris
> as a single-pane of glass for metadata. I was working on that proposal
> (albeit slowly :') ), but I'm happy to merge proposals as co-authors if you
> are open to it!
>
> I read through the proposal and skimmed through the POC code; I don't think
> this is a correct approach to solve this problem. As the base assumption
> your code makes is that: all previously loaded tables within a Spark
> Application contributed to the creation of a new table. This is a very
> simplistic heuristic that easily breaks in simple cases, such as when many
> tables are read within a single Spark application but only a handful of
> them are used to generate a new table. You've correctly pointed out the
> flaw in Question 10.2: Polaris cannot determine what populated a table
> based on IRC calls alone. This is why I aimed to propose the idea above -
> have the engines send this information directly to Polaris. This mirrors
> how any other OL servers rely on the engine's OL connectors to create this
> information, as the engine is the sole source of truth regarding the user's
> actual intent. Let me know your thoughts.
>
> Best,
> Adnan Hemani
>
> On Sun, Apr 19, 2026 at 9:56 PM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > Hi ITing,
> >
> > Thanks for your message!
> >
> > For context, we discussed OpenLineage support a while ago, specifically
> > regarding the initial support of OpenLineage events.
> >
> > I will take a look at your proposal and design document, and I am sure
> Mike
> > will as well.
> >
> > Regards,
> > JB
> >
> > On Mon, Apr 20, 2026 at 3:08 AM ITing Lee <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I’ve been working on OpenLineage support in Polaris and wanted to ask
> for
> > > feedback on the current implementation approach and where it should go
> > > next.
> > >
> > > The current direction is to emit OpenLineage events from
> Polaris-managed
> > > operations and publish them through an event-listener-based design.
> > Before
> > > moving further, I’d like to get suggestions on whether this
> > implementation
> > > model makes sense and what should be adjusted early.
> > >
> > > Below is the link for the proposal and PoC branch.
> > >
> > > - Proposal:
> > >
> > >
> >
> https://docs.google.com/document/d/1NEPzcMIcbxKEvBIWd6SJGllRIu4brjV-23xO5Qp3DlM/edit?tab=t.0
> > > - PoC branch: https://github.com/iting0321/polaris/tree/data-lineage
> > >
> > > I’d appreciate feedback on the implementation direction and any
> > suggestions
> > > before continuing further.
> > >
> > > Best regards,
> > > ITing
> > >
> >
>

Re: [DISCUSS] Add OpenLineage to Apache Polaris

Reply via email to