Hi ITing, I'm glad you are looking into this - I was also informally/verbally floating a proposal with some community members during Iceberg Summit earlier this month regarding OpenLineage support. Specifically, the proposal was to open Polaris as an OpenLineage (OL) server implementation (with exposed OL APIs) that could either retain OL information internally (as a "lite" version offering through the Polaris Persistence layer) or reflect it downstream to an OL server like Marquez for a richer feature set (up to admin preference). All compute engines would then call Polaris' OpenLineage APIs (through their respective OL connectors) to keep Polaris as a single-pane of glass for metadata. I was working on that proposal (albeit slowly :') ), but I'm happy to merge proposals as co-authors if you are open to it!
I read through the proposal and skimmed through the POC code; I don't think this is a correct approach to solve this problem. As the base assumption your code makes is that: all previously loaded tables within a Spark Application contributed to the creation of a new table. This is a very simplistic heuristic that easily breaks in simple cases, such as when many tables are read within a single Spark application but only a handful of them are used to generate a new table. You've correctly pointed out the flaw in Question 10.2: Polaris cannot determine what populated a table based on IRC calls alone. This is why I aimed to propose the idea above - have the engines send this information directly to Polaris. This mirrors how any other OL servers rely on the engine's OL connectors to create this information, as the engine is the sole source of truth regarding the user's actual intent. Let me know your thoughts. Best, Adnan Hemani On Sun, Apr 19, 2026 at 9:56 PM Jean-Baptiste Onofré <[email protected]> wrote: > Hi ITing, > > Thanks for your message! > > For context, we discussed OpenLineage support a while ago, specifically > regarding the initial support of OpenLineage events. > > I will take a look at your proposal and design document, and I am sure Mike > will as well. > > Regards, > JB > > On Mon, Apr 20, 2026 at 3:08 AM ITing Lee <[email protected]> wrote: > > > Hi all, > > > > I’ve been working on OpenLineage support in Polaris and wanted to ask for > > feedback on the current implementation approach and where it should go > > next. > > > > The current direction is to emit OpenLineage events from Polaris-managed > > operations and publish them through an event-listener-based design. > Before > > moving further, I’d like to get suggestions on whether this > implementation > > model makes sense and what should be adjusted early. > > > > Below is the link for the proposal and PoC branch. > > > > - Proposal: > > > > > https://docs.google.com/document/d/1NEPzcMIcbxKEvBIWd6SJGllRIu4brjV-23xO5Qp3DlM/edit?tab=t.0 > > - PoC branch: https://github.com/iting0321/polaris/tree/data-lineage > > > > I’d appreciate feedback on the implementation direction and any > suggestions > > before continuing further. > > > > Best regards, > > ITing > > >
