Re: [DISCUSS] Add OpenLineage to Apache Polaris

Adnan Hemani Mon, 27 Apr 2026 03:15:58 -0700

Hi all,

As promised, here's the proposal document that some of us have been working
on throughout this last week:
https://docs.google.com/document/d/1iOzIuFW66SFL2wZOADD9knMTG21OwY7VmaWVSvMUqQk/edit?usp=sharing


Michael, please do add your thoughts on the doc regarding what information
would be helpful for Polaris to pass through to downstream OL servers. I
believe that only strengthens the use case presented here - and we should
tackle that sooner than later if we do start implementing this proposal.

Please do feel free to add comments directly to the document. Looking
forward to everyone's thoughts :)

Best,
Adnan Hemani

On Thu, Apr 23, 2026 at 10:49 PM Michael Collado <[email protected]>
wrote:

> FYI, I had some discussions with the OpenLineage folks last summer. The
> option I had proposed was to treat Polaris as an OpenLineage proxy so that
> Polaris had the opportunity to augment the DatasetFacet with more specific
> information about the dataset (e.g., real FQN, UUID, snapshot id, etc.).
> The feedback I got from them was that they were actually interested in
> treating Polaris as a potential source for Dataset information without
> necessarily treating Polaris as a proxy for all OL events. I thought this
> was a fair way of looking at it, as it not only means that Polaris doesn't
> need to handle every single OL payload (the payload can be quite large and
> even a single Spark application can publish many events) and we could
> help drive a standard for other catalogs to follow.
>
> Just perusing the current design doc, I think there are some complexities
> that aren't addressed. For example, I see that table modifications publish
> events, but what about loadTable events? To publish lineage information, we
> need to know what tables are read and what tables are written, but
> loadTable is ambiguous - we don't know from the catalog whether a loadTable
> call is being made because that table is about to be written to or if it is
> being read. And if we're emitting dataset information, which snapshot do we
> emit in the dataset facet? If the table is an output, we should emit the
> snapshot id being added, but if it is being read, we should emit the
> snapshot id at the loadTable call. But, again, since loadTable is
> ambiguous, we don't know if we should emit or not.
>
> On the other hand, something like a
> /{catalog}/lineage/datasets/{namespace}/{table} api that simply returns
> DatasetFacets that the OL client can invoke and use to augment the dataset
> information in the OL payload allows the engine, that actually knows what's
> happening, to make decisions about where to fit that information in the OL
> payload and where to place the dataset in the OL run information.
>
> Mike
>
> On Thu, Apr 23, 2026 at 2:12 PM ITing Lee <[email protected]> wrote:
>
> > Hi Adnan Hemani,
> >
> > I totally agreed with you on these 2 questions. If I have further
> > questions, I will directly comment on the doc.
> >
> > Thanks for your feedback!
> >
> >
> > Best regards,
> >
> > ITing
> >
> >
> > Adnan Hemani via dev <[email protected]> 於 2026年4月23日週四 上午9:52寫道：
> >
> > > Hi ITing,
> > >
> > > Thanks for your thoughts - please feel free to comment directly onto
> the
> > > doc as well as others on the email thread cannot yet see the document
> and
> > > may get confused by this exchange :) I am working to release it ASAP,
> > but I
> > > need a few days (lots on my plate). If you would be willing to make
> > > suggestions and any required edits to the doc, that would be greatly
> > > appreciated!
> > >
> > > Responding to your comments:
> > >
> > > 1) This is a great starting point for this discussion but I don't agree
> > > with the suggestion. Here's why:
> > > * For the general OL event, we typically see no more than 1-5 inputs
> > and/or
> > > outputs, so our N x M fanout is typically not super high to begin with.
> > But
> > > more importantly, these edges will be upserted. If the 1000th event
> > states
> > > the same tables A -> B relationship in tables, the system only needs to
> > > check if that relationship already exists. The database rows do not
> grow
> > > infinitely. However, if we store events directly (which I understood
> from
> > > your email, please correct me if I'm wrong), this database table will
> > grow
> > > constantly and may require regular compaction/de-duplication,
> especially
> > > with if there are many frequent jobs running.
> > > * Given that OL requests are not in the processing phase of a query,
> > taking
> > > a little extra time here is not an extreme concern.
> > > * We lose column-level lineage information with this approach. e.g.
> > > answering which datasets computed column X within table A.
> > > * Array datatypes aren't ubiquitous across RDBMSs, which could
> > potentially
> > > break our database-level interfacing via JDBC. There might be a way
> > around
> > > this, but the question is: at what cost are we willing to make this
> > happen?
> > >
> > > 2) This is a fair suggestion - but I may have one to make this even
> > > simpler. First, pass the data through to the external OL server, and
> then
> > > save it to the Polaris dataset. This way, Polaris never has "phantom"
> > > relations that aren't reflected in the external OL server.
> Alternatively,
> > > we could turn off Polaris' lineage database if an external OL server is
> > > currently configured, and punt on this issue unless someone requires
> it.
> > > WDYT?
> > >
> > > -Adnan
> > >
> > > On Wed, Apr 22, 2026 at 9:23 AM ITing Lee <[email protected]> wrote:
> > >
> > > > Hi Adnan Hemani,
> > > >
> > > > I have two questions about this proposal after careful thought.
> > > >
> > > >
> > > > 1. For the Persistence Layer Data Model design, would it be better to
> > > have
> > > > a single table with raw OL events and necessary columns as indexes
> for
> > > the
> > > > landing purpose to avoid the N x M write amplification at the
> ingestion
> > > > time? More specifically, would having the following 4 columns as an
> > > > “ol_event” table be sufficient?
> > > >
> > > >    -
> > > >
> > > >    `input_ids: ARRAY[str]`
> > > >    -
> > > >
> > > >    `output_ids: ARRAY[str]`
> > > >    -
> > > >
> > > >    `column_fields: ARRAY[str]`
> > > >    -
> > > >
> > > >    `raw_event: JSON`
> > > >
> > > > From my perspective, scaling write is more difficult than scaling
> read.
> > > The
> > > > N x M edges permutation for the writing path seems too heavy for the
> > > > ingestion purpose, it is more like the transformation for the read
> time
> > > > optimization.
> > > >
> > > > The single table approach with the ARRAY-type columns as indexes make
> > the
> > > > raw_event column queryable for our public interface, and the overhead
> > for
> > > > writing each row should be fast and much simpler when ingestion. Then
> > we
> > > > could compute the edges in the read time. WDYT?
> > > >
> > > >
> > > > 2. In the "Processing Pipeline" section, I would like to double check
> > the
> > > > independence for the downstream backend as we’re using dual-write
> > between
> > > > two services in our design. Perhaps having a `state` field as
> audit-log
> > > > purpose that will be set as `pending` in the same transaction when
> > > writing
> > > > the “ol_event” table, then set the `state` field as `done` after
> > success.
> > > > So that if the server be force killed between the first transaction
> and
> > > > passing to OL backend, the user (or the client) could do the retry
> and
> > > make
> > > > the state consistent between Polaris and OL backend. Thanks!
> > > >
> > > > Best regards, ITing
> > > >
> > > > Adnan Hemani via dev <[email protected]> 於 2026年4月22日週三
> 上午1:13寫道：
> > > >
> > > > > Hi all,
> > > > >
> > > > > Great thoughts!
> > > > >
> > > > > I-Ting, you're exactly right. I will share the proposal to your
> email
> > > on
> > > > > Google Drive (it still needs a bit of work but it's about 90%
> > complete
> > > > > right now). I'd love to get your feedback and work on the doc!
> > > > >
> > > > > Yufei, I 100% agree with your feedback as well. The proposal's
> > current
> > > > > structure highlights a few advantages:
> > > > > 1. Centralized auth for the OL server: Marquez, the reference OL
> > > > > implementation, does not have an auth system. Polaris can provide
> > > > > significant value here immediately.
> > > > > 2. OL, in general, has many concepts, like Jobs/Runs/etc. I think
> it
> > is
> > > > > likely that some customers do not need that level of lineage
> tracking
> > > and
> > > > > would rather not deal with the additional maintenance of yet
> another
> > > web
> > > > > server. Polaris can provide a "lite" version of data lineage OOTB
> so
> > > that
> > > > > users don't need to install a full-functionality OL server if they
> > > don't
> > > > > require it. But Polaris should be able to support a pass-through
> > model
> > > > to a
> > > > > full-functionality OL server in case users still want to use
> Polaris
> > as
> > > > the
> > > > > "single pane of glass".
> > > > > 3. Although the proposal does not explicitly state catalog-enhanced
> > > data
> > > > > enrichment for the OL data that Polaris could send downstream (to
> > focus
> > > > the
> > > > > scope of the initial changes), this proposal would help us unlock
> > this
> > > > > functionality. This is something I discussed with Michael Collado
> > > earlier
> > > > > :)
> > > > >
> > > > > As next steps, I will work with I-Ting to introduce this proposal -
> > > > perhaps
> > > > > later this week!
> > > > >
> > > > > Best,
> > > > > Adnan Hemani
> > > > >
> > > > > On Tue, Apr 21, 2026 at 9:45 AM Yufei Gu <[email protected]>
> > wrote:
> > > > >
> > > > > > Agreed: engines provide more accurate lineage information.
> > > > > >
> > > > > > I think there are two viable architectures here:
> > > > > >
> > > > > >    - Engines send OpenLineage events directly to a lineage
> backend
> > > > (e.g.
> > > > > >    Marquez)
> > > > > >    - Engines send events to Polaris, which acts as an ingestion
> > layer
> > > > > (and
> > > > > >    optionally forwards downstream)
> > > > > >
> > > > > > The first option is the standard and simplest model. It avoids
> > extra
> > > > hops
> > > > > > and works well when lineage collection is the only goal. The
> second
> > > > > option
> > > > > > becomes interesting only if Polaris adds clear control-plane
> value,
> > > for
> > > > > > example:
> > > > > >
> > > > > >    - central auth and policy for lineage ingestion
> > > > > >    - catalog-aware enrichment (stable table identity, namespace,
> > > > > snapshots,
> > > > > >    federation context)
> > > > > >    - decoupling engines from the downstream lineage backend
> > > > > >
> > > > > > If Polaris is only acting as a pass-through, the extra layer
> likely
> > > > does
> > > > > > not justify the added complexity. However, if we position Polaris
> > as
> > > a
> > > > > > catalog-aware lineage gateway, the second model aligns well with
> > the
> > > > > > “single pane of glass” direction. I think this seems a direction
> > the
> > > > > > Polaris community can pursue. My inclination is to focus Polaris
> on
> > > > > > ingestion, normalization, and forwarding, and avoid expanding
> into
> > a
> > > > full
> > > > > > lineage storage or analytics system.
> > > > > >
> > > > > > Curious to hear others’ thoughts.
> > > > > >
> > > > > > Yufei
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 21, 2026 at 7:52 AM ITing Lee <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Hi Adnan Hemani,
> > > > > > >
> > > > > > > Yes, I agree with your point of view. After thinking
> thoroughly,
> > it
> > > > > might
> > > > > > > be possible to lose complete data lineage for my POC.
> > > > > > >
> > > > > > > Additionally, from your direction, we can decouple Spark from
> > > > > OpenLineage
> > > > > > > and make Polaris as the only interface that Spark interacts
> with.
> > > > > > >
> > > > > > > It is my pleasure to co-work for this project, and I would like
> > to
> > > > make
> > > > > > > sure we’re aligned before proceeding and please correct me if
> I’m
> > > > > wrong.
> > > > > > >
> > > > > > > Does it mean that we will implement OL Rest API in Polaris?
> > > > > > >
> > > > > > > For example, there's one OpenLineage server running on 5000
> > ports,
> > > > and
> > > > > a
> > > > > > > Polaris server running 8000.  What the user needs to do are:,
> > > install
> > > > > > > `openlineage-spark-*.jar` [1], which stays as-is for the user,
> > then
> > > > > > second,
> > > > > > > sets the host to 8000.
> > > > > > >
> > > > > > > If my thought is correct, would you like to share the proposal
> > you
> > > > have
> > > > > > > done, then we will not step on each other.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://openlineage.io/docs/integrations/spark/developing/built_in_lineage/
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > ITing
> > > > > > >
> > > > > > >
> > > > > > > Adnan Hemani via dev <[email protected]> 於 2026年4月21日週二
> > > > > 上午10:49寫道：
> > > > > > >
> > > > > > > > Hi ITing,
> > > > > > > >
> > > > > > > > I'm glad you are looking into this - I was also
> > > informally/verbally
> > > > > > > > floating a proposal with some community members during
> Iceberg
> > > > Summit
> > > > > > > > earlier this month regarding OpenLineage support.
> Specifically,
> > > the
> > > > > > > > proposal was to open Polaris as an OpenLineage (OL) server
> > > > > > implementation
> > > > > > > > (with exposed OL APIs) that could either retain OL
> information
> > > > > > internally
> > > > > > > > (as a "lite" version offering through the Polaris Persistence
> > > > layer)
> > > > > or
> > > > > > > > reflect it downstream to an OL server like Marquez for a
> richer
> > > > > feature
> > > > > > > set
> > > > > > > > (up to admin preference). All compute engines would then call
> > > > > Polaris'
> > > > > > > > OpenLineage APIs (through their respective OL connectors) to
> > keep
> > > > > > Polaris
> > > > > > > > as a single-pane of glass for metadata. I was working on that
> > > > > proposal
> > > > > > > > (albeit slowly :') ), but I'm happy to merge proposals as
> > > > co-authors
> > > > > if
> > > > > > > you
> > > > > > > > are open to it!
> > > > > > > >
> > > > > > > > I read through the proposal and skimmed through the POC
> code; I
> > > > don't
> > > > > > > think
> > > > > > > > this is a correct approach to solve this problem. As the base
> > > > > > assumption
> > > > > > > > your code makes is that: all previously loaded tables within
> a
> > > > Spark
> > > > > > > > Application contributed to the creation of a new table. This
> > is a
> > > > > very
> > > > > > > > simplistic heuristic that easily breaks in simple cases, such
> > as
> > > > when
> > > > > > > many
> > > > > > > > tables are read within a single Spark application but only a
> > > > handful
> > > > > of
> > > > > > > > them are used to generate a new table. You've correctly
> pointed
> > > out
> > > > > the
> > > > > > > > flaw in Question 10.2: Polaris cannot determine what
> populated
> > a
> > > > > table
> > > > > > > > based on IRC calls alone. This is why I aimed to propose the
> > idea
> > > > > > above -
> > > > > > > > have the engines send this information directly to Polaris.
> > This
> > > > > > mirrors
> > > > > > > > how any other OL servers rely on the engine's OL connectors
> to
> > > > create
> > > > > > > this
> > > > > > > > information, as the engine is the sole source of truth
> > regarding
> > > > the
> > > > > > > user's
> > > > > > > > actual intent. Let me know your thoughts.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Adnan Hemani
> > > > > > > >
> > > > > > > > On Sun, Apr 19, 2026 at 9:56 PM Jean-Baptiste Onofré <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi ITing,
> > > > > > > > >
> > > > > > > > > Thanks for your message!
> > > > > > > > >
> > > > > > > > > For context, we discussed OpenLineage support a while ago,
> > > > > > specifically
> > > > > > > > > regarding the initial support of OpenLineage events.
> > > > > > > > >
> > > > > > > > > I will take a look at your proposal and design document,
> and
> > I
> > > am
> > > > > > sure
> > > > > > > > Mike
> > > > > > > > > will as well.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > JB
> > > > > > > > >
> > > > > > > > > On Mon, Apr 20, 2026 at 3:08 AM ITing Lee <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I’ve been working on OpenLineage support in Polaris and
> > > wanted
> > > > to
> > > > > > ask
> > > > > > > > for
> > > > > > > > > > feedback on the current implementation approach and where
> > it
> > > > > should
> > > > > > > go
> > > > > > > > > > next.
> > > > > > > > > >
> > > > > > > > > > The current direction is to emit OpenLineage events from
> > > > > > > > Polaris-managed
> > > > > > > > > > operations and publish them through an
> event-listener-based
> > > > > design.
> > > > > > > > > Before
> > > > > > > > > > moving further, I’d like to get suggestions on whether
> this
> > > > > > > > > implementation
> > > > > > > > > > model makes sense and what should be adjusted early.
> > > > > > > > > >
> > > > > > > > > > Below is the link for the proposal and PoC branch.
> > > > > > > > > >
> > > > > > > > > > - Proposal:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1NEPzcMIcbxKEvBIWd6SJGllRIu4brjV-23xO5Qp3DlM/edit?tab=t.0
> > > > > > > > > > - PoC branch:
> > > > > > https://github.com/iting0321/polaris/tree/data-lineage
> > > > > > > > > >
> > > > > > > > > > I’d appreciate feedback on the implementation direction
> and
> > > any
> > > > > > > > > suggestions
> > > > > > > > > > before continuing further.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > ITing
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Add OpenLineage to Apache Polaris

Reply via email to