Re: [DISCUSS] Add OpenLineage to Apache Polaris

Michael Collado Thu, 23 Apr 2026 22:49:54 -0700

FYI, I had some discussions with the OpenLineage folks last summer. The
option I had proposed was to treat Polaris as an OpenLineage proxy so that
Polaris had the opportunity to augment the DatasetFacet with more specific
information about the dataset (e.g., real FQN, UUID, snapshot id, etc.).
The feedback I got from them was that they were actually interested in
treating Polaris as a potential source for Dataset information without
necessarily treating Polaris as a proxy for all OL events. I thought this
was a fair way of looking at it, as it not only means that Polaris doesn't
need to handle every single OL payload (the payload can be quite large and
even a single Spark application can publish many events) and we could
help drive a standard for other catalogs to follow.


Just perusing the current design doc, I think there are some complexities
that aren't addressed. For example, I see that table modifications publish
events, but what about loadTable events? To publish lineage information, we
need to know what tables are read and what tables are written, but
loadTable is ambiguous - we don't know from the catalog whether a loadTable
call is being made because that table is about to be written to or if it is
being read. And if we're emitting dataset information, which snapshot do we
emit in the dataset facet? If the table is an output, we should emit the
snapshot id being added, but if it is being read, we should emit the
snapshot id at the loadTable call. But, again, since loadTable is
ambiguous, we don't know if we should emit or not.

On the other hand, something like a
/{catalog}/lineage/datasets/{namespace}/{table} api that simply returns
DatasetFacets that the OL client can invoke and use to augment the dataset
information in the OL payload allows the engine, that actually knows what's
happening, to make decisions about where to fit that information in the OL
payload and where to place the dataset in the OL run information.

Mike

On Thu, Apr 23, 2026 at 2:12 PM ITing Lee <[email protected]> wrote:

> Hi Adnan Hemani,
>
> I totally agreed with you on these 2 questions. If I have further
> questions, I will directly comment on the doc.
>
> Thanks for your feedback!
>
>
> Best regards,
>
> ITing
>
>
> Adnan Hemani via dev <[email protected]> 於 2026年4月23日週四 上午9:52寫道：
>
> > Hi ITing,
> >
> > Thanks for your thoughts - please feel free to comment directly onto the
> > doc as well as others on the email thread cannot yet see the document and
> > may get confused by this exchange :) I am working to release it ASAP,
> but I
> > need a few days (lots on my plate). If you would be willing to make
> > suggestions and any required edits to the doc, that would be greatly
> > appreciated!
> >
> > Responding to your comments:
> >
> > 1) This is a great starting point for this discussion but I don't agree
> > with the suggestion. Here's why:
> > * For the general OL event, we typically see no more than 1-5 inputs
> and/or
> > outputs, so our N x M fanout is typically not super high to begin with.
> But
> > more importantly, these edges will be upserted. If the 1000th event
> states
> > the same tables A -> B relationship in tables, the system only needs to
> > check if that relationship already exists. The database rows do not grow
> > infinitely. However, if we store events directly (which I understood from
> > your email, please correct me if I'm wrong), this database table will
> grow
> > constantly and may require regular compaction/de-duplication, especially
> > with if there are many frequent jobs running.
> > * Given that OL requests are not in the processing phase of a query,
> taking
> > a little extra time here is not an extreme concern.
> > * We lose column-level lineage information with this approach. e.g.
> > answering which datasets computed column X within table A.
> > * Array datatypes aren't ubiquitous across RDBMSs, which could
> potentially
> > break our database-level interfacing via JDBC. There might be a way
> around
> > this, but the question is: at what cost are we willing to make this
> happen?
> >
> > 2) This is a fair suggestion - but I may have one to make this even
> > simpler. First, pass the data through to the external OL server, and then
> > save it to the Polaris dataset. This way, Polaris never has "phantom"
> > relations that aren't reflected in the external OL server. Alternatively,
> > we could turn off Polaris' lineage database if an external OL server is
> > currently configured, and punt on this issue unless someone requires it.
> > WDYT?
> >
> > -Adnan
> >
> > On Wed, Apr 22, 2026 at 9:23 AM ITing Lee <[email protected]> wrote:
> >
> > > Hi Adnan Hemani,
> > >
> > > I have two questions about this proposal after careful thought.
> > >
> > >
> > > 1. For the Persistence Layer Data Model design, would it be better to
> > have
> > > a single table with raw OL events and necessary columns as indexes for
> > the
> > > landing purpose to avoid the N x M write amplification at the ingestion
> > > time? More specifically, would having the following 4 columns as an
> > > “ol_event” table be sufficient?
> > >
> > >    -
> > >
> > >    `input_ids: ARRAY[str]`
> > >    -
> > >
> > >    `output_ids: ARRAY[str]`
> > >    -
> > >
> > >    `column_fields: ARRAY[str]`
> > >    -
> > >
> > >    `raw_event: JSON`
> > >
> > > From my perspective, scaling write is more difficult than scaling read.
> > The
> > > N x M edges permutation for the writing path seems too heavy for the
> > > ingestion purpose, it is more like the transformation for the read time
> > > optimization.
> > >
> > > The single table approach with the ARRAY-type columns as indexes make
> the
> > > raw_event column queryable for our public interface, and the overhead
> for
> > > writing each row should be fast and much simpler when ingestion. Then
> we
> > > could compute the edges in the read time. WDYT?
> > >
> > >
> > > 2. In the "Processing Pipeline" section, I would like to double check
> the
> > > independence for the downstream backend as we’re using dual-write
> between
> > > two services in our design. Perhaps having a `state` field as audit-log
> > > purpose that will be set as `pending` in the same transaction when
> > writing
> > > the “ol_event” table, then set the `state` field as `done` after
> success.
> > > So that if the server be force killed between the first transaction and
> > > passing to OL backend, the user (or the client) could do the retry and
> > make
> > > the state consistent between Polaris and OL backend. Thanks!
> > >
> > > Best regards, ITing
> > >
> > > Adnan Hemani via dev <[email protected]> 於 2026年4月22日週三 上午1:13寫道：
> > >
> > > > Hi all,
> > > >
> > > > Great thoughts!
> > > >
> > > > I-Ting, you're exactly right. I will share the proposal to your email
> > on
> > > > Google Drive (it still needs a bit of work but it's about 90%
> complete
> > > > right now). I'd love to get your feedback and work on the doc!
> > > >
> > > > Yufei, I 100% agree with your feedback as well. The proposal's
> current
> > > > structure highlights a few advantages:
> > > > 1. Centralized auth for the OL server: Marquez, the reference OL
> > > > implementation, does not have an auth system. Polaris can provide
> > > > significant value here immediately.
> > > > 2. OL, in general, has many concepts, like Jobs/Runs/etc. I think it
> is
> > > > likely that some customers do not need that level of lineage tracking
> > and
> > > > would rather not deal with the additional maintenance of yet another
> > web
> > > > server. Polaris can provide a "lite" version of data lineage OOTB so
> > that
> > > > users don't need to install a full-functionality OL server if they
> > don't
> > > > require it. But Polaris should be able to support a pass-through
> model
> > > to a
> > > > full-functionality OL server in case users still want to use Polaris
> as
> > > the
> > > > "single pane of glass".
> > > > 3. Although the proposal does not explicitly state catalog-enhanced
> > data
> > > > enrichment for the OL data that Polaris could send downstream (to
> focus
> > > the
> > > > scope of the initial changes), this proposal would help us unlock
> this
> > > > functionality. This is something I discussed with Michael Collado
> > earlier
> > > > :)
> > > >
> > > > As next steps, I will work with I-Ting to introduce this proposal -
> > > perhaps
> > > > later this week!
> > > >
> > > > Best,
> > > > Adnan Hemani
> > > >
> > > > On Tue, Apr 21, 2026 at 9:45 AM Yufei Gu <[email protected]>
> wrote:
> > > >
> > > > > Agreed: engines provide more accurate lineage information.
> > > > >
> > > > > I think there are two viable architectures here:
> > > > >
> > > > >    - Engines send OpenLineage events directly to a lineage backend
> > > (e.g.
> > > > >    Marquez)
> > > > >    - Engines send events to Polaris, which acts as an ingestion
> layer
> > > > (and
> > > > >    optionally forwards downstream)
> > > > >
> > > > > The first option is the standard and simplest model. It avoids
> extra
> > > hops
> > > > > and works well when lineage collection is the only goal. The second
> > > > option
> > > > > becomes interesting only if Polaris adds clear control-plane value,
> > for
> > > > > example:
> > > > >
> > > > >    - central auth and policy for lineage ingestion
> > > > >    - catalog-aware enrichment (stable table identity, namespace,
> > > > snapshots,
> > > > >    federation context)
> > > > >    - decoupling engines from the downstream lineage backend
> > > > >
> > > > > If Polaris is only acting as a pass-through, the extra layer likely
> > > does
> > > > > not justify the added complexity. However, if we position Polaris
> as
> > a
> > > > > catalog-aware lineage gateway, the second model aligns well with
> the
> > > > > “single pane of glass” direction. I think this seems a direction
> the
> > > > > Polaris community can pursue. My inclination is to focus Polaris on
> > > > > ingestion, normalization, and forwarding, and avoid expanding into
> a
> > > full
> > > > > lineage storage or analytics system.
> > > > >
> > > > > Curious to hear others’ thoughts.
> > > > >
> > > > > Yufei
> > > > >
> > > > >
> > > > > On Tue, Apr 21, 2026 at 7:52 AM ITing Lee <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi Adnan Hemani,
> > > > > >
> > > > > > Yes, I agree with your point of view. After thinking thoroughly,
> it
> > > > might
> > > > > > be possible to lose complete data lineage for my POC.
> > > > > >
> > > > > > Additionally, from your direction, we can decouple Spark from
> > > > OpenLineage
> > > > > > and make Polaris as the only interface that Spark interacts with.
> > > > > >
> > > > > > It is my pleasure to co-work for this project, and I would like
> to
> > > make
> > > > > > sure we’re aligned before proceeding and please correct me if I’m
> > > > wrong.
> > > > > >
> > > > > > Does it mean that we will implement OL Rest API in Polaris?
> > > > > >
> > > > > > For example, there's one OpenLineage server running on 5000
> ports,
> > > and
> > > > a
> > > > > > Polaris server running 8000.  What the user needs to do are:,
> > install
> > > > > > `openlineage-spark-*.jar` [1], which stays as-is for the user,
> then
> > > > > second,
> > > > > > sets the host to 8000.
> > > > > >
> > > > > > If my thought is correct, would you like to share the proposal
> you
> > > have
> > > > > > done, then we will not step on each other.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://openlineage.io/docs/integrations/spark/developing/built_in_lineage/
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > ITing
> > > > > >
> > > > > >
> > > > > > Adnan Hemani via dev <[email protected]> 於 2026年4月21日週二
> > > > 上午10:49寫道：
> > > > > >
> > > > > > > Hi ITing,
> > > > > > >
> > > > > > > I'm glad you are looking into this - I was also
> > informally/verbally
> > > > > > > floating a proposal with some community members during Iceberg
> > > Summit
> > > > > > > earlier this month regarding OpenLineage support. Specifically,
> > the
> > > > > > > proposal was to open Polaris as an OpenLineage (OL) server
> > > > > implementation
> > > > > > > (with exposed OL APIs) that could either retain OL information
> > > > > internally
> > > > > > > (as a "lite" version offering through the Polaris Persistence
> > > layer)
> > > > or
> > > > > > > reflect it downstream to an OL server like Marquez for a richer
> > > > feature
> > > > > > set
> > > > > > > (up to admin preference). All compute engines would then call
> > > > Polaris'
> > > > > > > OpenLineage APIs (through their respective OL connectors) to
> keep
> > > > > Polaris
> > > > > > > as a single-pane of glass for metadata. I was working on that
> > > > proposal
> > > > > > > (albeit slowly :') ), but I'm happy to merge proposals as
> > > co-authors
> > > > if
> > > > > > you
> > > > > > > are open to it!
> > > > > > >
> > > > > > > I read through the proposal and skimmed through the POC code; I
> > > don't
> > > > > > think
> > > > > > > this is a correct approach to solve this problem. As the base
> > > > > assumption
> > > > > > > your code makes is that: all previously loaded tables within a
> > > Spark
> > > > > > > Application contributed to the creation of a new table. This
> is a
> > > > very
> > > > > > > simplistic heuristic that easily breaks in simple cases, such
> as
> > > when
> > > > > > many
> > > > > > > tables are read within a single Spark application but only a
> > > handful
> > > > of
> > > > > > > them are used to generate a new table. You've correctly pointed
> > out
> > > > the
> > > > > > > flaw in Question 10.2: Polaris cannot determine what populated
> a
> > > > table
> > > > > > > based on IRC calls alone. This is why I aimed to propose the
> idea
> > > > > above -
> > > > > > > have the engines send this information directly to Polaris.
> This
> > > > > mirrors
> > > > > > > how any other OL servers rely on the engine's OL connectors to
> > > create
> > > > > > this
> > > > > > > information, as the engine is the sole source of truth
> regarding
> > > the
> > > > > > user's
> > > > > > > actual intent. Let me know your thoughts.
> > > > > > >
> > > > > > > Best,
> > > > > > > Adnan Hemani
> > > > > > >
> > > > > > > On Sun, Apr 19, 2026 at 9:56 PM Jean-Baptiste Onofré <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi ITing,
> > > > > > > >
> > > > > > > > Thanks for your message!
> > > > > > > >
> > > > > > > > For context, we discussed OpenLineage support a while ago,
> > > > > specifically
> > > > > > > > regarding the initial support of OpenLineage events.
> > > > > > > >
> > > > > > > > I will take a look at your proposal and design document, and
> I
> > am
> > > > > sure
> > > > > > > Mike
> > > > > > > > will as well.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > JB
> > > > > > > >
> > > > > > > > On Mon, Apr 20, 2026 at 3:08 AM ITing Lee <
> [email protected]
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I’ve been working on OpenLineage support in Polaris and
> > wanted
> > > to
> > > > > ask
> > > > > > > for
> > > > > > > > > feedback on the current implementation approach and where
> it
> > > > should
> > > > > > go
> > > > > > > > > next.
> > > > > > > > >
> > > > > > > > > The current direction is to emit OpenLineage events from
> > > > > > > Polaris-managed
> > > > > > > > > operations and publish them through an event-listener-based
> > > > design.
> > > > > > > > Before
> > > > > > > > > moving further, I’d like to get suggestions on whether this
> > > > > > > > implementation
> > > > > > > > > model makes sense and what should be adjusted early.
> > > > > > > > >
> > > > > > > > > Below is the link for the proposal and PoC branch.
> > > > > > > > >
> > > > > > > > > - Proposal:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1NEPzcMIcbxKEvBIWd6SJGllRIu4brjV-23xO5Qp3DlM/edit?tab=t.0
> > > > > > > > > - PoC branch:
> > > > > https://github.com/iting0321/polaris/tree/data-lineage
> > > > > > > > >
> > > > > > > > > I’d appreciate feedback on the implementation direction and
> > any
> > > > > > > > suggestions
> > > > > > > > > before continuing further.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > ITing
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Add OpenLineage to Apache Polaris

Reply via email to