Re: [DISCUSS] Add OpenLineage to Apache Polaris

Adnan Hemani via dev Wed, 22 Apr 2026 18:52:52 -0700

Hi ITing,

Thanks for your thoughts - please feel free to comment directly onto the
doc as well as others on the email thread cannot yet see the document and
may get confused by this exchange :) I am working to release it ASAP, but I
need a few days (lots on my plate). If you would be willing to make
suggestions and any required edits to the doc, that would be greatly
appreciated!


Responding to your comments:

1) This is a great starting point for this discussion but I don't agree
with the suggestion. Here's why:
* For the general OL event, we typically see no more than 1-5 inputs and/or
outputs, so our N x M fanout is typically not super high to begin with. But
more importantly, these edges will be upserted. If the 1000th event states
the same tables A -> B relationship in tables, the system only needs to
check if that relationship already exists. The database rows do not grow
infinitely. However, if we store events directly (which I understood from
your email, please correct me if I'm wrong), this database table will grow
constantly and may require regular compaction/de-duplication, especially
with if there are many frequent jobs running.
* Given that OL requests are not in the processing phase of a query, taking
a little extra time here is not an extreme concern.
* We lose column-level lineage information with this approach. e.g.
answering which datasets computed column X within table A.
* Array datatypes aren't ubiquitous across RDBMSs, which could potentially
break our database-level interfacing via JDBC. There might be a way around
this, but the question is: at what cost are we willing to make this happen?

2) This is a fair suggestion - but I may have one to make this even
simpler. First, pass the data through to the external OL server, and then
save it to the Polaris dataset. This way, Polaris never has "phantom"
relations that aren't reflected in the external OL server. Alternatively,
we could turn off Polaris' lineage database if an external OL server is
currently configured, and punt on this issue unless someone requires it.
WDYT?

-Adnan

On Wed, Apr 22, 2026 at 9:23 AM ITing Lee <[email protected]> wrote:

> Hi Adnan Hemani,
>
> I have two questions about this proposal after careful thought.
>
>
> 1. For the Persistence Layer Data Model design, would it be better to have
> a single table with raw OL events and necessary columns as indexes for the
> landing purpose to avoid the N x M write amplification at the ingestion
> time? More specifically, would having the following 4 columns as an
> “ol_event” table be sufficient?
>
>    -
>
>    `input_ids: ARRAY[str]`
>    -
>
>    `output_ids: ARRAY[str]`
>    -
>
>    `column_fields: ARRAY[str]`
>    -
>
>    `raw_event: JSON`
>
> From my perspective, scaling write is more difficult than scaling read. The
> N x M edges permutation for the writing path seems too heavy for the
> ingestion purpose, it is more like the transformation for the read time
> optimization.
>
> The single table approach with the ARRAY-type columns as indexes make the
> raw_event column queryable for our public interface, and the overhead for
> writing each row should be fast and much simpler when ingestion. Then we
> could compute the edges in the read time. WDYT?
>
>
> 2. In the "Processing Pipeline" section, I would like to double check the
> independence for the downstream backend as we’re using dual-write between
> two services in our design. Perhaps having a `state` field as audit-log
> purpose that will be set as `pending` in the same transaction when writing
> the “ol_event” table, then set the `state` field as `done` after success.
> So that if the server be force killed between the first transaction and
> passing to OL backend, the user (or the client) could do the retry and make
> the state consistent between Polaris and OL backend. Thanks!
>
> Best regards, ITing
>
> Adnan Hemani via dev <[email protected]> 於 2026年4月22日週三 上午1:13寫道：
>
> > Hi all,
> >
> > Great thoughts!
> >
> > I-Ting, you're exactly right. I will share the proposal to your email on
> > Google Drive (it still needs a bit of work but it's about 90% complete
> > right now). I'd love to get your feedback and work on the doc!
> >
> > Yufei, I 100% agree with your feedback as well. The proposal's current
> > structure highlights a few advantages:
> > 1. Centralized auth for the OL server: Marquez, the reference OL
> > implementation, does not have an auth system. Polaris can provide
> > significant value here immediately.
> > 2. OL, in general, has many concepts, like Jobs/Runs/etc. I think it is
> > likely that some customers do not need that level of lineage tracking and
> > would rather not deal with the additional maintenance of yet another web
> > server. Polaris can provide a "lite" version of data lineage OOTB so that
> > users don't need to install a full-functionality OL server if they don't
> > require it. But Polaris should be able to support a pass-through model
> to a
> > full-functionality OL server in case users still want to use Polaris as
> the
> > "single pane of glass".
> > 3. Although the proposal does not explicitly state catalog-enhanced data
> > enrichment for the OL data that Polaris could send downstream (to focus
> the
> > scope of the initial changes), this proposal would help us unlock this
> > functionality. This is something I discussed with Michael Collado earlier
> > :)
> >
> > As next steps, I will work with I-Ting to introduce this proposal -
> perhaps
> > later this week!
> >
> > Best,
> > Adnan Hemani
> >
> > On Tue, Apr 21, 2026 at 9:45 AM Yufei Gu <[email protected]> wrote:
> >
> > > Agreed: engines provide more accurate lineage information.
> > >
> > > I think there are two viable architectures here:
> > >
> > >    - Engines send OpenLineage events directly to a lineage backend
> (e.g.
> > >    Marquez)
> > >    - Engines send events to Polaris, which acts as an ingestion layer
> > (and
> > >    optionally forwards downstream)
> > >
> > > The first option is the standard and simplest model. It avoids extra
> hops
> > > and works well when lineage collection is the only goal. The second
> > option
> > > becomes interesting only if Polaris adds clear control-plane value, for
> > > example:
> > >
> > >    - central auth and policy for lineage ingestion
> > >    - catalog-aware enrichment (stable table identity, namespace,
> > snapshots,
> > >    federation context)
> > >    - decoupling engines from the downstream lineage backend
> > >
> > > If Polaris is only acting as a pass-through, the extra layer likely
> does
> > > not justify the added complexity. However, if we position Polaris as a
> > > catalog-aware lineage gateway, the second model aligns well with the
> > > “single pane of glass” direction. I think this seems a direction the
> > > Polaris community can pursue. My inclination is to focus Polaris on
> > > ingestion, normalization, and forwarding, and avoid expanding into a
> full
> > > lineage storage or analytics system.
> > >
> > > Curious to hear others’ thoughts.
> > >
> > > Yufei
> > >
> > >
> > > On Tue, Apr 21, 2026 at 7:52 AM ITing Lee <[email protected]> wrote:
> > >
> > > > Hi Adnan Hemani,
> > > >
> > > > Yes, I agree with your point of view. After thinking thoroughly, it
> > might
> > > > be possible to lose complete data lineage for my POC.
> > > >
> > > > Additionally, from your direction, we can decouple Spark from
> > OpenLineage
> > > > and make Polaris as the only interface that Spark interacts with.
> > > >
> > > > It is my pleasure to co-work for this project, and I would like to
> make
> > > > sure we’re aligned before proceeding and please correct me if I’m
> > wrong.
> > > >
> > > > Does it mean that we will implement OL Rest API in Polaris?
> > > >
> > > > For example, there's one OpenLineage server running on 5000 ports,
> and
> > a
> > > > Polaris server running 8000.  What the user needs to do are:, install
> > > > `openlineage-spark-*.jar` [1], which stays as-is for the user, then
> > > second,
> > > > sets the host to 8000.
> > > >
> > > > If my thought is correct, would you like to share the proposal you
> have
> > > > done, then we will not step on each other.
> > > >
> > > > Thanks!
> > > >
> > > > [1]
> > > >
> > >
> >
> https://openlineage.io/docs/integrations/spark/developing/built_in_lineage/
> > > >
> > > > Best regards,
> > > >
> > > > ITing
> > > >
> > > >
> > > > Adnan Hemani via dev <[email protected]> 於 2026年4月21日週二
> > 上午10:49寫道：
> > > >
> > > > > Hi ITing,
> > > > >
> > > > > I'm glad you are looking into this - I was also informally/verbally
> > > > > floating a proposal with some community members during Iceberg
> Summit
> > > > > earlier this month regarding OpenLineage support. Specifically, the
> > > > > proposal was to open Polaris as an OpenLineage (OL) server
> > > implementation
> > > > > (with exposed OL APIs) that could either retain OL information
> > > internally
> > > > > (as a "lite" version offering through the Polaris Persistence
> layer)
> > or
> > > > > reflect it downstream to an OL server like Marquez for a richer
> > feature
> > > > set
> > > > > (up to admin preference). All compute engines would then call
> > Polaris'
> > > > > OpenLineage APIs (through their respective OL connectors) to keep
> > > Polaris
> > > > > as a single-pane of glass for metadata. I was working on that
> > proposal
> > > > > (albeit slowly :') ), but I'm happy to merge proposals as
> co-authors
> > if
> > > > you
> > > > > are open to it!
> > > > >
> > > > > I read through the proposal and skimmed through the POC code; I
> don't
> > > > think
> > > > > this is a correct approach to solve this problem. As the base
> > > assumption
> > > > > your code makes is that: all previously loaded tables within a
> Spark
> > > > > Application contributed to the creation of a new table. This is a
> > very
> > > > > simplistic heuristic that easily breaks in simple cases, such as
> when
> > > > many
> > > > > tables are read within a single Spark application but only a
> handful
> > of
> > > > > them are used to generate a new table. You've correctly pointed out
> > the
> > > > > flaw in Question 10.2: Polaris cannot determine what populated a
> > table
> > > > > based on IRC calls alone. This is why I aimed to propose the idea
> > > above -
> > > > > have the engines send this information directly to Polaris. This
> > > mirrors
> > > > > how any other OL servers rely on the engine's OL connectors to
> create
> > > > this
> > > > > information, as the engine is the sole source of truth regarding
> the
> > > > user's
> > > > > actual intent. Let me know your thoughts.
> > > > >
> > > > > Best,
> > > > > Adnan Hemani
> > > > >
> > > > > On Sun, Apr 19, 2026 at 9:56 PM Jean-Baptiste Onofré <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi ITing,
> > > > > >
> > > > > > Thanks for your message!
> > > > > >
> > > > > > For context, we discussed OpenLineage support a while ago,
> > > specifically
> > > > > > regarding the initial support of OpenLineage events.
> > > > > >
> > > > > > I will take a look at your proposal and design document, and I am
> > > sure
> > > > > Mike
> > > > > > will as well.
> > > > > >
> > > > > > Regards,
> > > > > > JB
> > > > > >
> > > > > > On Mon, Apr 20, 2026 at 3:08 AM ITing Lee <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I’ve been working on OpenLineage support in Polaris and wanted
> to
> > > ask
> > > > > for
> > > > > > > feedback on the current implementation approach and where it
> > should
> > > > go
> > > > > > > next.
> > > > > > >
> > > > > > > The current direction is to emit OpenLineage events from
> > > > > Polaris-managed
> > > > > > > operations and publish them through an event-listener-based
> > design.
> > > > > > Before
> > > > > > > moving further, I’d like to get suggestions on whether this
> > > > > > implementation
> > > > > > > model makes sense and what should be adjusted early.
> > > > > > >
> > > > > > > Below is the link for the proposal and PoC branch.
> > > > > > >
> > > > > > > - Proposal:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1NEPzcMIcbxKEvBIWd6SJGllRIu4brjV-23xO5Qp3DlM/edit?tab=t.0
> > > > > > > - PoC branch:
> > > https://github.com/iting0321/polaris/tree/data-lineage
> > > > > > >
> > > > > > > I’d appreciate feedback on the implementation direction and any
> > > > > > suggestions
> > > > > > > before continuing further.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > ITing
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Add OpenLineage to Apache Polaris

Reply via email to