Re: [DISCUSS] Add OpenLineage to Apache Polaris

ITing Lee Thu, 23 Apr 2026 14:12:36 -0700

Hi Adnan Hemani,

I totally agreed with you on these 2 questions. If I have further
questions, I will directly comment on the doc.


Thanks for your feedback!


Best regards,

ITing


Adnan Hemani via dev <[email protected]> 於 2026年4月23日週四 上午9:52寫道：

> Hi ITing,
>
> Thanks for your thoughts - please feel free to comment directly onto the
> doc as well as others on the email thread cannot yet see the document and
> may get confused by this exchange :) I am working to release it ASAP, but I
> need a few days (lots on my plate). If you would be willing to make
> suggestions and any required edits to the doc, that would be greatly
> appreciated!
>
> Responding to your comments:
>
> 1) This is a great starting point for this discussion but I don't agree
> with the suggestion. Here's why:
> * For the general OL event, we typically see no more than 1-5 inputs and/or
> outputs, so our N x M fanout is typically not super high to begin with. But
> more importantly, these edges will be upserted. If the 1000th event states
> the same tables A -> B relationship in tables, the system only needs to
> check if that relationship already exists. The database rows do not grow
> infinitely. However, if we store events directly (which I understood from
> your email, please correct me if I'm wrong), this database table will grow
> constantly and may require regular compaction/de-duplication, especially
> with if there are many frequent jobs running.
> * Given that OL requests are not in the processing phase of a query, taking
> a little extra time here is not an extreme concern.
> * We lose column-level lineage information with this approach. e.g.
> answering which datasets computed column X within table A.
> * Array datatypes aren't ubiquitous across RDBMSs, which could potentially
> break our database-level interfacing via JDBC. There might be a way around
> this, but the question is: at what cost are we willing to make this happen?
>
> 2) This is a fair suggestion - but I may have one to make this even
> simpler. First, pass the data through to the external OL server, and then
> save it to the Polaris dataset. This way, Polaris never has "phantom"
> relations that aren't reflected in the external OL server. Alternatively,
> we could turn off Polaris' lineage database if an external OL server is
> currently configured, and punt on this issue unless someone requires it.
> WDYT?
>
> -Adnan
>
> On Wed, Apr 22, 2026 at 9:23 AM ITing Lee <[email protected]> wrote:
>
> > Hi Adnan Hemani,
> >
> > I have two questions about this proposal after careful thought.
> >
> >
> > 1. For the Persistence Layer Data Model design, would it be better to
> have
> > a single table with raw OL events and necessary columns as indexes for
> the
> > landing purpose to avoid the N x M write amplification at the ingestion
> > time? More specifically, would having the following 4 columns as an
> > “ol_event” table be sufficient?
> >
> >    -
> >
> >    `input_ids: ARRAY[str]`
> >    -
> >
> >    `output_ids: ARRAY[str]`
> >    -
> >
> >    `column_fields: ARRAY[str]`
> >    -
> >
> >    `raw_event: JSON`
> >
> > From my perspective, scaling write is more difficult than scaling read.
> The
> > N x M edges permutation for the writing path seems too heavy for the
> > ingestion purpose, it is more like the transformation for the read time
> > optimization.
> >
> > The single table approach with the ARRAY-type columns as indexes make the
> > raw_event column queryable for our public interface, and the overhead for
> > writing each row should be fast and much simpler when ingestion. Then we
> > could compute the edges in the read time. WDYT?
> >
> >
> > 2. In the "Processing Pipeline" section, I would like to double check the
> > independence for the downstream backend as we’re using dual-write between
> > two services in our design. Perhaps having a `state` field as audit-log
> > purpose that will be set as `pending` in the same transaction when
> writing
> > the “ol_event” table, then set the `state` field as `done` after success.
> > So that if the server be force killed between the first transaction and
> > passing to OL backend, the user (or the client) could do the retry and
> make
> > the state consistent between Polaris and OL backend. Thanks!
> >
> > Best regards, ITing
> >
> > Adnan Hemani via dev <[email protected]> 於 2026年4月22日週三 上午1:13寫道：
> >
> > > Hi all,
> > >
> > > Great thoughts!
> > >
> > > I-Ting, you're exactly right. I will share the proposal to your email
> on
> > > Google Drive (it still needs a bit of work but it's about 90% complete
> > > right now). I'd love to get your feedback and work on the doc!
> > >
> > > Yufei, I 100% agree with your feedback as well. The proposal's current
> > > structure highlights a few advantages:
> > > 1. Centralized auth for the OL server: Marquez, the reference OL
> > > implementation, does not have an auth system. Polaris can provide
> > > significant value here immediately.
> > > 2. OL, in general, has many concepts, like Jobs/Runs/etc. I think it is
> > > likely that some customers do not need that level of lineage tracking
> and
> > > would rather not deal with the additional maintenance of yet another
> web
> > > server. Polaris can provide a "lite" version of data lineage OOTB so
> that
> > > users don't need to install a full-functionality OL server if they
> don't
> > > require it. But Polaris should be able to support a pass-through model
> > to a
> > > full-functionality OL server in case users still want to use Polaris as
> > the
> > > "single pane of glass".
> > > 3. Although the proposal does not explicitly state catalog-enhanced
> data
> > > enrichment for the OL data that Polaris could send downstream (to focus
> > the
> > > scope of the initial changes), this proposal would help us unlock this
> > > functionality. This is something I discussed with Michael Collado
> earlier
> > > :)
> > >
> > > As next steps, I will work with I-Ting to introduce this proposal -
> > perhaps
> > > later this week!
> > >
> > > Best,
> > > Adnan Hemani
> > >
> > > On Tue, Apr 21, 2026 at 9:45 AM Yufei Gu <[email protected]> wrote:
> > >
> > > > Agreed: engines provide more accurate lineage information.
> > > >
> > > > I think there are two viable architectures here:
> > > >
> > > >    - Engines send OpenLineage events directly to a lineage backend
> > (e.g.
> > > >    Marquez)
> > > >    - Engines send events to Polaris, which acts as an ingestion layer
> > > (and
> > > >    optionally forwards downstream)
> > > >
> > > > The first option is the standard and simplest model. It avoids extra
> > hops
> > > > and works well when lineage collection is the only goal. The second
> > > option
> > > > becomes interesting only if Polaris adds clear control-plane value,
> for
> > > > example:
> > > >
> > > >    - central auth and policy for lineage ingestion
> > > >    - catalog-aware enrichment (stable table identity, namespace,
> > > snapshots,
> > > >    federation context)
> > > >    - decoupling engines from the downstream lineage backend
> > > >
> > > > If Polaris is only acting as a pass-through, the extra layer likely
> > does
> > > > not justify the added complexity. However, if we position Polaris as
> a
> > > > catalog-aware lineage gateway, the second model aligns well with the
> > > > “single pane of glass” direction. I think this seems a direction the
> > > > Polaris community can pursue. My inclination is to focus Polaris on
> > > > ingestion, normalization, and forwarding, and avoid expanding into a
> > full
> > > > lineage storage or analytics system.
> > > >
> > > > Curious to hear others’ thoughts.
> > > >
> > > > Yufei
> > > >
> > > >
> > > > On Tue, Apr 21, 2026 at 7:52 AM ITing Lee <[email protected]>
> wrote:
> > > >
> > > > > Hi Adnan Hemani,
> > > > >
> > > > > Yes, I agree with your point of view. After thinking thoroughly, it
> > > might
> > > > > be possible to lose complete data lineage for my POC.
> > > > >
> > > > > Additionally, from your direction, we can decouple Spark from
> > > OpenLineage
> > > > > and make Polaris as the only interface that Spark interacts with.
> > > > >
> > > > > It is my pleasure to co-work for this project, and I would like to
> > make
> > > > > sure we’re aligned before proceeding and please correct me if I’m
> > > wrong.
> > > > >
> > > > > Does it mean that we will implement OL Rest API in Polaris?
> > > > >
> > > > > For example, there's one OpenLineage server running on 5000 ports,
> > and
> > > a
> > > > > Polaris server running 8000.  What the user needs to do are:,
> install
> > > > > `openlineage-spark-*.jar` [1], which stays as-is for the user, then
> > > > second,
> > > > > sets the host to 8000.
> > > > >
> > > > > If my thought is correct, would you like to share the proposal you
> > have
> > > > > done, then we will not step on each other.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://openlineage.io/docs/integrations/spark/developing/built_in_lineage/
> > > > >
> > > > > Best regards,
> > > > >
> > > > > ITing
> > > > >
> > > > >
> > > > > Adnan Hemani via dev <[email protected]> 於 2026年4月21日週二
> > > 上午10:49寫道：
> > > > >
> > > > > > Hi ITing,
> > > > > >
> > > > > > I'm glad you are looking into this - I was also
> informally/verbally
> > > > > > floating a proposal with some community members during Iceberg
> > Summit
> > > > > > earlier this month regarding OpenLineage support. Specifically,
> the
> > > > > > proposal was to open Polaris as an OpenLineage (OL) server
> > > > implementation
> > > > > > (with exposed OL APIs) that could either retain OL information
> > > > internally
> > > > > > (as a "lite" version offering through the Polaris Persistence
> > layer)
> > > or
> > > > > > reflect it downstream to an OL server like Marquez for a richer
> > > feature
> > > > > set
> > > > > > (up to admin preference). All compute engines would then call
> > > Polaris'
> > > > > > OpenLineage APIs (through their respective OL connectors) to keep
> > > > Polaris
> > > > > > as a single-pane of glass for metadata. I was working on that
> > > proposal
> > > > > > (albeit slowly :') ), but I'm happy to merge proposals as
> > co-authors
> > > if
> > > > > you
> > > > > > are open to it!
> > > > > >
> > > > > > I read through the proposal and skimmed through the POC code; I
> > don't
> > > > > think
> > > > > > this is a correct approach to solve this problem. As the base
> > > > assumption
> > > > > > your code makes is that: all previously loaded tables within a
> > Spark
> > > > > > Application contributed to the creation of a new table. This is a
> > > very
> > > > > > simplistic heuristic that easily breaks in simple cases, such as
> > when
> > > > > many
> > > > > > tables are read within a single Spark application but only a
> > handful
> > > of
> > > > > > them are used to generate a new table. You've correctly pointed
> out
> > > the
> > > > > > flaw in Question 10.2: Polaris cannot determine what populated a
> > > table
> > > > > > based on IRC calls alone. This is why I aimed to propose the idea
> > > > above -
> > > > > > have the engines send this information directly to Polaris. This
> > > > mirrors
> > > > > > how any other OL servers rely on the engine's OL connectors to
> > create
> > > > > this
> > > > > > information, as the engine is the sole source of truth regarding
> > the
> > > > > user's
> > > > > > actual intent. Let me know your thoughts.
> > > > > >
> > > > > > Best,
> > > > > > Adnan Hemani
> > > > > >
> > > > > > On Sun, Apr 19, 2026 at 9:56 PM Jean-Baptiste Onofré <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi ITing,
> > > > > > >
> > > > > > > Thanks for your message!
> > > > > > >
> > > > > > > For context, we discussed OpenLineage support a while ago,
> > > > specifically
> > > > > > > regarding the initial support of OpenLineage events.
> > > > > > >
> > > > > > > I will take a look at your proposal and design document, and I
> am
> > > > sure
> > > > > > Mike
> > > > > > > will as well.
> > > > > > >
> > > > > > > Regards,
> > > > > > > JB
> > > > > > >
> > > > > > > On Mon, Apr 20, 2026 at 3:08 AM ITing Lee <[email protected]
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I’ve been working on OpenLineage support in Polaris and
> wanted
> > to
> > > > ask
> > > > > > for
> > > > > > > > feedback on the current implementation approach and where it
> > > should
> > > > > go
> > > > > > > > next.
> > > > > > > >
> > > > > > > > The current direction is to emit OpenLineage events from
> > > > > > Polaris-managed
> > > > > > > > operations and publish them through an event-listener-based
> > > design.
> > > > > > > Before
> > > > > > > > moving further, I’d like to get suggestions on whether this
> > > > > > > implementation
> > > > > > > > model makes sense and what should be adjusted early.
> > > > > > > >
> > > > > > > > Below is the link for the proposal and PoC branch.
> > > > > > > >
> > > > > > > > - Proposal:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1NEPzcMIcbxKEvBIWd6SJGllRIu4brjV-23xO5Qp3DlM/edit?tab=t.0
> > > > > > > > - PoC branch:
> > > > https://github.com/iting0321/polaris/tree/data-lineage
> > > > > > > >
> > > > > > > > I’d appreciate feedback on the implementation direction and
> any
> > > > > > > suggestions
> > > > > > > > before continuing further.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > ITing
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Add OpenLineage to Apache Polaris

Reply via email to