Re: OpenLineage Proposal - Follow-Up

Adnan Hemani via dev Mon, 01 Jun 2026 17:28:44 -0700

Hi Robert,

The proposal clearly explains why Polaris must be on the event path by
applying a simple jump in logic: if Polaris is to be the
authentication/authorization layer for an OL server, how can it not be on
the event path? All outcomes require changes to at least the OL routing
path and/or the OL server to respect Polaris's authN/authZ decisions.
Neither option is recommended. Other applications that have integrated with
OpenLineage servers in a similar fashion have also gone down the path of
being in the event path.


Local Persistence is still intended for immediate implementation. Maybe we
can ship M0 without it, but we should not block work for it unless there
are blocking concerns (as of this reply, I do not see any; the
LineagePersistence contract I see as an implementation negotation rather
than a blocking concern of the overall architecture). I'm working through
the LineagePersistence contract and will update this thread about it.

Best,
Adnan Hemani

On Mon, Jun 1, 2026 at 1:43 AM Robert Stupp <[email protected]> wrote:

> Hi,
>
> One clarification: I would not describe my preferred M0 as
> passthrough/proxy, but to give clients enough information ("config
> vending") to talk to an OL service.
>
> Passthrough/proxy was the fallback if Polaris really needs to be on the
> event path. I might have missed it, but I do not think the proposal or this
> thread establish that requirement.
>
> Before we start with lineage-persistence, could we first clarify two
> things?
>
> 1. Which requirements make Polaris need to be on the event path?
> 2. Is local lineage persistence still intended for the first
> implementation?
>
> Robert
>
>
> On Thu, May 28, 2026 at 10:24 PM Adnan Hemani via dev <
> [email protected]> wrote:
>
> > Hi all,
> >
> > Thanks for your review, Dmitri. I think we're discussing most of your
> > comments on the document itself, so I'll keep the majority of this
> response
> > focused on Robert's feedback.
> >
> > *Security Model:*
> >
> > > What exactly does LINEAGE_INGEST authorize: all inputs, all outputs,
> only
> > outputs, or parent namespaces for CTAS-style outputs?
> >
> > We should authorize all the above :)
> >
> > > What about external datasets that are not Polaris securables?
> >
> > We cannot authorize these. If the user has the LINEAGE_INGEST
> authorization
> > for all other datasets in the request, we should allow it.
> >
> > > Are OL events trusted metadata from privileged engine principals, or
> > should Polaris validate submitted read/write claims?
> >
> > No, they are not "trusted" in any way. The user making the request should
> > be authenticated and authorized for all actions.
> >
> > > For forwarding, Polaris should use configured downstream credentials by
> > default, not forward inbound Polaris bearer tokens or arbitrary request
> > headers, and explicitly define what realm/principal/resolved-entity
> context
> > is propagated.
> >
> > Agreed. This is the model advocated by the document.
> >
> > > For reads, opaque nodes with preserved edges may still leak that hidden
> > datasets exist and are connected to visible ones; that should be an
> > explicit choice.
> >
> > We should never show lineage information for tables the user isn't
> > authorized to see. We can either create a LINEAGE_READ permission or
> bundle
> > this into a permission that already exists in Polaris.
> >
> > *Runtime Limits and Failure Behavior:*
> >
> > I agree that we need to define this - but are there any concerns from the
> > proposal itself that you'd like answered at this stage specifically? I
> > believe these are more implementation details rather than items that
> belong
> > in a proposal document.
> >
> > *User Value of Local Store:*
> >
> > > Which queries is Polaris-local storage supposed to answer, and which
> are
> > left to Marquez/DataHub/OpenMetadata/etc.?
> >
> > My vision is that Polaris should mainly handle smaller deployments or use
> > cases out-of-the-box with the local store. This is mainly for users who
> > lack a use case substantial enough to require an external service or
> vendor
> > for data lineage, but are interested in introducing data lineage to their
> > organization. If their use case grows, the users should transition to an
> > external OL server using "mixed-mode" (Polaris local store + OL external
> OL
> > server). This approach allows for a graceful migration, with the eventual
> > goal of stopping the use of the Polaris local store. This is why the
> "mixed
> > mode" is not P0, but the "pass-through" mode and the "local store" modes
> > are - there should be users who can benefit from these use cases today.
> >
> > Based on my understanding of the remainder of the email, I think you
> align
> > with the proxy/"passthrough-only" mode. I'm ok to start implementation
> > there for now, while we iron out the remaining topics you may have
> concerns
> > about for the "local store" mode. Can you please ask those questions
> > point-by-point (either in this mail thread or on the document) and I can
> > answer them. In the meantime, I will work on defining the
> > LineagePersistence contract.
> >
> > Best,
> > Adnan Hemani
> >
> >
> > On Tue, May 26, 2026 at 5:10 AM Robert Stupp <[email protected]> wrote:
> >
> > > Hi Adnan,
> > >
> > > I am generally supportive of Polaris integrating with OpenLineage,
> > > especially around configuration, identity, and dataset resolution.
> > > If Polaris must be inline for auth/enrichment, the gateway / resolver /
> > > forwarder shape seems like the most natural fit to me.
> > >
> > > That said, I wonder whether the simpler M0 is to keep Polaris out of
> the
> > OL
> > > event data path unless inline authorization/enrichment is a hard
> > > requirement.
> > > OL clients can already emit directly to
> Marquez/DataHub/OpenMetadata/etc.
> > > Polaris could still add value by exposing the OL transport
> configuration,
> > > auth/credential references, realm/tenant context, TLS settings, and
> > naming
> > > conventions.
> > > That would avoid proxying large OL payloads through Polaris and would
> > keep
> > > gateway, local storage, custom query APIs, and downstream query
> > > normalization from being bundled into one milestone.
> > >
> > > For the Polaris-local storage and Polaris-specific query API parts, I
> > think
> > > a few things still need clarification before PRs assume that scope.
> > >
> > > First, could we make the security model more explicit?
> > > What exactly does LINEAGE_INGEST authorize: all inputs, all outputs,
> only
> > > outputs, or parent namespaces for CTAS-style outputs?
> > > What about external datasets that are not Polaris securables?
> > > Are OL events trusted metadata from privileged engine principals, or
> > should
> > > Polaris validate submitted read/write claims?
> > > For forwarding, Polaris should use configured downstream credentials by
> > > default, not forward inbound Polaris bearer tokens or arbitrary request
> > > headers, and explicitly define what realm/principal/resolved-entity
> > context
> > > is propagated.
> > > For reads, opaque nodes with preserved edges may still leak that hidden
> > > datasets exist and are connected to visible ones; that should be an
> > > explicit choice.
> > >
> > > Second, could we define runtime limits and failure behavior?
> > > OL payloads are effectively unbounded by the spec and can get large
> > through
> > > schemas, SQL, column lineage, data quality facets, debug facets, or
> > custom
> > > facets.
> > > Even if Polaris stores only a projection, it still has to receive,
> parse,
> > > and maybe forward the full event.
> > > The proposal should define request/event/batch limits, facet and
> > > column-lineage limits, timeouts, backpressure/concurrency behavior,
> > > oversized-payload responses, and logging rules.
> > >
> > > Third, could we clarify the user value of the reduced local store?
> > > The proposed local representation drops job/run history, run state,
> most
> > > facets, and process attribution.
> > > That may still be useful as a small dependency index for impact
> analysis
> > or
> > > future catalog policy checks, but it is not what users usually get from
> > > querying an OL backend.
> > > Which queries is Polaris-local storage supposed to answer, and which
> are
> > > left to Marquez/DataHub/OpenMetadata/etc.?
> > >
> > > Given those questions, my preferred scoping would be:
> > > if inline Polaris auth/enrichment is not required, start with
> > > configuration/discovery and let OL clients emit directly to the OL
> > backend.
> > > If Polaris does need to be in the event path, scope M0 to proxy/gateway
> > > mode only.
> > > I would keep local storage and Polaris-specific lineage query APIs out
> of
> > > M0 until the product semantics, security model, runtime limits, and
> > > persistence model are clearer.
> > >
> > > If local storage stays in scope, I think the proposal should define a
> > > backend-agnostic LineagePersistence contract first, instead of making
> > > relational tables the logical model.
> > > JDBC-only as the first implementation is fine, but the SPI should not
> > bake
> > > in assumptions that make the NoSQL backend awkward later.
> > >
> > > There are also a few core semantics that I think still need to be
> spelled
> > > out: collapsed edges, staleness/removal, dataset identity and Polaris
> > > entity resolution, query node identity, supported OL event types, batch
> > > behavior, and forwarding failure modes.
> > > In particular, a 200 from Polaris should have a clear meaning in each
> > mode:
> > > local-only, fail-closed forwarding, fail-open forwarding, and
> > > passthrough-only.
> > >
> > > So I am not against the OpenLineage integration direction.
> > > I mainly want to avoid treating "Polaris as an OL gateway" and "Polaris
> > as
> > > a new lineage storage/query system" as the same milestone.
> > > The gateway/configuration pieces feel like a natural fit; the
> > storage/query
> > > pieces still need more design work before I would be comfortable
> calling
> > > that part consensus.
> > >
> > > Robert
> > >
> > > On Mon, May 25, 2026 at 6:57 PM Dmitri Bourlatchkov <[email protected]>
> > > wrote:
> > >
> > > > Hi Adnan,
> > > >
> > > > Thanks for the update and apologies again for the late review.
> > > >
> > > > I posted some comments in the docs; all of them are non-blocking. We
> > can
> > > > address them in the doc or PRs, if you prefer.
> > > >
> > > > However, for the sake of clarity, I'd like to better understand the
> > idea
> > > > behind collapsing OL edges.
> > > >
> > > > What is the use case for the resulting data?
> > > >
> > > > What kind of problems can it help to solve?
> > > >
> > > > How is "staleness pruning" supposed to work (mentioned in section
> "Edge
> > > > Semantics")?
> > > >
> > > > Thanks,
> > > > Dmitri.
> > > >
> > > > On Fri, May 22, 2026 at 2:50 PM Adnan Hemani via dev <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi Dmitri,
> > > > >
> > > > > My understanding is that we discuss threads in the community before
> > > > > implementation to ensure alignment on the proposal's direction
> before
> > > > > community members put time into the implementation. I'm fully
> aligned
> > > > that
> > > > > the implementation details are always up for discussion (both
> before
> > or
> > > > > after a proposal or even before/after a PR :) I just don't want us
> to
> > > > > proceed with putting further time/effort in if the community is not
> > > > aligned
> > > > > on introducing these endpoints in general. I am claiming
> "approv[al]
> > by
> > > > > lazy consensus" for this direction because the doc has been
> > circulating
> > > > in
> > > > > the ML for quite long without any objections.
> > > > >
> > > > > The old proposal has a different direction altogether and does not
> > > align
> > > > > with where we now want to go with this proposal. I wanted to ensure
> > > there
> > > > > was no confusion between the old (now-abandoned) proposal and the
> new
> > > > one.
> > > > > I-Ting is joining on as a co-author on the new proposal.
> > > > >
> > > > > We should definitely discuss this at the next Community Sync!
> > > > >
> > > > > Best,
> > > > > Adnan Hemani
> > > > >
> > > > > On Fri, May 22, 2026 at 8:53 AM Dmitri Bourlatchkov <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Adnan,
> > > > > >
> > > > > > Apologies for the lack of feedback (too many concurrent
> > activities).
> > > > > >
> > > > > > However, this is a discussion thread, not a vote on a concrete
> > > > > > implementation with a timeline. Lack of comment does not mean
> that
> > > > people
> > > > > > approve the design.
> > > > > >
> > > > > > If you prefer to proceed to concrete PRs, that would be fine from
> > my
> > > > POV,
> > > > > > but please do not assume that PRs will not be challenged on
> aspects
> > > > that
> > > > > > were not already discussed.
> > > > > >
> > > > > > I believe it would be preferable to add this to the next
> Community
> > > Sync
> > > > > > agenda to invigorate the discussion.
> > > > > >
> > > > > > Also, I'm not sure what the status is of this document / email
> > thread
> > > > > with
> > > > > > respect to I-Ting's old proposal [1]. Why not continue the
> > discussion
> > > > on
> > > > > > that thread? WDYT?
> > > > > >
> > > > > > [1]
> > https://lists.apache.org/thread/qqpq5hl1xrq8mwnd7kn4vgt8x9mqtvmg
> > > > > >
> > > > > > Cheers,
> > > > > > Dmitri.
> > > > > >
> > > > > > On Fri, May 22, 2026 at 1:07 AM Adnan Hemani via dev <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > >> Hi folks,
> > > > > >>
> > > > > >> Since there hasn't been much traffic on this document or email
> > > > threads,
> > > > > I
> > > > > >> will consider the document approved by lazy consensus if there
> are
> > > no
> > > > > >> further blocking comments by Friday, May 28th.
> > > > > >>
> > > > > >> Thanks!
> > > > > >>
> > > > > >> Best,
> > > > > >> Adnan Hemani
> > > > > >>
> > > > > >> On Thu, May 14, 2026 at 1:49 PM Yufei Gu <[email protected]>
> > > > wrote:
> > > > > >>
> > > > > >> > Hi Adnan,
> > > > > >> >
> > > > > >> > Thanks for putting this proposal together and resurfacing it
> in
> > a
> > > > > >> > dedicated thread.
> > > > > >> >
> > > > > >> > I did one round of review on the document already, and I think
> > it
> > > > > would
> > > > > >> be
> > > > > >> > great if more folks from the community could take a look and
> > > provide
> > > > > >> > feedback as well. It would be especially helpful to get
> > > perspectives
> > > > > >> from
> > > > > >> > people working on related areas before implementation starts.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> > Yufei
> > > > > >> >
> > > > > >> >
> > > > > >> > On Wed, May 13, 2026 at 10:01 PM Adnan Hemani via dev <
> > > > > >> > [email protected]> wrote:
> > > > > >> >
> > > > > >> >> Hi all,
> > > > > >> >>
> > > > > >> >> I wanted to ensure that the OpenLineage proposal I previously
> > > > posted
> > > > > >> in a
> > > > > >> >> different thread [1] was actually being found, given that it
> > was
> > > > deep
> > > > > >> into
> > > > > >> >> the thread. I request the community to review this proposal
> so
> > we
> > > > can
> > > > > >> >> potentially start implementation.
> > > > > >> >>
> > > > > >> >> Proposal:
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1iOzIuFW66SFL2wZOADD9knMTG21OwY7VmaWVSvMUqQk/edit?tab=t.0#heading=h.59bmbnsf0gp1
> > > > > >> >>
> > > > > >> >> Best,
> > > > > >> >> Adnan Hemani
> > > > > >> >>
> > > > > >> >> [1]
> > > > https://lists.apache.org/thread/1fd6hrvx0v0s5wm6gh74cdo3yn4w1zhx
> > > > > >> >>
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: OpenLineage Proposal - Follow-Up

Reply via email to