Hi,

One clarification: I would not describe my preferred M0 as
passthrough/proxy, but to give clients enough information ("config
vending") to talk to an OL service.

Passthrough/proxy was the fallback if Polaris really needs to be on the
event path. I might have missed it, but I do not think the proposal or this
thread establish that requirement.

Before we start with lineage-persistence, could we first clarify two things?

1. Which requirements make Polaris need to be on the event path?
2. Is local lineage persistence still intended for the first implementation?

Robert


On Thu, May 28, 2026 at 10:24 PM Adnan Hemani via dev <
[email protected]> wrote:

> Hi all,
>
> Thanks for your review, Dmitri. I think we're discussing most of your
> comments on the document itself, so I'll keep the majority of this response
> focused on Robert's feedback.
>
> *Security Model:*
>
> > What exactly does LINEAGE_INGEST authorize: all inputs, all outputs, only
> outputs, or parent namespaces for CTAS-style outputs?
>
> We should authorize all the above :)
>
> > What about external datasets that are not Polaris securables?
>
> We cannot authorize these. If the user has the LINEAGE_INGEST authorization
> for all other datasets in the request, we should allow it.
>
> > Are OL events trusted metadata from privileged engine principals, or
> should Polaris validate submitted read/write claims?
>
> No, they are not "trusted" in any way. The user making the request should
> be authenticated and authorized for all actions.
>
> > For forwarding, Polaris should use configured downstream credentials by
> default, not forward inbound Polaris bearer tokens or arbitrary request
> headers, and explicitly define what realm/principal/resolved-entity context
> is propagated.
>
> Agreed. This is the model advocated by the document.
>
> > For reads, opaque nodes with preserved edges may still leak that hidden
> datasets exist and are connected to visible ones; that should be an
> explicit choice.
>
> We should never show lineage information for tables the user isn't
> authorized to see. We can either create a LINEAGE_READ permission or bundle
> this into a permission that already exists in Polaris.
>
> *Runtime Limits and Failure Behavior:*
>
> I agree that we need to define this - but are there any concerns from the
> proposal itself that you'd like answered at this stage specifically? I
> believe these are more implementation details rather than items that belong
> in a proposal document.
>
> *User Value of Local Store:*
>
> > Which queries is Polaris-local storage supposed to answer, and which are
> left to Marquez/DataHub/OpenMetadata/etc.?
>
> My vision is that Polaris should mainly handle smaller deployments or use
> cases out-of-the-box with the local store. This is mainly for users who
> lack a use case substantial enough to require an external service or vendor
> for data lineage, but are interested in introducing data lineage to their
> organization. If their use case grows, the users should transition to an
> external OL server using "mixed-mode" (Polaris local store + OL external OL
> server). This approach allows for a graceful migration, with the eventual
> goal of stopping the use of the Polaris local store. This is why the "mixed
> mode" is not P0, but the "pass-through" mode and the "local store" modes
> are - there should be users who can benefit from these use cases today.
>
> Based on my understanding of the remainder of the email, I think you align
> with the proxy/"passthrough-only" mode. I'm ok to start implementation
> there for now, while we iron out the remaining topics you may have concerns
> about for the "local store" mode. Can you please ask those questions
> point-by-point (either in this mail thread or on the document) and I can
> answer them. In the meantime, I will work on defining the
> LineagePersistence contract.
>
> Best,
> Adnan Hemani
>
>
> On Tue, May 26, 2026 at 5:10 AM Robert Stupp <[email protected]> wrote:
>
> > Hi Adnan,
> >
> > I am generally supportive of Polaris integrating with OpenLineage,
> > especially around configuration, identity, and dataset resolution.
> > If Polaris must be inline for auth/enrichment, the gateway / resolver /
> > forwarder shape seems like the most natural fit to me.
> >
> > That said, I wonder whether the simpler M0 is to keep Polaris out of the
> OL
> > event data path unless inline authorization/enrichment is a hard
> > requirement.
> > OL clients can already emit directly to Marquez/DataHub/OpenMetadata/etc.
> > Polaris could still add value by exposing the OL transport configuration,
> > auth/credential references, realm/tenant context, TLS settings, and
> naming
> > conventions.
> > That would avoid proxying large OL payloads through Polaris and would
> keep
> > gateway, local storage, custom query APIs, and downstream query
> > normalization from being bundled into one milestone.
> >
> > For the Polaris-local storage and Polaris-specific query API parts, I
> think
> > a few things still need clarification before PRs assume that scope.
> >
> > First, could we make the security model more explicit?
> > What exactly does LINEAGE_INGEST authorize: all inputs, all outputs, only
> > outputs, or parent namespaces for CTAS-style outputs?
> > What about external datasets that are not Polaris securables?
> > Are OL events trusted metadata from privileged engine principals, or
> should
> > Polaris validate submitted read/write claims?
> > For forwarding, Polaris should use configured downstream credentials by
> > default, not forward inbound Polaris bearer tokens or arbitrary request
> > headers, and explicitly define what realm/principal/resolved-entity
> context
> > is propagated.
> > For reads, opaque nodes with preserved edges may still leak that hidden
> > datasets exist and are connected to visible ones; that should be an
> > explicit choice.
> >
> > Second, could we define runtime limits and failure behavior?
> > OL payloads are effectively unbounded by the spec and can get large
> through
> > schemas, SQL, column lineage, data quality facets, debug facets, or
> custom
> > facets.
> > Even if Polaris stores only a projection, it still has to receive, parse,
> > and maybe forward the full event.
> > The proposal should define request/event/batch limits, facet and
> > column-lineage limits, timeouts, backpressure/concurrency behavior,
> > oversized-payload responses, and logging rules.
> >
> > Third, could we clarify the user value of the reduced local store?
> > The proposed local representation drops job/run history, run state, most
> > facets, and process attribution.
> > That may still be useful as a small dependency index for impact analysis
> or
> > future catalog policy checks, but it is not what users usually get from
> > querying an OL backend.
> > Which queries is Polaris-local storage supposed to answer, and which are
> > left to Marquez/DataHub/OpenMetadata/etc.?
> >
> > Given those questions, my preferred scoping would be:
> > if inline Polaris auth/enrichment is not required, start with
> > configuration/discovery and let OL clients emit directly to the OL
> backend.
> > If Polaris does need to be in the event path, scope M0 to proxy/gateway
> > mode only.
> > I would keep local storage and Polaris-specific lineage query APIs out of
> > M0 until the product semantics, security model, runtime limits, and
> > persistence model are clearer.
> >
> > If local storage stays in scope, I think the proposal should define a
> > backend-agnostic LineagePersistence contract first, instead of making
> > relational tables the logical model.
> > JDBC-only as the first implementation is fine, but the SPI should not
> bake
> > in assumptions that make the NoSQL backend awkward later.
> >
> > There are also a few core semantics that I think still need to be spelled
> > out: collapsed edges, staleness/removal, dataset identity and Polaris
> > entity resolution, query node identity, supported OL event types, batch
> > behavior, and forwarding failure modes.
> > In particular, a 200 from Polaris should have a clear meaning in each
> mode:
> > local-only, fail-closed forwarding, fail-open forwarding, and
> > passthrough-only.
> >
> > So I am not against the OpenLineage integration direction.
> > I mainly want to avoid treating "Polaris as an OL gateway" and "Polaris
> as
> > a new lineage storage/query system" as the same milestone.
> > The gateway/configuration pieces feel like a natural fit; the
> storage/query
> > pieces still need more design work before I would be comfortable calling
> > that part consensus.
> >
> > Robert
> >
> > On Mon, May 25, 2026 at 6:57 PM Dmitri Bourlatchkov <[email protected]>
> > wrote:
> >
> > > Hi Adnan,
> > >
> > > Thanks for the update and apologies again for the late review.
> > >
> > > I posted some comments in the docs; all of them are non-blocking. We
> can
> > > address them in the doc or PRs, if you prefer.
> > >
> > > However, for the sake of clarity, I'd like to better understand the
> idea
> > > behind collapsing OL edges.
> > >
> > > What is the use case for the resulting data?
> > >
> > > What kind of problems can it help to solve?
> > >
> > > How is "staleness pruning" supposed to work (mentioned in section "Edge
> > > Semantics")?
> > >
> > > Thanks,
> > > Dmitri.
> > >
> > > On Fri, May 22, 2026 at 2:50 PM Adnan Hemani via dev <
> > > [email protected]>
> > > wrote:
> > >
> > > > Hi Dmitri,
> > > >
> > > > My understanding is that we discuss threads in the community before
> > > > implementation to ensure alignment on the proposal's direction before
> > > > community members put time into the implementation. I'm fully aligned
> > > that
> > > > the implementation details are always up for discussion (both before
> or
> > > > after a proposal or even before/after a PR :) I just don't want us to
> > > > proceed with putting further time/effort in if the community is not
> > > aligned
> > > > on introducing these endpoints in general. I am claiming "approv[al]
> by
> > > > lazy consensus" for this direction because the doc has been
> circulating
> > > in
> > > > the ML for quite long without any objections.
> > > >
> > > > The old proposal has a different direction altogether and does not
> > align
> > > > with where we now want to go with this proposal. I wanted to ensure
> > there
> > > > was no confusion between the old (now-abandoned) proposal and the new
> > > one.
> > > > I-Ting is joining on as a co-author on the new proposal.
> > > >
> > > > We should definitely discuss this at the next Community Sync!
> > > >
> > > > Best,
> > > > Adnan Hemani
> > > >
> > > > On Fri, May 22, 2026 at 8:53 AM Dmitri Bourlatchkov <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi Adnan,
> > > > >
> > > > > Apologies for the lack of feedback (too many concurrent
> activities).
> > > > >
> > > > > However, this is a discussion thread, not a vote on a concrete
> > > > > implementation with a timeline. Lack of comment does not mean that
> > > people
> > > > > approve the design.
> > > > >
> > > > > If you prefer to proceed to concrete PRs, that would be fine from
> my
> > > POV,
> > > > > but please do not assume that PRs will not be challenged on aspects
> > > that
> > > > > were not already discussed.
> > > > >
> > > > > I believe it would be preferable to add this to the next Community
> > Sync
> > > > > agenda to invigorate the discussion.
> > > > >
> > > > > Also, I'm not sure what the status is of this document / email
> thread
> > > > with
> > > > > respect to I-Ting's old proposal [1]. Why not continue the
> discussion
> > > on
> > > > > that thread? WDYT?
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/qqpq5hl1xrq8mwnd7kn4vgt8x9mqtvmg
> > > > >
> > > > > Cheers,
> > > > > Dmitri.
> > > > >
> > > > > On Fri, May 22, 2026 at 1:07 AM Adnan Hemani via dev <
> > > > > [email protected]> wrote:
> > > > >
> > > > >> Hi folks,
> > > > >>
> > > > >> Since there hasn't been much traffic on this document or email
> > > threads,
> > > > I
> > > > >> will consider the document approved by lazy consensus if there are
> > no
> > > > >> further blocking comments by Friday, May 28th.
> > > > >>
> > > > >> Thanks!
> > > > >>
> > > > >> Best,
> > > > >> Adnan Hemani
> > > > >>
> > > > >> On Thu, May 14, 2026 at 1:49 PM Yufei Gu <[email protected]>
> > > wrote:
> > > > >>
> > > > >> > Hi Adnan,
> > > > >> >
> > > > >> > Thanks for putting this proposal together and resurfacing it in
> a
> > > > >> > dedicated thread.
> > > > >> >
> > > > >> > I did one round of review on the document already, and I think
> it
> > > > would
> > > > >> be
> > > > >> > great if more folks from the community could take a look and
> > provide
> > > > >> > feedback as well. It would be especially helpful to get
> > perspectives
> > > > >> from
> > > > >> > people working on related areas before implementation starts.
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Yufei
> > > > >> >
> > > > >> >
> > > > >> > On Wed, May 13, 2026 at 10:01 PM Adnan Hemani via dev <
> > > > >> > [email protected]> wrote:
> > > > >> >
> > > > >> >> Hi all,
> > > > >> >>
> > > > >> >> I wanted to ensure that the OpenLineage proposal I previously
> > > posted
> > > > >> in a
> > > > >> >> different thread [1] was actually being found, given that it
> was
> > > deep
> > > > >> into
> > > > >> >> the thread. I request the community to review this proposal so
> we
> > > can
> > > > >> >> potentially start implementation.
> > > > >> >>
> > > > >> >> Proposal:
> > > > >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1iOzIuFW66SFL2wZOADD9knMTG21OwY7VmaWVSvMUqQk/edit?tab=t.0#heading=h.59bmbnsf0gp1
> > > > >> >>
> > > > >> >> Best,
> > > > >> >> Adnan Hemani
> > > > >> >>
> > > > >> >> [1]
> > > https://lists.apache.org/thread/1fd6hrvx0v0s5wm6gh74cdo3yn4w1zhx
> > > > >> >>
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Reply via email to