Thanks Adnan. That makes sense. I agree we should not change the OL endpoint shape in a way that requires client-side changes. My point was not to replace the OpenLineage-defined endpoint, but to make the OpenLineage-specific API surface clear in the surrounding route grouping/docs while preserving the endpoint shape OL clients expect.
I also agree the pieces need to be designed as one usable vertical slice, not as independently mergeable unusable parts. The distinction I was trying to make is responsibility/review scope, not shipment scope: REST/API compatibility, the provider/capability boundary behind it, and the default implementation need to work together. I'll take a look at PR #4667 with that lens. On Fri, Jun 12, 2026 at 1:33 PM Adnan Hemani <[email protected]> wrote: > Hi EJ, > > Unfortunately, the "/lineage" API is defined in the OpenLineage spec. > Changing this out for Polaris would require client-side changes - leading > us to the same situation that you confirmed after your investigation. > > I agree with tackling the implementation similarly to what you've > outlined. However, breaking this design into those topics may create more > chaos than good because all these topics must work hand-in-hand design-wise > and no other non-OpenLineage proposals for Data Lineage are expected in the > near future. I request everyone to please review the initial PR that sets > the Ingest API in Polaris: https://github.com/apache/polaris/pull/4667. > > Best, > Adnan Hemani > > > > On Fri, Jun 12, 2026 at 12:02 PM EJ Wang <[email protected]> > wrote: > >> Hi Adnan, >> >> I think your point about adoption is right, and I'd revise part of my >> earlier framing after looking more closely at how existing OpenLineage >> integrations work. >> >> I was previously thinking too much about whether clients could emit a >> more Polaris-native or framework-agnostic payload. But that is probably not >> the right first-slice adoption model. Existing OL producers generally >> already emit OpenLineage events, and the common low-friction knob is the >> transport target, URL/endpoint, not a uniform way to wrap or reshape the >> event body. >> >> So I agree that the first slice should optimize for endpoint retargeting >> and raw OL event ingestion. Clients should not need to know that the >> backend is Polaris or learn a Polaris-specific payload shape. >> >> *The design question I'd still like us to make explicit is where the >> OpenLineage specificity lives*. My preference would be to make it >> explicit at the ingress/API layer, for example with an OpenLineage-specific >> route under the lineage namespace such as: /.../lineage/openlineage >> >> That still preserves endpoint-retargeting for existing OL producers, >> while avoiding ambiguity about whether the generic `/lineage` namespace is >> an OpenLineage contract or a broader Polaris lineage namespace. It also >> leaves room for future `/lineage/<format>` ingress adapters if Polaris >> later supports other lineage formats or frameworks. >> >> Behind that ingress route, I'd like to keep the platform boundary >> Polaris-owned. I would separate: >> >> 1. *OpenLineage REST ingress/API* : an OL-aware endpoint that accepts >> raw OL events. >> 2. *Polaris lineage capability boundary*: a Polaris-owned contract >> behind ingress. >> 3. *Default/OOTB implementation:* a small bundled implementation that >> proves the SPI capability (encapsulate correctly and expose sufficiently >> for extension impls) works end-to-end, >> 4. *Extension implementations*: richer provider/proxy/forwarder/custom >> behavior for deployments that need it. >> >> This is not meant to reduce OpenLineage support. Quite the opposite: >> OpenLineage can be the first explicit supported ingress format. The point >> is to make the specificity explicit where it belongs, so Polaris can >> support OpenLineage well now while preserving room for future contributions >> in the right layer. >> >> *With that framing, I'd suggest*: >> - Initial PR: OpenLineage-specific ingress + Polaris lineage capability >> boundary + minimal default/OOTB path. >> - Follow-up PRs: proxy/forwarder/custom provider implementations and >> richer behavior. >> - Query/persistence semantics: separate unless this proposal is >> explicitly adding a read/query API. >> >> I think that would support the adoption goal you described, while keeping >> Polaris extensible in an organized way. >> >> -ej >> >> On Thu, Jun 11, 2026 at 8:19 PM Adnan Hemani <[email protected]> >> wrote: >> >>> Hi EJ, >>> >>> Thanks for looking at the proposal. I've responded to most of your >>> comments on the document itself, but I'll summarize the stances here to >>> close the loop. >>> >>> I am consciously making an effort to let the OpenLineage standard drive >>> the requirements here; this is a feature, not a bug. IMO, OpenLineage is >>> by-far the most well-used standard for data lineage; I don't even know of >>> any other significant competitors. Big Data engines like Spark and Trino, >>> which represent a significant use case for Polaris, have OpenLineage >>> integrations and nothing else. Going the extra mile for further flexibility >>> to de-couple our lineage implementations from OpenLineage will likely not >>> produce any ROI in terms of work IMO. Happy to hear any other thoughts on >>> this topic. >>> >>> I also don't agree that Polaris should morph into a full-fledged >>> OpenLineage server. I don't think the Polaris community is attempting to >>> make a "Swiss-Army Knife" tool out of Polaris. For major lineage use cases, >>> users absolutely should be redirected to other servers like Marquez where >>> they can get full graph history, multi-hop traversal, jobs/runs info, etc. >>> I disagree with the "extensions" piece of your email based on this >>> reasoning. >>> >>> Regarding the "out-of-the-box" experience, I have no doubt: Polaris >>> cannot have lineage information. An admin must take a small step to >>> configure how they want to enable Lineage data persistence: either for >>> Polaris-local persistence or for the passthrough/proxy/AuthZ layer modes. I >>> think you've missed some of the points in the mailing thread replies above; >>> the Query API is really only helpful when using the Polaris local >>> persistence mode. The current plan is to build toward "passthrough" mode >>> first, with plans to support the Polaris local implementation soon >>> afterward. A Query API won't be introduced until the Polaris local >>> implementation work begins. This means there's no implication that a Query >>> API will exist without returning data to the user. You can see this in my >>> first PR, where only the Ingest API is implemented: >>> https://github.com/apache/polaris/pull/4667. >>> >>> One last note/suggestion for you: the term "default battery" on its own >>> generally doesn't make much sense. I'm only able to piece together your >>> comments because you used the phrase "batteries included" in this morning's >>> community sync. I would usually use "out-of-the-box (OOTB)" or "default >>> implementation". Using similar terms in the future would improve >>> readability in general. >>> >>> Best, >>> Adnan Hemani >>> >>> On Thu, Jun 11, 2026 at 4:12 PM EJ Wang <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> I read through the proposal and the comments. One framing that may help >>>> us converge is to split the proposal into a few separate decisions instead >>>> of reviewing it as one bundled “OpenLineage support in Polaris” feature. >>>> >>>> This seems related to a broader direction I understand for Polaris as a >>>> platform: it should be flexible enough to support different deployment and >>>> integration use cases, but still battery-included enough to be useful out >>>> of the box. For lineage, I think that means we should explicitly separate: >>>> what Polaris promises as native lineage semantics, what the default battery >>>> implementation does, and what should remain pluggable for richer or >>>> deployment-specific implementations. >>>> >>>> I have been using a similar exercise in a recent SPI proposal draft: >>>> first separate external contracts, default/battery implementation, >>>> extension implementations, and provider-facing replacement points; then >>>> decide implementation. I think that exercise applies well here because this >>>> proposal touches several different boundary types at once: ingest protocol, >>>> Polaris-native lineage model, persistence, query API, downstream >>>> forwarding, auth, and dataset resolution. >>>> >>>> The questions I think we should separate are: >>>> >>>> 1. *OpenLineage compatibility: *Do we require existing OpenLineage >>>> clients to emit to Polaris by changing only the endpoint/config? >>>> - If yes, then a server-side OpenLineage-compatible adapter >>>> endpoint makes sense. >>>> - If not, another option is a Polaris-provided OpenLineage >>>> transport/client shim that reshapes OpenLineage events into a >>>> Polaris-native lineage API. >>>> - Those are different adoption tradeoffs, and I think we should >>>> choose intentionally rather than letting OpenLineage compatibility >>>> implicitly define the Polaris-native API. >>>> 2. *Polaris-native lineage model: *Should the long-term Polaris >>>> lineage model/query API be OpenLineage-specific, or framework-agnostic >>>> with >>>> OpenLineage as one adapter? >>>> - My preference is the latter. OpenLineage compatibility is >>>> useful, but I would avoid making the OpenLineage payload shape the >>>> Polaris-native lineage model by accident. >>>> 3. *Default battery behavior: *What should work out of the box? >>>> - If query is part of the initial release, I think the battery >>>> needs enough local state to answer a minimal query. A narrow default >>>> could >>>> be: latest observed direct table-level upstreams for a >>>> Polaris-managed >>>> target table, with observed timestamp, producer/engine identifier, >>>> and >>>> upstream dataset refs. >>>> 4. *Extension implementations: *What should be pluggable or future >>>> work? >>>> - I would put raw OpenLineage forwarding/proxying, external >>>> backend query, full graph history, multi-hop traversal, column-level >>>> query, >>>> job/run graph, pruning/staleness, and richer governance-aware >>>> behavior into >>>> extension/future implementation areas rather than the default >>>> battery. >>>> >>>> *One subtle point*: I do not think the default battery and the >>>> REST/API envelope need to have exactly the same scope. >>>> >>>> The default battery can be intentionally small. For example, latest >>>> direct table-level lineage summary for Polaris-managed target tables. *But >>>> the REST/API envelope can still be designed so that richer implementations >>>> are possible later or through extensions*. For example, the API can >>>> carry metadata such as *granularity (table/col/job etc.), >>>> format/source protocol (OpenLineage or other lineage framework)*, or >>>> requested mode to help Polaris route handling to the configured provider, >>>> without requiring every default implementation to support every mode. >>>> >>>> Said differently, I would separate: >>>> >>>> - what the API envelope can represent; >>>> - what the default battery actually guarantees; >>>> - what extension implementations can support. >>>> >>>> *My concrete recommendation would be*: >>>> >>>> If Polaris exposes a lineage Query API in the initial release, the >>>> default battery should provide a minimal latest table-level summary >>>> implementation so the query works out of the box. If we do not want any >>>> local persistence in the initial release, then I think the Query API should >>>> be out of scope for the initial release or clearly extension-provided. I >>>> would avoid exposing a core query API whose default implementation cannot >>>> answer anything. >>>> >>>> *My preferred shape would be*: >>>> >>>> - Polaris-native lineage semantics stay *framework-agnostic*. >>>> - OpenLineage is supported as an adapter/adoption path, *not as the >>>> only Polaris lineage model*. >>>> - The default battery, if query is in scope, is latest direct >>>> table-level lineage summary only. >>>> - *The API envelope leaves room for richer provider implementations* >>>> . >>>> - Full OpenLineage backend behavior, downstream >>>> forwarding/proxying, historical graph, column lineage, job/run lineage, >>>> multi-hop query, pruning/staleness, and external backend query *are >>>> extension or future work*. >>>> >>>> This would still give Polaris a useful out-of-the-box lineage >>>> experience, while avoiding turning Polaris into a full lineage backend in >>>> the first step. >>>> >>>> -ej >>>> >>>> On Mon, Jun 8, 2026 at 2:31 PM Adnan Hemani via dev < >>>> [email protected]> wrote: >>>> >>>>> Hi Robert, >>>>> >>>>> > Is my understanding correct that option 1 is out of scope from your >>>>> perspective, and option 2 is not sufficient for the M0 you have in >>>>> mind? In >>>>> other words, you are proposing option 3 as the baseline, with active >>>>> planning toward option 4? >>>>> >>>>> Yes, that's correct. Happy to hear others' opinions, but Option 4 has >>>>> been >>>>> detailed in the proposal document since the very start. I'm happy to >>>>> wait a >>>>> few more days for others' opinions, but as of now I don't see any >>>>> active >>>>> opposition to the plans as-is and the "lazy consensus" suggested >>>>> deadline >>>>> was over 2 weeks ago. I-Ting and I will start implementation in the >>>>> meantime. >>>>> >>>>> Best, >>>>> Adnan Hemani >>>>> >>>>> On Mon, Jun 8, 2026 at 3:19 AM Robert Stupp <[email protected]> wrote: >>>>> >>>>> > Hi all, >>>>> > >>>>> > Thanks Adnan, that helps clarify the shape. >>>>> > >>>>> > I think this is the point where broader community input would be >>>>> useful, >>>>> > because options 3/4 are a materially different commitment from >>>>> options 1/2. >>>>> > >>>>> > Is my understanding correct that option 1 is out of scope from your >>>>> > perspective, and option 2 is not sufficient for the M0 you have in >>>>> mind? In >>>>> > other words, you are proposing option 3 as the baseline, with active >>>>> > planning toward option 4? >>>>> > >>>>> > Option 3 does not just put a proxy endpoint in Polaris. >>>>> > It makes Polaris responsible for the OL ingest path: dataset-name >>>>> > resolution, per-entity authZ over OL assertions, policy for >>>>> non-Polaris >>>>> > datasets, trusted-service credentials to downstream systems, >>>>> request-size >>>>> > and payload limits, forwarding failure semantics, audit behavior, and >>>>> > tenant isolation. >>>>> > >>>>> > Option 4 then adds a Polaris-local lineage storage/query subsystem. >>>>> > Even if the first version stores only a reduced projection, Polaris >>>>> would >>>>> > take on many responsibilities of an OL backend: persistence >>>>> semantics, >>>>> > query semantics, staleness/pruning, auth-filtered reads, backend >>>>> > compatibility, migrations, limits, and long-term compatibility with >>>>> OL >>>>> > event shapes. >>>>> > At that point, even if intentionally limited, Polaris effectively >>>>> operates >>>>> > as an OL backend for the supported subset. >>>>> > >>>>> > So before we treat option 3 plus active planning toward option 4 as >>>>> the M0 >>>>> > baseline, I think it would be good to hear whether others agree that >>>>> > Polaris should take on that implementation and maintenance surface >>>>> for the >>>>> > first milestone. >>>>> > >>>>> > Or whether we should start with a smaller integration point first. >>>>> > >>>>> > Robert >>>>> > >>>>> >>>>
