Re: OpenLineage Proposal - Follow-Up

EJ Wang Fri, 12 Jun 2026 14:57:51 -0700

Thanks Adnan. That makes sense. I agree we should not change the OL
endpoint shape in a way that requires client-side changes. My point was not
to replace the OpenLineage-defined endpoint, but to make the
OpenLineage-specific API surface clear in the surrounding route
grouping/docs while preserving the endpoint shape OL clients expect.


I also agree the pieces need to be designed as one usable vertical slice,
not as independently mergeable unusable parts. The distinction I was trying
to make is responsibility/review scope, not shipment scope: REST/API
compatibility, the provider/capability boundary behind it, and the default
implementation need to work together.

I'll take a look at PR #4667 with that lens.

On Fri, Jun 12, 2026 at 1:33 PM Adnan Hemani <[email protected]>
wrote:

> Hi EJ,
>
> Unfortunately, the "/lineage" API is defined in the OpenLineage spec.
> Changing this out for Polaris would require client-side changes - leading
> us to the same situation that you confirmed after your investigation.
>
> I agree with tackling the implementation similarly to what you've
> outlined. However, breaking this design into those topics may create more
> chaos than good because all these topics must work hand-in-hand design-wise
> and no other non-OpenLineage proposals for Data Lineage are expected in the
> near future. I request everyone to please review the initial PR that sets
> the Ingest API in Polaris: https://github.com/apache/polaris/pull/4667.
>
> Best,
> Adnan Hemani
>
>
>
> On Fri, Jun 12, 2026 at 12:02 PM EJ Wang <[email protected]>
> wrote:
>
>> Hi Adnan,
>>
>> I think your point about adoption is right, and I'd revise part of my
>> earlier framing after looking more closely at how existing OpenLineage
>> integrations work.
>>
>> I was previously thinking too much about whether clients could emit a
>> more Polaris-native or framework-agnostic payload. But that is probably not
>> the right first-slice adoption model. Existing OL producers generally
>> already emit OpenLineage events, and the common low-friction knob is the
>> transport target, URL/endpoint, not a uniform way to wrap or reshape the
>> event body.
>>
>> So I agree that the first slice should optimize for endpoint retargeting
>> and raw OL event ingestion. Clients should not need to know that the
>> backend is Polaris or learn a Polaris-specific payload shape.
>>
>> *The design question I'd still like us to make explicit is where the
>> OpenLineage specificity lives*. My preference would be to make it
>> explicit at the ingress/API layer, for example with an OpenLineage-specific
>> route under the lineage namespace such as: /.../lineage/openlineage
>>
>> That still preserves endpoint-retargeting for existing OL producers,
>> while avoiding ambiguity about whether the generic `/lineage` namespace is
>> an OpenLineage contract or a broader Polaris lineage namespace. It also
>> leaves room for future `/lineage/<format>` ingress adapters if Polaris
>> later supports other lineage formats or frameworks.
>>
>> Behind that ingress route, I'd like to keep the platform boundary
>> Polaris-owned. I would separate:
>>
>> 1. *OpenLineage REST ingress/API* : an OL-aware endpoint that accepts
>> raw OL events.
>> 2. *Polaris lineage capability boundary*: a Polaris-owned contract
>> behind ingress.
>> 3. *Default/OOTB implementation:* a small bundled implementation that
>> proves the SPI capability (encapsulate correctly and expose sufficiently
>> for extension impls) works end-to-end,
>> 4. *Extension implementations*: richer provider/proxy/forwarder/custom
>> behavior for deployments that need it.
>>
>> This is not meant to reduce OpenLineage support. Quite the opposite:
>> OpenLineage can be the first explicit supported ingress format. The point
>> is to make the specificity explicit where it belongs, so Polaris can
>> support OpenLineage well now while preserving room for future contributions
>> in the right layer.
>>
>> *With that framing, I'd suggest*:
>> - Initial PR: OpenLineage-specific ingress + Polaris lineage capability
>> boundary + minimal default/OOTB path.
>> - Follow-up PRs: proxy/forwarder/custom provider implementations and
>> richer behavior.
>> - Query/persistence semantics: separate unless this proposal is
>> explicitly adding a read/query API.
>>
>> I think that would support the adoption goal you described, while keeping
>> Polaris extensible in an organized way.
>>
>> -ej
>>
>> On Thu, Jun 11, 2026 at 8:19 PM Adnan Hemani <[email protected]>
>> wrote:
>>
>>> Hi EJ,
>>>
>>> Thanks for looking at the proposal. I've responded to most of your
>>> comments on the document itself, but I'll summarize the stances here to
>>> close the loop.
>>>
>>> I am consciously making an effort to let the OpenLineage standard drive
>>> the requirements here; this is a feature, not a bug. IMO, OpenLineage is
>>> by-far the most well-used standard for data lineage; I don't even know of
>>> any other significant competitors. Big Data engines like Spark and Trino,
>>> which represent a significant use case for Polaris, have OpenLineage
>>> integrations and nothing else. Going the extra mile for further flexibility
>>> to de-couple our lineage implementations from OpenLineage will likely not
>>> produce any ROI in terms of work IMO. Happy to hear any other thoughts on
>>> this topic.
>>>
>>> I also don't agree that Polaris should morph into a full-fledged
>>> OpenLineage server. I don't think the Polaris community is attempting to
>>> make a "Swiss-Army Knife" tool out of Polaris. For major lineage use cases,
>>> users absolutely should be redirected to other servers like Marquez where
>>> they can get full graph history, multi-hop traversal, jobs/runs info, etc.
>>> I disagree with the "extensions" piece of your email based on this
>>> reasoning.
>>>
>>> Regarding the "out-of-the-box" experience, I have no doubt: Polaris
>>> cannot have lineage information. An admin must take a small step to
>>> configure how they want to enable Lineage data persistence: either for
>>> Polaris-local persistence or for the passthrough/proxy/AuthZ layer modes. I
>>> think you've missed some of the points in the mailing thread replies above;
>>> the Query API is really only helpful when using the Polaris local
>>> persistence mode. The current plan is to build toward "passthrough" mode
>>> first, with plans to support the Polaris local implementation soon
>>> afterward. A Query API won't be introduced until the Polaris local
>>> implementation work begins. This means there's no implication that a Query
>>> API will exist without returning data to the user. You can see this in my
>>> first PR, where only the Ingest API is implemented:
>>> https://github.com/apache/polaris/pull/4667.
>>>
>>> One last note/suggestion for you: the term "default battery" on its own
>>> generally doesn't make much sense. I'm only able to piece together your
>>> comments because you used the phrase "batteries included" in this morning's
>>> community sync. I would usually use "out-of-the-box (OOTB)" or "default
>>> implementation". Using similar terms in the future would improve
>>> readability in general.
>>>
>>> Best,
>>> Adnan Hemani
>>>
>>> On Thu, Jun 11, 2026 at 4:12 PM EJ Wang <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I read through the proposal and the comments. One framing that may help
>>>> us converge is to split the proposal into a few separate decisions instead
>>>> of reviewing it as one bundled “OpenLineage support in Polaris” feature.
>>>>
>>>> This seems related to a broader direction I understand for Polaris as a
>>>> platform: it should be flexible enough to support different deployment and
>>>> integration use cases, but still battery-included enough to be useful out
>>>> of the box. For lineage, I think that means we should explicitly separate:
>>>> what Polaris promises as native lineage semantics, what the default battery
>>>> implementation does, and what should remain pluggable for richer or
>>>> deployment-specific implementations.
>>>>
>>>> I have been using a similar exercise in a recent SPI proposal draft:
>>>> first separate external contracts, default/battery implementation,
>>>> extension implementations, and provider-facing replacement points; then
>>>> decide implementation. I think that exercise applies well here because this
>>>> proposal touches several different boundary types at once: ingest protocol,
>>>> Polaris-native lineage model, persistence, query API, downstream
>>>> forwarding, auth, and dataset resolution.
>>>>
>>>> The questions I think we should separate are:
>>>>
>>>>    1. *OpenLineage compatibility: *Do we require existing OpenLineage
>>>>    clients to emit to Polaris by changing only the endpoint/config?
>>>>       - If yes, then a server-side OpenLineage-compatible adapter
>>>>       endpoint makes sense.
>>>>       - If not, another option is a Polaris-provided OpenLineage
>>>>       transport/client shim that reshapes OpenLineage events into a
>>>>       Polaris-native lineage API.
>>>>    - Those are different adoption tradeoffs, and I think we should
>>>>       choose intentionally rather than letting OpenLineage compatibility
>>>>       implicitly define the Polaris-native API.
>>>>    2. *Polaris-native lineage model: *Should the long-term Polaris
>>>>    lineage model/query API be OpenLineage-specific, or framework-agnostic 
>>>> with
>>>>    OpenLineage as one adapter?
>>>>       - My preference is the latter. OpenLineage compatibility is
>>>>       useful, but I would avoid making the OpenLineage payload shape the
>>>>       Polaris-native lineage model by accident.
>>>>    3. *Default battery behavior: *What should work out of the box?
>>>>       - If query is part of the initial release, I think the battery
>>>>       needs enough local state to answer a minimal query. A narrow default 
>>>> could
>>>>       be: latest observed direct table-level upstreams for a 
>>>> Polaris-managed
>>>>       target table, with observed timestamp, producer/engine identifier, 
>>>> and
>>>>       upstream dataset refs.
>>>>    4. *Extension implementations: *What should be pluggable or future
>>>>    work?
>>>>       - I would put raw OpenLineage forwarding/proxying, external
>>>>       backend query, full graph history, multi-hop traversal, column-level 
>>>> query,
>>>>       job/run graph, pruning/staleness, and richer governance-aware 
>>>> behavior into
>>>>       extension/future implementation areas rather than the default 
>>>> battery.
>>>>
>>>> *One subtle point*: I do not think the default battery and the
>>>> REST/API envelope need to have exactly the same scope.
>>>>
>>>> The default battery can be intentionally small. For example, latest
>>>> direct table-level lineage summary for Polaris-managed target tables. *But
>>>> the REST/API envelope can still be designed so that richer implementations
>>>> are possible later or through extensions*. For example, the API can
>>>> carry metadata such as *granularity (table/col/job etc.),
>>>> format/source protocol (OpenLineage or other lineage framework)*, or
>>>> requested mode to help Polaris route handling to the configured provider,
>>>> without requiring every default implementation to support every mode.
>>>>
>>>> Said differently, I would separate:
>>>>
>>>>    - what the API envelope can represent;
>>>>    - what the default battery actually guarantees;
>>>>    - what extension implementations can support.
>>>>
>>>> *My concrete recommendation would be*:
>>>>
>>>> If Polaris exposes a lineage Query API in the initial release, the
>>>> default battery should provide a minimal latest table-level summary
>>>> implementation so the query works out of the box. If we do not want any
>>>> local persistence in the initial release, then I think the Query API should
>>>> be out of scope for the initial release or clearly extension-provided. I
>>>> would avoid exposing a core query API whose default implementation cannot
>>>> answer anything.
>>>>
>>>> *My preferred shape would be*:
>>>>
>>>>    - Polaris-native lineage semantics stay *framework-agnostic*.
>>>>    - OpenLineage is supported as an adapter/adoption path, *not as the
>>>>    only Polaris lineage model*.
>>>>    - The default battery, if query is in scope, is latest direct
>>>>    table-level lineage summary only.
>>>>    - *The API envelope leaves room for richer provider implementations*
>>>>    .
>>>>    - Full OpenLineage backend behavior, downstream
>>>>    forwarding/proxying, historical graph, column lineage, job/run lineage,
>>>>    multi-hop query, pruning/staleness, and external backend query *are
>>>>    extension or future work*.
>>>>
>>>> This would still give Polaris a useful out-of-the-box lineage
>>>> experience, while avoiding turning Polaris into a full lineage backend in
>>>> the first step.
>>>>
>>>> -ej
>>>>
>>>> On Mon, Jun 8, 2026 at 2:31 PM Adnan Hemani via dev <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Robert,
>>>>>
>>>>> > Is my understanding correct that option 1 is out of scope from your
>>>>> perspective, and option 2 is not sufficient for the M0 you have in
>>>>> mind? In
>>>>> other words, you are proposing option 3 as the baseline, with active
>>>>> planning toward option 4?
>>>>>
>>>>> Yes, that's correct. Happy to hear others' opinions, but Option 4 has
>>>>> been
>>>>> detailed in the proposal document since the very start. I'm happy to
>>>>> wait a
>>>>> few more days for others' opinions, but as of now I don't see any
>>>>> active
>>>>> opposition to the plans as-is and the "lazy consensus" suggested
>>>>> deadline
>>>>> was over 2 weeks ago. I-Ting and I will start implementation in the
>>>>> meantime.
>>>>>
>>>>> Best,
>>>>> Adnan Hemani
>>>>>
>>>>> On Mon, Jun 8, 2026 at 3:19 AM Robert Stupp <[email protected]> wrote:
>>>>>
>>>>> > Hi all,
>>>>> >
>>>>> > Thanks Adnan, that helps clarify the shape.
>>>>> >
>>>>> > I think this is the point where broader community input would be
>>>>> useful,
>>>>> > because options 3/4 are a materially different commitment from
>>>>> options 1/2.
>>>>> >
>>>>> > Is my understanding correct that option 1 is out of scope from your
>>>>> > perspective, and option 2 is not sufficient for the M0 you have in
>>>>> mind? In
>>>>> > other words, you are proposing option 3 as the baseline, with active
>>>>> > planning toward option 4?
>>>>> >
>>>>> > Option 3 does not just put a proxy endpoint in Polaris.
>>>>> > It makes Polaris responsible for the OL ingest path: dataset-name
>>>>> > resolution, per-entity authZ over OL assertions, policy for
>>>>> non-Polaris
>>>>> > datasets, trusted-service credentials to downstream systems,
>>>>> request-size
>>>>> > and payload limits, forwarding failure semantics, audit behavior, and
>>>>> > tenant isolation.
>>>>> >
>>>>> > Option 4 then adds a Polaris-local lineage storage/query subsystem.
>>>>> > Even if the first version stores only a reduced projection, Polaris
>>>>> would
>>>>> > take on many responsibilities of an OL backend: persistence
>>>>> semantics,
>>>>> > query semantics, staleness/pruning, auth-filtered reads, backend
>>>>> > compatibility, migrations, limits, and long-term compatibility with
>>>>> OL
>>>>> > event shapes.
>>>>> > At that point, even if intentionally limited, Polaris effectively
>>>>> operates
>>>>> > as an OL backend for the supported subset.
>>>>> >
>>>>> > So before we treat option 3 plus active planning toward option 4 as
>>>>> the M0
>>>>> > baseline, I think it would be good to hear whether others agree that
>>>>> > Polaris should take on that implementation and maintenance surface
>>>>> for the
>>>>> > first milestone.
>>>>> >
>>>>> > Or whether we should start with a smaller integration point first.
>>>>> >
>>>>> > Robert
>>>>> >
>>>>>
>>>>

Re: OpenLineage Proposal - Follow-Up

Reply via email to