Re: OpenLineage Proposal - Follow-Up

EJ Wang Fri, 12 Jun 2026 12:02:32 -0700

Hi Adnan,

I think your point about adoption is right, and I'd revise part of my
earlier framing after looking more closely at how existing OpenLineage
integrations work.


I was previously thinking too much about whether clients could emit a more
Polaris-native or framework-agnostic payload. But that is probably not the
right first-slice adoption model. Existing OL producers generally already
emit OpenLineage events, and the common low-friction knob is the transport
target, URL/endpoint, not a uniform way to wrap or reshape the event body.

So I agree that the first slice should optimize for endpoint retargeting
and raw OL event ingestion. Clients should not need to know that the
backend is Polaris or learn a Polaris-specific payload shape.

*The design question I'd still like us to make explicit is where the
OpenLineage specificity lives*. My preference would be to make it explicit
at the ingress/API layer, for example with an OpenLineage-specific route
under the lineage namespace such as: /.../lineage/openlineage

That still preserves endpoint-retargeting for existing OL producers, while
avoiding ambiguity about whether the generic `/lineage` namespace is an
OpenLineage contract or a broader Polaris lineage namespace. It also leaves
room for future `/lineage/<format>` ingress adapters if Polaris later
supports other lineage formats or frameworks.

Behind that ingress route, I'd like to keep the platform boundary
Polaris-owned. I would separate:

1. *OpenLineage REST ingress/API* : an OL-aware endpoint that accepts raw
OL events.
2. *Polaris lineage capability boundary*: a Polaris-owned contract behind
ingress.
3. *Default/OOTB implementation:* a small bundled implementation that
proves the SPI capability (encapsulate correctly and expose sufficiently
for extension impls) works end-to-end,
4. *Extension implementations*: richer provider/proxy/forwarder/custom
behavior for deployments that need it.

This is not meant to reduce OpenLineage support. Quite the opposite:
OpenLineage can be the first explicit supported ingress format. The point
is to make the specificity explicit where it belongs, so Polaris can
support OpenLineage well now while preserving room for future contributions
in the right layer.

*With that framing, I'd suggest*:
- Initial PR: OpenLineage-specific ingress + Polaris lineage capability
boundary + minimal default/OOTB path.
- Follow-up PRs: proxy/forwarder/custom provider implementations and richer
behavior.
- Query/persistence semantics: separate unless this proposal is explicitly
adding a read/query API.

I think that would support the adoption goal you described, while keeping
Polaris extensible in an organized way.

-ej

On Thu, Jun 11, 2026 at 8:19 PM Adnan Hemani <[email protected]>
wrote:

> Hi EJ,
>
> Thanks for looking at the proposal. I've responded to most of your
> comments on the document itself, but I'll summarize the stances here to
> close the loop.
>
> I am consciously making an effort to let the OpenLineage standard drive
> the requirements here; this is a feature, not a bug. IMO, OpenLineage is
> by-far the most well-used standard for data lineage; I don't even know of
> any other significant competitors. Big Data engines like Spark and Trino,
> which represent a significant use case for Polaris, have OpenLineage
> integrations and nothing else. Going the extra mile for further flexibility
> to de-couple our lineage implementations from OpenLineage will likely not
> produce any ROI in terms of work IMO. Happy to hear any other thoughts on
> this topic.
>
> I also don't agree that Polaris should morph into a full-fledged
> OpenLineage server. I don't think the Polaris community is attempting to
> make a "Swiss-Army Knife" tool out of Polaris. For major lineage use cases,
> users absolutely should be redirected to other servers like Marquez where
> they can get full graph history, multi-hop traversal, jobs/runs info, etc.
> I disagree with the "extensions" piece of your email based on this
> reasoning.
>
> Regarding the "out-of-the-box" experience, I have no doubt: Polaris cannot
> have lineage information. An admin must take a small step to configure how
> they want to enable Lineage data persistence: either for Polaris-local
> persistence or for the passthrough/proxy/AuthZ layer modes. I think you've
> missed some of the points in the mailing thread replies above; the Query
> API is really only helpful when using the Polaris local persistence mode.
> The current plan is to build toward "passthrough" mode first, with plans to
> support the Polaris local implementation soon afterward. A Query API won't
> be introduced until the Polaris local implementation work begins. This
> means there's no implication that a Query API will exist without returning
> data to the user. You can see this in my first PR, where only the Ingest
> API is implemented: https://github.com/apache/polaris/pull/4667.
>
> One last note/suggestion for you: the term "default battery" on its own
> generally doesn't make much sense. I'm only able to piece together your
> comments because you used the phrase "batteries included" in this morning's
> community sync. I would usually use "out-of-the-box (OOTB)" or "default
> implementation". Using similar terms in the future would improve
> readability in general.
>
> Best,
> Adnan Hemani
>
> On Thu, Jun 11, 2026 at 4:12 PM EJ Wang <[email protected]>
> wrote:
>
>> Hi all,
>>
>> I read through the proposal and the comments. One framing that may help
>> us converge is to split the proposal into a few separate decisions instead
>> of reviewing it as one bundled “OpenLineage support in Polaris” feature.
>>
>> This seems related to a broader direction I understand for Polaris as a
>> platform: it should be flexible enough to support different deployment and
>> integration use cases, but still battery-included enough to be useful out
>> of the box. For lineage, I think that means we should explicitly separate:
>> what Polaris promises as native lineage semantics, what the default battery
>> implementation does, and what should remain pluggable for richer or
>> deployment-specific implementations.
>>
>> I have been using a similar exercise in a recent SPI proposal draft:
>> first separate external contracts, default/battery implementation,
>> extension implementations, and provider-facing replacement points; then
>> decide implementation. I think that exercise applies well here because this
>> proposal touches several different boundary types at once: ingest protocol,
>> Polaris-native lineage model, persistence, query API, downstream
>> forwarding, auth, and dataset resolution.
>>
>> The questions I think we should separate are:
>>
>>    1. *OpenLineage compatibility: *Do we require existing OpenLineage
>>    clients to emit to Polaris by changing only the endpoint/config?
>>       - If yes, then a server-side OpenLineage-compatible adapter
>>       endpoint makes sense.
>>       - If not, another option is a Polaris-provided OpenLineage
>>       transport/client shim that reshapes OpenLineage events into a
>>       Polaris-native lineage API.
>>    - Those are different adoption tradeoffs, and I think we should
>>       choose intentionally rather than letting OpenLineage compatibility
>>       implicitly define the Polaris-native API.
>>    2. *Polaris-native lineage model: *Should the long-term Polaris
>>    lineage model/query API be OpenLineage-specific, or framework-agnostic 
>> with
>>    OpenLineage as one adapter?
>>       - My preference is the latter. OpenLineage compatibility is
>>       useful, but I would avoid making the OpenLineage payload shape the
>>       Polaris-native lineage model by accident.
>>    3. *Default battery behavior: *What should work out of the box?
>>       - If query is part of the initial release, I think the battery
>>       needs enough local state to answer a minimal query. A narrow default 
>> could
>>       be: latest observed direct table-level upstreams for a Polaris-managed
>>       target table, with observed timestamp, producer/engine identifier, and
>>       upstream dataset refs.
>>    4. *Extension implementations: *What should be pluggable or future
>>    work?
>>       - I would put raw OpenLineage forwarding/proxying, external
>>       backend query, full graph history, multi-hop traversal, column-level 
>> query,
>>       job/run graph, pruning/staleness, and richer governance-aware behavior 
>> into
>>       extension/future implementation areas rather than the default battery.
>>
>> *One subtle point*: I do not think the default battery and the REST/API
>> envelope need to have exactly the same scope.
>>
>> The default battery can be intentionally small. For example, latest
>> direct table-level lineage summary for Polaris-managed target tables. *But
>> the REST/API envelope can still be designed so that richer implementations
>> are possible later or through extensions*. For example, the API can
>> carry metadata such as *granularity (table/col/job etc.), format/source
>> protocol (OpenLineage or other lineage framework)*, or requested mode to
>> help Polaris route handling to the configured provider, without requiring
>> every default implementation to support every mode.
>>
>> Said differently, I would separate:
>>
>>    - what the API envelope can represent;
>>    - what the default battery actually guarantees;
>>    - what extension implementations can support.
>>
>> *My concrete recommendation would be*:
>>
>> If Polaris exposes a lineage Query API in the initial release, the
>> default battery should provide a minimal latest table-level summary
>> implementation so the query works out of the box. If we do not want any
>> local persistence in the initial release, then I think the Query API should
>> be out of scope for the initial release or clearly extension-provided. I
>> would avoid exposing a core query API whose default implementation cannot
>> answer anything.
>>
>> *My preferred shape would be*:
>>
>>    - Polaris-native lineage semantics stay *framework-agnostic*.
>>    - OpenLineage is supported as an adapter/adoption path, *not as the
>>    only Polaris lineage model*.
>>    - The default battery, if query is in scope, is latest direct
>>    table-level lineage summary only.
>>    - *The API envelope leaves room for richer provider implementations*.
>>    - Full OpenLineage backend behavior, downstream forwarding/proxying,
>>    historical graph, column lineage, job/run lineage, multi-hop query,
>>    pruning/staleness, and external backend query *are extension or
>>    future work*.
>>
>> This would still give Polaris a useful out-of-the-box lineage experience,
>> while avoiding turning Polaris into a full lineage backend in the first
>> step.
>>
>> -ej
>>
>> On Mon, Jun 8, 2026 at 2:31 PM Adnan Hemani via dev <
>> [email protected]> wrote:
>>
>>> Hi Robert,
>>>
>>> > Is my understanding correct that option 1 is out of scope from your
>>> perspective, and option 2 is not sufficient for the M0 you have in mind?
>>> In
>>> other words, you are proposing option 3 as the baseline, with active
>>> planning toward option 4?
>>>
>>> Yes, that's correct. Happy to hear others' opinions, but Option 4 has
>>> been
>>> detailed in the proposal document since the very start. I'm happy to
>>> wait a
>>> few more days for others' opinions, but as of now I don't see any active
>>> opposition to the plans as-is and the "lazy consensus" suggested deadline
>>> was over 2 weeks ago. I-Ting and I will start implementation in the
>>> meantime.
>>>
>>> Best,
>>> Adnan Hemani
>>>
>>> On Mon, Jun 8, 2026 at 3:19 AM Robert Stupp <[email protected]> wrote:
>>>
>>> > Hi all,
>>> >
>>> > Thanks Adnan, that helps clarify the shape.
>>> >
>>> > I think this is the point where broader community input would be
>>> useful,
>>> > because options 3/4 are a materially different commitment from options
>>> 1/2.
>>> >
>>> > Is my understanding correct that option 1 is out of scope from your
>>> > perspective, and option 2 is not sufficient for the M0 you have in
>>> mind? In
>>> > other words, you are proposing option 3 as the baseline, with active
>>> > planning toward option 4?
>>> >
>>> > Option 3 does not just put a proxy endpoint in Polaris.
>>> > It makes Polaris responsible for the OL ingest path: dataset-name
>>> > resolution, per-entity authZ over OL assertions, policy for non-Polaris
>>> > datasets, trusted-service credentials to downstream systems,
>>> request-size
>>> > and payload limits, forwarding failure semantics, audit behavior, and
>>> > tenant isolation.
>>> >
>>> > Option 4 then adds a Polaris-local lineage storage/query subsystem.
>>> > Even if the first version stores only a reduced projection, Polaris
>>> would
>>> > take on many responsibilities of an OL backend: persistence semantics,
>>> > query semantics, staleness/pruning, auth-filtered reads, backend
>>> > compatibility, migrations, limits, and long-term compatibility with OL
>>> > event shapes.
>>> > At that point, even if intentionally limited, Polaris effectively
>>> operates
>>> > as an OL backend for the supported subset.
>>> >
>>> > So before we treat option 3 plus active planning toward option 4 as
>>> the M0
>>> > baseline, I think it would be good to hear whether others agree that
>>> > Polaris should take on that implementation and maintenance surface for
>>> the
>>> > first milestone.
>>> >
>>> > Or whether we should start with a smaller integration point first.
>>> >
>>> > Robert
>>> >
>>>
>>

Re: OpenLineage Proposal - Follow-Up

Reply via email to