Hi all, I recall our initial discussions regarding OpenLineage integration where we agreed on an event-first approach, as this aligns best with Polaris's positioning. Other aspects of OpenLineage seem more suited to the query engine level.
The proposal to treat Polaris as a proxy is an interesting approach, particularly for the potential to augment events with additional information. Regards, JB On Tue, Apr 21, 2026 at 7:13 PM Adnan Hemani via dev <[email protected]> wrote: > Hi all, > > Great thoughts! > > I-Ting, you're exactly right. I will share the proposal to your email on > Google Drive (it still needs a bit of work but it's about 90% complete > right now). I'd love to get your feedback and work on the doc! > > Yufei, I 100% agree with your feedback as well. The proposal's current > structure highlights a few advantages: > 1. Centralized auth for the OL server: Marquez, the reference OL > implementation, does not have an auth system. Polaris can provide > significant value here immediately. > 2. OL, in general, has many concepts, like Jobs/Runs/etc. I think it is > likely that some customers do not need that level of lineage tracking and > would rather not deal with the additional maintenance of yet another web > server. Polaris can provide a "lite" version of data lineage OOTB so that > users don't need to install a full-functionality OL server if they don't > require it. But Polaris should be able to support a pass-through model to a > full-functionality OL server in case users still want to use Polaris as the > "single pane of glass". > 3. Although the proposal does not explicitly state catalog-enhanced data > enrichment for the OL data that Polaris could send downstream (to focus the > scope of the initial changes), this proposal would help us unlock this > functionality. This is something I discussed with Michael Collado earlier > :) > > As next steps, I will work with I-Ting to introduce this proposal - perhaps > later this week! > > Best, > Adnan Hemani > > On Tue, Apr 21, 2026 at 9:45 AM Yufei Gu <[email protected]> wrote: > > > Agreed: engines provide more accurate lineage information. > > > > I think there are two viable architectures here: > > > > - Engines send OpenLineage events directly to a lineage backend (e.g. > > Marquez) > > - Engines send events to Polaris, which acts as an ingestion layer > (and > > optionally forwards downstream) > > > > The first option is the standard and simplest model. It avoids extra hops > > and works well when lineage collection is the only goal. The second > option > > becomes interesting only if Polaris adds clear control-plane value, for > > example: > > > > - central auth and policy for lineage ingestion > > - catalog-aware enrichment (stable table identity, namespace, > snapshots, > > federation context) > > - decoupling engines from the downstream lineage backend > > > > If Polaris is only acting as a pass-through, the extra layer likely does > > not justify the added complexity. However, if we position Polaris as a > > catalog-aware lineage gateway, the second model aligns well with the > > “single pane of glass” direction. I think this seems a direction the > > Polaris community can pursue. My inclination is to focus Polaris on > > ingestion, normalization, and forwarding, and avoid expanding into a full > > lineage storage or analytics system. > > > > Curious to hear others’ thoughts. > > > > Yufei > > > > > > On Tue, Apr 21, 2026 at 7:52 AM ITing Lee <[email protected]> wrote: > > > > > Hi Adnan Hemani, > > > > > > Yes, I agree with your point of view. After thinking thoroughly, it > might > > > be possible to lose complete data lineage for my POC. > > > > > > Additionally, from your direction, we can decouple Spark from > OpenLineage > > > and make Polaris as the only interface that Spark interacts with. > > > > > > It is my pleasure to co-work for this project, and I would like to make > > > sure we’re aligned before proceeding and please correct me if I’m > wrong. > > > > > > Does it mean that we will implement OL Rest API in Polaris? > > > > > > For example, there's one OpenLineage server running on 5000 ports, and > a > > > Polaris server running 8000. What the user needs to do are:, install > > > `openlineage-spark-*.jar` [1], which stays as-is for the user, then > > second, > > > sets the host to 8000. > > > > > > If my thought is correct, would you like to share the proposal you have > > > done, then we will not step on each other. > > > > > > Thanks! > > > > > > [1] > > > > > > https://openlineage.io/docs/integrations/spark/developing/built_in_lineage/ > > > > > > Best regards, > > > > > > ITing > > > > > > > > > Adnan Hemani via dev <[email protected]> 於 2026年4月21日週二 > 上午10:49寫道: > > > > > > > Hi ITing, > > > > > > > > I'm glad you are looking into this - I was also informally/verbally > > > > floating a proposal with some community members during Iceberg Summit > > > > earlier this month regarding OpenLineage support. Specifically, the > > > > proposal was to open Polaris as an OpenLineage (OL) server > > implementation > > > > (with exposed OL APIs) that could either retain OL information > > internally > > > > (as a "lite" version offering through the Polaris Persistence layer) > or > > > > reflect it downstream to an OL server like Marquez for a richer > feature > > > set > > > > (up to admin preference). All compute engines would then call > Polaris' > > > > OpenLineage APIs (through their respective OL connectors) to keep > > Polaris > > > > as a single-pane of glass for metadata. I was working on that > proposal > > > > (albeit slowly :') ), but I'm happy to merge proposals as co-authors > if > > > you > > > > are open to it! > > > > > > > > I read through the proposal and skimmed through the POC code; I don't > > > think > > > > this is a correct approach to solve this problem. As the base > > assumption > > > > your code makes is that: all previously loaded tables within a Spark > > > > Application contributed to the creation of a new table. This is a > very > > > > simplistic heuristic that easily breaks in simple cases, such as when > > > many > > > > tables are read within a single Spark application but only a handful > of > > > > them are used to generate a new table. You've correctly pointed out > the > > > > flaw in Question 10.2: Polaris cannot determine what populated a > table > > > > based on IRC calls alone. This is why I aimed to propose the idea > > above - > > > > have the engines send this information directly to Polaris. This > > mirrors > > > > how any other OL servers rely on the engine's OL connectors to create > > > this > > > > information, as the engine is the sole source of truth regarding the > > > user's > > > > actual intent. Let me know your thoughts. > > > > > > > > Best, > > > > Adnan Hemani > > > > > > > > On Sun, Apr 19, 2026 at 9:56 PM Jean-Baptiste Onofré < > [email protected]> > > > > wrote: > > > > > > > > > Hi ITing, > > > > > > > > > > Thanks for your message! > > > > > > > > > > For context, we discussed OpenLineage support a while ago, > > specifically > > > > > regarding the initial support of OpenLineage events. > > > > > > > > > > I will take a look at your proposal and design document, and I am > > sure > > > > Mike > > > > > will as well. > > > > > > > > > > Regards, > > > > > JB > > > > > > > > > > On Mon, Apr 20, 2026 at 3:08 AM ITing Lee <[email protected]> > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I’ve been working on OpenLineage support in Polaris and wanted to > > ask > > > > for > > > > > > feedback on the current implementation approach and where it > should > > > go > > > > > > next. > > > > > > > > > > > > The current direction is to emit OpenLineage events from > > > > Polaris-managed > > > > > > operations and publish them through an event-listener-based > design. > > > > > Before > > > > > > moving further, I’d like to get suggestions on whether this > > > > > implementation > > > > > > model makes sense and what should be adjusted early. > > > > > > > > > > > > Below is the link for the proposal and PoC branch. > > > > > > > > > > > > - Proposal: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1NEPzcMIcbxKEvBIWd6SJGllRIu4brjV-23xO5Qp3DlM/edit?tab=t.0 > > > > > > - PoC branch: > > https://github.com/iting0321/polaris/tree/data-lineage > > > > > > > > > > > > I’d appreciate feedback on the implementation direction and any > > > > > suggestions > > > > > > before continuing further. > > > > > > > > > > > > Best regards, > > > > > > ITing > > > > > > > > > > > > > > > > > > > > >
