Hi Adnan Hemani, I totally agreed with you on these 2 questions. If I have further questions, I will directly comment on the doc.
Thanks for your feedback! Best regards, ITing Adnan Hemani via dev <[email protected]> 於 2026年4月23日週四 上午9:52寫道: > Hi ITing, > > Thanks for your thoughts - please feel free to comment directly onto the > doc as well as others on the email thread cannot yet see the document and > may get confused by this exchange :) I am working to release it ASAP, but I > need a few days (lots on my plate). If you would be willing to make > suggestions and any required edits to the doc, that would be greatly > appreciated! > > Responding to your comments: > > 1) This is a great starting point for this discussion but I don't agree > with the suggestion. Here's why: > * For the general OL event, we typically see no more than 1-5 inputs and/or > outputs, so our N x M fanout is typically not super high to begin with. But > more importantly, these edges will be upserted. If the 1000th event states > the same tables A -> B relationship in tables, the system only needs to > check if that relationship already exists. The database rows do not grow > infinitely. However, if we store events directly (which I understood from > your email, please correct me if I'm wrong), this database table will grow > constantly and may require regular compaction/de-duplication, especially > with if there are many frequent jobs running. > * Given that OL requests are not in the processing phase of a query, taking > a little extra time here is not an extreme concern. > * We lose column-level lineage information with this approach. e.g. > answering which datasets computed column X within table A. > * Array datatypes aren't ubiquitous across RDBMSs, which could potentially > break our database-level interfacing via JDBC. There might be a way around > this, but the question is: at what cost are we willing to make this happen? > > 2) This is a fair suggestion - but I may have one to make this even > simpler. First, pass the data through to the external OL server, and then > save it to the Polaris dataset. This way, Polaris never has "phantom" > relations that aren't reflected in the external OL server. Alternatively, > we could turn off Polaris' lineage database if an external OL server is > currently configured, and punt on this issue unless someone requires it. > WDYT? > > -Adnan > > On Wed, Apr 22, 2026 at 9:23 AM ITing Lee <[email protected]> wrote: > > > Hi Adnan Hemani, > > > > I have two questions about this proposal after careful thought. > > > > > > 1. For the Persistence Layer Data Model design, would it be better to > have > > a single table with raw OL events and necessary columns as indexes for > the > > landing purpose to avoid the N x M write amplification at the ingestion > > time? More specifically, would having the following 4 columns as an > > “ol_event” table be sufficient? > > > > - > > > > `input_ids: ARRAY[str]` > > - > > > > `output_ids: ARRAY[str]` > > - > > > > `column_fields: ARRAY[str]` > > - > > > > `raw_event: JSON` > > > > From my perspective, scaling write is more difficult than scaling read. > The > > N x M edges permutation for the writing path seems too heavy for the > > ingestion purpose, it is more like the transformation for the read time > > optimization. > > > > The single table approach with the ARRAY-type columns as indexes make the > > raw_event column queryable for our public interface, and the overhead for > > writing each row should be fast and much simpler when ingestion. Then we > > could compute the edges in the read time. WDYT? > > > > > > 2. In the "Processing Pipeline" section, I would like to double check the > > independence for the downstream backend as we’re using dual-write between > > two services in our design. Perhaps having a `state` field as audit-log > > purpose that will be set as `pending` in the same transaction when > writing > > the “ol_event” table, then set the `state` field as `done` after success. > > So that if the server be force killed between the first transaction and > > passing to OL backend, the user (or the client) could do the retry and > make > > the state consistent between Polaris and OL backend. Thanks! > > > > Best regards, ITing > > > > Adnan Hemani via dev <[email protected]> 於 2026年4月22日週三 上午1:13寫道: > > > > > Hi all, > > > > > > Great thoughts! > > > > > > I-Ting, you're exactly right. I will share the proposal to your email > on > > > Google Drive (it still needs a bit of work but it's about 90% complete > > > right now). I'd love to get your feedback and work on the doc! > > > > > > Yufei, I 100% agree with your feedback as well. The proposal's current > > > structure highlights a few advantages: > > > 1. Centralized auth for the OL server: Marquez, the reference OL > > > implementation, does not have an auth system. Polaris can provide > > > significant value here immediately. > > > 2. OL, in general, has many concepts, like Jobs/Runs/etc. I think it is > > > likely that some customers do not need that level of lineage tracking > and > > > would rather not deal with the additional maintenance of yet another > web > > > server. Polaris can provide a "lite" version of data lineage OOTB so > that > > > users don't need to install a full-functionality OL server if they > don't > > > require it. But Polaris should be able to support a pass-through model > > to a > > > full-functionality OL server in case users still want to use Polaris as > > the > > > "single pane of glass". > > > 3. Although the proposal does not explicitly state catalog-enhanced > data > > > enrichment for the OL data that Polaris could send downstream (to focus > > the > > > scope of the initial changes), this proposal would help us unlock this > > > functionality. This is something I discussed with Michael Collado > earlier > > > :) > > > > > > As next steps, I will work with I-Ting to introduce this proposal - > > perhaps > > > later this week! > > > > > > Best, > > > Adnan Hemani > > > > > > On Tue, Apr 21, 2026 at 9:45 AM Yufei Gu <[email protected]> wrote: > > > > > > > Agreed: engines provide more accurate lineage information. > > > > > > > > I think there are two viable architectures here: > > > > > > > > - Engines send OpenLineage events directly to a lineage backend > > (e.g. > > > > Marquez) > > > > - Engines send events to Polaris, which acts as an ingestion layer > > > (and > > > > optionally forwards downstream) > > > > > > > > The first option is the standard and simplest model. It avoids extra > > hops > > > > and works well when lineage collection is the only goal. The second > > > option > > > > becomes interesting only if Polaris adds clear control-plane value, > for > > > > example: > > > > > > > > - central auth and policy for lineage ingestion > > > > - catalog-aware enrichment (stable table identity, namespace, > > > snapshots, > > > > federation context) > > > > - decoupling engines from the downstream lineage backend > > > > > > > > If Polaris is only acting as a pass-through, the extra layer likely > > does > > > > not justify the added complexity. However, if we position Polaris as > a > > > > catalog-aware lineage gateway, the second model aligns well with the > > > > “single pane of glass” direction. I think this seems a direction the > > > > Polaris community can pursue. My inclination is to focus Polaris on > > > > ingestion, normalization, and forwarding, and avoid expanding into a > > full > > > > lineage storage or analytics system. > > > > > > > > Curious to hear others’ thoughts. > > > > > > > > Yufei > > > > > > > > > > > > On Tue, Apr 21, 2026 at 7:52 AM ITing Lee <[email protected]> > wrote: > > > > > > > > > Hi Adnan Hemani, > > > > > > > > > > Yes, I agree with your point of view. After thinking thoroughly, it > > > might > > > > > be possible to lose complete data lineage for my POC. > > > > > > > > > > Additionally, from your direction, we can decouple Spark from > > > OpenLineage > > > > > and make Polaris as the only interface that Spark interacts with. > > > > > > > > > > It is my pleasure to co-work for this project, and I would like to > > make > > > > > sure we’re aligned before proceeding and please correct me if I’m > > > wrong. > > > > > > > > > > Does it mean that we will implement OL Rest API in Polaris? > > > > > > > > > > For example, there's one OpenLineage server running on 5000 ports, > > and > > > a > > > > > Polaris server running 8000. What the user needs to do are:, > install > > > > > `openlineage-spark-*.jar` [1], which stays as-is for the user, then > > > > second, > > > > > sets the host to 8000. > > > > > > > > > > If my thought is correct, would you like to share the proposal you > > have > > > > > done, then we will not step on each other. > > > > > > > > > > Thanks! > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://openlineage.io/docs/integrations/spark/developing/built_in_lineage/ > > > > > > > > > > Best regards, > > > > > > > > > > ITing > > > > > > > > > > > > > > > Adnan Hemani via dev <[email protected]> 於 2026年4月21日週二 > > > 上午10:49寫道: > > > > > > > > > > > Hi ITing, > > > > > > > > > > > > I'm glad you are looking into this - I was also > informally/verbally > > > > > > floating a proposal with some community members during Iceberg > > Summit > > > > > > earlier this month regarding OpenLineage support. Specifically, > the > > > > > > proposal was to open Polaris as an OpenLineage (OL) server > > > > implementation > > > > > > (with exposed OL APIs) that could either retain OL information > > > > internally > > > > > > (as a "lite" version offering through the Polaris Persistence > > layer) > > > or > > > > > > reflect it downstream to an OL server like Marquez for a richer > > > feature > > > > > set > > > > > > (up to admin preference). All compute engines would then call > > > Polaris' > > > > > > OpenLineage APIs (through their respective OL connectors) to keep > > > > Polaris > > > > > > as a single-pane of glass for metadata. I was working on that > > > proposal > > > > > > (albeit slowly :') ), but I'm happy to merge proposals as > > co-authors > > > if > > > > > you > > > > > > are open to it! > > > > > > > > > > > > I read through the proposal and skimmed through the POC code; I > > don't > > > > > think > > > > > > this is a correct approach to solve this problem. As the base > > > > assumption > > > > > > your code makes is that: all previously loaded tables within a > > Spark > > > > > > Application contributed to the creation of a new table. This is a > > > very > > > > > > simplistic heuristic that easily breaks in simple cases, such as > > when > > > > > many > > > > > > tables are read within a single Spark application but only a > > handful > > > of > > > > > > them are used to generate a new table. You've correctly pointed > out > > > the > > > > > > flaw in Question 10.2: Polaris cannot determine what populated a > > > table > > > > > > based on IRC calls alone. This is why I aimed to propose the idea > > > > above - > > > > > > have the engines send this information directly to Polaris. This > > > > mirrors > > > > > > how any other OL servers rely on the engine's OL connectors to > > create > > > > > this > > > > > > information, as the engine is the sole source of truth regarding > > the > > > > > user's > > > > > > actual intent. Let me know your thoughts. > > > > > > > > > > > > Best, > > > > > > Adnan Hemani > > > > > > > > > > > > On Sun, Apr 19, 2026 at 9:56 PM Jean-Baptiste Onofré < > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Hi ITing, > > > > > > > > > > > > > > Thanks for your message! > > > > > > > > > > > > > > For context, we discussed OpenLineage support a while ago, > > > > specifically > > > > > > > regarding the initial support of OpenLineage events. > > > > > > > > > > > > > > I will take a look at your proposal and design document, and I > am > > > > sure > > > > > > Mike > > > > > > > will as well. > > > > > > > > > > > > > > Regards, > > > > > > > JB > > > > > > > > > > > > > > On Mon, Apr 20, 2026 at 3:08 AM ITing Lee <[email protected] > > > > > > wrote: > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > I’ve been working on OpenLineage support in Polaris and > wanted > > to > > > > ask > > > > > > for > > > > > > > > feedback on the current implementation approach and where it > > > should > > > > > go > > > > > > > > next. > > > > > > > > > > > > > > > > The current direction is to emit OpenLineage events from > > > > > > Polaris-managed > > > > > > > > operations and publish them through an event-listener-based > > > design. > > > > > > > Before > > > > > > > > moving further, I’d like to get suggestions on whether this > > > > > > > implementation > > > > > > > > model makes sense and what should be adjusted early. > > > > > > > > > > > > > > > > Below is the link for the proposal and PoC branch. > > > > > > > > > > > > > > > > - Proposal: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1NEPzcMIcbxKEvBIWd6SJGllRIu4brjV-23xO5Qp3DlM/edit?tab=t.0 > > > > > > > > - PoC branch: > > > > https://github.com/iting0321/polaris/tree/data-lineage > > > > > > > > > > > > > > > > I’d appreciate feedback on the implementation direction and > any > > > > > > > suggestions > > > > > > > > before continuing further. > > > > > > > > > > > > > > > > Best regards, > > > > > > > > ITing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
