HI . Thanks for the SPIP. I fully support the goal-abstracting CDC merge logic is a huge win for the community. However, looking at the current Spark versions, there are significant architectural gaps between Databricks Lakeflow's proprietary implementation and OSS Spark.
A few technical blockers need clarification before we move forward: - OSS Compatibility: Databricks documentation explicitly states that the AUTO CDC APIs are not supported by Apache Spark Declarative Pipelines <https://docs.databricks.com/gcp/en/ldp/cdc>. - Streaming MERGE: The proposed flow requires continuous upsert/delete semantics, but Dataset.mergeInto() currently does not support streaming queries. Does this SPIP introduce an entirely new execution path to bypass this restriction? - Tombstone Garbage Collection: Handling stream deletes safely requires state store tombstone retention (e.g., configuring pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving data from resurrecting deleted keys. How will this be implemented natively in OSS Spark state stores? - Sequencing Constraints: SEQUENCE BY enforces strict ordering where NULL sequencing values are explicitly not supported. How will the engine handle malformed or non-monotonic upstream sequences compared to our existing time-based watermarks? - Given the massive surface area (new SQL DDL, streaming MERGE paths, SCD Type 1/2 state logic, tombstone GC), a phased delivery plan would be very helpful. It would also clarify exactly which Lakeflow components are being contributed to open-source versus what needs to be rebuilt from scratch. Best regards, Viquar Khan On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote: > unsubscribe > > 获取Outlook for Android <https://aka.ms/AAb9ysg> > ------------------------------ > *From:* Andreas Neumann <[email protected]> > *Sent:* Saturday, March 28, 2026 2:43:54 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: SPIP: Auto CDC support for Apache Spark > > Hi Vaibhav, > > The goal of this proposal is not to replace MERGE but to provide a simple > abstraction for the common use case of CDC. > MERGE itself is a very powerful operator and there will always be use > cases outside of CDC that will require MERGE. > > And thanks for spotting the typo in the SPIP. It is fixed now! > > Cheers -Andreas > > > On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar <[email protected]> > wrote: > > Hi Andrew, > > Thanks for sharing the SPIP, Does that mean the MERGE statement would be > deprecated? Also I think there was a small typo I have suggested in the > doc. > > Regards, > Vaibhav > > On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote: > > +1 > > DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 > > On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]> wrote: > > Hi all, > > I’d like to start a discussion on a new SPIP to introduce Auto CDC > support to Apache Spark. > > - SPIP Document: > > https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/ > - > > JIRA: <https://issues.apache.org/jira/browse/SPARK-55668> > https://issues.apache.org/jira/browse/SPARK-5566 > > Motivation > > With the upcoming introduction of standardized CDC support > <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon have > a unified way to produce change data feeds. However, consuming these > feeds and applying them to a target table remains a significant challenge. > > Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD Type 2 > (tracking full change history) often require hand-crafted, complex MERGE > logic. In distributed systems, these implementations are frequently > error-prone when handling deletions or out-of-order data. > Proposal > > This SPIP proposes a new "Auto CDC" flow type for Spark. It encapsulates > the complex logic for SCD types and out-of-order data, allowing data > engineers to configure a declarative flow instead of writing manual MERGE > statements. This feature will be available in both Python and SQL. > Example SQL: > -- Produce a change feed > CREATE STREAMING TABLE cdc.users AS > SELECT * FROM STREAM my_table CHANGES FROM VERSION 10; > > -- Consume the change feed > CREATE FLOW flow > AS AUTO CDC INTO > target > FROM stream(cdc_data.users) > KEYS (userId) > APPLY AS DELETE WHEN operation = "DELETE" > SEQUENCE BY sequenceNum > COLUMNS * EXCEPT (operation, sequenceNum) > STORED AS SCD TYPE 2 > TRACK HISTORY ON * EXCEPT (city); > > > Please review the full SPIP for the technical details. Looking forward to > your feedback and discussion! > > Best regards, > > Andreas > > >
