Re: SPIP: Auto CDC support for Apache Spark

vaquar khan Sat, 28 Mar 2026 16:42:46 -0700

HI .
Thanks for the SPIP. I fully support the goal-abstracting CDC merge logic
is a huge win for the community. However, looking at the current Spark
versions, there are significant architectural gaps between Databricks
Lakeflow's proprietary implementation and OSS Spark.


A few technical blockers need clarification before we move forward:

- OSS Compatibility: Databricks documentation explicitly states that the
AUTO CDC APIs are not supported by Apache Spark Declarative Pipelines
<https://docs.databricks.com/gcp/en/ldp/cdc>.

- Streaming MERGE: The proposed flow requires continuous upsert/delete
semantics, but Dataset.mergeInto() currently does not support streaming
queries. Does this SPIP introduce an entirely new execution path to bypass
this restriction?

- Tombstone Garbage Collection: Handling stream deletes safely requires
state store tombstone retention (e.g., configuring
pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving data
from resurrecting deleted keys. How will this be implemented natively in
OSS Spark state stores?

- Sequencing Constraints: SEQUENCE BY enforces strict ordering where NULL
sequencing values are explicitly not supported. How will the engine handle
malformed or non-monotonic upstream sequences compared to our existing
time-based watermarks?

- Given the massive surface area (new SQL DDL, streaming MERGE paths, SCD
Type 1/2 state logic, tombstone GC), a phased delivery plan would be very
helpful. It would also clarify exactly which Lakeflow components are being
contributed to open-source versus what needs to be rebuilt from scratch.


Best regards,
Viquar Khan

On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote:

> unsubscribe
>
> 获取Outlook for Android <https://aka.ms/AAb9ysg>
> ------------------------------
> *From:* Andreas Neumann <[email protected]>
> *Sent:* Saturday, March 28, 2026 2:43:54 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: SPIP: Auto CDC support for Apache Spark
>
> Hi Vaibhav,
>
> The goal of this proposal is not to replace MERGE but to provide a simple
> abstraction for the common use case of CDC.
> MERGE itself is a very powerful operator and there will always be use
> cases outside of CDC that will require MERGE.
>
> And thanks for spotting the typo in the SPIP. It is fixed now!
>
> Cheers -Andreas
>
>
> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar <[email protected]>
> wrote:
>
> Hi Andrew,
>
> Thanks for sharing the SPIP, Does that mean the MERGE statement would be
> deprecated? Also I think there was a small typo I have suggested in the
> doc.
>
> Regards,
> Vaibhav
>
> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote:
>
> +1
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]> wrote:
>
> Hi all,
>
> I’d like to start a discussion on a new SPIP to introduce Auto CDC
> support to Apache Spark.
>
>    - SPIP Document:
>    
> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>    -
>
>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>    https://issues.apache.org/jira/browse/SPARK-5566
>
> Motivation
>
> With the upcoming introduction of standardized CDC support
> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon have
> a unified way to produce change data feeds. However, consuming these
> feeds and applying them to a target table remains a significant challenge.
>
> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD Type 2
> (tracking full change history) often require hand-crafted, complex MERGE
> logic. In distributed systems, these implementations are frequently
> error-prone when handling deletions or out-of-order data.
> Proposal
>
> This SPIP proposes a new "Auto CDC" flow type for Spark. It encapsulates
> the complex logic for SCD types and out-of-order data, allowing data
> engineers to configure a declarative flow instead of writing manual MERGE
> statements. This feature will be available in both Python and SQL.
> Example SQL:
> -- Produce a change feed
> CREATE STREAMING TABLE cdc.users AS
> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>
> -- Consume the change feed
> CREATE FLOW flow
> AS AUTO CDC INTO
>   target
> FROM stream(cdc_data.users)
>   KEYS (userId)
>   APPLY AS DELETE WHEN operation = "DELETE"
>   SEQUENCE BY sequenceNum
>   COLUMNS * EXCEPT (operation, sequenceNum)
>   STORED AS SCD TYPE 2
>   TRACK HISTORY ON * EXCEPT (city);
>
>
> Please review the full SPIP for the technical details. Looking forward to
> your feedback and discussion!
>
> Best regards,
>
> Andreas
>
>
>

Re: SPIP: Auto CDC support for Apache Spark

Reply via email to