SanJSp opened a new pull request, #55583:
URL: https://github.com/apache/spark/pull/55583

   ### What changes were proposed in this pull request?
   
   This PR adds the `netChanges` deduplication mode to the 
`ResolveChangelogTable` analyzer rule (SPARK-55668 / #55668). When a CDC read 
sets `deduplicationMode = 'netChanges'`, intermediate changes per row identity 
are collapsed into a single net effect, per the SPIP `Deduplication Semantics`.
   
   Implementation pipeline: `Window` (per-`rowId` aggregates: row number, row 
count, first/last `_change_type`) → `Filter` (keep first and/or last row per 
partition) → `Project` (relabel `_change_type`, drop helper columns).
   
   The 2x2 matrix on `existedBefore` (first event is 
`delete`/`update_preimage`) × `existsAfter` (last event is 
`insert`/`update_postimage`) determines the output:
   
   | existedBefore | existsAfter | output                              |
   |---------------|-------------|-------------------------------------|
   | false         | false       | (cancel)                            |
   | false         | true        | insert                              |
   | true          | false       | delete                              |
   | true          | true        | update_preimage + update_postimage  |
   
   If `computeUpdates = false`, the `update_preimage` + `update_postimage` pair 
is emitted as `delete` + `insert` instead.
   
   ### Why are the changes needed?
   
   This completes the net-change post-processing capability of the DSv2 CDC API 
per the SPIP. Without it, connectors that surface intermediate changes cannot 
expose a deduplicated change feed to users via the standard CDC API.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Requesting `deduplicationMode = 'netChanges'` on a CDC read now 
produces a deduplicated change stream. Previously the same request was rejected 
up-front.
   
   ### How was this patch tested?
   
   Added `ResolveChangelogTableNetChangesSuite` — a trait + 2 concrete suite 
classes (`...WithComputeUpdatesSuite`, `...WithoutComputeUpdatesSuite`) running 
the same 16-test body under both modes (32 invocations total). Coverage:
   
   - 4 single-event tests (lone insert/delete across various range shapes).
   - 9 matrix tests covering all (first_change_type, last_change_type) cells.
   - 1 range-narrowing test (events outside the requested version range are not 
seen).
   - 2 multi-rowId tests (independent partitions, mixed mode-dependent cells).
   
   Removed 2 obsolete tests in `ResolveChangelogTablePostProcessingSuite` that 
asserted the previous "not supported" rejection.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (claude-opus-4-7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to