SanJSp opened a new pull request, #55583: URL: https://github.com/apache/spark/pull/55583
### What changes were proposed in this pull request? This PR adds the `netChanges` deduplication mode to the `ResolveChangelogTable` analyzer rule (SPARK-55668 / #55668). When a CDC read sets `deduplicationMode = 'netChanges'`, intermediate changes per row identity are collapsed into a single net effect, per the SPIP `Deduplication Semantics`. Implementation pipeline: `Window` (per-`rowId` aggregates: row number, row count, first/last `_change_type`) → `Filter` (keep first and/or last row per partition) → `Project` (relabel `_change_type`, drop helper columns). The 2x2 matrix on `existedBefore` (first event is `delete`/`update_preimage`) × `existsAfter` (last event is `insert`/`update_postimage`) determines the output: | existedBefore | existsAfter | output | |---------------|-------------|-------------------------------------| | false | false | (cancel) | | false | true | insert | | true | false | delete | | true | true | update_preimage + update_postimage | If `computeUpdates = false`, the `update_preimage` + `update_postimage` pair is emitted as `delete` + `insert` instead. ### Why are the changes needed? This completes the net-change post-processing capability of the DSv2 CDC API per the SPIP. Without it, connectors that surface intermediate changes cannot expose a deduplicated change feed to users via the standard CDC API. ### Does this PR introduce _any_ user-facing change? Yes. Requesting `deduplicationMode = 'netChanges'` on a CDC read now produces a deduplicated change stream. Previously the same request was rejected up-front. ### How was this patch tested? Added `ResolveChangelogTableNetChangesSuite` — a trait + 2 concrete suite classes (`...WithComputeUpdatesSuite`, `...WithoutComputeUpdatesSuite`) running the same 16-test body under both modes (32 invocations total). Coverage: - 4 single-event tests (lone insert/delete across various range shapes). - 9 matrix tests covering all (first_change_type, last_change_type) cells. - 1 range-narrowing test (events outside the requested version range are not seen). - 2 multi-rowId tests (independent partitions, mixed mode-dependent cells). Removed 2 obsolete tests in `ResolveChangelogTablePostProcessingSuite` that asserted the previous "not supported" rejection. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-7) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
