gengliangwang opened a new pull request, #55588:
URL: https://github.com/apache/spark/pull/55588

   ### What changes were proposed in this pull request?
   
   Re-apply of #55508 (commit 881957a, which was reverted in fe6051a) plus a 
fix for the `ProtoToParsedPlanTestSuite` failures that triggered the revert.
   
   The original commit introduces the `ResolveChangelogTable` analyzer rule 
that post-processes a resolved `DataSourceV2Relation(ChangelogTable)` to inject 
carry-over removal and/or update detection, fused into a single pass over a 
`(rowId, _commit_version)`-partitioned Window. To prevent silent wrong results, 
it also includes an explicit rejection path for streaming CDC reads that would 
require post-processing.
   
   Included changes (re-applied from #55508):
   - `ResolveChangelogTable` analyzer rule:
     - **Batch**: applies the requested post-processing transformations. 
Carry-over removal is a `Filter` on the Window (drops CoW pairs where 
`min(rowVersion) == max(rowVersion)`). Update detection is a `CASE WHEN` over 
delete/insert counts (relabels pairs as `update_preimage` / 
`update_postimage`). The two passes are fused into a single Window.
     - **Streaming**: throws 
`INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED` when the requested 
options would need post-processing. Streams that don't need post-processing 
pass through unchanged.
     - **Net changes**: throws 
`INVALID_CDC_OPTION.NET_CHANGES_NOT_YET_SUPPORTED` for both batch and streaming.
   - Option validation: throws 
`INVALID_CDC_OPTION.UPDATE_DETECTION_REQUIRES_CARRY_OVER_REMOVAL` when 
`computeUpdates = true` is combined with a carry-over-surfacing connector and 
`deduplicationMode = none`.
   - Runtime guard: 
`INVALID_CDC_OPTION.UNEXPECTED_MULTIPLE_CHANGES_PER_ROW_VERSION` when the 
connector emits more than one delete or insert for the same `(rowId, 
_commit_version)` partition.
   - `Analyzer`: register the rule after `ResolveRelations`.
   - `InMemoryChangelogCatalog`: `ChangelogProperties` extension so tests can 
configure post-processing scenarios without a real connector.
   
   Additional changes in this PR that were missing from the original commit 
(the cause of the revert):
   - `PlanGenerationTestSuite`: switch the `read changes with options` test 
from `deduplicationMode = netChanges` to `dropCarryovers` since `netChanges` is 
now rejected up-front by the new rule (`NET_CHANGES_NOT_YET_SUPPORTED`).
   - Regenerate the corresponding `read_changes_with_options.{json,proto.bin}` 
query inputs.
   - Regenerate `streaming_changes_API_with_options.explain` golden file to 
include the new `resolved: Boolean` field added to `ChangelogTable`.
   
   ### Why are the changes needed?
   
   Currently, `CHANGES FROM VERSION ... WITH (deduplicationMode = ..., 
computeUpdates = ...)` parses the options, but they are silently ignored — 
connector output is returned raw. This PR wires the options to their actual 
semantics for batch reads, and prevents silent wrong results for streaming 
reads.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, for CDC queries against a `Changelog` connector. See the original PR 
#55508 description for a before/after example.
   
   ### How was this patch tested?
   
   - `ResolveChangelogTablePostProcessingSuite` exercises the batch rule 
end-to-end via SQL against `InMemoryChangelogCatalog` (carry-over removal, 
update detection, their interaction across the option and connector-flag 
matrix, data-column handling with mixed types, and plan-shape invariants).
   - `ChangelogResolutionSuite` adds streaming-rejection cases for the two 
capability flags that would require post-processing.
   - `ProtoToParsedPlanTestSuite` (724/724) — the suite that previously failed 
and led to the revert.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Opus 4.7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to