SanJSp opened a new pull request, #55507:
URL: https://github.com/apache/spark/pull/55507

   ### What changes were proposed in this pull request? 
   This is **PR 1 of a split** of #55426 (see the [split 
suggestion](https://github.com/apache/spark/pull/55426#issuecomment-4292375876) 
for the full plan). Can merge in any order. 
   
   Validates the CDC metadata columns, row identity and row versioning returned 
by a `Changelog` connector at relation construction time, and introduces a 
dedicated error class to report the failure at analysis time rather than later 
at execution time with a less helpful error. 
   
   - `ChangelogTable.validateSchema`: fail-fast checks that the connector 
schema contains the required metadata columns (`_change_type`, 
`_commit_version`, `_commit_timestamp`), and that — when the connector 
advertises a capability requiring it — `rowId()` and `rowVersion()` are 
declared and the row version column is a non-nullable top-level column. Invoked 
from the `ChangelogTable` constructor. 
   - New error class `INVALID_CHANGELOG_SCHEMA` with sub-classes: 
       - `MISSING_COLUMN`, `INVALID_COLUMN_TYPE` 
       - `MISSING_ROW_ID`, `MISSING_ROW_VERSION`, `NESTED_ROW_VERSION`, 
`NULLABLE_ROW_VERSION` 
   - Matching `QueryCompilationErrors` helpers for each sub-class. 
   - Tests: `ChangelogResolutionSuite` schema-validation cases using a 
`TestChangelog` fixture. 
   
   ### Why are the changes needed? 
   
   Gives connector implementors a clear analysis-time error message for 
misshapen CDC schemas instead of an opaque execution-time failure. Background 
on the original PR and its [dev list 
thread](https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts). 
   
   ### Does this PR introduce _any_ user-facing change? 
   
   Yes, for connector implementors. A connector that returns an invalid 
changelog schema (or advertises a capability that requires row identity/row 
versioning without declaring them) now fails at analysis time with 
`INVALID_CHANGELOG_SCHEMA.*` instead of at execution time. 
   
   ### How was this patch tested? 
   
   Added schema-validation cases to `ChangelogResolutionSuite` covering: 
missing metadata column (each of `_change_type`, `_commit_version`, 
`_commit_timestamp`), wrong data type, connector-defined `_commit_version` type 
accepted, row-identity-required capabilities without rowId/rowVersion, nested 
rowVersion, nullable rowVersion, and valid schemas with data columns pass. 
   
   ### Was this patch authored or co-authored using generative AI tooling? 
   
   Generated-by: Claude Opus 4.7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to