SanJSp opened a new pull request, #55507: URL: https://github.com/apache/spark/pull/55507
### What changes were proposed in this pull request? This is **PR 1 of a split** of #55426 (see the [split suggestion](https://github.com/apache/spark/pull/55426#issuecomment-4292375876) for the full plan). Can merge in any order. Validates the CDC metadata columns, row identity and row versioning returned by a `Changelog` connector at relation construction time, and introduces a dedicated error class to report the failure at analysis time rather than later at execution time with a less helpful error. - `ChangelogTable.validateSchema`: fail-fast checks that the connector schema contains the required metadata columns (`_change_type`, `_commit_version`, `_commit_timestamp`), and that — when the connector advertises a capability requiring it — `rowId()` and `rowVersion()` are declared and the row version column is a non-nullable top-level column. Invoked from the `ChangelogTable` constructor. - New error class `INVALID_CHANGELOG_SCHEMA` with sub-classes: - `MISSING_COLUMN`, `INVALID_COLUMN_TYPE` - `MISSING_ROW_ID`, `MISSING_ROW_VERSION`, `NESTED_ROW_VERSION`, `NULLABLE_ROW_VERSION` - Matching `QueryCompilationErrors` helpers for each sub-class. - Tests: `ChangelogResolutionSuite` schema-validation cases using a `TestChangelog` fixture. ### Why are the changes needed? Gives connector implementors a clear analysis-time error message for misshapen CDC schemas instead of an opaque execution-time failure. Background on the original PR and its [dev list thread](https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts). ### Does this PR introduce _any_ user-facing change? Yes, for connector implementors. A connector that returns an invalid changelog schema (or advertises a capability that requires row identity/row versioning without declaring them) now fails at analysis time with `INVALID_CHANGELOG_SCHEMA.*` instead of at execution time. ### How was this patch tested? Added schema-validation cases to `ChangelogResolutionSuite` covering: missing metadata column (each of `_change_type`, `_commit_version`, `_commit_timestamp`), wrong data type, connector-defined `_commit_version` type accepted, row-identity-required capabilities without rowId/rowVersion, nested rowVersion, nullable rowVersion, and valid schemas with data columns pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
