nsivabalan opened a new pull request, #18885:
URL: https://github.com/apache/hudi/pull/18885
### Describe the issue this Pull Request addresses
The schema persisted under `HoodieCommitMetadata.SCHEMA_KEY` is expected to
be the user/write schema (without Hudi meta fields). The write path relied on
`config.getSchema()` being clean by convention with no enforcement: any
upstream mutation that sets a schema-with-meta-fields onto the write config
(e.g. compaction reader-schema setup, conflict-resolution rewriting
`SCHEMA_KEY`, or reading a previously-polluted `SCHEMA_KEY` back into the
config) propagates polluted schemas into every subsequent commit's
`extraMetadata` — affecting both ingestion and clustering replace commits.
### Summary and Changelog
- Centralize sanitization in `CommitUtils.sanitizeSchemaForCommitMetadata`
and route the existing `buildMetadata` `SCHEMA_KEY` write through it, so all
commit paths (`BaseHoodieWriteClient.commitStats` for ingestion, Spark / Java /
Flink commit-action executors, the `executeClustering` fallback) are clean by
construction.
- Added two functional tests in `TestSparkSortAndSizeClustering`:
- `testReplaceCommitSchemaHasNoMetaFields` — asserts the schema in a
clustering replace commit has no Hudi meta fields.
- `testCommitSchemaCleanedEvenWhenConfigSchemaHasMetaFields` —
pre-pollutes `config.getSchema()` with meta fields and asserts both the
ingestion and the clustering replace commit persist a clean schema.
### Impact
No public API or user-facing config change. Behavior is restricted to what
gets persisted under `SCHEMA_KEY` in commit metadata — it is now always
sanitized of Hudi meta fields, as the documented contract already implied.
### Risk Level
low — the change only normalizes the value already being written under
`SCHEMA_KEY`; readers of commit metadata that expected a clean schema continue
to see one (and now reliably so).
### Documentation Update
none
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]