shangxinli opened a new pull request, #18764: URL: https://github.com/apache/hudi/pull/18764
### Describe the issue this Pull Request addresses Phase 1 of #18750 — migrating `HoodieStreamerWriteStatusValidator` (HSWSV) into the pre-commit validator framework (#18068, #18362, #18405). This is the first of 4 phases. It is purely additive: it adds a new opt-in validator and changes nothing about the existing HSWSV path. ### Summary and Changelog **Summary:** Adds `SparkWriteErrorValidator`, a pure pre-commit validator that fails the commit when any records failed to write. This is a behavior-equivalent extract of HSWSV's boolean error check (`hasErrorRecords = totalErroredRecords > 0`) without any of HSWSV's side effects — no error-table commit, no top-100 error logging, no instant rollback. Those concerns will be extracted in subsequent phases. Behavior mapping from HSWSV: - `cfg.commitOnErrors = false` (default) ↔ `hoodie.precommit.validators.failure.policy = FAIL` - `cfg.commitOnErrors = true` ↔ `hoodie.precommit.validators.failure.policy = WARN_LOG` **Stack context:** This PR builds on Phase 3 (#18405), which introduced the Spark/HoodieStreamer pre-commit validator wiring in `StreamSync.writeToSinkAndDoMetaSync()`. No changes to `StreamSync` are needed in Phase 1 — the validator plugs into the existing `SparkStreamerValidatorUtils.runValidators()` call when the user opts in via `hoodie.precommit.validators`. **Changelog:** - Added `SparkWriteErrorValidator extends BasePreCommitValidator` in `org.apache.hudi.utilities.streamer.validator` - Reads `getTotalWriteErrors()` and `getTotalRecordsWritten()` from `ValidationContext` (no new context methods needed) - Reuses existing `hoodie.precommit.validators.failure.policy` (FAIL / WARN_LOG); no new config introduced - Added `TestSparkWriteErrorValidator` with 8 unit tests - All code is new, no existing code was copied **Phased rollout (tracked in #18750):** - Phase 1 (this PR): Additive `SparkWriteErrorValidator`. HSWSV unchanged. - Phase 2: Carve out HSWSV's side effects into `ErrorTableCommitter`, `WriteErrorReporter`, `SuccessfulRecordCounter` helpers. HSWSV becomes a thin coordinator. - Phase 3: Flip the call site in `StreamSync` from the `WriteStatusValidator` callback to the pre-commit framework. - Phase 4: Delete HSWSV. Remove the `WriteStatusValidator` hook from the write client if no other caller exists. ### Impact **Public API Changes:** - New public class `org.apache.hudi.utilities.streamer.validator.SparkWriteErrorValidator` **User-Facing Changes:** Users can now enable the framework-based write-error check in HoodieStreamer pipelines by configuring: ``` hoodie.precommit.validators=org.apache.hudi.utilities.streamer.validator.SparkWriteErrorValidator hoodie.precommit.validators.failure.policy=FAIL # or WARN_LOG to allow commits with errors ``` This runs *alongside* HSWSV in Phase 1, not in place of it. HSWSV remains the canonical path that commits the error table and rolls back on failure. Enabling this validator adds a pre-commit guard but does not remove or replace any existing behavior. **Performance Impact:** None for users who do not configure `hoodie.precommit.validators`. When configured, the validator runs once per commit and only inspects already-collected `HoodieWriteStat` aggregates — no additional Spark actions or DAG evaluations. ### Risk Level **Risk Level: low** **Justification:** - Purely additive — no existing code path is modified - Validator is opt-in; default behavior is unchanged - Reuses battle-tested framework components (`BasePreCommitValidator`, `ValidationContext`, `SparkStreamerValidatorUtils` from Phase 3) - 8 unit tests covering all branches (no errors, FAIL/WARN_LOG, empty commit, missing write stats, multi-partition summing, update records, default policy) **Verification:** - `mvn -pl hudi-utilities -am test-compile`: BUILD SUCCESS - `mvn -pl hudi-utilities test -Dtest=TestSparkWriteErrorValidator`: 8/8 pass - `mvn -pl hudi-utilities test -Dtest=TestSparkKafkaOffsetValidator,TestSparkValidationContext,TestSparkStreamerValidatorUtils,TestSparkWriteErrorValidator`: 44/44 pass (no regression in other validator tests) - `mvn -pl hudi-utilities checkstyle:check`: 0 violations - `mvn -pl hudi-utilities apache-rat:check`: 0 unapproved licenses ### Documentation Update No user-facing documentation update is needed in Phase 1 — the validator is opt-in and the existing configuration documentation in `HoodiePreCommitValidatorConfig` already covers the failure-policy property. The `VALIDATOR_CLASS_NAMES` documentation will be updated in Phase 4 (cleanup) to reference `SparkWriteErrorValidator` once HSWSV is removed and this becomes the canonical write-error path. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
