shangxinli opened a new issue, #18750:
URL: https://github.com/apache/hudi/issues/18750

   ## Background
   
   Follow-up to #18405 ([Danny's review 
comment](https://github.com/apache/hudi/pull/18405#discussion_r3214248837)).
   
   #18405 introduced a Spark/HoodieStreamer pre-commit validator 
(`SparkKafkaOffsetValidator`) built on the framework from #18068. We now have 
two parallel write-error / write-status validation paths in HoodieStreamer:
   
   1. The new `BasePreCommitValidator` framework (#18068, #18405) — runs 
*before* `writeClient.commit()`, parameterized by `TypedProperties`, state 
exposed via `ValidationContext`.
   2. The existing `HoodieStreamerWriteStatusValidator` 
(`StreamSync.java:1403`) — runs *inside* `writeClient.commit()` via the 
`WriteStatusValidator` callback hook.
   
   Danny suggested consolidating these by migrating (2) into the (1) framework. 
This issue tracks that work.
   
   ## Why it isn't a trivial migration
   
   `HoodieStreamerWriteStatusValidator` does substantially more than pure 
validation:
   
   - Counts records/errors and writes to a shared `AtomicLong 
totalSuccessfulRecords` consumed *after* commit (drives the `runMetaSync()` 
decision).
   - **Commits the error table** as a side effect, with `ROLLBACK_COMMIT` / 
`LOG_ERROR` strategies.
   - Rolls back the instant on validation failure; logs the top-100 errors.
   - Runs *inside* `writeClient.commit()`, not before it.
   
   The current `BasePreCommitValidator` / `ValidationContext` interface has no 
path to:
   - Surface the raw `JavaRDD<WriteStatus>` (needed for error-table commit).
   - Inject the error-table writer.
   - Expose a post-commit counter back to `StreamSync`.
   
   ## Proposed approach (to discuss)
   
   Option A — **Extend `ValidationContext`**: add accessors for 
`JavaRDD<WriteStatus>` and an error-table writer handle; keep side effects 
inside the validator.
   
   Option B — **Split side-effects out**: keep validation pure (just 
succeed/fail with reasons) and move error-table commit + counter wiring into a 
separate post-validation step in `StreamSync`.
   
   Leaning toward Option B because it keeps the validator framework clean and 
matches the "validate, don't mutate" contract of `BasePreCommitValidator`. Open 
to feedback.
   
   ## Acceptance criteria
   
   - [ ] Decision on Option A vs. B (or alternative).
   - [ ] `HoodieStreamerWriteStatusValidator` removed; equivalent behavior 
implemented as a pre-commit validator.
   - [ ] Error-table commit + `totalSuccessfulRecords` wiring preserved (no 
regression in `runMetaSync` decision).
   - [ ] Tests covering `ROLLBACK_COMMIT` and `LOG_ERROR` error policies.
   - [ ] No behavior change for users who don't configure 
`hoodie.precommit.validators`.
   
   cc @danny0405 @codope


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to