shangxinli opened a new pull request, #18795:
URL: https://github.com/apache/hudi/pull/18795
### Change Logs
Introduces `RollbackOrphanDetector` and a new feature-flag config that will
later gate archival of rollback instants when their orphan files are still on
storage. **No behavior change yet** — this PR only lands the building block;
wiring into the archive planner follows in a separate cascade PR.
#### Motivation
See issue #18783 for the full problem statement. In summary: when a rollback
partially fails (crash mid-rollback, marker loss, or a blocked storage
`close()` that lands data after rollback completed) and the rollback instant is
later archived, the system loses the metadata anchor that lets readers filter
out the orphan files. Readers then return corrupt-parquet errors or duplicate
records — a hard violation of the reader/writer isolation guarantee.
This PR is the foundation for an archive-time precondition check that
prevents that loss-of-anchor scenario.
#### What's in this PR
1. New `RollbackOrphanDetector` in `hudi-client/hudi-client-common` under
`org.apache.hudi.table.action.rollback`. Two detection modes:
- `LIGHT` — reads `HoodieRollbackMetadata.failedDeleteFiles`. O(metadata
size). Catches files the rollback explicitly tried and failed to delete but
misses post-rollback late landings.
- `THOROUGH` — additionally lists the partitions named in the rollback
metadata and matches filenames against the rollback's target instant time(s)
(both base parquet and MoR log file naming). Bounded by partition count in the
rollback metadata, not whole-table size.
2. Safety floor: every candidate is cross-checked against completed instants
in the active timeline — a file whose embedded instant is a `COMPLETED` commit
is never flagged as an orphan, even if its filename matches the regex.
3. New config `hoodie.archive.rollback.orphan.guard.mode` (values `OFF` /
`LIGHT` / `THOROUGH`, **default `OFF`**) and a getter on `HoodieWriteConfig`.
4. Two overloads: `(HoodieTable, HoodieInstant, Mode)` for the archive
planner context and `(HoodieTableMetaClient, HoodieInstant, Mode)` for the
`hudi-cli` context that follows in PR3.
5. `TestRollbackOrphanDetector` with 5 tests covering `OFF`, `LIGHT` with
empty/non-empty `failedDeleteFiles`, `THOROUGH` with a real partition listing,
and the safety-floor case.
### Impact
None until the new config is set to `LIGHT` or `THOROUGH`. The default `OFF`
short-circuits before any work happens.
### Risk level
low
### Documentation Update
A user-facing config doc update is appropriate once PR1b (the wiring PR)
lands — that's the change users would actually flip. This PR is plumbing only.
### Related
- Issue: #18783
- Cascade PR (wires the detector in):
`feat/rollback-orphan-archive-precondition` on the fork; will be opened after
this merges
- Companion CLI PR: `feat/rollback-orphan-repair-cli` on the fork; will be
opened after this merges
### Contributor's checklist
- [ ] HUDI JIRA ticket (placeholder `HUDI-XXXX` in title — happy to file
once approach is sanity-checked)
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests added
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]