nsivabalan opened a new pull request, #18380:
URL: https://github.com/apache/hudi/pull/18380
### Describe the issue this Pull Request addresses
This PR adds support to block archival based on the Earliest Commit To
Retain (ECTR) from the last completed clean operation, preventing potential
data leaks when cleaning configurations change between clean and archival runs.
Problem: Currently, archival recomputes ECTR independently based on cleaning
configs at archival time, rather than reading it from the last clean plan. When
cleaning configs change between clean and archival operations, archival may
archive commits whose data files haven't been cleaned yet, leading to timeline
metadata loss for existing data files.
Example scenario:
1. Clean runs with retainCommits=5, computes ECTR=commit_100, cleans files
older than commit_100
2. Config changes to retainCommits=2 before next clean
3. Archival runs with new config, recomputes ECTR=commit_103
4. Archival archives commits 100-102, but their data files still exist
(weren't cleaned yet)
5. Result: Timeline metadata is lost for existing data files → data leak
### Summary and Changelog
User-facing summary: Users can now optionally enable archival blocking
based on ECTR from the last clean to prevent archiving commits whose data files
haven't been cleaned. This is useful when cleaning configurations may change
over time or when strict data retention guarantees are needed.
Detailed changelog:
Configuration Changes:
- Added new advanced config hoodie.archive.block.on.latest.clean.ectr
(default: false)
- When enabled, archival reads ECTR from last completed clean metadata
- Blocks archival of commits with timestamp >= ECTR
- Marked as advanced config for power users
- Available since version 1.2.0
Implementation Changes:
- TimelineArchiverV1.java: Added ECTR blocking logic in
getCommitInstantsToArchive() method
- Reads ECTR from last completed clean's metadata (lines 274-294)
- Filters commit timeline to exclude commits >= ECTR (lines 322-326)
- Follows same pattern as existing compaction/clustering retention checks
- Includes error handling with graceful degradation (logs warning if
metadata read fails)
- HoodieArchivalConfig.java: Added config property
BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR
- Builder method: withBlockArchivalOnCleanECTR(boolean)
- HoodieWriteConfig.java: Added access method
shouldBlockArchivalOnCleanECTR()
Test Changes:
- Added 7 comprehensive tests in TestHoodieTimelineArchiver.java (633
lines):
a. testArchivalBlocksOnCleanECTRWhenEnabled - Core blocking functionality
b. testArchivalProceedsNormallyWhenECTRBlockingDisabled - Backward
compatibility
c. testArchivalMakesProgressWhenECTRIsLaterThanArchivalWindow - Progress
validation
d. testArchivalContinuesWhenCleanMetadataIsMissing - Missing metadata
handling
e. testArchivalHandlesEmptyECTRInCleanMetadata - Empty ECTR handling
f. testArchivalProceedsWhenCleanHasFileVersionsPolicyWithNullECTR -
FILE_VERSIONS policy compatibility
g. testArchivalBlocksOnCleanECTRWithTimelineArchiverV2AndVersion9 -
Version 9 / LSM timeline compatibility
### Impact
Public API Changes:
- New config property: hoodie.archive.block.on.latest.clean.ectr (opt-in,
default: false)
- New builder method:
HoodieArchivalConfig.Builder.withBlockArchivalOnCleanECTR(boolean)
- New accessor: HoodieWriteConfig.shouldBlockArchivalOnCleanECTR()
User-facing changes:
- When enabled, archival may retain more commits in active timeline if
they haven't been cleaned
- Timeline growth bounded by ECTR from last clean operation
- No behavior change when config is disabled (default)
Performance impact:
- Minimal: One additional metadata read per archival operation when enabled
- Read operation is fast (single clean metadata file)
- No impact when config is disabled (default)
Breaking changes: None - opt-in feature with no default behavior changes
### Risk Level
low
### Documentation Update
Config documentation:
The new config hoodie.archive.block.on.latest.clean.ectr is documented
inline:
.withDocumentation("If enabled, archival will block on latest ECTR from
last known clean")
Website documentation needed:
- Add entry to config reference page for
HoodieArchivalConfig.BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR
- Update archival section in Hudi docs to explain ECTR blocking feature
- Add usage example showing when to enable this config
- Document interaction with different cleaning policies
(KEEP_LATEST_COMMITS vs KEEP_LATEST_FILE_VERSIONS)
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]