nsivabalan opened a new pull request, #18380:
URL: https://github.com/apache/hudi/pull/18380

   ### Describe the issue this Pull Request addresses
   
   This PR adds support to block archival based on the Earliest Commit To 
Retain (ECTR) from the last completed clean operation, preventing potential 
data leaks when cleaning configurations change between clean and archival runs.
    
   Problem: Currently, archival recomputes ECTR independently based on cleaning 
configs at archival time, rather than reading it from the last clean plan. When 
cleaning configs change between clean and archival operations, archival may 
archive commits whose data files haven't been cleaned yet, leading to timeline 
metadata loss for existing data files.
   
     Example scenario:
     1. Clean runs with retainCommits=5, computes ECTR=commit_100, cleans files 
older than commit_100
     2. Config changes to retainCommits=2 before next clean
     3. Archival runs with new config, recomputes ECTR=commit_103
     4. Archival archives commits 100-102, but their data files still exist 
(weren't cleaned yet)
     5. Result: Timeline metadata is lost for existing data files → data leak
   
   ### Summary and Changelog
    User-facing summary: Users can now optionally enable archival blocking 
based on ECTR from the last clean to prevent archiving commits whose data files 
haven't been cleaned. This is useful when cleaning configurations may change 
over time or when strict data retention guarantees are needed.
   
     Detailed changelog:
   
     Configuration Changes:
     - Added new advanced config hoodie.archive.block.on.latest.clean.ectr 
(default: false)
       - When enabled, archival reads ECTR from last completed clean metadata
       - Blocks archival of commits with timestamp >= ECTR
       - Marked as advanced config for power users
       - Available since version 1.2.0
   
     Implementation Changes:
     - TimelineArchiverV1.java: Added ECTR blocking logic in 
getCommitInstantsToArchive() method
       - Reads ECTR from last completed clean's metadata (lines 274-294)
       - Filters commit timeline to exclude commits >= ECTR (lines 322-326)
       - Follows same pattern as existing compaction/clustering retention checks
       - Includes error handling with graceful degradation (logs warning if 
metadata read fails)
     - HoodieArchivalConfig.java: Added config property 
BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR
       - Builder method: withBlockArchivalOnCleanECTR(boolean)
     - HoodieWriteConfig.java: Added access method 
shouldBlockArchivalOnCleanECTR()
   
     Test Changes:
     - Added 7 comprehensive tests in TestHoodieTimelineArchiver.java (633 
lines):
       a. testArchivalBlocksOnCleanECTRWhenEnabled - Core blocking functionality
       b. testArchivalProceedsNormallyWhenECTRBlockingDisabled - Backward 
compatibility
       c. testArchivalMakesProgressWhenECTRIsLaterThanArchivalWindow - Progress 
validation
       d. testArchivalContinuesWhenCleanMetadataIsMissing - Missing metadata 
handling
       e. testArchivalHandlesEmptyECTRInCleanMetadata - Empty ECTR handling
       f. testArchivalProceedsWhenCleanHasFileVersionsPolicyWithNullECTR - 
FILE_VERSIONS policy compatibility
       g. testArchivalBlocksOnCleanECTRWithTimelineArchiverV2AndVersion9 - 
Version 9 / LSM timeline compatibility
   
   ### Impact
     Public API Changes:
     - New config property: hoodie.archive.block.on.latest.clean.ectr (opt-in, 
default: false)
     - New builder method: 
HoodieArchivalConfig.Builder.withBlockArchivalOnCleanECTR(boolean)
     - New accessor: HoodieWriteConfig.shouldBlockArchivalOnCleanECTR()
   
     User-facing changes:
     - When enabled, archival may retain more commits in active timeline if 
they haven't been cleaned
     - Timeline growth bounded by ECTR from last clean operation
     - No behavior change when config is disabled (default)
   
     Performance impact:
     - Minimal: One additional metadata read per archival operation when enabled
     - Read operation is fast (single clean metadata file)
     - No impact when config is disabled (default)
   
     Breaking changes: None - opt-in feature with no default behavior changes
   
   ### Risk Level
   
   low
   
   ### Documentation Update
   
   Config documentation:
     The new config hoodie.archive.block.on.latest.clean.ectr is documented 
inline:
     .withDocumentation("If enabled, archival will block on latest ECTR from 
last known clean")
   
     Website documentation needed:
     - Add entry to config reference page for 
HoodieArchivalConfig.BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR
     - Update archival section in Hudi docs to explain ECTR blocking feature
     - Add usage example showing when to enable this config
     - Document interaction with different cleaning policies 
(KEEP_LATEST_COMMITS vs KEEP_LATEST_FILE_VERSIONS)
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to