fhan688 opened a new pull request, #19041:
URL: https://github.com/apache/hudi/pull/19041

   ### Describe the issue this Pull Request addresses
   
     `KEEP_LATEST_BY_HOURS` currently picks the first completed instant after 
the time cutoff as the earliest commit to retain. This can make the cleaner 
skip files that should already be eligible
     for cleaning, especially when commit density is sparse around the cutoff.
   
     This PR corrects the clean-by-time retention boundary and makes timeline 
archival respect the latest completed clean's `earliestCommitToRetain` 
consistently across timeline archiver versions.
     It also exposes the existing cleaner/archive configs through the Flink 
write path.
   
   ### Summary and Changelog
   
     This PR adds clean-by-time support improvements without introducing any 
LSM-table-specific behavior.
   
     Changes:
     - Update `KEEP_LATEST_BY_HOURS` ECTR calculation to choose the latest 
completed instant at or before the retention cutoff.
     - Ensure the by-hours ECTR does not move past the earliest pending instant 
by retaining the completed instant before the pending instant.
     - Add a shared archival utility to derive the earliest instant to retain 
from the latest completed clean metadata.
     - Apply clean ECTR archive blocking to both `TimelineArchiverV1` and 
`TimelineArchiverV2` when `hoodie.archive.block.on.clean.ectr` is enabled.
     - Expose Flink options for:
       - `hoodie.clean.max.commits`
       - `hoodie.clean.empty.commit.interval.hours`
       - `hoodie.archive.block.on.clean.ectr`
     - Add tests for by-hours ECTR selection, pending instant protection, V2 
archival behavior, and Flink config propagation.
   
     No code was copied from another project.
   
   ### Impact
   
     User-facing behavior changes:
     - Tables using `KEEP_LATEST_BY_HOURS` compute the clean boundary more 
accurately against the configured time window.
     - When archive blocking on clean ECTR is enabled, timeline archiving 
avoids archiving commits that may still be needed because their data files have 
not been cleaned yet.
     - Flink writers can configure the existing clean/archive controls through 
Flink options.
   
     This may retain more active timeline instants in some cases when clean 
ECTR archive blocking is enabled, which is expected for correctness.
   
   ### Risk Level
   
     medium
   
     The change affects cleaner retention boundary calculation and timeline 
archival retention decisions. The implementation is conservative: it avoids 
cleaning past pending instants and only blocks
     archival on clean ECTR when the existing config is enabled.
   
     Verification:
     - `mvn -pl hudi-common -am -DskipTests -DskipITs -Dcheckstyle.skip=true 
test-compile`
     - `mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs 
-Dcheckstyle.skip=true -DfailIfNoTests=false 
-Dsurefire.failIfNoSpecifiedTests=false
     -Dtest=TestFlinkWriteClients#testCleanByTimeConfigsPropagateToWriteConfig 
test`
     - `git diff --check`
   
   ### Documentation Update
   
     The Hudi website/config documentation should be updated for the newly 
exposed Flink options:
     - `hoodie.clean.max.commits`
     - `hoodie.clean.empty.commit.interval.hours`
     - `hoodie.archive.block.on.clean.ectr`
   
   ### Contributor's checklist
   
     - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
     - [x] Enough context is provided in the sections above
     - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to