fhan688 opened a new pull request, #19041:
URL: https://github.com/apache/hudi/pull/19041
### Describe the issue this Pull Request addresses
`KEEP_LATEST_BY_HOURS` currently picks the first completed instant after
the time cutoff as the earliest commit to retain. This can make the cleaner
skip files that should already be eligible
for cleaning, especially when commit density is sparse around the cutoff.
This PR corrects the clean-by-time retention boundary and makes timeline
archival respect the latest completed clean's `earliestCommitToRetain`
consistently across timeline archiver versions.
It also exposes the existing cleaner/archive configs through the Flink
write path.
### Summary and Changelog
This PR adds clean-by-time support improvements without introducing any
LSM-table-specific behavior.
Changes:
- Update `KEEP_LATEST_BY_HOURS` ECTR calculation to choose the latest
completed instant at or before the retention cutoff.
- Ensure the by-hours ECTR does not move past the earliest pending instant
by retaining the completed instant before the pending instant.
- Add a shared archival utility to derive the earliest instant to retain
from the latest completed clean metadata.
- Apply clean ECTR archive blocking to both `TimelineArchiverV1` and
`TimelineArchiverV2` when `hoodie.archive.block.on.clean.ectr` is enabled.
- Expose Flink options for:
- `hoodie.clean.max.commits`
- `hoodie.clean.empty.commit.interval.hours`
- `hoodie.archive.block.on.clean.ectr`
- Add tests for by-hours ECTR selection, pending instant protection, V2
archival behavior, and Flink config propagation.
No code was copied from another project.
### Impact
User-facing behavior changes:
- Tables using `KEEP_LATEST_BY_HOURS` compute the clean boundary more
accurately against the configured time window.
- When archive blocking on clean ECTR is enabled, timeline archiving
avoids archiving commits that may still be needed because their data files have
not been cleaned yet.
- Flink writers can configure the existing clean/archive controls through
Flink options.
This may retain more active timeline instants in some cases when clean
ECTR archive blocking is enabled, which is expected for correctness.
### Risk Level
medium
The change affects cleaner retention boundary calculation and timeline
archival retention decisions. The implementation is conservative: it avoids
cleaning past pending instants and only blocks
archival on clean ECTR when the existing config is enabled.
Verification:
- `mvn -pl hudi-common -am -DskipTests -DskipITs -Dcheckstyle.skip=true
test-compile`
- `mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs
-Dcheckstyle.skip=true -DfailIfNoTests=false
-Dsurefire.failIfNoSpecifiedTests=false
-Dtest=TestFlinkWriteClients#testCleanByTimeConfigsPropagateToWriteConfig
test`
- `git diff --check`
### Documentation Update
The Hudi website/config documentation should be updated for the newly
exposed Flink options:
- `hoodie.clean.max.commits`
- `hoodie.clean.empty.commit.interval.hours`
- `hoodie.archive.block.on.clean.ectr`
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]