gfcoder opened a new issue, #11405:
URL: https://github.com/apache/hudi/issues/11405
The clean service can't clean historical version files after the savepoint
instant when i set `hoodie.archive.beyond.savepoint=true`
**To Reproduce**
1. set hoodie.archive.beyond.savepoint=true
2. use default clean policy (KEEP_LATEST_COMMITS)
3. use default archive policy
4. start flink job
5. after several commit, create savepoint
6. after several clean periods, check the partition data
**Expected behavior**
old commit data should be cleaned up according to the clean policy.
**Environment Description**
* Hudi version: 0.13.1
* Flink version: 1.14.4
* Hadoop version: 3.1.0
* Storage: HDFS
**Additional context**
I found that in the `HoodieDefaultTimeline.getFirstNonSavepointCommit`
method, 'savepointTimestamps" set is always empty, even though the savepoint
instant already exist.
this issue occurs because in the
`CleanPlanner.getFilesToCleanKeepingLatestCommits` method, the call to
`fileSystemView.getAllFileGroups` retrieves all fileGroups in the partition
path. however the `HoodieTimeline` in HoodieFileGroup only matches the
following actions: `COMMIT_ACTION, DELTA_COMMIT_ACTION, COMPACTION_ACTION,
LOG_COMPACTION_ACTION, REPLACE_COMMIT_ACTION` . Consequently, when
`getFirstNonSavepointCommit` is called, it nerver returns the first instant
beyond the savepoint instant. As a result, historical version files are nerver
cleaned.
`CleanPlanner.getFilesToCleanKeepingLatestCommits ->
fileSystemView.getAllFileGroups -> AbstractTableFileSystemView.addFilesToView
-> this.visibleCommitsAndCompactionTimeline =
visibleActiveTimeline.**getWriteTimeline** -> fileGroup.getAllFileSlices ->
HoodieDefaultTimeline.getFirstNonSavepointCommit
`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org