Re: [I] [SUPPORT] The clean service can't clean historical version files after the savepoint instant when i set `hoodie.archive.beyond.savepoint=true` [hudi]

2024-06-08 Thread via GitHub


danny0405 commented on issue #11405:
URL: https://github.com/apache/hudi/issues/11405#issuecomment-2156256748

   @nsivabalan can you give some insights here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] The clean service can't clean historical version files after the savepoint instant when i set `hoodie.archive.beyond.savepoint=true` [hudi]

2024-06-06 Thread via GitHub


gfcoder opened a new issue, #11405:
URL: https://github.com/apache/hudi/issues/11405

   
   The clean service can't clean historical version files after the savepoint 
instant when i set `hoodie.archive.beyond.savepoint=true`
   
   **To Reproduce**
   1. set hoodie.archive.beyond.savepoint=true
   2. use default clean policy (KEEP_LATEST_COMMITS)
   3. use default archive policy
   4. start flink job 
   5. after several commit, create savepoint
   6. after several clean periods, check the partition data
   
   **Expected behavior**
   old commit data should be cleaned up according to the clean policy.
   
   **Environment Description**
   * Hudi version: 0.13.1
   * Flink version: 1.14.4
   * Hadoop version: 3.1.0
   * Storage: HDFS
   
   **Additional context**
   I found that in the `HoodieDefaultTimeline.getFirstNonSavepointCommit` 
method, 'savepointTimestamps" set is always empty, even though the savepoint 
instant already exist. 
this issue occurs because in the 
`CleanPlanner.getFilesToCleanKeepingLatestCommits` method,  the call to  
`fileSystemView.getAllFileGroups`  retrieves all fileGroups in the partition 
path. however the `HoodieTimeline` in HoodieFileGroup only matches the 
following actions:  `COMMIT_ACTION, DELTA_COMMIT_ACTION, COMPACTION_ACTION, 
LOG_COMPACTION_ACTION, REPLACE_COMMIT_ACTION` . Consequently, when 
`getFirstNonSavepointCommit` is called, it nerver returns the first instant 
beyond the savepoint instant. As a result, historical version files are nerver 
cleaned.
   
   `CleanPlanner.getFilesToCleanKeepingLatestCommits -> 
fileSystemView.getAllFileGroups -> AbstractTableFileSystemView.addFilesToView 
-> this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.**getWriteTimeline** -> fileGroup.getAllFileSlices -> 
HoodieDefaultTimeline.getFirstNonSavepointCommit
   `
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org