[ https://issues.apache.org/jira/browse/HUDI-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Kudinkin updated HUDI-4878: ---------------------------------- Sprint: 2022/09/05, 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/29, 2022/12/12 (was: 2022/09/05, 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/29, 2022/12/12, 0.13.0 Final Sprint) > Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS > ---------------------------------------------------------------- > > Key: HUDI-4878 > URL: https://issues.apache.org/jira/browse/HUDI-4878 > Project: Apache Hudi > Issue Type: Improvement > Components: cleaning > Reporter: sivabalan narayanan > Assignee: nicolas paris > Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > > clean based on LATEST_FILE_VERSIONS can be improved further since incremental > clean is not enabled. lets see if we can improvise. > > context from author: > > > Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, > KEEP_LATEST_BY_HOURS > policies. It is not run when KEEP_LATEST_FILE_VERSIONS. > This can lead to not cleaning files. This PR fixes this problem by enabling > incremental cleaning for KEEP_LATEST_FILE_VERSIONS only. > Here is the scenario of the problem: > Say we have 3 committed files in partition-A and we add a new commit in > partition-B, and we trigger cleaning for the first time (full partition scan): > {{partition-A/ > commit-0.parquet > commit-1.parquet > commit-2.parquet > partition-B/ > commit-3.parquet}} > In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, > the cleaner will remove the commit-0.parquet to keep 3 commits. > For the next cleaning, incremental cleaning will trigger, and won't consider > partition-A/ until a new commit change it. In case no later commit changes > partition-A then commit-1.parquet will stay forever. However it should be > removed by the cleaner. > Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep > commit-2.parquet. Then it makes sense that incremental cleaning won't > consider partition-A until it is changed. Because there is only one commit. > This is why incremental cleaning should only be enabled with > KEEP_LATEST_FILE_VERSIONS > Hope this is clear enough > -- This message was sent by Atlassian Jira (v8.20.10#820010)