[ 
https://issues.apache.org/jira/browse/HUDI-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4515.
-------------------------------------
    Resolution: Fixed

> savepoints will be clean in keeping latest versions policy
> ----------------------------------------------------------
>
>                 Key: HUDI-4515
>                 URL: https://issues.apache.org/jira/browse/HUDI-4515
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: cleaning
>    Affects Versions: 0.11.1
>            Reporter: zouxxyy
>            Assignee: zouxxyy
>            Priority: Blocker
>              Labels: bug, clean, pull-request-available, savepoints
>             Fix For: 0.12.1
>
>
> When I tested the behavior of clean and savepoint, I found that when clean is 
> keeping latest versions, the files of savepoint will be deleted. By reading 
> the code, I found that this should be a bug
>  
> For example, if I use "HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS", and 
> set the “hoodie.cleaner.fileversions.retained” to 2, I do the following:
> 1. insert, get xxxx_001.parquet
> 2. savepoint
> 3. insert, get xxxx_002.parquet
> 4. insert, get xxxx_003.parquet
> After the fourth step, the xxxx_001.parquet will be deleted even if it 
> belongs to savepoint !
>  
> here is: 
> hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
>  getFilesToCleanKeepingLatestVersions
>  * According to the following code, on the one hand, the checkpoints 
> belonging to keepversion will be skipped and will not be counted in the 
> calculation of keepversion, which I feel is unreasonable.
>  * On the other hand, if there is a checkpoint in the remaining version of 
> the files, it will be deleted, which I don't think is in line with the design 
> philosophy of savepoints.
> {code:java}
> while (fileSliceIterator.hasNext() && keepVersions > 0) {
>   // Skip this most recent version
>   FileSlice nextSlice = fileSliceIterator.next();
>   Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
>   if (dataFile.isPresent() && 
> savepointedFiles.contains(dataFile.get().getFileName())) {
>     // do not clean up a savepoint data file
>     continue;
>   }
>   keepVersions--;
> }
> // Delete the remaining files
> while (fileSliceIterator.hasNext()) {
>   FileSlice nextSlice = fileSliceIterator.next();
>   deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
> }{code}
>  
> So I think the judgment logic of the checkpoint should be moved down, if can 
> be fixed by this:
> {code:java}
> while (fileSliceIterator.hasNext() && keepVersions > 0) {
>   // Skip this most recent version
>   fileSliceIterator.next();
>   keepVersions--;
> }
> // Delete the remaining files
> while (fileSliceIterator.hasNext()) {
>   FileSlice nextSlice = fileSliceIterator.next();
>   Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
>   if (dataFile.isPresent() && 
> savepointedFiles.contains(dataFile.get().getFileName())) {
>     // do not clean up a savepoint data file
>     continue;
>   }
>   deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
> }{code}
>  
> Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to