[ https://issues.apache.org/jira/browse/HUDI-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan closed HUDI-4515. ------------------------------------- Resolution: Fixed > savepoints will be clean in keeping latest versions policy > ---------------------------------------------------------- > > Key: HUDI-4515 > URL: https://issues.apache.org/jira/browse/HUDI-4515 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning > Affects Versions: 0.11.1 > Reporter: zouxxyy > Assignee: zouxxyy > Priority: Blocker > Labels: bug, clean, pull-request-available, savepoints > Fix For: 0.12.1 > > > When I tested the behavior of clean and savepoint, I found that when clean is > keeping latest versions, the files of savepoint will be deleted. By reading > the code, I found that this should be a bug > > For example, if I use "HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS", and > set the “hoodie.cleaner.fileversions.retained” to 2, I do the following: > 1. insert, get xxxx_001.parquet > 2. savepoint > 3. insert, get xxxx_002.parquet > 4. insert, get xxxx_003.parquet > After the fourth step, the xxxx_001.parquet will be deleted even if it > belongs to savepoint ! > > here is: > hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: > getFilesToCleanKeepingLatestVersions > * According to the following code, on the one hand, the checkpoints > belonging to keepversion will be skipped and will not be counted in the > calculation of keepversion, which I feel is unreasonable. > * On the other hand, if there is a checkpoint in the remaining version of > the files, it will be deleted, which I don't think is in line with the design > philosophy of savepoints. > {code:java} > while (fileSliceIterator.hasNext() && keepVersions > 0) { > // Skip this most recent version > FileSlice nextSlice = fileSliceIterator.next(); > Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile(); > if (dataFile.isPresent() && > savepointedFiles.contains(dataFile.get().getFileName())) { > // do not clean up a savepoint data file > continue; > } > keepVersions--; > } > // Delete the remaining files > while (fileSliceIterator.hasNext()) { > FileSlice nextSlice = fileSliceIterator.next(); > deletePaths.addAll(getCleanFileInfoForSlice(nextSlice)); > }{code} > > So I think the judgment logic of the checkpoint should be moved down, if can > be fixed by this: > {code:java} > while (fileSliceIterator.hasNext() && keepVersions > 0) { > // Skip this most recent version > fileSliceIterator.next(); > keepVersions--; > } > // Delete the remaining files > while (fileSliceIterator.hasNext()) { > FileSlice nextSlice = fileSliceIterator.next(); > Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile(); > if (dataFile.isPresent() && > savepointedFiles.contains(dataFile.get().getFileName())) { > // do not clean up a savepoint data file > continue; > } > deletePaths.addAll(getCleanFileInfoForSlice(nextSlice)); > }{code} > > Thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010)