bvaradar commented on a change in pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#discussion_r467524509



##########
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##########
@@ -116,6 +119,19 @@ HoodieCleanerPlan requestClean(JavaSparkContext jsc) {
         PartitionCleanStat partitionCleanStat = 
partitionCleanStatMap.get(partitionPath);
         partitionCleanStat.addDeleteFilePatterns(deletePath.getName());
         partitionCleanStat.addDeletedFileResult(deletePath.getName(), 
deletedFileResult);
+
+        // If CleanBootstrapSourceFileEnabled and it is a metadata bootstrap 
commit, also delete the corresponding source file
+        if (cleanBootstrapSourceFileEnabled && !FSUtils.isLogFile(deletePath)
+            && 
FSUtils.getCommitTime(delFileName).equals(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS))
 {
+          Option<HoodieBaseFile> baseFile = 
fileSystemView.getBaseFileOn(partitionPath,

Review comment:
       @zhedoubushishi : Tried reaching you on slack :) to discuss an approach. 
I went ahead and implemented it in the interest of time. 
   
   The basic idea is to ensure that Cleaner plan stores necessary information 
related to files to be deleted including bootstrap base files. The cleaner 
executor will simply read the cleaner plan and be able to distinguish normal vs 
bootstrap base files. It goes ahead and deletes those files. For bootstrap base 
files, it records the complete path of the file it deleted in a separate (new) 
avro field. This is very important in order to ensure incremental timeline 
syncing (which reads these metadata) to work properly. 
   
   Please take a look at this code changes if possible. 
   ( @vinothchandar  @umehrot2  : FYI )




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to