zhangyue19921010 opened a new pull request #4994: URL: https://github.com/apache/hudi/pull/4994
https://issues.apache.org/jira/browse/HUDI-3376 ## What is the purpose of the pull request add an option named `--skip-under-deletion-data-files` which will skip to compare the data files which are under deletion by cleaner ## Brief change log When enabled, HoodieMetadataTableValidator will read the clean plan for pending cleaning action and get under deletion data files. Then hoodie will skip to compare these files. This patch is tested on our local env. There 're pending cleaning actions during performing HoodieMetadataTableValidator. Before this patch validator failed with ``` 4257 [Executor task launch worker for task 1] INFO org.apache.hudi.utilities.HoodieMetadataTableValidator - Validation of getLatestBaseFiles succeeded for partition 20210623 4258 [Executor task launch worker for task 1] ERROR org.apache.hudi.utilities.HoodieMetadataTableValidator - Validation of file slices for partition 20210623 failed. file slices from metadata: [FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309114505047, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_0-11-0_20220309114505047.parquet, fileLen=14203377, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309124131821, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309124131821.parquet, fileLen=13973255, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/ hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309125005608.parquet, fileLen=13975017, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309114505047, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_1-12-0_20220309114505047.parquet, fileLen=14121559, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309124131821, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309124131821.parquet, fileLen=13895052, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fil eId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309125005608.parquet, fileLen=13896710, BootstrapBaseFile=null}', logFiles='[]'}] file slices from file system and base files: [FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309114505047, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_0-11-0_20220309114505047.parquet, fileLen=14203377, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309124131821, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309124131821.parquet, fileLen=13973255, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/U sers/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309125005608.parquet, fileLen=13975017, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309124131821, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309124131821.parquet, fileLen=13895052, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309125005608.parquet, fileLen=13896710, BootstrapBaseFile=null}', logFiles='[]'}] 4258 [Executor task launch worker for task 1] ERROR org.apache.hudi.utilities.HoodieMetadataTableValidator - Metadata table validation failed for 20210623 due to HoodieValidationException org.apache.hudi.exception.HoodieValidationException: Validation of file slices for partition 20210623 failed. file slices from metadata: [FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309114505047, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_0-11-0_20220309114505047.parquet, fileLen=14203377, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309124131821, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309124131821.parquet, fileLen=13973255, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/ hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309125005608.parquet, fileLen=13975017, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309114505047, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_1-12-0_20220309114505047.parquet, fileLen=14121559, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309124131821, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309124131821.parquet, fileLen=13895052, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fil eId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309125005608.parquet, fileLen=13896710, BootstrapBaseFile=null}', logFiles='[]'}] file slices from file system and base files: [FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309114505047, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_0-11-0_20220309114505047.parquet, fileLen=14203377, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309124131821, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309124131821.parquet, fileLen=13973255, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/U sers/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309125005608.parquet, fileLen=13975017, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309124131821, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309124131821.parquet, fileLen=13895052, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice {fileGroupId=HoodieFileGroupId{partitionPath='20210623', fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309125005608.parquet, fileLen=13896710, BootstrapBaseFile=null}', logFiles='[]'}] at org.apache.hudi.utilities.HoodieMetadataTableValidator.validate(HoodieMetadataTableValidator.java:660) at org.apache.hudi.utilities.HoodieMetadataTableValidator.validateAllFileGroups(HoodieMetadataTableValidator.java:522) at org.apache.hudi.utilities.HoodieMetadataTableValidator.validateFilesInPartition(HoodieMetadataTableValidator.java:476) at org.apache.hudi.utilities.HoodieMetadataTableValidator.lambda$doMetadataTableValidation$6f568e7e$1(HoodieMetadataTableValidator.java:405) at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) at scala.collection.AbstractIterator.to(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 4265 [Executor task launch worker for task 1] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 712 bytes result sent to driver 4271 [task-result-getter-1] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 1) in 132 ms on localhost (executor driver) (1/1) 4271 [task-result-getter-1] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 1.0, whose tasks have all completed, from pool 4272 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 1 (collect at HoodieJavaRDD.java:140) finished in 0.216 s 4272 [main] INFO org.apache.spark.scheduler.DAGScheduler - Job 1 finished: collect at HoodieJavaRDD.java:140, took 0.218763 s 4272 [main] WARN org.apache.hudi.utilities.HoodieMetadataTableValidator - Metadata table validation failed. 4274 [Thread-1] INFO org.apache.spark.SparkContext - Invoking stop() from shutdown hook 4281 [Thread-1] INFO org.spark_project.jetty.server.AbstractConnector - Stopped Spark@43ed0ff3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 4282 [Thread-1] INFO org.apache.spark.ui.SparkUI - Stopped Spark web UI at http://localhost:4040 4288 [dispatcher-event-loop-1] INFO org.apache.spark.MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped! 4298 [Thread-1] INFO org.apache.spark.storage.memory.MemoryStore - MemoryStore cleared 4298 [Thread-1] INFO org.apache.spark.storage.BlockManager - BlockManager stopped 4299 [Thread-1] INFO org.apache.spark.storage.BlockManagerMaster - BlockManagerMaster stopped 4300 [dispatcher-event-loop-6] INFO org.apache.spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped! 4307 [Thread-1] INFO org.apache.spark.SparkContext - Successfully stopped SparkContext 4307 [Thread-1] INFO org.apache.spark.util.ShutdownHookManager - Shutdown hook called 4308 [Thread-1] INFO org.apache.spark.util.ShutdownHookManager - Deleting directory /private/var/folders/61/77xdhf3x0x9g3t_vdd1c9_nwr4wznp/T/spark-97af2b76-9039-45e2-98b4-52400a3a1ecc ``` After this patch ``` 12543 [Executor task launch worker for task 1] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 12548 [Executor task launch worker for task 1] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=20210623, #files=6 12549 [Executor task launch worker for task 1] INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView - addFilesToView: NumFiles=6, NumFileGroups=2, FileGroupsCreationTime=1, StoreTimeTaken=0 18533 [Executor task launch worker for task 1] INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView - Building file system view for partition (20210623) 18536 [Executor task launch worker for task 1] INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView - addFilesToView: NumFiles=6, NumFileGroups=2, FileGroupsCreationTime=1, StoreTimeTaken=0 18536 [Executor task launch worker for task 1] INFO org.apache.hudi.utilities.HoodieMetadataTableValidator - Validation of file slices succeeded for partition 20210623 18536 [Executor task launch worker for task 1] INFO org.apache.hudi.utilities.HoodieMetadataTableValidator - Validation of getLatestFileSlices succeeded for partition 20210623 18542 [Executor task launch worker for task 1] INFO org.apache.hudi.utilities.HoodieMetadataTableValidator - Validation of getLatestBaseFiles succeeded for partition 20210623 21672 [Executor task launch worker for task 1] INFO org.apache.hudi.utilities.HoodieMetadataTableValidator - Validation of file slices succeeded for partition 20210623 21672 [Executor task launch worker for task 1] INFO org.apache.hudi.utilities.HoodieMetadataTableValidator - Validation of all file groups succeeded for partition 20210623 21672 [Executor task launch worker for task 1] INFO org.apache.hudi.utilities.HoodieMetadataTableValidator - Metadata table validation succeeded for 20210623 21684 [Executor task launch worker for task 1] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 669 bytes result sent to driver 21696 [task-result-getter-1] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 1) in 16418 ms on localhost (executor driver) (1/1) 21696 [task-result-getter-1] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 1.0, whose tasks have all completed, from pool 21697 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 1 (collect at HoodieJavaRDD.java:140) finished in 16.501 s 21697 [main] INFO org.apache.spark.scheduler.DAGScheduler - Job 1 finished: collect at HoodieJavaRDD.java:140, took 16.503945 s 21698 [main] INFO org.apache.hudi.utilities.HoodieMetadataTableValidator - Metadata table validation succeeded. ``` ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org