zhangyue19921010 opened a new pull request #4994:
URL: https://github.com/apache/hudi/pull/4994


   https://issues.apache.org/jira/browse/HUDI-3376
   ## What is the purpose of the pull request
   add an option named `--skip-under-deletion-data-files` which will skip to 
compare the data files which are under deletion by cleaner
   
   ## Brief change log
   When enabled, HoodieMetadataTableValidator will read the clean plan for 
pending cleaning action and get under deletion data files. Then hoodie will 
skip to compare these files.
   
   This patch is tested on our local env.
   There 're pending cleaning actions during performing 
HoodieMetadataTableValidator.
   Before this patch validator failed with 
   ```
   4257 [Executor task launch worker for task 1] INFO  
org.apache.hudi.utilities.HoodieMetadataTableValidator  - Validation of 
getLatestBaseFiles succeeded for partition 20210623
   4258 [Executor task launch worker for task 1] ERROR 
org.apache.hudi.utilities.HoodieMetadataTableValidator  - Validation of file 
slices for partition 20210623 failed.
   file slices from metadata: [FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309114505047, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_0-11-0_20220309114505047.parquet,
 fileLen=14203377, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309124131821, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309124131821.parquet,
 fileLen=13973255, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309125005608, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/
 
hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309125005608.parquet,
 fileLen=13975017, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309114505047, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_1-12-0_20220309114505047.parquet,
 fileLen=14121559, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309124131821, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309124131821.parquet,
 fileLen=13895052, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', fil
 eId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309125005608, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309125005608.parquet,
 fileLen=13896710, BootstrapBaseFile=null}', logFiles='[]'}]
   file slices from file system and base files: [FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309114505047, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_0-11-0_20220309114505047.parquet,
 fileLen=14203377, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309124131821, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309124131821.parquet,
 fileLen=13973255, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/U
 
sers/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309125005608.parquet,
 fileLen=13975017, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309124131821, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309124131821.parquet,
 fileLen=13895052, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309125005608, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309125005608.parquet,
 fileLen=13896710, BootstrapBaseFile=null}', logFiles='[]'}]
   4258 [Executor task launch worker for task 1] ERROR 
org.apache.hudi.utilities.HoodieMetadataTableValidator  - Metadata table 
validation failed for 20210623 due to HoodieValidationException
   org.apache.hudi.exception.HoodieValidationException: Validation of file 
slices for partition 20210623 failed.
   file slices from metadata: [FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309114505047, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_0-11-0_20220309114505047.parquet,
 fileLen=14203377, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309124131821, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309124131821.parquet,
 fileLen=13973255, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309125005608, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/
 
hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309125005608.parquet,
 fileLen=13975017, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309114505047, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_1-12-0_20220309114505047.parquet,
 fileLen=14121559, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309124131821, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309124131821.parquet,
 fileLen=13895052, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', fil
 eId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309125005608, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309125005608.parquet,
 fileLen=13896710, BootstrapBaseFile=null}', logFiles='[]'}]
   file slices from file system and base files: [FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309114505047, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_0-11-0_20220309114505047.parquet,
 fileLen=14203377, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309124131821, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309124131821.parquet,
 fileLen=13973255, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='347bff6c-26bc-4577-ad3d-482ba73453ea-0'}, 
baseCommitTime=20220309125005608, baseFile='HoodieBaseFile{fullPath=file:/U
 
sers/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/347bff6c-26bc-4577-ad3d-482ba73453ea-0_1-25-1611_20220309125005608.parquet,
 fileLen=13975017, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309124131821, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309124131821.parquet,
 fileLen=13895052, BootstrapBaseFile=null}', logFiles='[]'}, FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='20210623', 
fileId='a149d7a0-1377-4694-9f95-c98486d6947c-0'}, 
baseCommitTime=20220309125005608, 
baseFile='HoodieBaseFile{fullPath=file:/Users/yuezhang/tmp/hudiAfTable/forecast_agg/20210623/a149d7a0-1377-4694-9f95-c98486d6947c-0_0-20-1610_20220309125005608.parquet,
 fileLen=13896710, BootstrapBaseFile=null}', logFiles='[]'}]
        at 
org.apache.hudi.utilities.HoodieMetadataTableValidator.validate(HoodieMetadataTableValidator.java:660)
        at 
org.apache.hudi.utilities.HoodieMetadataTableValidator.validateAllFileGroups(HoodieMetadataTableValidator.java:522)
        at 
org.apache.hudi.utilities.HoodieMetadataTableValidator.validateFilesInPartition(HoodieMetadataTableValidator.java:476)
        at 
org.apache.hudi.utilities.HoodieMetadataTableValidator.lambda$doMetadataTableValidation$6f568e7e$1(HoodieMetadataTableValidator.java:405)
        at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
        at scala.collection.AbstractIterator.to(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
        at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
        at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   4265 [Executor task launch worker for task 1] INFO  
org.apache.spark.executor.Executor  - Finished task 0.0 in stage 1.0 (TID 1). 
712 bytes result sent to driver
   4271 [task-result-getter-1] INFO  org.apache.spark.scheduler.TaskSetManager  
- Finished task 0.0 in stage 1.0 (TID 1) in 132 ms on localhost (executor 
driver) (1/1)
   4271 [task-result-getter-1] INFO  
org.apache.spark.scheduler.TaskSchedulerImpl  - Removed TaskSet 1.0, whose 
tasks have all completed, from pool 
   4272 [dag-scheduler-event-loop] INFO  
org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (collect at 
HoodieJavaRDD.java:140) finished in 0.216 s
   4272 [main] INFO  org.apache.spark.scheduler.DAGScheduler  - Job 1 finished: 
collect at HoodieJavaRDD.java:140, took 0.218763 s
   4272 [main] WARN  org.apache.hudi.utilities.HoodieMetadataTableValidator  - 
Metadata table validation failed.
   4274 [Thread-1] INFO  org.apache.spark.SparkContext  - Invoking stop() from 
shutdown hook
   4281 [Thread-1] INFO  org.spark_project.jetty.server.AbstractConnector  - 
Stopped Spark@43ed0ff3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
   4282 [Thread-1] INFO  org.apache.spark.ui.SparkUI  - Stopped Spark web UI at 
http://localhost:4040
   4288 [dispatcher-event-loop-1] INFO  
org.apache.spark.MapOutputTrackerMasterEndpoint  - 
MapOutputTrackerMasterEndpoint stopped!
   4298 [Thread-1] INFO  org.apache.spark.storage.memory.MemoryStore  - 
MemoryStore cleared
   4298 [Thread-1] INFO  org.apache.spark.storage.BlockManager  - BlockManager 
stopped
   4299 [Thread-1] INFO  org.apache.spark.storage.BlockManagerMaster  - 
BlockManagerMaster stopped
   4300 [dispatcher-event-loop-6] INFO  
org.apache.spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint
  - OutputCommitCoordinator stopped!
   4307 [Thread-1] INFO  org.apache.spark.SparkContext  - Successfully stopped 
SparkContext
   4307 [Thread-1] INFO  org.apache.spark.util.ShutdownHookManager  - Shutdown 
hook called
   4308 [Thread-1] INFO  org.apache.spark.util.ShutdownHookManager  - Deleting 
directory 
/private/var/folders/61/77xdhf3x0x9g3t_vdd1c9_nwr4wznp/T/spark-97af2b76-9039-45e2-98b4-52400a3a1ecc
   ```
   
   After this patch
   ```
   12543 [Executor task launch worker for task 1] INFO  
org.apache.hadoop.io.compress.CodecPool  - Got brand-new decompressor [.gz]
   12548 [Executor task launch worker for task 1] INFO  
org.apache.hudi.metadata.BaseTableMetadata  - Listed file in partition from 
metadata: partition=20210623, #files=6
   12549 [Executor task launch worker for task 1] INFO  
org.apache.hudi.common.table.view.AbstractTableFileSystemView  - 
addFilesToView: NumFiles=6, NumFileGroups=2, FileGroupsCreationTime=1, 
StoreTimeTaken=0
   18533 [Executor task launch worker for task 1] INFO  
org.apache.hudi.common.table.view.AbstractTableFileSystemView  - Building file 
system view for partition (20210623)
   18536 [Executor task launch worker for task 1] INFO  
org.apache.hudi.common.table.view.AbstractTableFileSystemView  - 
addFilesToView: NumFiles=6, NumFileGroups=2, FileGroupsCreationTime=1, 
StoreTimeTaken=0
   18536 [Executor task launch worker for task 1] INFO  
org.apache.hudi.utilities.HoodieMetadataTableValidator  - Validation of file 
slices succeeded for partition 20210623
   18536 [Executor task launch worker for task 1] INFO  
org.apache.hudi.utilities.HoodieMetadataTableValidator  - Validation of 
getLatestFileSlices succeeded for partition 20210623
   18542 [Executor task launch worker for task 1] INFO  
org.apache.hudi.utilities.HoodieMetadataTableValidator  - Validation of 
getLatestBaseFiles succeeded for partition 20210623
   21672 [Executor task launch worker for task 1] INFO  
org.apache.hudi.utilities.HoodieMetadataTableValidator  - Validation of file 
slices succeeded for partition 20210623
   21672 [Executor task launch worker for task 1] INFO  
org.apache.hudi.utilities.HoodieMetadataTableValidator  - Validation of all 
file groups succeeded for partition 20210623
   21672 [Executor task launch worker for task 1] INFO  
org.apache.hudi.utilities.HoodieMetadataTableValidator  - Metadata table 
validation succeeded for 20210623
   21684 [Executor task launch worker for task 1] INFO  
org.apache.spark.executor.Executor  - Finished task 0.0 in stage 1.0 (TID 1). 
669 bytes result sent to driver
   21696 [task-result-getter-1] INFO  org.apache.spark.scheduler.TaskSetManager 
 - Finished task 0.0 in stage 1.0 (TID 1) in 16418 ms on localhost (executor 
driver) (1/1)
   21696 [task-result-getter-1] INFO  
org.apache.spark.scheduler.TaskSchedulerImpl  - Removed TaskSet 1.0, whose 
tasks have all completed, from pool 
   21697 [dag-scheduler-event-loop] INFO  
org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (collect at 
HoodieJavaRDD.java:140) finished in 16.501 s
   21697 [main] INFO  org.apache.spark.scheduler.DAGScheduler  - Job 1 
finished: collect at HoodieJavaRDD.java:140, took 16.503945 s
   21698 [main] INFO  org.apache.hudi.utilities.HoodieMetadataTableValidator  - 
Metadata table validation succeeded.
   ```
   
   
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to