[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran edited a comment on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-687993540 @bvaradar FYI, I have reproduced this error using Hudi 0.6.0, after running a structured streaming job during several days. Please find the logs attached. (I haven't identified a concurrent process that runs compactions) [withHudi060_stderr_01.log](https://github.com/apache/hudi/files/5180850/withHudi060_stderr_01.log) [withHudi060_stderr_02.log](https://github.com/apache/hudi/files/5180851/withHudi060_stderr_02.log) Is there a workaround to fix this error? Would it be possible to rollback some commits and resume ingestion/compaction? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran edited a comment on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-682314989 The file that isn't found is `'s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4957-299294_20200827155539.parquet'`. The available files in s3 that start with "9dee1248-c972-4ed3-80f5-15545ac4c534-0_2" are: ``` 2020-08-27 10:26 33525767 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3850-231917_20200827102526.parquet 2020-08-27 10:33 33526574 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3891-234401_20200827103318.parquet 2020-08-27 16:17 33545224 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet 2020-08-27 11:13 33530132 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4096-246791_20200827111254.parquet 2020-08-27 11:22 33530880 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4137-249295_20200827112139.parquet 2020-08-27 12:00 3353 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4301-259277_20200827115949.parquet 2020-08-27 12:20 33534377 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4383-264271_20200827121947.parquet 2020-08-27 12:42 33535631 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4465-269277_20200827124204.parquet 2020-08-27 12:54 33536084 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4506-271786_20200827125338.parquet 2020-08-27 13:07 33536635 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4547-274289_20200827130640.parquet 2020-08-27 13:20 33537444 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4588-276783_20200827131919.parquet 2020-08-27 13:32 33538151 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4629-279284_20200827133143.parquet 2020-08-27 13:46 33539531 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4670-281782_20200827134536.parquet 2020-08-27 14:14 33541130 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4752-286756_20200827141258.parquet 2020-08-27 14:30 33541913 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4793-289269_20200827142922.parquet 2020-08-27 14:49 33542820 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4834-291776_20200827144807.parquet 2020-08-27 15:08 33543459 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4875-294286_20200827150653.parquet 2020-08-27 15:30 33544369 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet ``` Contents of s3://my-bucket/my-table/.hoodie/20200827155539.commit ``` "9dee1248-c972-4ed3-80f5-15545ac4c534-0" : "daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet", ``` Contents of s3://my-bucket/my-table/.hoodie/20200827155539.compaction.requested ``` [20200827152840, [.9dee1248-c972-4ed3-80f5-15545ac4c534-0_20200827152840.log.1_32-4949-299212], 9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet, 9dee1248-c972-4ed3-80f5-15545ac4c534-0, daas_date=2020, [TOTAL_LOG_FILES -> 1.0, TOTAL_IO_READ_MB -> 32.0, TOTAL_LOG_FILES_SIZE -> 121966.0, TOTAL_IO_WRITE_MB -> 31.0, TOTAL_IO_MB -> 63.0, TOTAL_LOG_FILE_SIZE -> 121966.0]], ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran edited a comment on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-682311268 @bvaradar The exception was raised, after running the structured streaming job for a while. Please find attached the driver logs with INFO level logging. [stderr_01.log](https://github.com/apache/hudi/files/5139921/stderr_01.log) : the structured streaming job fails with error `org.apache.hudi.exception.HoodieIOException: Consistency check failed to ensure all files APPEAR` [stderr_02.log](https://github.com/apache/hudi/files/5139922/stderr_02.log) : the structured streaming job is retried by YARN and compaction fails with a `java.io.FileNotFoundException` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran edited a comment on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-679875396 > can you re-bootstrap and then start ingesting the data but this time enable consistency guard right from the begining. @bvaradar Actually, this is what I did. I deleted the hudi table in s3, added the consistency check property and started ingesting the data from the beginning. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran edited a comment on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-679866696 @bvaradar I ran the structured streaming job with `hoodie.consistency.check.enabled = true`, starting from the earliest offsets in Kafka, and got the same error: a `java.io.FileNotFoundException` when the compaction is retried. **Summary** The structured streaming job ran during 3 hours: - at some point, some executors were lost because of an OutOfMemory error. - then the spark driver failed because the consistency check failed. The spark application was then retried by YARN, and the 2nd attempt failed with `Caused by: java.io.FileNotFoundException: No such file or directory` when the compaction was retried. **Stacktraces** Stracktrace of the first attempt: ``` 20/08/25 06:51:39 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist 20/08/25 06:51:40 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 30 milliseconds, but spent 800229 milliseconds 20/08/25 06:56:24 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_1775_40 ! 20/08/25 06:56:24 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_1785_53 ! [...] 20/08/25 06:56:24 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_1785_35 ! 20/08/25 06:56:24 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_1785_50 ! 20/08/25 06:56:24 WARN YarnAllocator: Container from a bad node: container_1594796531644_1833_01_02 on host: ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 143 [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. [2020-08-25 06:56:24.637]Killed by external signal . 20/08/25 06:56:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container from a bad node: container_1594796531644_1833_01_02 on host: ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 143 [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. [2020-08-25 06:56:24.637]Killed by external signal . 20/08/25 06:56:24 ERROR YarnClusterScheduler: Lost executor 1 on ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal: Container from a bad node: container_1594796531644_1833_01_02 on host: ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 143 [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. [2020-08-25 06:56:24.637]Killed by external signal . 20/08/25 06:56:24 WARN TaskSetManager: Lost task 1.0 in stage 816.0 (TID 50626, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container from a bad node: container_1594796531644_1833_01_02 on host: ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 143 [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. [2020-08-25 06:56:24.637]Killed by external signal . 20/08/25 06:56:24 WARN TaskSetManager: Lost task 0.0 in stage 816.0 (TID 50625, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container from a bad node: container_1594796531644_1833_01_02 on host: ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 143 [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. [2020-08-25 06:56:24.637]Killed by external signal . 20/08/25 06:56:24 WARN ExecutorAllocationManager: Attempted to mark unknown executor 1 idle 20/08/25 07:07:51 ERROR MicroBatchExecution: Query [id = 6ea738ee-0886-4014-a2b2-f51efd693c45, runId = 97c16ef4-d610-4d44-a0e9-a9d24ed5e0cf] terminated with error org.apache.hudi.exception.HoodieCommitException: Failed to complete commit 20200825065331 due to finalize errors. at org.apache.hudi.client.AbstractHoodieWriteClient.finalizeWrite(AbstractHoodieWriteClient.java:204) at org.apache.hudi.client.HoodieWriteClient.doCompactionCommit(HoodieWriteClient.java:1142) at org.apache.hudi.client.HoodieWriteClient.commitCompaction(HoodieWriteClient.java:1102) at org.apache.hudi.client.HoodieWriteClient.runCompaction(HoodieWriteClient.java:1085) at org.apache.hudi.client.HoodieWriteClient.compact(HoodieWriteClient.java:1056) at