[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-09-06 Thread GitBox


dm-tran edited a comment on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-687993540


   @bvaradar FYI, I have reproduced this error using Hudi 0.6.0, after running 
a structured streaming job during several days. Please find the logs attached. 
(I haven't identified a concurrent process that runs compactions)
   
   
[withHudi060_stderr_01.log](https://github.com/apache/hudi/files/5180850/withHudi060_stderr_01.log)
   
[withHudi060_stderr_02.log](https://github.com/apache/hudi/files/5180851/withHudi060_stderr_02.log)
   
   Is there a workaround to fix this error? Would it be possible to rollback 
some commits and resume ingestion/compaction?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-27 Thread GitBox


dm-tran edited a comment on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-682314989


   The file that isn't found is 
`'s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4957-299294_20200827155539.parquet'`.
   
   The available files in s3 that start with 
"9dee1248-c972-4ed3-80f5-15545ac4c534-0_2" are: 
   ```
   2020-08-27 10:26 33525767 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3850-231917_20200827102526.parquet
   2020-08-27 10:33 33526574 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3891-234401_20200827103318.parquet
   2020-08-27 16:17 33545224 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet
   2020-08-27 11:13 33530132 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4096-246791_20200827111254.parquet
   2020-08-27 11:22 33530880 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4137-249295_20200827112139.parquet
   2020-08-27 12:00 3353 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4301-259277_20200827115949.parquet
   2020-08-27 12:20 33534377 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4383-264271_20200827121947.parquet
   2020-08-27 12:42 33535631 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4465-269277_20200827124204.parquet
   2020-08-27 12:54 33536084 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4506-271786_20200827125338.parquet
   2020-08-27 13:07 33536635 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4547-274289_20200827130640.parquet
   2020-08-27 13:20 33537444 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4588-276783_20200827131919.parquet
   2020-08-27 13:32 33538151 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4629-279284_20200827133143.parquet
   2020-08-27 13:46 33539531 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4670-281782_20200827134536.parquet
   2020-08-27 14:14 33541130 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4752-286756_20200827141258.parquet
   2020-08-27 14:30 33541913 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4793-289269_20200827142922.parquet
   2020-08-27 14:49 33542820 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4834-291776_20200827144807.parquet
   2020-08-27 15:08 33543459 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4875-294286_20200827150653.parquet
   2020-08-27 15:30 33544369 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet
   ```
   
   Contents of s3://my-bucket/my-table/.hoodie/20200827155539.commit
   
   ```
"9dee1248-c972-4ed3-80f5-15545ac4c534-0" : 
"daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet",
   ```
   
   Contents of 
s3://my-bucket/my-table/.hoodie/20200827155539.compaction.requested
   
   ```
   [20200827152840, 
[.9dee1248-c972-4ed3-80f5-15545ac4c534-0_20200827152840.log.1_32-4949-299212], 
9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet, 
9dee1248-c972-4ed3-80f5-15545ac4c534-0, daas_date=2020, [TOTAL_LOG_FILES -> 
1.0, TOTAL_IO_READ_MB -> 32.0, TOTAL_LOG_FILES_SIZE -> 121966.0, 
TOTAL_IO_WRITE_MB -> 31.0, TOTAL_IO_MB -> 63.0, TOTAL_LOG_FILE_SIZE -> 
121966.0]],
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-27 Thread GitBox


dm-tran edited a comment on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-682311268


   @bvaradar The exception was raised, after running the structured streaming 
job for a while.
   
   Please find attached the driver logs with INFO level logging.
   
   [stderr_01.log](https://github.com/apache/hudi/files/5139921/stderr_01.log) 
: the structured streaming job fails with error 
`org.apache.hudi.exception.HoodieIOException: Consistency check failed to 
ensure all files APPEAR`
   [stderr_02.log](https://github.com/apache/hudi/files/5139922/stderr_02.log) 
: the structured streaming job is retried by YARN and compaction fails with a 
`java.io.FileNotFoundException`
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-25 Thread GitBox


dm-tran edited a comment on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-679875396


   > can you re-bootstrap and then start ingesting the data but this time 
enable consistency guard right from the begining.
   
   @bvaradar Actually, this is what I did. I deleted the hudi table in s3, 
added the consistency check property and started ingesting the data from the 
beginning.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-25 Thread GitBox


dm-tran edited a comment on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-679866696


   @bvaradar I ran the structured streaming job with 
`hoodie.consistency.check.enabled = true`, starting from the earliest offsets 
in Kafka, and got the same error: a `java.io.FileNotFoundException` when the 
compaction is retried.
   
   **Summary**
   
   The structured streaming job ran during 3 hours:
   - at some point, some executors were lost because of an OutOfMemory error.
   - then the spark driver failed because the consistency check failed.
   
   The spark application was then retried by YARN, and the 2nd attempt failed 
with `Caused by: java.io.FileNotFoundException: No such file or directory` when 
the compaction was retried.
   
   **Stacktraces**
   
   Stracktrace of the first attempt:
   ```
   20/08/25 06:51:39 WARN HiveConf: HiveConf of name hive.server2.thrift.url 
does not exist
   20/08/25 06:51:40 WARN ProcessingTimeExecutor: Current batch is falling 
behind. The trigger interval is 30 milliseconds, but spent 800229 
milliseconds
   20/08/25 06:56:24 WARN BlockManagerMasterEndpoint: No more replicas 
available for rdd_1775_40 !
   20/08/25 06:56:24 WARN BlockManagerMasterEndpoint: No more replicas 
available for rdd_1785_53 !
   [...]
   20/08/25 06:56:24 WARN BlockManagerMasterEndpoint: No more replicas 
available for rdd_1785_35 !
   20/08/25 06:56:24 WARN BlockManagerMasterEndpoint: No more replicas 
available for rdd_1785_50 !
   20/08/25 06:56:24 WARN YarnAllocator: Container from a bad node: 
container_1594796531644_1833_01_02 on host: 
ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. 
Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 
143
   [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. 
   [2020-08-25 06:56:24.637]Killed by external signal
   .
   20/08/25 06:56:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
Requesting driver to remove executor 1 for reason Container from a bad node: 
container_1594796531644_1833_01_02 on host: 
ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. 
Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 
143
   [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. 
   [2020-08-25 06:56:24.637]Killed by external signal
   .
   20/08/25 06:56:24 ERROR YarnClusterScheduler: Lost executor 1 on 
ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal: Container from a bad node: 
container_1594796531644_1833_01_02 on host: 
ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. 
Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 
143
   [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. 
   [2020-08-25 06:56:24.637]Killed by external signal
   .
   20/08/25 06:56:24 WARN TaskSetManager: Lost task 1.0 in stage 816.0 (TID 
50626, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): 
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) 
Reason: Container from a bad node: container_1594796531644_1833_01_02 on 
host: ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. 
Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 
143
   [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. 
   [2020-08-25 06:56:24.637]Killed by external signal
   .
   20/08/25 06:56:24 WARN TaskSetManager: Lost task 0.0 in stage 816.0 (TID 
50625, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): 
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) 
Reason: Container from a bad node: container_1594796531644_1833_01_02 on 
host: ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal. Exit status: 143. 
Diagnostics: [2020-08-25 06:56:24.636]Container killed on request. Exit code is 
143
   [2020-08-25 06:56:24.636]Container exited with a non-zero exit code 143. 
   [2020-08-25 06:56:24.637]Killed by external signal
   .
   20/08/25 06:56:24 WARN ExecutorAllocationManager: Attempted to mark unknown 
executor 1 idle
   20/08/25 07:07:51 ERROR MicroBatchExecution: Query [id = 
6ea738ee-0886-4014-a2b2-f51efd693c45, runId = 
97c16ef4-d610-4d44-a0e9-a9d24ed5e0cf] terminated with error
   org.apache.hudi.exception.HoodieCommitException: Failed to complete commit 
20200825065331 due to finalize errors.
   at 
org.apache.hudi.client.AbstractHoodieWriteClient.finalizeWrite(AbstractHoodieWriteClient.java:204)
   at 
org.apache.hudi.client.HoodieWriteClient.doCompactionCommit(HoodieWriteClient.java:1142)
   at 
org.apache.hudi.client.HoodieWriteClient.commitCompaction(HoodieWriteClient.java:1102)
   at 
org.apache.hudi.client.HoodieWriteClient.runCompaction(HoodieWriteClient.java:1085)
   at 
org.apache.hudi.client.HoodieWriteClient.compact(HoodieWriteClient.java:1056)
   at