Re: [I] [SUPPORT] Hudi 0.13.1 on EMR, MOR table writer hangs intermittently with S3 read timeout error for column stats index [hudi]

2024-03-04 Thread via GitHub


CTTY commented on issue #10415:
URL: https://github.com/apache/hudi/issues/10415#issuecomment-1977660648

   This looks similar to this issue: https://github.com/apache/hudi/issues/7487 
where user ran into S3 throttling issue due to too many S3 calls.
   
   Was wondering if you can check if there are too many 503 error code from 
your S3 bucket?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi 0.13.1 on EMR, MOR table writer hangs intermittently with S3 read timeout error for column stats index [hudi]

2024-01-31 Thread via GitHub


ad1happy2go commented on issue #10415:
URL: https://github.com/apache/hudi/issues/10415#issuecomment-1919021038

   Thanks for trying @ergophobiac. @CTTY any insights here ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi 0.13.1 on EMR, MOR table writer hangs intermittently with S3 read timeout error for column stats index [hudi]

2024-01-10 Thread via GitHub


ergophobiac commented on issue #10415:
URL: https://github.com/apache/hudi/issues/10415#issuecomment-1886257556

   Hello @ad1happy2go , 
   We ran a test with the same configurations, just one addition: 
spark.hadoop.fs.s3a.connection.maximum=2000. (We found a resource saying the 
default on EMR is 50). 
   
   Ran into the same error, application failed 3 days after starting. 
   We have disabled multi-modal indexing for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi 0.13.1 on EMR, MOR table writer hangs intermittently with S3 read timeout error for column stats index [hudi]

2024-01-02 Thread via GitHub


ergophobiac commented on issue #10415:
URL: https://github.com/apache/hudi/issues/10415#issuecomment-1874167209

   Hey @ad1happy2go, we have a test case running, we'll observe till we're sure 
it's stable and let you know how it turns out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi 0.13.1 on EMR, MOR table writer hangs intermittently with S3 read timeout error for column stats index [hudi]

2024-01-02 Thread via GitHub


ad1happy2go commented on issue #10415:
URL: https://github.com/apache/hudi/issues/10415#issuecomment-1874092759

   @ergophobiac Did you got a chance to try this out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi 0.13.1 on EMR, MOR table writer hangs intermittently with S3 read timeout error for column stats index [hudi]

2023-12-27 Thread via GitHub


ad1happy2go commented on issue #10415:
URL: https://github.com/apache/hudi/issues/10415#issuecomment-1870277355

   @ergophobiac Are you setting fs.s3a.connection.maximum to a higher value? 
Can you try increasing its value and try?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Hudi 0.13.1 on EMR, MOR table writer hangs intermittently with S3 read timeout error for column stats index [hudi]

2023-12-26 Thread via GitHub


ergophobiac opened a new issue, #10415:
URL: https://github.com/apache/hudi/issues/10415

   **Describe the problem you faced**
   
   Stack: Hudi 0.13.1, EMR 6.13.0, Spark 3.4.1
   
   We are writing to an MOR table in S3, using Spark Structured Streaming job 
on EMR. Once this job has run for a while (like 12+ hours) we notice that at 
random times, there are long periods of time with no operations/commits on the 
timeline. Usually, the offending executor is removed and the task retried, but 
eventually, it stalls and the application exits with fatal errors on the driver.
   
   We are using deployment model B: Single writer with async services + OCC for 
metadata table. We also tried using DynamoDB based lock provider (in case OCC 
was the culprit) and the same thing happened.
   
   Sometimes, a deltacommit, clean or compaction is inflight but never 
completed, and our logs are reporting the same error over and over:
   
   (on Spark driver stderr)
   
![image](https://github.com/apache/hudi/assets/122294902/886ac590-c989-4006-9523-233e9ecb5866)
   
   The timeout always occurs while trying to read column stats.
   
   Meanwhile, the timeline shows no progress:
   
![image](https://github.com/apache/hudi/assets/122294902/483b8af8-5cf5-4b11-be79-7969ea4387eb)
   
   Notice that the deltacommit inflight at 10:20:01 AM is stuck (all prior 
deltacommits were finished within 45 seconds). The micro-batch interval is 5 
mins.
   
   We can see no active tasks on the Spark Web UI:
   
![image](https://github.com/apache/hudi/assets/122294902/5159aeda-189d-4032-9719-8bb7826e9e8f)
   
   
   In some instances, the tasks will succeed eventually and the job will 
progress ahead, until it gets stuck again with the same errors. 
   
   Here are all our Hudi configs:
   
   hoodie.table.version ->  
5
   hoodie.datasource.write.hive_style_partitioning   ->  True
   hoodie.datasource.hive_sync.enable   ->  True
   hoodie.datasource.hive_sync.auto_create_database  ->  True
   hoodie.datasource.hive_sync.skip_ro_suffix->  True
   hoodie.parquet.small.file.limit   ->  
104857600
   hoodie.parquet.max.file.size ->  
125829120
   hoodie.compact.inline.trigger.strategy  ->  
NUM_OR_TIME
   hoodie.compact.inline.max.delta.commits->  3
   hoodie.compact.inline.max.delta.seconds->  600
   hoodie.parquet.compression.codec  ->  snappy
   hoodie.clean.automatic ->  
True
   hoodie.index.type  
->  BLOOM
   hoodie.bloom.index.use.metadata->  True
   hoodie.metadata.enable   ->  True
   hoodie.metadata.index.bloom.filter.enable   ->  True
   hoodie.metadata.index.column.stats.enable->  True
   hoodie.keep.max.commits->  50
   hoodie.archive.automatic  ->  
True
   hoodie.archive.beyond.savepoint ->  True
   hoodie.metrics.on -> 
 True
   hoodie.metadata.metrics.enable  ->  True
   hoodie.metrics.executor.enable->  True
   hoodie.metrics.reporter.type ->  
GRAPHITE
   hoodie.metrics.graphite.host ->  

   hoodie.metrics.graphite.port  ->  
2003
   hoodie.metrics.graphite.report.period.seconds ->  30
   hoodie.metrics.graphite.metric.prefix->  
test_prefix_demo_mor
   hoodie.cleaner.policy.failed.writes ->  LAZY
   hoodie.write.concurrency.mode ->  
OPTIMISTIC_CONCURRENCY_CONTROL
   hoodie.write.lock.provider  ->  
org.apache.hudi.client.transaction.lock.InProcessLockProvider
   hoodie.metrics.lock.enable->  
True
   hoodie.clean.async-> 
 True
   hoodie.archive.async ->  
True
   hoodie.metadata.index.async->  False
   hoodie.metadata.clean.async ->  False
   hoodie.cleaner.policy  
->  KEEP_LATEST_BY_HOURS
   hoodie.cleaner.hours.retained->  1
   hoodie.datasource.write.table.name   ->  table_0