sam-wmt opened a new issue #2042:
URL: https://github.com/apache/hudi/issues/2042


   I have loaded a Merge On Read table via the Bulk Insert API.  This will be 
an entity based table 315M Entities (2TB Parquet) and partitioned evenly into 
1000 distinct partitions 0 -> 999.  The Bulk API works as expected and I am 
able to query the table via Presto/Hive/Spark.  When I start a live ingestion 
into this table I am able to ingest about 8 batches before I receive a corrupt 
log file exception shown below.
   
   **Exception killing jobs**
   20/08/26 01:11:26 ERROR SplunkStreamListener: 
|exception=org.apache.spark.SparkException
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 496 
in stage 45.0 failed 4 times, most recent failure: Lost task 496.3 in stage 
45.0 (TID 46608, 192.168.150.31, executor 10): 
org.apache.hudi.exception.HoodieIOException: IOException when reading log file 
   
   I have duplicated this twice with fresh table loads against the like stream 
from Kafka.  Once the job is dead the job continues to have this exception.
   
   googling I was able to see this: 
   
   I enabled hoodie.consistency.check.enabled which did not resolve the issue.
   
   Image from Hudi CLI for the table
   <img width="1339" alt="Screen Shot 2020-08-26 at 2 55 55 AM" 
src="https://user-images.githubusercontent.com/67726885/91271504-4769d080-e748-11ea-950d-d135076dc160.png";>
   
   **Hudi Configuration:**
   "hoodie.compaction.strategy" = 
"org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy",
   "hoodie.consistency.check.enabled" -> "true"
   DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL
   HoodieCompactionConfig.INLINE_COMPACT_PROP = true,
   DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY = true,
   HoodieWriteConfig.UPSERT_PARALLELISM = 512,
   HoodieWriteConfig.INSERT_PARALLELISM = 512,
   HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP = 1,
   HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP = 8,
   HoodieStorageConfig.PARQUET_FILE_MAX_BYTES = 256 * 1024 * 1024
   HoodieStorageConfig.PARQUET_FILE_MAX_BYTES = 256 * 1024 * 1024
   
   **Versions**
   Hudi: Release 0.6.0
   Spark: 2.4.5 on Databricks
   Storage: Azure ADLS_V2
   
   **Configuration**
   Spark Cluster for Streaming ingestion: 32-Core [128GB Ram] x 16 Executors 
   
   **Most Importantly:**
   How would I recover from this failure mode?
   What could be causing this failure mode?
   
   **Other thoughts around performance tuning:**
   Table is about 2TB and I am attempting to ingest and compact about 100GB 
worth of data per hour into this table.  Any other hints tips for managing this 
size table with N-minute batches (currently 15) and in-line compaction.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to