sam-wmt opened a new issue #2042: URL: https://github.com/apache/hudi/issues/2042
I have loaded a Merge On Read table via the Bulk Insert API. This will be an entity based table 315M Entities (2TB Parquet) and partitioned evenly into 1000 distinct partitions 0 -> 999. The Bulk API works as expected and I am able to query the table via Presto/Hive/Spark. When I start a live ingestion into this table I am able to ingest about 8 batches before I receive a corrupt log file exception shown below. **Exception killing jobs** 20/08/26 01:11:26 ERROR SplunkStreamListener: |exception=org.apache.spark.SparkException org.apache.spark.SparkException: Job aborted due to stage failure: Task 496 in stage 45.0 failed 4 times, most recent failure: Lost task 496.3 in stage 45.0 (TID 46608, 192.168.150.31, executor 10): org.apache.hudi.exception.HoodieIOException: IOException when reading log file I have duplicated this twice with fresh table loads against the like stream from Kafka. Once the job is dead the job continues to have this exception. googling I was able to see this: I enabled hoodie.consistency.check.enabled which did not resolve the issue. Image from Hudi CLI for the table <img width="1339" alt="Screen Shot 2020-08-26 at 2 55 55 AM" src="https://user-images.githubusercontent.com/67726885/91271504-4769d080-e748-11ea-950d-d135076dc160.png"> **Hudi Configuration:** "hoodie.compaction.strategy" = "org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy", "hoodie.consistency.check.enabled" -> "true" DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL HoodieCompactionConfig.INLINE_COMPACT_PROP = true, DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY = true, HoodieWriteConfig.UPSERT_PARALLELISM = 512, HoodieWriteConfig.INSERT_PARALLELISM = 512, HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP = 1, HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP = 8, HoodieStorageConfig.PARQUET_FILE_MAX_BYTES = 256 * 1024 * 1024 HoodieStorageConfig.PARQUET_FILE_MAX_BYTES = 256 * 1024 * 1024 **Versions** Hudi: Release 0.6.0 Spark: 2.4.5 on Databricks Storage: Azure ADLS_V2 **Configuration** Spark Cluster for Streaming ingestion: 32-Core [128GB Ram] x 16 Executors **Most Importantly:** How would I recover from this failure mode? What could be causing this failure mode? **Other thoughts around performance tuning:** Table is about 2TB and I am attempting to ingest and compact about 100GB worth of data per hour into this table. Any other hints tips for managing this size table with N-minute batches (currently 15) and in-line compaction. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org