stackfun opened a new issue #2692: URL: https://github.com/apache/hudi/issues/2692
**Describe the problem you faced** During a delta commit on MOR tables, if hudi generates avro log files that are larger than 16MB on Google Cloud Storage, then hudi always fails to read them due to a corrupted block error. Might be related to this - https://github.com/apache/hudi/pull/2500 **To Reproduce** ``` def corrupt_block_defect(spark, database_name, table_name, destination): import secrets def gen_data(start, stop): return [ { "uuid": str(i), "partitionpath": "partition", "data": secrets.token_hex(1000), "start": start, "stop": stop, "ts": str(i), } for i in range(start, stop) ] hudi_write_options = { "hoodie.table.name": table_name, "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.table.name": table_name, "hoodie.datasource.write.table.type": "MERGE_ON_READ", "hoodie.compact.inline": True, "hoodie.compact.inline.max.delta.commits": 1, } df = spark.read.json(spark.sparkContext.parallelize(gen_data(0, 100000))) df.write.format("hudi").options(**hudi_write_options).mode("overwrite").save(destination) df = spark.read.json(spark.sparkContext.parallelize(gen_data(0, 1000))) # generates log file < 16 MB df.write.format("hudi").options(**hudi_write_options).mode("append").save(destination) df = spark.read.json(spark.sparkContext.parallelize(gen_data(0, 10000))) # generates avro file > 16 MB, get corrupted block df.write.format("hudi").options(**hudi_write_options).mode("append").save(destination) ``` in hudi-cli: ``` hudi:test_hudi_table->show logfile metadata --logFilePathPattern gs://my_bucket/test_hudi_table/partition/.2d471d69-5385-42ed-adf1-2078cf308ccb-0_20210317164945.log.1_0-93-39067 21/03/17 16:56:54 INFO log.HoodieLogFileReader: Found corrupted block in file HoodieLogFile{pathStr='gs://my_bucket/test_hudi_table/partition/.2d471d69-5385-42ed-adf1-2078cf308ccb-0_20210317164945.log.1_0-93-39067', fileLen=0}. Header block size(7035295) did not match the footer block size(27475) ``` In the compaction commit, we also see corrupted blocks. **Environment Description** * Hudi version : 0.7.0 (with https://github.com/apache/hudi/pull/2500 merged) * Spark version : 2.4.7 * Hive version : 2.3.7 * Hadoop version : 2.9.2 * Storage (HDFS/S3/GCS..) : GCS * Running on Docker? (yes/no) : no * GCP Dataproc: 1.4 **Workaround** Setting the following settings prevents corrupted blocks, however, we get worse performance. ``` "hoodie.logfile.data.block.max.size": 1024 * 1024 * 8, "hoodie.logfile.max.size": 1024 * 1024 * 8, ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org