[GitHub] [hudi] stackfun opened a new issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

GitBox Wed, 17 Mar 2021 10:38:36 -0700


stackfun opened a new issue #2692:
URL: https://github.com/apache/hudi/issues/2692



   **Describe the problem you faced**
   
   During a delta commit on MOR tables, if hudi generates avro log files that 
are larger than 16MB on Google Cloud Storage, then hudi always fails to read 
them due to a corrupted block error. Might be related to this - 
https://github.com/apache/hudi/pull/2500
   
   **To Reproduce**
   
   ```
   def corrupt_block_defect(spark, database_name, table_name, destination):
       import secrets 
       
       def gen_data(start, stop):
           return [
               {
                   "uuid": str(i),
                   "partitionpath": "partition",
                   "data": secrets.token_hex(1000),
                   "start": start,
                   "stop": stop,
                   "ts": str(i),
               }
               for i in range(start, stop)
           ]
       
       hudi_write_options = {
           "hoodie.table.name": table_name,
           "hoodie.datasource.write.operation": "upsert",
           "hoodie.datasource.write.table.name": table_name,
           "hoodie.datasource.write.table.type": "MERGE_ON_READ",
           "hoodie.compact.inline": True,
           "hoodie.compact.inline.max.delta.commits": 1,
       }
       df = spark.read.json(spark.sparkContext.parallelize(gen_data(0, 100000)))
       
df.write.format("hudi").options(**hudi_write_options).mode("overwrite").save(destination)
       df = spark.read.json(spark.sparkContext.parallelize(gen_data(0, 1000)))  
# generates log file < 16 MB
       
df.write.format("hudi").options(**hudi_write_options).mode("append").save(destination)
       df = spark.read.json(spark.sparkContext.parallelize(gen_data(0, 10000))) 
 # generates avro file > 16 MB, get corrupted block
       
df.write.format("hudi").options(**hudi_write_options).mode("append").save(destination)
   ```
   
   in hudi-cli:
   ```
   hudi:test_hudi_table->show logfile metadata --logFilePathPattern 
gs://my_bucket/test_hudi_table/partition/.2d471d69-5385-42ed-adf1-2078cf308ccb-0_20210317164945.log.1_0-93-39067
   21/03/17 16:56:54 INFO log.HoodieLogFileReader: Found corrupted block in 
file 
HoodieLogFile{pathStr='gs://my_bucket/test_hudi_table/partition/.2d471d69-5385-42ed-adf1-2078cf308ccb-0_20210317164945.log.1_0-93-39067',
 fileLen=0}. Header block size(7035295) did not match the footer block 
size(27475)
   ```
   
   In the compaction commit, we also see corrupted blocks.
   
   **Environment Description**
   
   * Hudi version : 0.7.0 (with https://github.com/apache/hudi/pull/2500 merged)
   
   * Spark version : 2.4.7
   
   * Hive version : 2.3.7
   
   * Hadoop version : 2.9.2
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : no
   
   * GCP Dataproc: 1.4
   
   
   **Workaround**
   
   Setting the following settings prevents corrupted blocks, however, we get 
worse performance.
   ```
   "hoodie.logfile.data.block.max.size": 1024 * 1024 * 8,
   "hoodie.logfile.max.size": 1024 * 1024 * 8,
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stackfun opened a new issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

Reply via email to