Nitish-sati opened a new issue, #9727: URL: https://github.com/apache/hudi/issues/9727
I'm currently operating a Spark Streaming job on an EMR cluster, where it retrieves data from an S3 source, performs upsert operations, and then stores it in the Hudi format. Additionally, I'm utilizing a separate EMR cluster dedicated to HBase for indexing purposes. Recently, I encountered a problem related to corruption in the Parquet files within the target data Hudi table. Consequently, during the execution of one of the micro-batches, an error surfaced, indicating Parquet file corruption. We diligently reviewed the YARN container logs in an attempt to identify any clues pertaining to the file corruption issue, but unfortunately, we could not uncover any relevant information. **To Reproduce** not able to reproduce the issue. **Expected behavior** Ideally it should upsert the latest record for each primary key into target location. **Environment Description** * Hudi version : 0.9 * Spark version :2.4.8 * Hive version :2.3.9 * Hadoop version :2.10.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no *Hudi configs* > "hoodie.clean.async": "false", > "hoodie.clean.automatic": "true", > "hoodie.cleaner.commits.retained": "10", > "hoodie.cleaner.parallelism": "700", > "hoodie.consistency.check.enabled": "true", > "hoodie.datasource.hive_sync.assume_date_partitioning": "false", > "hoodie.datasource.hive_sync.database": "db_name", > "hoodie.datasource.hive_sync.enable": "true", > "hoodie.datasource.hive_sync.jdbcurl": "jdbc_url", > "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", > "hoodie.datasource.hive_sync.partition_fields": "col_1", > "hoodie.datasource.hive_sync.table": "hudi_table_name", > "hoodie.datasource.write.hive_style_partitioning": "true", > "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", > "hoodie.payload.ordering.field": "col_2", > "hoodie.datasource.write.operation": "upsert", > "hoodie.datasource.write.partitionpath.field": "col_1", > "hoodie.datasource.write.precombine.field": "col_2", > "hoodie.datasource.write.recordkey.field": "col_3", > "hoodie.datasource.write.streaming.ignore.failed.batch": "false", > "hoodie.datasource.write.table.type": "COPY_ON_WRITE", > "hoodie.avro.schema.validate": "true", > "hoodie.hbase.index.update.partition.path": "true", > "hoodie.index.hbase.get.batch.size": "1000", > "hoodie.index.hbase.max.qps.per.region.server": "1000", > "hoodie.index.hbase.put.batch.size": "1000", > "hoodie.index.hbase.qps.allocator.class": "org.apache.hudi.index.hbase.DefaultHBaseQPSResourceAllocator", > "hoodie.index.hbase.qps.fraction": "0.5f", > "hoodie.index.hbase.max.qps.fraction": "10000", > "hoodie.index.hbase.min.qps.fraction": "1000", > "hoodie.index.hbase.put.batch.size.autocompute": "true", > "hoodie.index.hbase.rollback.sync": "true", > "hoodie.index.hbase.table": "hbase_table_name", > "hoodie.index.hbase.zknode.path": "/hbase", > "hoodie.index.hbase.zkport": "2181", > "hoodie.index.hbase.zkquorum": "hbase_cluster_dns", > "hoodie.index.type": "HBASE", > "hoodie.memory.compaction.fraction": "0.8", > "hoodie.parquet.block.size": "152043520", > "hoodie.parquet.compression.codec": "snappy", > "hoodie.parquet.max.file.size": "152043520", > "hoodie.parquet.small.file.limit": "104857600", > "hoodie.table.name": "hudi_table_name", > "hoodie.upsert.shuffle.parallelism": "50" **Stacktrace** ``` 23/09/06 05:39:32 INFO S3NativeFileSystem: Opening 's3://**/**/**/**/**.parquet' for reading 23/09/06 05:39:32 ERROR BoundedInMemoryExecutor: error producing records org.apache.hudi.exception.HoodieIOException: unable to read next record from parquet file at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:53) at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45) at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.IOException: can not read class org.apache.parquet.format.FileMetaData: Required field 'name' was not present! Struct: SchemaElement(name:null, field_id:6) at org.apache.parquet.format.Util.read(Util.java:216) at org.apache.parquet.format.Util.readFileMetaData(Util.java:73) at org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:873) at org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:870) at org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:747) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:870) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:540) at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:715) at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:603) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49) ... 8 more Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'name' was not present! Struct: SchemaElement(name:null, field_id:6) at org.apache.parquet.format.SchemaElement.validate(SchemaElement.java:1201) at org.apache.parquet.format.SchemaElement$SchemaElementStandardScheme.read(SchemaElement.java:1331) at org.apache.parquet.format.SchemaElement$SchemaElementStandardScheme.read(SchemaElement.java:1230) at org.apache.parquet.format.SchemaElement.read(SchemaElement.java:1105) at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1080) at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1051) at org.apache.parquet.format.FileMetaData.read(FileMetaData.java:945) at org.apache.parquet.format.Util.read(Util.java:213) ... 19 more 23/09/06 05:39:33 ERROR BoundedInMemoryExecutor: error consuming records org.apache.hudi.exception.HoodieException: operation has failed at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.throwExceptionIfFailed(BoundedInMemoryQueue.java:247) at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.readNextRecord(BoundedInMemoryQueue.java:226) at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.access$100(BoundedInMemoryQueue.java:52) at org.apache.hudi.common.util.queue.BoundedInMemoryQueue$QueueIterator.hasNext(BoundedInMemoryQueue.java:277) at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:36) at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hudi.exception.HoodieIOException: unable to read next record from parquet file at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:53) at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45) at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ... 4 more Caused by: java.io.IOException: can not read class org.apache.parquet.format.FileMetaData: Required field 'name' was not present! Struct: SchemaElement(name:null, field_id:6) at org.apache.parquet.format.Util.read(Util.java:216) at org.apache.parquet.format.Util.readFileMetaData(Util.java:73) at org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:873) at org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:870) at org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:747) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:870) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:540) at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:715) at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:603) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49) ... 8 more Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'name' was not present! Struct: SchemaElement(name:null, field_id:6) at org.apache.parquet.format.SchemaElement.validate(SchemaElement.java:1201) at org.apache.parquet.format.SchemaElement$SchemaElementStandardScheme.read(SchemaElement.java:1331) at org.apache.parquet.format.SchemaElement$SchemaElementStandardScheme.read(SchemaElement.java:1230) at org.apache.parquet.format.SchemaElement.read(SchemaElement.java:1105) at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1080) at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1051) at org.apache.parquet.format.FileMetaData.read(FileMetaData.java:945) at org.apache.parquet.format.Util.read(Util.java:213) ... 19 more``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org