phillycoder opened a new issue, #5985: URL: https://github.com/apache/hudi/issues/5985
**Describe the problem you faced** Getting `java.lang.ClassCastException: optional binary xx (STRING)` exception when a record get updated. Also this issue specifically happening when a field is array of structs with one field. But the issue is not happening when array of structs have more than one field. **To Reproduce** Steps to reproduce the behavior: launch spark-shell ``` ./spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \ --conf "spark.sql.hive.convertMetastoreParquet=false" \ --jars $HOME/.m2/repository/org/apache/hudi/hudi-spark3-bundle_2.12/0.10.0/hudi-spark3-bundle_2.12-0.10.0.jar,$HOME/.m2/repository/org/apache/spark/spark-avro_2.12/3.2.0/spark-avro_2.12-3.2.0.jar ``` ``` import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val tableName = "hudi_cow" val basePath = "/tmp/hudi_cow" val schema = StructType( Array( | StructField("rowId", StringType,true), | StructField("preComb", LongType,true), | StructField("name", StringType,true), | StructField("valObjs", ArrayType(StructType(Array( | StructField("id", StringType) | )))) | )) val data1 = Seq(Row("row_1", 0L, "test", Array()), | Row("row_2", 0L, "test", Array()), | Row("row_3", 0L, "test", Array())) var dfFromData1 = spark.createDataFrame(data1, schema) dfFromData1.printSchema dfFromData1.show dfFromData1.write.format("hudi"). | option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). | option(RECORDKEY_FIELD_OPT_KEY, "rowId"). | option(TABLE_NAME, tableName). | mode(Overwrite). | save(basePath) var snapshotDF1 = spark.read.format("hudi").load(basePath + "/*") snapshotDF1.createOrReplaceTempView("hudi_snapshot") spark.sql("select rowId, preComb, name from hudi_snapshot").show() dfFromData1.write.format("hudi"). | options(getQuickstartWriteConfigs). | option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). | option(RECORDKEY_FIELD_OPT_KEY, "rowId"). | option(TABLE_NAME, tableName). | mode(Append). | save(basePath) ``` when updating records (second save) hudi throwing ``` 22/06/27 14:56:41 ERROR BoundedInMemoryExecutor: error producing records org.apache.hudi.exception.HoodieException: unable to read next record from parquet file at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54) at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45) at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassCastException: optional binary id (STRING) is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:536) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:486) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289) at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141) at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95) at org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:185) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49) ... 8 more ``` weird thing is if we have more than one field in that array of structs field (valObjs), update works. for example this schema works with above example, please note added secondid field to valObjs ``` val schema = StructType( Array( | StructField("rowId", StringType,true), | StructField("preComb", LongType,true), | StructField("name", StringType,true), | StructField("valObjs", ArrayType(StructType(Array( | StructField("id", StringType), | StructField("secondid", StringType) | )))) | )) ``` **Expected behavior** Update should work. **Environment Description** * Hudi version : 0.10.1 & 0.11.1 * Spark version : 3.2.0 * Hadoop version : 2.7 * Storage (HDFS/S3/GCS..) : Tested using local spark-shell and in emr * Running on Docker? (yes/no) : In Mac, also same error in EMR 6.6.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org