phillycoder opened a new issue, #5985:
URL: https://github.com/apache/hudi/issues/5985

   **Describe the problem you faced**
   
   Getting `java.lang.ClassCastException: optional binary xx (STRING)` 
exception when a record get updated.
   Also this issue specifically happening when a field is array of structs with 
one field.
   But the issue is not happening when array of structs have more than one 
field.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   launch spark-shell
   ```
   ./spark-shell --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer" \
                 --conf "spark.sql.hive.convertMetastoreParquet=false" \
       --jars 
$HOME/.m2/repository/org/apache/hudi/hudi-spark3-bundle_2.12/0.10.0/hudi-spark3-bundle_2.12-0.10.0.jar,$HOME/.m2/repository/org/apache/spark/spark-avro_2.12/3.2.0/spark-avro_2.12-3.2.0.jar
 
   ```
   
   ```
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.Row
   
   val tableName = "hudi_cow"
   val basePath = "/tmp/hudi_cow"
   val schema = StructType( Array(
   | StructField("rowId", StringType,true),
   | StructField("preComb", LongType,true),
   | StructField("name", StringType,true),
   | StructField("valObjs", ArrayType(StructType(Array(
   |                      StructField("id", StringType)
   |                          ))))
   | ))
   
   val data1 = Seq(Row("row_1", 0L, "test", Array()),
   |                Row("row_2", 0L, "test",  Array()),
   |                Row("row_3", 0L, "test",  Array()))
   
   var dfFromData1 = spark.createDataFrame(data1, schema)
   dfFromData1.printSchema
   dfFromData1.show
   
   dfFromData1.write.format("hudi").
   |   option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
   |   option(RECORDKEY_FIELD_OPT_KEY, "rowId").
   |   option(TABLE_NAME, tableName).
   |   mode(Overwrite).
   |   save(basePath)
   
   var snapshotDF1 = spark.read.format("hudi").load(basePath + "/*")
   
   snapshotDF1.createOrReplaceTempView("hudi_snapshot")
   
   
   spark.sql("select rowId,  preComb, name from hudi_snapshot").show()
   
   dfFromData1.write.format("hudi").
   |   options(getQuickstartWriteConfigs).
   |   option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
   |   option(RECORDKEY_FIELD_OPT_KEY, "rowId").
   |   option(TABLE_NAME, tableName).
   |   mode(Append).
   |   save(basePath)
   ```
   
   when updating records (second save)  hudi throwing
   
   ```
   22/06/27 14:56:41 ERROR BoundedInMemoryExecutor: error producing records
   org.apache.hudi.exception.HoodieException: unable to read next record from 
parquet file 
        at 
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54)
        at 
org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
        at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: java.lang.ClassCastException: optional binary id (STRING) is not 
a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:536)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:486)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
        at 
org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)
        at 
org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:185)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at 
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
        ... 8 more
   ```
   
   weird thing is if we have more than one field in that array of structs field 
(valObjs), update works.
   for example this schema works with above example, please note added secondid 
field to valObjs
   ```
   val schema = StructType( Array(
   | StructField("rowId", StringType,true),
   | StructField("preComb", LongType,true),
   | StructField("name", StringType,true),
   | StructField("valObjs", ArrayType(StructType(Array(
   |                      StructField("id", StringType),
   |                       StructField("secondid", StringType)
   |                          ))))
   | ))
   ```
   
   
   **Expected behavior**
   
   Update should work. 
   
   **Environment Description**
   
   * Hudi version : 0.10.1 & 0.11.1
   
   * Spark version : 3.2.0 
   
   * Hadoop version : 2.7
   
   * Storage (HDFS/S3/GCS..) : Tested using local spark-shell and in emr
   
   * Running on Docker? (yes/no) : In Mac, also same error in EMR 6.6.0
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to