shardulm94 opened a new issue #446: KryoException when writing Iceberg tables 
in Spark 
URL: https://github.com/apache/incubator-iceberg/issues/446
 
 
   Say we have a table `test (col1: string)`.
   ```
   case class Input(col1: String)
   val df = spark.createDataFrame(List(Input("data1"), Input("data2")))
   df.write.format("iceberg").mode("overwrite").option("write-format", 
"parquet").save("default.test")
   ```
   ```
   2019-09-04 05:35:13 ERROR TaskResultGetter:91 - Exception while getting task 
result
   com.esotericsoftware.kryo.KryoException: 
java.lang.UnsupportedOperationException
   Serialization trace:
   splitOffsets (org.apache.iceberg.GenericDataFile)
   files (org.apache.iceberg.spark.source.Writer$TaskCommit)
        at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
        at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
        at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391)
        at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
        at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
        at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
        at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)
        at 
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.UnsupportedOperationException
        at 
org.apache.iceberg.shaded.com.google.common.collect.ImmutableCollection.add(ImmutableCollection.java:244)
        at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
        at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
        at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
        ... 18 more
   2019-09-04 05:35:13 ERROR WriteToDataSourceV2Exec:70 - Data source writer 
IcebergWrite(table=default.test, format=PARQUET) is aborting.
   2019-09-04 05:35:13 ERROR WriteToDataSourceV2Exec:70 - Data source writer 
IcebergWrite(table=default.test, format=PARQUET) aborted.
   ```
   
   Hitting this when writing Parquet data in Spark 2.4 using the latest master. 
I believe this is because in 549da809490976b53a13b14596dd240ed74bce5e an 
immutable copy of the split offsets are stored in GenericDataFile. The 
serialization is successful, but when deserializing, Kryo will try to 
instantiate an ImmutableList first and then add elements to it.
   I think this can be fixed in two ways:
   1. Don't use immutable copies of the split offsets list. I am not sure what 
was the intention of making them immutable in the first place. Other metadata 
like metrics are still being provided in mutable collections.
   2. Include custom Kryo serdes which support guava immutable collections. 
e.g. https://github.com/magro/kryo-serializers/
   
   Option 2 seems a little overkill IMO for this specific case given that 
immutable collections don't seem to be necessary here, but would like to hear 
what others think.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to