shardulm94 opened a new issue #446: KryoException when writing Iceberg tables in Spark URL: https://github.com/apache/incubator-iceberg/issues/446 Say we have a table `test (col1: string)`. ``` case class Input(col1: String) val df = spark.createDataFrame(List(Input("data1"), Input("data2"))) df.write.format("iceberg").mode("overwrite").option("write-format", "parquet").save("default.test") ``` ``` 2019-09-04 05:35:13 ERROR TaskResultGetter:91 - Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException Serialization trace: splitOffsets (org.apache.iceberg.GenericDataFile) files (org.apache.iceberg.spark.source.Writer$TaskCommit) at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731) at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945) at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.UnsupportedOperationException at org.apache.iceberg.shaded.com.google.common.collect.ImmutableCollection.add(ImmutableCollection.java:244) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731) at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) ... 18 more 2019-09-04 05:35:13 ERROR WriteToDataSourceV2Exec:70 - Data source writer IcebergWrite(table=default.test, format=PARQUET) is aborting. 2019-09-04 05:35:13 ERROR WriteToDataSourceV2Exec:70 - Data source writer IcebergWrite(table=default.test, format=PARQUET) aborted. ``` Hitting this when writing Parquet data in Spark 2.4 using the latest master. I believe this is because in 549da809490976b53a13b14596dd240ed74bce5e an immutable copy of the split offsets are stored in GenericDataFile. The serialization is successful, but when deserializing, Kryo will try to instantiate an ImmutableList first and then add elements to it. I think this can be fixed in two ways: 1. Don't use immutable copies of the split offsets list. I am not sure what was the intention of making them immutable in the first place. Other metadata like metrics are still being provided in mutable collections. 2. Include custom Kryo serdes which support guava immutable collections. e.g. https://github.com/magro/kryo-serializers/ Option 2 seems a little overkill IMO for this specific case given that immutable collections don't seem to be necessary here, but would like to hear what others think.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org