[I] Serialization Issue with Sedona and Iceberg (Kryo serializer) [sedona]

via GitHub Sat, 14 Dec 2024 17:31:43 -0800


HelloJowet opened a new issue, #1724:
URL: https://github.com/apache/sedona/issues/1724


   ## Expected behavior
   
   Data should be successfully inserted into the Iceberg table without 
serialisation errors when using Sedona and Iceberg.
   
   ## Actual behavior
   
   The `INSERT INTO` operation fails with a Kryo serialisation exception. The 
error trace indicates an `IndexOutOfBoundsException` in the Kryo serializer 
while handling Iceberg's `GenericDataFile` and `SparkWrite.TaskCommit` objects.
   
   Error message:
   > py4j.protocol.Py4JJavaError: An error occurred while calling o55.sql.
   : org.apache.spark.SparkException: Job aborted due to stage failure: 
Exception while getting task result:
   com.esotericsoftware.kryo.KryoException: 
java.lang.IndexOutOfBoundsException: Index 44 out of bounds for length 14
   Serialization trace:
   partitionType (org.apache.iceberg.GenericDataFile)
   taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit)
   writerCommitMessage 
(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult)
   
   
   ## Steps to reproduce the problem
   
   1. Configure Sedona and Iceberg with the following settings:
   
   ```py
   from sedona.spark import SedonaContext
   
   config = (
       SedonaContext.builder()
       .master('spark://localhost:5581')
       .config('spark.jars.packages', 
'org.apache.sedona:sedona-spark-3.5_2.12:1.7.0,org.datasyslab:geotools-wrapper:1.7.0-28.5,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1,org.postgresql:postgresql:42.7.4')
       .config('spark.sql.extensions', 
'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
       .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer')
       .config('spark.kryo.registrator', 
'org.apache.sedona.core.serde.SedonaKryoRegistrator')
       .config('spark.sql.catalog.my_catalog', 
'org.apache.iceberg.spark.SparkCatalog')
       .config('spark.sql.catalog.my_catalog.type', 'jdbc')
       .config('spark.sql.catalog.my_catalog.uri', 
'jdbc:postgresql://localhost:5500/data_catalog_apache_iceberg')
       .config('spark.sql.catalog.my_catalog.jdbc.user', 'postgres')
       .config('spark.sql.catalog.my_catalog.jdbc.password', 'postgres')
       .config('spark.sql.catalog.my_catalog.io-impl', 
'org.apache.iceberg.aws.s3.S3FileIO')
       .config('spark.sql.catalog.my_catalog.warehouse', 's3a://data-lakehouse')
       .config('spark.sql.catalog.my_catalog.s3.endpoint', 
'http://localhost:5561')
       .config('spark.sql.catalog.my_catalog.s3.access-key-id', 'admin')
       .config('spark.sql.catalog.my_catalog.s3.secret-access-key', 'password')
       .getOrCreate()
   )
   sedona = SedonaContext.create(config)
   ```
   
   2. Execute the following queries:
   
   ```py
   sedona.sql('CREATE TABLE my_catalog.table2 (name string) USING iceberg;')
   sedona.sql("INSERT INTO my_catalog.table2 VALUES ('Alex'), ('Dipankar'), 
('Jason')")
   ```
   
   ## Additional information
   
   If I perform the same operations using Spark without Sedona, everything 
works seamlessly:
   
   ```py
   from pyspark.sql import SparkSession
   
   spark = (
       SparkSession.builder.master('spark://localhost:5581')
       .config(
           'spark.jars.packages',
           'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,' 
'org.apache.iceberg:iceberg-aws-bundle:1.7.1,' 
'org.postgresql:postgresql:42.7.4',
       )
       .config('spark.sql.extensions', 
'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
       .config('spark.sql.catalog.my_catalog', 
'org.apache.iceberg.spark.SparkCatalog')
       .config('spark.sql.catalog.my_catalog.type', 'jdbc')
       .config('spark.sql.catalog.my_catalog.uri', 
'jdbc:postgresql://localhost:5500/data_catalog_apache_iceberg')
       .config('spark.sql.catalog.my_catalog.jdbc.user', 'postgres')
       .config('spark.sql.catalog.my_catalog.jdbc.password', 'postgres')
       .config('spark.sql.catalog.my_catalog.io-impl', 
'org.apache.iceberg.aws.s3.S3FileIO')
       .config('spark.sql.catalog.my_catalog.warehouse', 's3a://data-lakehouse')
       .config('spark.sql.catalog.my_catalog.s3.endpoint', 
'http://localhost:5561')
       .config('spark.sql.catalog.my_catalog.s3.access-key-id', 'admin')
       .config('spark.sql.catalog.my_catalog.s3.secret-access-key', 'password')
       .getOrCreate()
   )
   
   spark.sql('CREATE TABLE my_catalog.table8 (name string) USING iceberg;')
   spark.sql("INSERT INTO my_catalog.table8 VALUES ('Alex'), ('Dipankar'), 
('Jason')")
   ```
   
   ## Settings
   
   Sedona version = 1.7.1
   
   Apache Spark version = 3.5
   
   API type = Python
   
   Scala version = 2.12
   
   JRE version = 11.0.25
   
   Python version = 3.12.0
   
   Environment = Standalone
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Serialization Issue with Sedona and Iceberg (Kryo serializer) [sedona]

Reply via email to