HelloJowet opened a new issue, #1724:
URL: https://github.com/apache/sedona/issues/1724
## Expected behavior
Data should be successfully inserted into the Iceberg table without
serialisation errors when using Sedona and Iceberg.
## Actual behavior
The `INSERT INTO` operation fails with a Kryo serialisation exception. The
error trace indicates an `IndexOutOfBoundsException` in the Kryo serializer
while handling Iceberg's `GenericDataFile` and `SparkWrite.TaskCommit` objects.
Error message:
> py4j.protocol.Py4JJavaError: An error occurred while calling o55.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Exception while getting task result:
com.esotericsoftware.kryo.KryoException:
java.lang.IndexOutOfBoundsException: Index 44 out of bounds for length 14
Serialization trace:
partitionType (org.apache.iceberg.GenericDataFile)
taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit)
writerCommitMessage
(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult)
## Steps to reproduce the problem
1. Configure Sedona and Iceberg with the following settings:
```py
from sedona.spark import SedonaContext
config = (
SedonaContext.builder()
.master('spark://localhost:5581')
.config('spark.jars.packages',
'org.apache.sedona:sedona-spark-3.5_2.12:1.7.0,org.datasyslab:geotools-wrapper:1.7.0-28.5,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1,org.postgresql:postgresql:42.7.4')
.config('spark.sql.extensions',
'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer')
.config('spark.kryo.registrator',
'org.apache.sedona.core.serde.SedonaKryoRegistrator')
.config('spark.sql.catalog.my_catalog',
'org.apache.iceberg.spark.SparkCatalog')
.config('spark.sql.catalog.my_catalog.type', 'jdbc')
.config('spark.sql.catalog.my_catalog.uri',
'jdbc:postgresql://localhost:5500/data_catalog_apache_iceberg')
.config('spark.sql.catalog.my_catalog.jdbc.user', 'postgres')
.config('spark.sql.catalog.my_catalog.jdbc.password', 'postgres')
.config('spark.sql.catalog.my_catalog.io-impl',
'org.apache.iceberg.aws.s3.S3FileIO')
.config('spark.sql.catalog.my_catalog.warehouse', 's3a://data-lakehouse')
.config('spark.sql.catalog.my_catalog.s3.endpoint',
'http://localhost:5561')
.config('spark.sql.catalog.my_catalog.s3.access-key-id', 'admin')
.config('spark.sql.catalog.my_catalog.s3.secret-access-key', 'password')
.getOrCreate()
)
sedona = SedonaContext.create(config)
```
2. Execute the following queries:
```py
sedona.sql('CREATE TABLE my_catalog.table2 (name string) USING iceberg;')
sedona.sql("INSERT INTO my_catalog.table2 VALUES ('Alex'), ('Dipankar'),
('Jason')")
```
## Additional information
If I perform the same operations using Spark without Sedona, everything
works seamlessly:
```py
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master('spark://localhost:5581')
.config(
'spark.jars.packages',
'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,'
'org.apache.iceberg:iceberg-aws-bundle:1.7.1,'
'org.postgresql:postgresql:42.7.4',
)
.config('spark.sql.extensions',
'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
.config('spark.sql.catalog.my_catalog',
'org.apache.iceberg.spark.SparkCatalog')
.config('spark.sql.catalog.my_catalog.type', 'jdbc')
.config('spark.sql.catalog.my_catalog.uri',
'jdbc:postgresql://localhost:5500/data_catalog_apache_iceberg')
.config('spark.sql.catalog.my_catalog.jdbc.user', 'postgres')
.config('spark.sql.catalog.my_catalog.jdbc.password', 'postgres')
.config('spark.sql.catalog.my_catalog.io-impl',
'org.apache.iceberg.aws.s3.S3FileIO')
.config('spark.sql.catalog.my_catalog.warehouse', 's3a://data-lakehouse')
.config('spark.sql.catalog.my_catalog.s3.endpoint',
'http://localhost:5561')
.config('spark.sql.catalog.my_catalog.s3.access-key-id', 'admin')
.config('spark.sql.catalog.my_catalog.s3.secret-access-key', 'password')
.getOrCreate()
)
spark.sql('CREATE TABLE my_catalog.table8 (name string) USING iceberg;')
spark.sql("INSERT INTO my_catalog.table8 VALUES ('Alex'), ('Dipankar'),
('Jason')")
```
## Settings
Sedona version = 1.7.1
Apache Spark version = 3.5
API type = Python
Scala version = 2.12
JRE version = 11.0.25
Python version = 3.12.0
Environment = Standalone
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]