sauliusvl opened a new issue #4038:
URL: https://github.com/apache/iceberg/issues/4038
Consider a table created in Trino (it has a native UUID type that maps to
UUID in Iceberg):
```sql
CREATE TABLE test (id UUID) WITH (location =
'hdfs://rr-hdpz1/user/iceberg/test', format = 'PARQUET');
insert into test values (uuid '12151fd2-7586-11e9-8f9e-2a86e4085a59')
```
Everything looks good, I can query the table, running `parquet meta` on the
resulting file in HDFS suggests it's written correctly according to the parquet
specification (byte array of length 16 with logical type `UUID`):
```
message table {
optional fixed_len_byte_array(16) id (UUID) = 1;
}
```
now I try to read it in Spark:
```bash
scala> spark.sql("select * from test").printSchema
root
|-- id: string (nullable = true)
scala> spark.sql("select * from test").show(false)
22/02/03 12:16:18 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0)
(bn-hdpz1.vinted.net executor 2): java.lang.ClassCastException: class [B cannot
be cast to class org.apache.spark.unsafe.types.UTF8String ([B is in module
java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in
unnamed module of loader 'app')
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
```
Same thing when trying to insert from spark:
```bash
scala> spark.sql("insert into test values
('e9238c3e-3aa6-4668-aceb-f9507a8f8d59')")
22/02/03 12:42:29 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 4)
(hx-hdpz2.vinted.net executor 3): java.lang.ClassCastException: class
org.apache.spark.unsafe.types.UTF8String cannot be cast to class [B
(org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app';
[B is in module java.base of loader 'bootstrap')
at
org.apache.iceberg.spark.data.SparkParquetWriters$ByteArrayWriter.write(SparkParquetWriters.java:291)
```
The [docs](https://iceberg.apache.org/#spark-writes/#type-compatibility)
seem to suggest that UUID should be converted to a string in Spark, but after
reading the source code I don't see how is this supposed to work: the `UUID`
type gets simply mapped to `String`
[here](https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java#L112),
which is not enough - the byte representation of a UUID can't be cast to
`String` straightforwardly.
The way I see it Iceberg should either:
- somehow do the conversion back and forth to string automatically upon
reading/inserting (not sure if this is possible and it's not optimal
performance wise, as the string representation is 32 bytes vs. the raw 16 bytes)
- map it to a byte array and force users to convert to `java.util.UUID` if
they really want to see the text representation - optimal, but not user friendly
- implement a Spark [user defined
type](https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/types/UserDefinedType.html)
for UUID and convert to it - not sure about performance implications here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]