[GitHub] [iceberg] sauliusvl opened a new issue #4038: UUID type support in Spark is incomplete?

GitBox Fri, 04 Feb 2022 03:00:41 -0800


sauliusvl opened a new issue #4038:
URL: https://github.com/apache/iceberg/issues/4038



   Consider a table created in Trino (it has a native UUID type that maps to 
UUID in Iceberg):
   
   ```sql
   CREATE TABLE test (id UUID) WITH (location = 
'hdfs://rr-hdpz1/user/iceberg/test', format = 'PARQUET');
   insert into test values (uuid '12151fd2-7586-11e9-8f9e-2a86e4085a59')
   ```
   
   Everything looks good, I can query the table, running `parquet meta` on the 
resulting file in HDFS suggests it's written correctly according to the parquet 
specification (byte array of length 16 with logical type `UUID`):
   
   ```
   message table {
     optional fixed_len_byte_array(16) id (UUID) = 1;
   }
   ```
   
   now I try to read it in Spark:
   
   ```bash
   scala> spark.sql("select * from test").printSchema
   root
    |-- id: string (nullable = true)
   
   scala> spark.sql("select * from test").show(false)
   22/02/03 12:16:18 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) 
(bn-hdpz1.vinted.net executor 2): java.lang.ClassCastException: class [B cannot 
be cast to class org.apache.spark.unsafe.types.UTF8String ([B is in module 
java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in 
unnamed module of loader 'app')
        at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
   ```
   Same thing when trying to insert from spark:
   
   ```bash
   scala> spark.sql("insert into test values 
('e9238c3e-3aa6-4668-aceb-f9507a8f8d59')")
   22/02/03 12:42:29 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 4) 
(hx-hdpz2.vinted.net executor 3): java.lang.ClassCastException: class 
org.apache.spark.unsafe.types.UTF8String cannot be cast to class [B 
(org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app'; 
[B is in module java.base of loader 'bootstrap')
        at 
org.apache.iceberg.spark.data.SparkParquetWriters$ByteArrayWriter.write(SparkParquetWriters.java:291)
   ```
   
   
   The [docs](https://iceberg.apache.org/#spark-writes/#type-compatibility) 
seem to suggest that UUID should be converted to a string in Spark, but after 
reading the source code I don't see how is this supposed to work: the `UUID` 
type gets simply mapped to `String` 
[here](https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java#L112),
 which is not enough - the byte representation of a UUID can't be cast to 
`String` straightforwardly.
   
   The way I see it Iceberg should either:
    - somehow do the conversion back and forth to string automatically upon 
reading/inserting (not sure if this is possible and it's not optimal 
performance wise, as the string representation is 32 bytes vs. the raw 16 bytes)
    - map it to a byte array and force users to convert to `java.util.UUID` if 
they really want to see the text representation - optimal, but not user friendly
    - implement a Spark [user defined 
type](https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/types/UserDefinedType.html)
 for UUID and convert to it - not sure about performance implications here
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] sauliusvl opened a new issue #4038: UUID type support in Spark is incomplete?

Reply via email to