Grant Henke created KUDU-2454:
---------------------------------

             Summary: Avro Import/Export does not round trip
                 Key: KUDU-2454
                 URL: https://issues.apache.org/jira/browse/KUDU-2454
             Project: Kudu
          Issue Type: Bug
            Reporter: Grant Henke


When exporting to Avro columns with type Byte or Short are treated as Integers 
because Avro doesn't have a Byte or Short type. When re-importing the data, the 
job fails because the column types do not match.

Ideally spark-avro would solve this by safely casting the values back to the 
smaller type. Guava has utilities to make this straightforward. (ex. 
Shorts.checkedCast(i)). We could send a pull request to spark-avro to fix this, 
or add some special handling to the Kudu side to handle the safe 
downconversion. 

Another type issue when exporting is that Decimal values are written as Strings 
instead of BigDecimal logical types. There are a few un-merged pull request to 
fix that here: 
 * [https://github.com/databricks/spark-avro/pull/276]
 * [https://github.com/databricks/spark-avro/pull/121]

Additionally Timestamp values are written as longs instead of Timestamp logical 
types (timestamp-micros). This is a data corruption issue because the long 
[value that is 
output|https://github.com/databricks/spark-avro/blob/0764d699015975acf87dc5210cca8a43db84196a/src/main/scala/com/databricks/spark/avro/AvroOutputWriter.scala#L103]
 is in milliseconds (Timestamp.getTime()) but the expected long value for a 
Kudu Timestamp column should be in microseconds.

Given all these issues, ImportExportFiles needs a lot more test coverage before 
we suggest it's use. Currently it only tests importing Strings form a CSV and 
does not test Avro or parquet support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to