Grant Henke created KUDU-2454: --------------------------------- Summary: Avro Import/Export does not round trip Key: KUDU-2454 URL: https://issues.apache.org/jira/browse/KUDU-2454 Project: Kudu Issue Type: Bug Reporter: Grant Henke
When exporting to Avro columns with type Byte or Short are treated as Integers because Avro doesn't have a Byte or Short type. When re-importing the data, the job fails because the column types do not match. Ideally spark-avro would solve this by safely casting the values back to the smaller type. Guava has utilities to make this straightforward. (ex. Shorts.checkedCast(i)). We could send a pull request to spark-avro to fix this, or add some special handling to the Kudu side to handle the safe downconversion. Another type issue when exporting is that Decimal values are written as Strings instead of BigDecimal logical types. There are a few un-merged pull request to fix that here: * [https://github.com/databricks/spark-avro/pull/276] * [https://github.com/databricks/spark-avro/pull/121] Additionally Timestamp values are written as longs instead of Timestamp logical types (timestamp-micros). This is a data corruption issue because the long [value that is output|https://github.com/databricks/spark-avro/blob/0764d699015975acf87dc5210cca8a43db84196a/src/main/scala/com/databricks/spark/avro/AvroOutputWriter.scala#L103] is in milliseconds (Timestamp.getTime()) but the expected long value for a Kudu Timestamp column should be in microseconds. Given all these issues, ImportExportFiles needs a lot more test coverage before we suggest it's use. Currently it only tests importing Strings form a CSV and does not test Avro or parquet support. -- This message was sent by Atlassian JIRA (v7.6.3#76005)