[jira] [Created] (KUDU-2454) Avro Import/Export does not round trip

Grant Henke (JIRA) Thu, 24 May 2018 14:35:45 -0700

Grant Henke created KUDU-2454:
---------------------------------

             Summary: Avro Import/Export does not round trip
                 Key: KUDU-2454
                 URL: https://issues.apache.org/jira/browse/KUDU-2454
             Project: Kudu
          Issue Type: Bug
            Reporter: Grant Henke

When exporting to Avro columns with type Byte or Short are treated as Integers
because Avro doesn't have a Byte or Short type. When re-importing the data, the
job fails because the column types do not match.

Ideally spark-avro would solve this by safely casting the values back to the
smaller type. Guava has utilities to make this straightforward. (ex.
Shorts.checkedCast(i)). We could send a pull request to spark-avro to fix this,
or add some special handling to the Kudu side to handle the safe
downconversion.

Another type issue when exporting is that Decimal values are written as Strings
instead of BigDecimal logical types. There are a few un-merged pull request to
fix that here:
* [https://github.com/databricks/spark-avro/pull/276]
* [https://github.com/databricks/spark-avro/pull/121]

Additionally Timestamp values are written as longs instead of Timestamp logical
types (timestamp-micros). This is a data corruption issue because the long
[value that is
output|https://github.com/databricks/spark-avro/blob/0764d699015975acf87dc5210cca8a43db84196a/src/main/scala/com/databricks/spark/avro/AvroOutputWriter.scala#L103]
is in milliseconds (Timestamp.getTime()) but the expected long value for a
Kudu Timestamp column should be in microseconds.

Given all these issues, ImportExportFiles needs a lot more test coverage before
we suggest it's use. Currently it only tests importing Strings form a CSV and
does not test Avro or parquet support.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2454) Avro Import/Export does not round trip

Reply via email to