[ https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155322#comment-17155322 ]
Gabor Szadovszky commented on PARQUET-1883: ------------------------------------------- [~sha...@uber.com], [~satishkotha], INT96 IS already deprecated. See PARQUET-323 and PARQUET-1870 for details. Hive also implemented the support of the INT64 timestamps (see HIVE-21215) unfortunately, only for 4.0. (Impala also moved to the INT64 timestamps already: IMPALA-5049) Also would like to mention that parquet-avro has never supported INT96 timestamps so it is not a regression. > int96 support in parquet-avro > ----------------------------- > > Key: PARQUET-1883 > URL: https://issues.apache.org/jira/browse/PARQUET-1883 > Project: Parquet > Issue Type: Bug > Components: parquet-avro > Affects Versions: 1.10.1 > Reporter: satish > Priority: Major > > Hi > It looks like 'timestamp' is being converted to 'int64' primitive type in > parquet-avro. This is incompatible with hive2. Hive throws below error > {code:java} > Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be > cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0) > {code} > What does it take to write timestamp field as 'int96'? > Hive seems to write timestamp field as int96. See example below > {code:java} > $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/000000_0 > creator: parquet-mr version 1.10.6 (build > 098c6199a821edd3d6af56b962fd0f1558af849b) > file schema: hive_schema > -------------------------------------------------------------------------------- > ts: OPTIONAL INT96 R:0 D:1 > row group 1: RC:4 TS:88 OFFSET:4 > -------------------------------------------------------------------------------- > ts: INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 > ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY > {code} > Writing a spark dataframe into parquet format (without using avro) is also > using int96. > {code:java} > scala> testDS.printSchema() > root > |-- ts: timestamp (nullable = true) > scala> testDS.write.mode(Overwrite).save("/tmp/x"); > $ parquet-tools meta > /tmp/x/part-00000-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet > file: > file:/tmp/x/part-00000-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet > creator: parquet-mr version 1.10.1 (build > a89df8f9932b6ef6633d06069e50c9b7970bebd1) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]} > > file schema: spark_schema > -------------------------------------------------------------------------------- > ts: OPTIONAL INT96 R:0 D:1 > row group 1: RC:4 TS:93 OFFSET:4 > -------------------------------------------------------------------------------- > ts: INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 > ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column] > {code} > I saw some explanation for deprecating int96 [support > here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963] > from [~gszadovszky]. But given hive and serialization in other parquet > modules (non-avro) support int96, I'm trying to understand the reasoning for > not implementing it in parquet-avro. > A bit more context: we are trying to migrate some of our data to [hudi > format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use > cases. But, when we write data using hudi, hudi uses parquet-avro and > timestamp is being converted to int64. As mentioned earlier, this breaks > compatibility with hive. A lot of columns in our tables have 'timestamp' as > type in hive DDL. It is almost impossible to change DDL to long as there are > large number of tables and columns. > We are happy to contribute if there is a clear path forward to support int96 > in parquet-avro. Please also let me know if you are aware of a workaround in > hive that can read int64 correctly as timestamp. -- This message was sent by Atlassian Jira (v8.3.4#803005)