[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

Gabor Szadovszky (Jira) Fri, 10 Jul 2020 03:35:09 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155322#comment-17155322
 ]


Gabor Szadovszky commented on PARQUET-1883:
-------------------------------------------

[~sha...@uber.com], [~satishkotha],

INT96 IS already deprecated. See PARQUET-323 and PARQUET-1870 for details. Hive 
also implemented the support of the INT64 timestamps (see HIVE-21215) 
unfortunately, only for 4.0. (Impala also moved to the INT64 timestamps 
already: IMPALA-5049)
Also would like to mention that parquet-avro has never supported INT96 
timestamps so it is not a regression. 

> int96 support in parquet-avro
> -----------------------------
>
>                 Key: PARQUET-1883
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1883
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.10.1
>            Reporter: satish
>            Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/000000_0
> creator:     parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> --------------------------------------------------------------------------------
> ts:          OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> --------------------------------------------------------------------------------
> ts:           INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-00000-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:        
> file:/tmp/x/part-00000-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator:     parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:       org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> --------------------------------------------------------------------------------
> ts:          OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> --------------------------------------------------------------------------------
> ts:           INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
> format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
> cases. But, when we write data using hudi, hudi uses parquet-avro and 
> timestamp is being converted to int64. As mentioned earlier, this breaks 
> compatibility with hive. A lot of columns in our tables have 'timestamp' as 
> type in hive DDL.  It is almost impossible to change DDL to long as there are 
> large number of tables and columns. 
> We are happy to contribute if there is a clear path forward to support int96 
> in parquet-avro. Please also let me know if you are aware of a workaround in 
> hive that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

Reply via email to