[ 
https://issues.apache.org/jira/browse/HIVE-21290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Coppage updated HIVE-21290:
---------------------------------
    Attachment: HIVE-21290.4.patch
        Status: Patch Available  (was: Open)

> Restore historical way of handling timestamps in Parquet while keeping the 
> new semantics at the same time
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-21290
>                 URL: https://issues.apache.org/jira/browse/HIVE-21290
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Zoltan Ivanfi
>            Assignee: Karen Coppage
>            Priority: Major
>         Attachments: HIVE-21290.1.patch, HIVE-21290.2.patch, 
> HIVE-21290.2.patch, HIVE-21290.3.patch, HIVE-21290.4.patch, HIVE-21290.4.patch
>
>
> This sub-task is for implementing the Parquet-specific parts of the following 
> plan:
> h1. Problem
> Historically, the semantics of the TIMESTAMP type in Hive depended on the 
> file format. Timestamps in Avro, Parquet and RCFiles with a binary SerDe had 
> _Instant_ semantics, while timestamps in ORC, textfiles and RCFiles with a 
> text SerDe had _LocalDateTime_ semantics.
> The Hive community wanted to get rid of this inconsistency and have 
> _LocalDateTime_ semantics in Avro, Parquet and RCFiles with a binary SerDe as 
> well. *Hive 3.1 turned off normalization to UTC* to achieve this. While this 
> leads to the desired new semantics, it also leads to incorrect results when 
> new Hive versions read timestamps written by old Hive versions or when old 
> Hive versions or any other component not aware of this change (including 
> legacy Impala and Spark versions) read timestamps written by new Hive 
> versions.
> h1. Solution
> To work around this issue, Hive *should restore the practice of normalizing 
> to UTC* when writing timestamps to Avro, Parquet and RCFiles with a binary 
> SerDe. In itself, this would restore the historical _Instant_ semantics, 
> which is undesirable. In order to achieve the desired _LocalDateTime_ 
> semantics in spite of normalizing to UTC, newer Hive versions should record 
> the session-local local time zone in the file metadata fields serving 
> arbitrary key-value storage purposes.
> When reading back files with this time zone metadata, newer Hive versions (or 
> any other new component aware of this extra metadata) can achieve 
> _LocalDateTime_ semantics by *converting from UTC to the saved time zone 
> (instead of to the local time zone)*. Legacy components that are unaware of 
> the new metadata can read the files without any problem and the timestamps 
> will show the historical Instant behaviour to them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to