Re: [DISCUSS] Consistent Timestamps across Hadoop

Zoltan Ivanfi Tue, 11 Dec 2018 10:41:01 -0800

Hi Owen,

Thanks, I think your email contains a great summary of the problems tackled
in the proposal. I would like highlight two particular topics from the
discussion that we are having in the comments (details can be read in the
document
<https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit>
):


   - It seems that we have agreement on the desired semantics of the more
   explicit SQL types. In particular, I was glad to hear that the TIMESTAMP
   WITH LOCAL TIME ZONE type that is already implemented in Hive is supposed
   to have Instant semantics. (In fact, it already does have Instant
   semantics, but it also has additional time zone information that is unused
   at this moment, and I wasn't sure whether that will be utilized, changing
   the semantics, or whether the semantics will remain and the superflous time
   zone data will be removed.)
   - We are still discussing what is the best course of action to take with
   the plain TIMESTAMP type, which behaved differently in different file
   formats in Hive 2 and was made to behave the same way in a
   compatibility-breaking manner in Hive 3. My take on this type is that it
   has already been used to write huge amounts of data and for this reason we
   should restore its Avro- and Parquet-specific incosistent behaviour
   (possibly controlled by a feature flag), so that legacy data remains
   readable and legacy workarounds remain functional. The new, more explicit
   SQL types will provide a clear migration path away from the messy TIMESTAMP
   type.

All in all, I feel that we are converging towards a common goal and I have
high hopes that the more explicit timestamp types will have much better
interoperability and consistency across different Hadoop SQL engines.

Thanks,

Zoltan


On Mon, Dec 10, 2018 at 7:54 PM Owen O'Malley <owen.omal...@gmail.com>
wrote:

> Thank you for starting this discussion. Clearly the Hive semantics on
> timestamp are very messed up, but has been moving in the right direction of
> becoming more SQL standard compliant. I'm pulling this discussion back to
> the list rather than the personal GoogleDoc, which isn't very
> collaborative.
>
> I like your breakdown of the semantics:
>
>    - Instant - point in time that will appear different depending on the
>    reader time zone
>    - LocalDateTime - consistent hour and minute regardless of the reader
>    time zone.
>    - OffsetDateTime - consistent hour and minute with the offset of the
>    writer time zone
>
> The SQL standard has:
>
>    - Timestamp & Timestamp without time zone = LocalDateTime
>    - Timestamp with time zone = OffsetDateTime
>
> Hive 2 had very confused semantics for timestamp:
>
>    - When storage was ORC, text, or RCFile with a text serde it was
>    LocalDateTime
>    - When storage was Avro, Parquet, or RCFile with a binary serde it was
>    Instant
>
> Hive 3.1 has moved toward the SQL standard extended with Oracles'
> timestamp with local time zone:
>
>    - Timestamp = LocalDateTime
>    - Timestamp with local time zone = Instant
>
> This leaves us with a few problems:
>
>    - The Hive bindings to Parquet and Avro don't handle timestamps
>    correctly.
>    - ORC doesn't support timestamps with local time zone. I start working
>    on it in ORC-189.
>    - We don't have timestamp with time zone support.
>
> .. Owen
>
> On Thu, Dec 6, 2018 at 7:55 AM Marta Kuczora <kuczo...@cloudera.com.invalid>
> wrote:
>
>> Hi Hive Community,
>>
>> I would like to share the following document on our "Consistent Timestamp
>> types in Hadoop" plans for review.
>>
>> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit
>>
>> With this plan we would like to get an agreement on consistent timestamp
>> behavior on Hive, Spark and Impala and in order to achieve this, we are
>> sharing this document with all three communities.
>>
>> Please review and comment, any feedback is much appreciated!
>>
>> Regards,
>> Marta
>>
>

Re: [DISCUSS] Consistent Timestamps across Hadoop

Reply via email to