Re: [DISCUSS] Consistent Timestamps across Hadoop

Zoltan Ivanfi Fri, 25 Jan 2019 06:26:28 -0800

Dear Hive Developers,

I would like to briefly summarize an offline discussion a few of us had
about the timestamp harmonization proposal <https://goo.gl/VV88c5>. Please
let me know if you agree with the outcome or if you have any concerns or
questions.


[Meeting notes]

Participants: Owen O'Malley and Jesús Camacho Rodríguez from Hive, Anna
Szonyi and Zoltan Ivanfi representing the original proposal.

Owen and Jesús reasoned that the TIMESTAMP type must have the same
semantics in all file formats in Hive.

Anna and Zoltan reasoned that different Hive versions (and other components
as well) must be able to correctly read timestamps written by each other,
but there is a historical partice of normalizing to UTC in selected file
formats and eliminating that practice would be a breaking change.

Owen and Jesús suggested a solution that would change the semantics without
eliminating the partice of normalizing to UTC. This makes this solution
completely backwards- and forwards-compatible. The solution involves
recording the session-local local time zone in the file metadata fields
that allow arbitrary key-value storage. When reading back files with this
time zone metadata, newer Hive versions (or any other new component aware
of this extra metadata) can achieve LocalDateTime semantics by converting
from UTC to the saved time zone (instead of to the local time zone). Legacy
components that are unaware of the new metadata can read the files without
any problem and the timestamps will show the historical Instant behaviour
to them.

[End of meeting notes]

Since this solution achieves both goals, I have updated the proposal
<https://goo.gl/VV88c5> accordingly. Hopefully with this change we have
resolved any remaining disagreements. I will wait a little for feedback and
if everybody is fine with the updated proposal, I will move forward and ask
the affected components to incorporate it into their plans.

Thanks,

Zoltan


On Fri, Jan 11, 2019 at 4:53 PM Zoltan Ivanfi <z...@cloudera.com> wrote:

> Hi,
>
> My past experience with fixing timestamps was that no fix was going to
> work if even one of major SQL engines of the Hadoop stack disagreed
> with the approach and was not willing to implement it. For this
> reason, we can't just add a specification to Hive, we need an
> agreement from said communities.
>
> Indeed, there are more projects than these three that deal with
> timestamps, but these are the ones that deal with them on the SQL
> level and this proposal is about the semantics of the SQL timestamp
> types. I have planned to write a small summary from the file format
> perspective as well and send it to affected groups. I had Avro,
> Parquet, ORC, Arrow and Kudu in mind. Based on your suggestion, I will
> add Iceberg to that list.
>
> While I agree that a Google Doc would not be adequate for the final
> version of the plan, I think it is a better tool for doing the review
> and the design discussion than the individual mailing lists, for the
> following reasons:
>
> - It allows separate discussions around separate parts of the proposal
> and these discussion can happen in context (they are tied to specific
> parts of the document).
> - It allows adding suggestions to the proposal, in-context and
> immediately visible to everyone.
> - Most importantly, it is equally accessible to the Hive, Spark and
> Impala communities, therefore allows a real cross-component
> discussion.
>
> Br,
>
> Zoltan
>
> On Fri, Jan 11, 2019 at 12:10 AM Owen O'Malley <owen.omal...@gmail.com>
> wrote:
> >
> > ---------- Forwarded message ---------
> > From: Owen O'Malley <owen.omal...@gmail.com>
> > Date: Thu, Jan 10, 2019 at 3:09 PM
> > Subject: Re: [DISCUSS] Consistent Timestamps across Hadoop
> > To: Zoltan Ivanfi <z...@cloudera.com>
> >
> >
> > No, that isn't right.
> >
> > The discussion for Apache projects needs to happen in the open and not
> the private google doc that isn't archived at Apache.
> >
> > Three is a severe underestimate of the projects that care about
> timestamps. The Apache projects that care about parts of that document are:
> >
> > Avro
> > Hive
> > Iceberg
> > Impala
> > ORC
> > Parquet
> > Spark
> >
> > That said, Hive needs to make its decisions about what the semantics of
> Hive should be. Impala, Iceberg, and Spark may make separate choices. Avro,
> ORC, and Parquet need their bindings for each engine need to agree with the
> semantics for that engine.
> >
> > My point is that Hive should have a page that describes its current
> semantics with respect to timestamps, but those discussions need to happen
> on the Hive list and result in documents in the Hive wiki. Hive can't tell
> other projects what to do, but by clarifying their semantics it makes
> inter-operation better. In my opinion, Spark SQL should move to local date
> time semantics for timestamp. But they should want to do that to make
> themselves more compatible with the SQL standard. Clearly Hive can't force
> them to change their semantics.
> >
> > .. Owen
> >
> >
> > On Thu, Jan 10, 2019 at 8:04 AM Zoltan Ivanfi <z...@cloudera.com> wrote:
> >>
> >> Hi,
> >>
> >> Once we are through the discussion phase and hopefully have reached
> >> agreement, I support moving the document to a more permanent place.
> >> I'm unsure about what the best place would be though. Since it needs
> >> the agreement of three communities, it does not strictly belong to any
> >> single one of them (although the Hive Metastore is certainly a central
> >> component in this ecosystem, so we could put it in the Hive
> >> documentation based on that). I am also uncertain about whether we
> >> should use a wiki page, because it is too easily editable and after
> >> reaching an agreement it should not be modified without asking or
> >> notifying the same communities again.
> >>
> >> Is there a documentation space where modifications are subject to
> >> review? Or can we protect a wiki page to achieve that? I'm open to
> >> your suggestions.
> >>
> >> Thanks,
> >>
> >> Zoltan
> >>
> >>
> >> On Wed, Jan 9, 2019 at 9:32 PM Owen O'Malley <owen.omal...@gmail.com>
> wrote:
> >> >
> >> > From an Apache point of view, we really need to move this document
> and the discussion to the Apache wiki and mailing lists.
> >> >
> >> > Did you want to take a first pass at moving it to Hive's wiki?
> >> >
> >> > .. Owen
> >> >
> >> > On Tue, Dec 11, 2018 at 10:40 AM Zoltan Ivanfi <z...@cloudera.com>
> wrote:
> >> >>
> >> >> Hi Owen,
> >> >>
> >> >> Thanks, I think your email contains a great summary of the problems
> tackled in the proposal. I would like highlight two particular topics from
> the discussion that we are having in the comments (details can be read in
> the document):
> >> >>
> >> >> It seems that we have agreement on the desired semantics of the more
> explicit SQL types. In particular, I was glad to hear that the TIMESTAMP
> WITH LOCAL TIME ZONE type that is already implemented in Hive is supposed
> to have Instant semantics. (In fact, it already does have Instant
> semantics, but it also has additional time zone information that is unused
> at this moment, and I wasn't sure whether that will be utilized, changing
> the semantics, or whether the semantics will remain and the superflous time
> zone data will be removed.)
> >> >> We are still discussing what is the best course of action to take
> with the plain TIMESTAMP type, which behaved differently in different file
> formats in Hive 2 and was made to behave the same way in a
> compatibility-breaking manner in Hive 3. My take on this type is that it
> has already been used to write huge amounts of data and for this reason we
> should restore its Avro- and Parquet-specific incosistent behaviour
> (possibly controlled by a feature flag), so that legacy data remains
> readable and legacy workarounds remain functional. The new, more explicit
> SQL types will provide a clear migration path away from the messy TIMESTAMP
> type.
> >> >>
> >> >> All in all, I feel that we are converging towards a common goal and
> I have high hopes that the more explicit timestamp types will have much
> better interoperability and consistency across different Hadoop SQL engines.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Zoltan
> >> >>
> >> >>
> >> >> On Mon, Dec 10, 2018 at 7:54 PM Owen O'Malley <
> owen.omal...@gmail.com> wrote:
> >> >>>
> >> >>> Thank you for starting this discussion. Clearly the Hive semantics
> on timestamp are very messed up, but has been moving in the right direction
> of becoming more SQL standard compliant. I'm pulling this discussion back
> to the list rather than the personal GoogleDoc, which isn't very
> collaborative.
> >> >>>
> >> >>> I like your breakdown of the semantics:
> >> >>>
> >> >>> Instant - point in time that will appear different depending on the
> reader time zone
> >> >>> LocalDateTime - consistent hour and minute regardless of the reader
> time zone.
> >> >>> OffsetDateTime - consistent hour and minute with the offset of the
> writer time zone
> >> >>>
> >> >>> The SQL standard has:
> >> >>>
> >> >>> Timestamp & Timestamp without time zone = LocalDateTime
> >> >>> Timestamp with time zone = OffsetDateTime
> >> >>>
> >> >>> Hive 2 had very confused semantics for timestamp:
> >> >>>
> >> >>> When storage was ORC, text, or RCFile with a text serde it was
> LocalDateTime
> >> >>> When storage was Avro, Parquet, or RCFile with a binary serde it
> was Instant
> >> >>>
> >> >>> Hive 3.1 has moved toward the SQL standard extended with Oracles'
> timestamp with local time zone:
> >> >>>
> >> >>> Timestamp = LocalDateTime
> >> >>> Timestamp with local time zone = Instant
> >> >>>
> >> >>> This leaves us with a few problems:
> >> >>>
> >> >>> The Hive bindings to Parquet and Avro don't handle timestamps
> correctly.
> >> >>> ORC doesn't support timestamps with local time zone. I start
> working on it in ORC-189.
> >> >>> We don't have timestamp with time zone support.
> >> >>>
> >> >>> .. Owen
> >> >>>
> >> >>> On Thu, Dec 6, 2018 at 7:55 AM Marta Kuczora
> <kuczo...@cloudera.com.invalid> wrote:
> >> >>>>
> >> >>>> Hi Hive Community,
> >> >>>>
> >> >>>> I would like to share the following document on our "Consistent
> Timestamp
> >> >>>> types in Hadoop" plans for review.
> >> >>>>
> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit
> >> >>>>
> >> >>>> With this plan we would like to get an agreement on consistent
> timestamp
> >> >>>> behavior on Hive, Spark and Impala and in order to achieve this,
> we are
> >> >>>> sharing this document with all three communities.
> >> >>>>
> >> >>>> Please review and comment, any feedback is much appreciated!
> >> >>>>
> >> >>>> Regards,
> >> >>>> Marta
>

Re: [DISCUSS] Consistent Timestamps across Hadoop

Reply via email to