Re: [Format][Important] Needed clarification of timezone-less timestamps

Jorge Cardoso Leitão Thu, 17 Jun 2021 21:59:55 -0700

Thank you everyone for participating so far; really important and
useful discussion.


I think of this discussion as a set of test cases over behavior:

parameterization:
* Timestamp(ms, None)
* Timestamp(ms, "00:00")
* Timestamp(ms, "01:00")

Cases:
* its string representation equals to
* add a duration equals to
* add an interval equals to
* subtract a Timestamp(ms, None) equals to
* subtract a Timestamp(ms, "01:00") equals to
* subtract a Date32 equals to
* subtract a Time32(ms) equals to
* extract the day equals to
* extract the timezone equals to
* cast to Timestamp(ms, None) equals to
* cast to Timestamp(ms, "01:00") equals to
* write to parquet v2 equals to (physical value and logical type)

In all cases, the result may either be valid or invalid. If valid, we
would need a datatype and an actual value.
I was hoping to be able to answer each of the above at the end of this
discussion.

I've suggested adding these in the google docs.

Best,
Jorge

On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> I've posted the examples above in
> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> because I think it would be better to collaborate there instead of linear
> e-mail history and then bring the consensus back to the list.
>
> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > I feel like we might still be talking past each other here or at least I
> > don't understand the two sides of this.  I'll try to expand Weston's
> > example because I think it provides the best clarification.
> >
> > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, assuming
> > ms) for a timestamp column without timezone (always).   This represents an
> > offset from the unix epoch.  This interpretation should not change based on
> > the local system timezone.  Extracting the hour field always yields 14
> > (extraction is done relative to UTC).
> >
> > The alternative here seems to be that we can encode (1970, 1, 2, 14, 0) in
> > multiple different ways depending on what the current local system time
> > is.  As a note, I think ORC and Spark do this, and it leads to
> > confusion/misinterpretation when trying to transfer data.
> >
> > If we then convert this column to a timestamp with a timezone in "UTC"
> > timezone extracting the hour field still yields 14.  If the column is
> > converted to Timezone with timestamp PST.  Extracting an hour would yield 6
> > (assume PST = -8GMT).    Through all of these changes the data bits do not
> > change.
> >
> > Display is not mentioned because I think the points about how a time
> > display is correct. Applications can choose what they feel makes sense to
> > them (as long as they don't start automatically tacking on timezones to
> > naive timestamps).  My interpretation of the specification has been display
> > was kind of shorthand for field extraction.
> >
> > Could others on the thread confirm this is the issue up for debate?  Are
> > there subtleties/operations we need to consider?
> >
> > I also agree that we should document recommended conversion practices from
> > other systems.
> >
> > -Micah
> >
> >
> >  So let's invent a third way.  I could use
> >> the first 16 bits for the year, the next 8 bits for the month, the
> >> next 8 bits for the day of month, the next 8 bits for the hour, the
> >> next 8 bits for the minute, and the remaining bits for the seconds.
> >> Using this method I would store (1970, 1, 2, 14, 0) as
> >> 0x07B201020E000000.
> >
> > Aside, With some small variation this is what ZetaSql uses [2]
> >
> > [1]
> > https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > [2]
> > https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> >
> >
> >
> > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com> wrote:
> >
> >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >>
> >> >
> >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> >> > with/without time zone — in some SQL implementations each slot can
> >> > have a different time zone
> >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> >> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> >> > or given as timezone local time"
> >> >
> >>
> >> RDBMSs conflict (universally) with ANSI.
> >>
> >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch.
> >> It has no timezone.
> >>
> >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant
> >> since the epoch. It has no timezone.
> >>
> >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in
> >> *function*, but not in implementation:
> >>
> >>    - MySQL DATETIME
> >>    <
> >> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> >> >
> >>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> >>    - MSSQL
> >>    <
> >> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> >> >
> >>    uses 6, 7 or 8 bytes
> >>    - PostgreSQL stores an integer, but I think its epoch is still
> >> different
> >>    <
> >> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> >> >
> >> (it
> >>    used to store doubles since 2000-01-01)
> >>
> >> ... so in general, moving datetimes from these systems into 64-bit
> >> integers
> >> is nontrivial and lossy.
> >>
> >> Spark / Databricks discusses how Spark handles this
> >> >
> >> >
> >> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> >> > and are wall clock timestamps." — not UTC-normalized
> >> > * WITH TIME ZONE: "does not affect the physical point in time that the
> >> > timestamp represents, as that is fully represented by the UTC time
> >> > instant given by the other timestamp components"
> >> >
> >>
> >> I don't use Spark, but I read that page twice. First reading, I got the
> >> same thing out of it. But the second time I read it, I read the opposite!
> >>
> >> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH
> >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP
> >> WITH
> >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp
> >> type:
> >> a 64-bit Instant since the epoch. (It also has a Date type.)
> >>
> >> If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP
> >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
> >> timestamps as bare 64-bit integers since the epoch -- without timezone.
> >>
> >>
> >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >> > interpreted as UTC-normalized, that would force all of these other
> >> > systems (and more) to serialize their data to be UTC-normalized
> >>
> >>
> >> Are those systems' 64-bit integers interoperable in the first place?
> >>
> >> As I understand it, there's a ton of variance out there when encoding
> >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding
> >> is one of many. As I mentioned in another thread, programming languages
> >> all
> >> use structs.
> >>
> >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> >> > timestamps and UTC-normalized WITH TIME ZONE.
> >> >
> >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >> > interpreted as UTC-normalized, that would force all of these other
> >> > systems (and more) to serialize their data to be UTC-normalized (i.e.
> >> > calling the equivalent of pandas's tz_localize function) when they
> >> > convert to Arrow.
> >>
> >>
> >> Alternatives:
> >>
> >>    - int64
> >>    - date32+time64
> >>    - date32+time32
> >>
> >> This seems very harmful to me, and will make data
> >> > from these systems not accurately representable in Arrow and unable to
> >> > be round-tripped.
> >> >
> >>
> >> Certainly nobody wants to go backwards.
> >>
> >> We need to clarify: how do we store these *common* types -- MySQL
> >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
> >> Arrow?
> >>
> >> Secondarily, I think: how do we recommend users store *datetimes* in
> >> Arrow?
> >> (I'd expect this to be messier, since every system/language uses a
> >> different byte structure.)
> >>
> >> Perhaps we can make a spreadsheet and look comprehensively at how many
> >> > use cases would be disenfranchised by requiring UTC normalization
> >> > always.
> >>
> >>
> >> Hear, hear!
> >>
> >> Can we also poll people to find out how they're storing Instants today?
> >>
> >> Enjoy life,
> >> Adam
> >>
> >> --
> >> Adam Hooper
> >> +1-514-882-9694
> >> http://adamhooper.com
> >>
> >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to