Re: [Format][Important] Needed clarification of timezone-less timestamps

Weston Pace Mon, 21 Jun 2021 20:39:22 -0700

I agree that a vote would be a good idea.  Do you want to start a
dedicated vote thread?  I can write one up too if you'd rather.


-Weston

On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> I think comments on the doc are tailing off.  Jorge's test cases I think
> still need some more careful analysis but Weston has provided an
> initial pass.
>
> The matter not resolved on the document is whether Timestamp with timezone
> logically represents multi-field date and time (that does not represent a
> specific instant) or whether it logically represents an instant (some
> measurable offset from an epoch).   Based on comments on the documentation
> both C++/Python implementations and the Java implementations (those that
> have generally been considered "reference") both have evidence the the
> former representation is what is intended (some links are in the document).
>
> We can probably continue to debate what is useful but it seems ultimately
> we need to pick one or the other and clarify the specification.  Given how
> the reference implementations currently work I think we should error on the
> side of interpreting these values as date times.  Ultimately, given the
> contention here we will likely need to vote on this.
>
> More comments on the document or here are still useful in case we've missed
> an interpretation or there are other facts to consider.
>
> Cheers,
> Micah
>
> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Thank you everyone for participating so far; really important and
> > useful discussion.
> >
> > I think of this discussion as a set of test cases over behavior:
> >
> > parameterization:
> > * Timestamp(ms, None)
> > * Timestamp(ms, "00:00")
> > * Timestamp(ms, "01:00")
> >
> > Cases:
> > * its string representation equals to
> > * add a duration equals to
> > * add an interval equals to
> > * subtract a Timestamp(ms, None) equals to
> > * subtract a Timestamp(ms, "01:00") equals to
> > * subtract a Date32 equals to
> > * subtract a Time32(ms) equals to
> > * extract the day equals to
> > * extract the timezone equals to
> > * cast to Timestamp(ms, None) equals to
> > * cast to Timestamp(ms, "01:00") equals to
> > * write to parquet v2 equals to (physical value and logical type)
> >
> > In all cases, the result may either be valid or invalid. If valid, we
> > would need a datatype and an actual value.
> > I was hoping to be able to answer each of the above at the end of this
> > discussion.
> >
> > I've suggested adding these in the google docs.
> >
> > Best,
> > Jorge
> >
> > On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> > >
> > > I've posted the examples above in
> > >
> > https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> > > because I think it would be better to collaborate there instead of linear
> > > e-mail history and then bring the consensus back to the list.
> > >
> > > On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <emkornfi...@gmail.com>
> > > wrote:
> > >
> > > > I feel like we might still be talking past each other here or at least
> > I
> > > > don't understand the two sides of this.  I'll try to expand Weston's
> > > > example because I think it provides the best clarification.
> > > >
> > > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> > assuming
> > > > ms) for a timestamp column without timezone (always).   This
> > represents an
> > > > offset from the unix epoch.  This interpretation should not change
> > based on
> > > > the local system timezone.  Extracting the hour field always yields 14
> > > > (extraction is done relative to UTC).
> > > >
> > > > The alternative here seems to be that we can encode (1970, 1, 2, 14,
> > 0) in
> > > > multiple different ways depending on what the current local system time
> > > > is.  As a note, I think ORC and Spark do this, and it leads to
> > > > confusion/misinterpretation when trying to transfer data.
> > > >
> > > > If we then convert this column to a timestamp with a timezone in "UTC"
> > > > timezone extracting the hour field still yields 14.  If the column is
> > > > converted to Timezone with timestamp PST.  Extracting an hour would
> > yield 6
> > > > (assume PST = -8GMT).    Through all of these changes the data bits do
> > not
> > > > change.
> > > >
> > > > Display is not mentioned because I think the points about how a time
> > > > display is correct. Applications can choose what they feel makes sense
> > to
> > > > them (as long as they don't start automatically tacking on timezones to
> > > > naive timestamps).  My interpretation of the specification has been
> > display
> > > > was kind of shorthand for field extraction.
> > > >
> > > > Could others on the thread confirm this is the issue up for debate?
> > Are
> > > > there subtleties/operations we need to consider?
> > > >
> > > > I also agree that we should document recommended conversion practices
> > from
> > > > other systems.
> > > >
> > > > -Micah
> > > >
> > > >
> > > >  So let's invent a third way.  I could use
> > > >> the first 16 bits for the year, the next 8 bits for the month, the
> > > >> next 8 bits for the day of month, the next 8 bits for the hour, the
> > > >> next 8 bits for the minute, and the remaining bits for the seconds.
> > > >> Using this method I would store (1970, 1, 2, 14, 0) as
> > > >> 0x07B201020E000000.
> > > >
> > > > Aside, With some small variation this is what ZetaSql uses [2]
> > > >
> > > > [1]
> > > >
> > https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > > > [2]
> > > >
> > https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> > > >
> > > >
> > > >
> > > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com>
> > wrote:
> > > >
> > > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > >>
> > > >> >
> > > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > > >> > with/without time zone — in some SQL implementations each slot can
> > > >> > have a different time zone
> > > >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > > >> > WITHOUT TIME ZONE: "timestamp without time zone value should be
> > taken
> > > >> > or given as timezone local time"
> > > >> >
> > > >>
> > > >> RDBMSs conflict (universally) with ANSI.
> > > >>
> > > >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the
> > epoch.
> > > >> It has no timezone.
> > > >>
> > > >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> > Instant
> > > >> since the epoch. It has no timezone.
> > > >>
> > > >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> > datetime" in
> > > >> *function*, but not in implementation:
> > > >>
> > > >>    - MySQL DATETIME
> > > >>    <
> > > >>
> > https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> > > >> >
> > > >>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> > > >>    - MSSQL
> > > >>    <
> > > >>
> > https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> > > >> >
> > > >>    uses 6, 7 or 8 bytes
> > > >>    - PostgreSQL stores an integer, but I think its epoch is still
> > > >> different
> > > >>    <
> > > >>
> > https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> > > >> >
> > > >> (it
> > > >>    used to store doubles since 2000-01-01)
> > > >>
> > > >> ... so in general, moving datetimes from these systems into 64-bit
> > > >> integers
> > > >> is nontrivial and lossy.
> > > >>
> > > >> Spark / Databricks discusses how Spark handles this
> > > >> >
> > > >> >
> > > >>
> > https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > > >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> > zone,
> > > >> > and are wall clock timestamps." — not UTC-normalized
> > > >> > * WITH TIME ZONE: "does not affect the physical point in time that
> > the
> > > >> > timestamp represents, as that is fully represented by the UTC time
> > > >> > instant given by the other timestamp components"
> > > >> >
> > > >>
> > > >> I don't use Spark, but I read that page twice. First reading, I got
> > the
> > > >> same thing out of it. But the second time I read it, I read the
> > opposite!
> > > >>
> > > >> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP
> > WITH
> > > >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP
> > > >> WITH
> > > >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp
> > > >> type:
> > > >> a 64-bit Instant since the epoch. (It also has a Date type.)
> > > >>
> > > >> If I'm reading correctly, this is exactly the same as PostgreSQL
> > TIMESTAMP
> > > >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
> > > >> timestamps as bare 64-bit integers since the epoch -- without
> > timezone.
> > > >>
> > > >>
> > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > >> > interpreted as UTC-normalized, that would force all of these other
> > > >> > systems (and more) to serialize their data to be UTC-normalized
> > > >>
> > > >>
> > > >> Are those systems' 64-bit integers interoperable in the first place?
> > > >>
> > > >> As I understand it, there's a ton of variance out there when encoding
> > > >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> > encoding
> > > >> is one of many. As I mentioned in another thread, programming
> > languages
> > > >> all
> > > >> use structs.
> > > >>
> > > >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> > > >> > timestamps and UTC-normalized WITH TIME ZONE.
> > > >> >
> > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > >> > interpreted as UTC-normalized, that would force all of these other
> > > >> > systems (and more) to serialize their data to be UTC-normalized
> > (i.e.
> > > >> > calling the equivalent of pandas's tz_localize function) when they
> > > >> > convert to Arrow.
> > > >>
> > > >>
> > > >> Alternatives:
> > > >>
> > > >>    - int64
> > > >>    - date32+time64
> > > >>    - date32+time32
> > > >>
> > > >> This seems very harmful to me, and will make data
> > > >> > from these systems not accurately representable in Arrow and unable
> > to
> > > >> > be round-tripped.
> > > >> >
> > > >>
> > > >> Certainly nobody wants to go backwards.
> > > >>
> > > >> We need to clarify: how do we store these *common* types -- MySQL
> > > >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
> > > >> Arrow?
> > > >>
> > > >> Secondarily, I think: how do we recommend users store *datetimes* in
> > > >> Arrow?
> > > >> (I'd expect this to be messier, since every system/language uses a
> > > >> different byte structure.)
> > > >>
> > > >> Perhaps we can make a spreadsheet and look comprehensively at how many
> > > >> > use cases would be disenfranchised by requiring UTC normalization
> > > >> > always.
> > > >>
> > > >>
> > > >> Hear, hear!
> > > >>
> > > >> Can we also poll people to find out how they're storing Instants
> > today?
> > > >>
> > > >> Enjoy life,
> > > >> Adam
> > > >>
> > > >> --
> > > >> Adam Hooper
> > > >> +1-514-882-9694
> > > >> http://adamhooper.com
> > > >>
> > > >
> >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to