Re: [Format][Important] Needed clarification of timezone-less timestamps

Adam Hooper Tue, 22 Jun 2021 07:01:18 -0700

Maybe Arrow should add a new type, "INSTANT"?

Instant (seconds since the epoch) is the gold standard in storing moments
in time. All programming languages; all RDBMSs; Parquet ... everybody uses
this.


I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
Instant because I read the docs *assuming* Arrow stores Instant.

I know, I know, no vote can stop me from interpreting "timestamp without
timezone" however the heck I want. But it's painful for me to transition
from happy user to heretic.

Voting to clarify that Arrow doesn't store Instants is voting to clarify
that Arrow *doesn't* do something extremely useful. It's voting for a
negative. That sounds painful! What if there were positives to vote for? An
"INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
A fiat that timezone=UTC means Instant, not ZonedDateTime?

Enjoy life,
Adam

On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <weston.p...@gmail.com> wrote:

> I agree that a vote would be a good idea.  Do you want to start a
> dedicated vote thread?  I can write one up too if you'd rather.
>
> -Weston
>
> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > I think comments on the doc are tailing off.  Jorge's test cases I think
> > still need some more careful analysis but Weston has provided an
> > initial pass.
> >
> > The matter not resolved on the document is whether Timestamp with
> timezone
> > logically represents multi-field date and time (that does not represent a
> > specific instant) or whether it logically represents an instant (some
> > measurable offset from an epoch).   Based on comments on the
> documentation
> > both C++/Python implementations and the Java implementations (those that
> > have generally been considered "reference") both have evidence the the
> > former representation is what is intended (some links are in the
> document).
> >
> > We can probably continue to debate what is useful but it seems ultimately
> > we need to pick one or the other and clarify the specification.  Given
> how
> > the reference implementations currently work I think we should error on
> the
> > side of interpreting these values as date times.  Ultimately, given the
> > contention here we will likely need to vote on this.
> >
> > More comments on the document or here are still useful in case we've
> missed
> > an interpretation or there are other facts to consider.
> >
> > Cheers,
> > Micah
> >
> > On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > Thank you everyone for participating so far; really important and
> > > useful discussion.
> > >
> > > I think of this discussion as a set of test cases over behavior:
> > >
> > > parameterization:
> > > * Timestamp(ms, None)
> > > * Timestamp(ms, "00:00")
> > > * Timestamp(ms, "01:00")
> > >
> > > Cases:
> > > * its string representation equals to
> > > * add a duration equals to
> > > * add an interval equals to
> > > * subtract a Timestamp(ms, None) equals to
> > > * subtract a Timestamp(ms, "01:00") equals to
> > > * subtract a Date32 equals to
> > > * subtract a Time32(ms) equals to
> > > * extract the day equals to
> > > * extract the timezone equals to
> > > * cast to Timestamp(ms, None) equals to
> > > * cast to Timestamp(ms, "01:00") equals to
> > > * write to parquet v2 equals to (physical value and logical type)
> > >
> > > In all cases, the result may either be valid or invalid. If valid, we
> > > would need a datatype and an actual value.
> > > I was hoping to be able to answer each of the above at the end of this
> > > discussion.
> > >
> > > I've suggested adding these in the google docs.
> > >
> > > Best,
> > > Jorge
> > >
> > > On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > > >
> > > > I've posted the examples above in
> > > >
> > >
> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> > > > because I think it would be better to collaborate there instead of
> linear
> > > > e-mail history and then bring the consensus back to the list.
> > > >
> > > > On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > > > I feel like we might still be talking past each other here or at
> least
> > > I
> > > > > don't understand the two sides of this.  I'll try to expand
> Weston's
> > > > > example because I think it provides the best clarification.
> > > > >
> > > > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> > > assuming
> > > > > ms) for a timestamp column without timezone (always).   This
> > > represents an
> > > > > offset from the unix epoch.  This interpretation should not change
> > > based on
> > > > > the local system timezone.  Extracting the hour field always
> yields 14
> > > > > (extraction is done relative to UTC).
> > > > >
> > > > > The alternative here seems to be that we can encode (1970, 1, 2,
> 14,
> > > 0) in
> > > > > multiple different ways depending on what the current local system
> time
> > > > > is.  As a note, I think ORC and Spark do this, and it leads to
> > > > > confusion/misinterpretation when trying to transfer data.
> > > > >
> > > > > If we then convert this column to a timestamp with a timezone in
> "UTC"
> > > > > timezone extracting the hour field still yields 14.  If the column
> is
> > > > > converted to Timezone with timestamp PST.  Extracting an hour would
> > > yield 6
> > > > > (assume PST = -8GMT).    Through all of these changes the data
> bits do
> > > not
> > > > > change.
> > > > >
> > > > > Display is not mentioned because I think the points about how a
> time
> > > > > display is correct. Applications can choose what they feel makes
> sense
> > > to
> > > > > them (as long as they don't start automatically tacking on
> timezones to
> > > > > naive timestamps).  My interpretation of the specification has been
> > > display
> > > > > was kind of shorthand for field extraction.
> > > > >
> > > > > Could others on the thread confirm this is the issue up for debate?
> > > Are
> > > > > there subtleties/operations we need to consider?
> > > > >
> > > > > I also agree that we should document recommended conversion
> practices
> > > from
> > > > > other systems.
> > > > >
> > > > > -Micah
> > > > >
> > > > >
> > > > >  So let's invent a third way.  I could use
> > > > >> the first 16 bits for the year, the next 8 bits for the month, the
> > > > >> next 8 bits for the day of month, the next 8 bits for the hour,
> the
> > > > >> next 8 bits for the minute, and the remaining bits for the
> seconds.
> > > > >> Using this method I would store (1970, 1, 2, 14, 0) as
> > > > >> 0x07B201020E000000.
> > > > >
> > > > > Aside, With some small variation this is what ZetaSql uses [2]
> > > > >
> > > > > [1]
> > > > >
> > >
> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > > > > [2]
> > > > >
> > >
> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com>
> > > wrote:
> > > > >
> > > > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com
> >
> > > wrote:
> > > > >>
> > > > >> >
> > > > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > > > >> > with/without time zone — in some SQL implementations each slot
> can
> > > > >> > have a different time zone
> > > > >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > > > >> > WITHOUT TIME ZONE: "timestamp without time zone value should be
> > > taken
> > > > >> > or given as timezone local time"
> > > > >> >
> > > > >>
> > > > >> RDBMSs conflict (universally) with ANSI.
> > > > >>
> > > > >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
> the
> > > epoch.
> > > > >> It has no timezone.
> > > > >>
> > > > >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> > > Instant
> > > > >> since the epoch. It has no timezone.
> > > > >>
> > > > >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> > > datetime" in
> > > > >> *function*, but not in implementation:
> > > > >>
> > > > >>    - MySQL DATETIME
> > > > >>    <
> > > > >>
> > >
> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> > > > >> >
> > > > >>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> > > > >>    - MSSQL
> > > > >>    <
> > > > >>
> > >
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> > > > >> >
> > > > >>    uses 6, 7 or 8 bytes
> > > > >>    - PostgreSQL stores an integer, but I think its epoch is still
> > > > >> different
> > > > >>    <
> > > > >>
> > >
> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> > > > >> >
> > > > >> (it
> > > > >>    used to store doubles since 2000-01-01)
> > > > >>
> > > > >> ... so in general, moving datetimes from these systems into 64-bit
> > > > >> integers
> > > > >> is nontrivial and lossy.
> > > > >>
> > > > >> Spark / Databricks discusses how Spark handles this
> > > > >> >
> > > > >> >
> > > > >>
> > >
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > > > >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> > > zone,
> > > > >> > and are wall clock timestamps." — not UTC-normalized
> > > > >> > * WITH TIME ZONE: "does not affect the physical point in time
> that
> > > the
> > > > >> > timestamp represents, as that is fully represented by the UTC
> time
> > > > >> > instant given by the other timestamp components"
> > > > >> >
> > > > >>
> > > > >> I don't use Spark, but I read that page twice. First reading, I
> got
> > > the
> > > > >> same thing out of it. But the second time I read it, I read the
> > > opposite!
> > > > >>
> > > > >> The key part is: "*Spark SQL defines the timestamp type as
> TIMESTAMP
> > > WITH
> > > > >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
> TIMESTAMP
> > > > >> WITH
> > > > >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
> Timestamp
> > > > >> type:
> > > > >> a 64-bit Instant since the epoch. (It also has a Date type.)
> > > > >>
> > > > >> If I'm reading correctly, this is exactly the same as PostgreSQL
> > > TIMESTAMP
> > > > >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
> transmit
> > > > >> timestamps as bare 64-bit integers since the epoch -- without
> > > timezone.
> > > > >>
> > > > >>
> > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > > >> > interpreted as UTC-normalized, that would force all of these
> other
> > > > >> > systems (and more) to serialize their data to be UTC-normalized
> > > > >>
> > > > >>
> > > > >> Are those systems' 64-bit integers interoperable in the first
> place?
> > > > >>
> > > > >> As I understand it, there's a ton of variance out there when
> encoding
> > > > >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> > > encoding
> > > > >> is one of many. As I mentioned in another thread, programming
> > > languages
> > > > >> all
> > > > >> use structs.
> > > > >>
> > > > >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
> "naive"
> > > > >> > timestamps and UTC-normalized WITH TIME ZONE.
> > > > >> >
> > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > > >> > interpreted as UTC-normalized, that would force all of these
> other
> > > > >> > systems (and more) to serialize their data to be UTC-normalized
> > > (i.e.
> > > > >> > calling the equivalent of pandas's tz_localize function) when
> they
> > > > >> > convert to Arrow.
> > > > >>
> > > > >>
> > > > >> Alternatives:
> > > > >>
> > > > >>    - int64
> > > > >>    - date32+time64
> > > > >>    - date32+time32
> > > > >>
> > > > >> This seems very harmful to me, and will make data
> > > > >> > from these systems not accurately representable in Arrow and
> unable
> > > to
> > > > >> > be round-tripped.
> > > > >> >
> > > > >>
> > > > >> Certainly nobody wants to go backwards.
> > > > >>
> > > > >> We need to clarify: how do we store these *common* types -- MySQL
> > > > >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
> -- in
> > > > >> Arrow?
> > > > >>
> > > > >> Secondarily, I think: how do we recommend users store *datetimes*
> in
> > > > >> Arrow?
> > > > >> (I'd expect this to be messier, since every system/language uses a
> > > > >> different byte structure.)
> > > > >>
> > > > >> Perhaps we can make a spreadsheet and look comprehensively at how
> many
> > > > >> > use cases would be disenfranchised by requiring UTC
> normalization
> > > > >> > always.
> > > > >>
> > > > >>
> > > > >> Hear, hear!
> > > > >>
> > > > >> Can we also poll people to find out how they're storing Instants
> > > today?
> > > > >>
> > > > >> Enjoy life,
> > > > >> Adam
> > > > >>
> > > > >> --
> > > > >> Adam Hooper
> > > > >> +1-514-882-9694
> > > > >> http://adamhooper.com
> > > > >>
> > > > >
> > >
>


-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to