Re: [Format][Important] Needed clarification of timezone-less timestamps

Julian Hyde Tue, 22 Jun 2021 09:01:55 -0700

My proposal is that Arrow should support three different kinds of date-times: 
zoneless, zoned, and instant. (Not necessarily with those names.)


All three kinds occur frequently in the industry.

Many systems only have two, and users of those systems have figured out how to 
make do. (For example, you can implement an instant using a zoneless, and vice 
versa, if you are careful.) But let’s suppose that Arrow has two of the three, 
and needs to interoperate with a system that has a different two of the three. 
Chaos ensues.

Let’s just implement all three.

Julian




> On Jun 22, 2021, at 8:46 AM, Wes McKinney <[email protected]> wrote:
> 
> Let's see a streamlined document about what we are voting on — I have
> limited bandwidth to read through and synthesize the discussion myself
> and I am probably not the only one. It has always been my
> understanding to represent time elapsed from the UNIX epoch
> (1970-01-01 00:00:00 UTC)  as a timestamp with tz=UTC (which I believe
> is what is being called an "instant").
> 
> On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <[email protected]> wrote:
>> 
>> Maybe Arrow should add a new type, "INSTANT"?
>> 
>> Instant (seconds since the epoch) is the gold standard in storing moments
>> in time. All programming languages; all RDBMSs; Parquet ... everybody uses
>> this.
>> 
>> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
>> Instant because I read the docs *assuming* Arrow stores Instant.
>> 
>> I know, I know, no vote can stop me from interpreting "timestamp without
>> timezone" however the heck I want. But it's painful for me to transition
>> from happy user to heretic.
>> 
>> Voting to clarify that Arrow doesn't store Instants is voting to clarify
>> that Arrow *doesn't* do something extremely useful. It's voting for a
>> negative. That sounds painful! What if there were positives to vote for? An
>> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
>> A fiat that timezone=UTC means Instant, not ZonedDateTime?
>> 
>> Enjoy life,
>> Adam
>> 
>> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <[email protected]> wrote:
>> 
>>> I agree that a vote would be a good idea.  Do you want to start a
>>> dedicated vote thread?  I can write one up too if you'd rather.
>>> 
>>> -Weston
>>> 
>>> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>> 
>>>> I think comments on the doc are tailing off.  Jorge's test cases I think
>>>> still need some more careful analysis but Weston has provided an
>>>> initial pass.
>>>> 
>>>> The matter not resolved on the document is whether Timestamp with
>>> timezone
>>>> logically represents multi-field date and time (that does not represent a
>>>> specific instant) or whether it logically represents an instant (some
>>>> measurable offset from an epoch).   Based on comments on the
>>> documentation
>>>> both C++/Python implementations and the Java implementations (those that
>>>> have generally been considered "reference") both have evidence the the
>>>> former representation is what is intended (some links are in the
>>> document).
>>>> 
>>>> We can probably continue to debate what is useful but it seems ultimately
>>>> we need to pick one or the other and clarify the specification.  Given
>>> how
>>>> the reference implementations currently work I think we should error on
>>> the
>>>> side of interpreting these values as date times.  Ultimately, given the
>>>> contention here we will likely need to vote on this.
>>>> 
>>>> More comments on the document or here are still useful in case we've
>>> missed
>>>> an interpretation or there are other facts to consider.
>>>> 
>>>> Cheers,
>>>> Micah
>>>> 
>>>> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
>>>> [email protected]> wrote:
>>>> 
>>>>> Thank you everyone for participating so far; really important and
>>>>> useful discussion.
>>>>> 
>>>>> I think of this discussion as a set of test cases over behavior:
>>>>> 
>>>>> parameterization:
>>>>> * Timestamp(ms, None)
>>>>> * Timestamp(ms, "00:00")
>>>>> * Timestamp(ms, "01:00")
>>>>> 
>>>>> Cases:
>>>>> * its string representation equals to
>>>>> * add a duration equals to
>>>>> * add an interval equals to
>>>>> * subtract a Timestamp(ms, None) equals to
>>>>> * subtract a Timestamp(ms, "01:00") equals to
>>>>> * subtract a Date32 equals to
>>>>> * subtract a Time32(ms) equals to
>>>>> * extract the day equals to
>>>>> * extract the timezone equals to
>>>>> * cast to Timestamp(ms, None) equals to
>>>>> * cast to Timestamp(ms, "01:00") equals to
>>>>> * write to parquet v2 equals to (physical value and logical type)
>>>>> 
>>>>> In all cases, the result may either be valid or invalid. If valid, we
>>>>> would need a datatype and an actual value.
>>>>> I was hoping to be able to answer each of the above at the end of this
>>>>> discussion.
>>>>> 
>>>>> I've suggested adding these in the google docs.
>>>>> 
>>>>> Best,
>>>>> Jorge
>>>>> 
>>>>> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
>>> [email protected]>
>>>>> wrote:
>>>>>> 
>>>>>> I've posted the examples above in
>>>>>> 
>>>>> 
>>> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
>>>>>> because I think it would be better to collaborate there instead of
>>> linear
>>>>>> e-mail history and then bring the consensus back to the list.
>>>>>> 
>>>>>> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
>>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> I feel like we might still be talking past each other here or at
>>> least
>>>>> I
>>>>>>> don't understand the two sides of this.  I'll try to expand
>>> Weston's
>>>>>>> example because I think it provides the best clarification.
>>>>>>> 
>>>>>>> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
>>>>> assuming
>>>>>>> ms) for a timestamp column without timezone (always).   This
>>>>> represents an
>>>>>>> offset from the unix epoch.  This interpretation should not change
>>>>> based on
>>>>>>> the local system timezone.  Extracting the hour field always
>>> yields 14
>>>>>>> (extraction is done relative to UTC).
>>>>>>> 
>>>>>>> The alternative here seems to be that we can encode (1970, 1, 2,
>>> 14,
>>>>> 0) in
>>>>>>> multiple different ways depending on what the current local system
>>> time
>>>>>>> is.  As a note, I think ORC and Spark do this, and it leads to
>>>>>>> confusion/misinterpretation when trying to transfer data.
>>>>>>> 
>>>>>>> If we then convert this column to a timestamp with a timezone in
>>> "UTC"
>>>>>>> timezone extracting the hour field still yields 14.  If the column
>>> is
>>>>>>> converted to Timezone with timestamp PST.  Extracting an hour would
>>>>> yield 6
>>>>>>> (assume PST = -8GMT).    Through all of these changes the data
>>> bits do
>>>>> not
>>>>>>> change.
>>>>>>> 
>>>>>>> Display is not mentioned because I think the points about how a
>>> time
>>>>>>> display is correct. Applications can choose what they feel makes
>>> sense
>>>>> to
>>>>>>> them (as long as they don't start automatically tacking on
>>> timezones to
>>>>>>> naive timestamps).  My interpretation of the specification has been
>>>>> display
>>>>>>> was kind of shorthand for field extraction.
>>>>>>> 
>>>>>>> Could others on the thread confirm this is the issue up for debate?
>>>>> Are
>>>>>>> there subtleties/operations we need to consider?
>>>>>>> 
>>>>>>> I also agree that we should document recommended conversion
>>> practices
>>>>> from
>>>>>>> other systems.
>>>>>>> 
>>>>>>> -Micah
>>>>>>> 
>>>>>>> 
>>>>>>> So let's invent a third way.  I could use
>>>>>>>> the first 16 bits for the year, the next 8 bits for the month, the
>>>>>>>> next 8 bits for the day of month, the next 8 bits for the hour,
>>> the
>>>>>>>> next 8 bits for the minute, and the remaining bits for the
>>> seconds.
>>>>>>>> Using this method I would store (1970, 1, 2, 14, 0) as
>>>>>>>> 0x07B201020E000000.
>>>>>>> 
>>>>>>> Aside, With some small variation this is what ZetaSql uses [2]
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>>> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
>>>>>>> [2]
>>>>>>> 
>>>>> 
>>> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <[email protected]>
>>>>> wrote:
>>>>>>> 
>>>>>>>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <[email protected]
>>>> 
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The SQL standard (e.g. PostgresSQL) has two timestamp types:
>>>>>>>>> with/without time zone — in some SQL implementations each slot
>>> can
>>>>>>>>> have a different time zone
>>>>>>>>> https://www.postgresql.org/docs/9.1/datatype-datetime.html
>>>>>>>>> WITHOUT TIME ZONE: "timestamp without time zone value should be
>>>>> taken
>>>>>>>>> or given as timezone local time"
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> RDBMSs conflict (universally) with ANSI.
>>>>>>>> 
>>>>>>>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
>>> the
>>>>> epoch.
>>>>>>>> It has no timezone.
>>>>>>>> 
>>>>>>>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
>>>>> Instant
>>>>>>>> since the epoch. It has no timezone.
>>>>>>>> 
>>>>>>>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
>>>>> datetime" in
>>>>>>>> *function*, but not in implementation:
>>>>>>>> 
>>>>>>>>   - MySQL DATETIME
>>>>>>>>   <
>>>>>>>> 
>>>>> 
>>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
>>>>>>>>> 
>>>>>>>>   is weird: 1-bit sign, 17-bit month, 5-bit day, ....
>>>>>>>>   - MSSQL
>>>>>>>>   <
>>>>>>>> 
>>>>> 
>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>>>>>>>> 
>>>>>>>>   uses 6, 7 or 8 bytes
>>>>>>>>   - PostgreSQL stores an integer, but I think its epoch is still
>>>>>>>> different
>>>>>>>>   <
>>>>>>>> 
>>>>> 
>>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
>>>>>>>>> 
>>>>>>>> (it
>>>>>>>>   used to store doubles since 2000-01-01)
>>>>>>>> 
>>>>>>>> ... so in general, moving datetimes from these systems into 64-bit
>>>>>>>> integers
>>>>>>>> is nontrivial and lossy.
>>>>>>>> 
>>>>>>>> Spark / Databricks discusses how Spark handles this
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
>>>>>>>>> * WITHOUT TIME ZONE: "These timestamps are not bound to any time
>>>>> zone,
>>>>>>>>> and are wall clock timestamps." — not UTC-normalized
>>>>>>>>> * WITH TIME ZONE: "does not affect the physical point in time
>>> that
>>>>> the
>>>>>>>>> timestamp represents, as that is fully represented by the UTC
>>> time
>>>>>>>>> instant given by the other timestamp components"
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't use Spark, but I read that page twice. First reading, I
>>> got
>>>>> the
>>>>>>>> same thing out of it. But the second time I read it, I read the
>>>>> opposite!
>>>>>>>> 
>>>>>>>> The key part is: "*Spark SQL defines the timestamp type as
>>> TIMESTAMP
>>>>> WITH
>>>>>>>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
>>> TIMESTAMP
>>>>>>>> WITH
>>>>>>>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
>>> Timestamp
>>>>>>>> type:
>>>>>>>> a 64-bit Instant since the epoch. (It also has a Date type.)
>>>>>>>> 
>>>>>>>> If I'm reading correctly, this is exactly the same as PostgreSQL
>>>>> TIMESTAMP
>>>>>>>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
>>> transmit
>>>>>>>> timestamps as bare 64-bit integers since the epoch -- without
>>>>> timezone.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>>>>>>>>> interpreted as UTC-normalized, that would force all of these
>>> other
>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Are those systems' 64-bit integers interoperable in the first
>>> place?
>>>>>>>> 
>>>>>>>> As I understand it, there's a ton of variance out there when
>>> encoding
>>>>>>>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
>>>>> encoding
>>>>>>>> is one of many. As I mentioned in another thread, programming
>>>>> languages
>>>>>>>> all
>>>>>>>> use structs.
>>>>>>>> 
>>>>>>>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
>>> "naive"
>>>>>>>>> timestamps and UTC-normalized WITH TIME ZONE.
>>>>>>>>> 
>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>>>>>>>>> interpreted as UTC-normalized, that would force all of these
>>> other
>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
>>>>> (i.e.
>>>>>>>>> calling the equivalent of pandas's tz_localize function) when
>>> they
>>>>>>>>> convert to Arrow.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Alternatives:
>>>>>>>> 
>>>>>>>>   - int64
>>>>>>>>   - date32+time64
>>>>>>>>   - date32+time32
>>>>>>>> 
>>>>>>>> This seems very harmful to me, and will make data
>>>>>>>>> from these systems not accurately representable in Arrow and
>>> unable
>>>>> to
>>>>>>>>> be round-tripped.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Certainly nobody wants to go backwards.
>>>>>>>> 
>>>>>>>> We need to clarify: how do we store these *common* types -- MySQL
>>>>>>>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
>>> -- in
>>>>>>>> Arrow?
>>>>>>>> 
>>>>>>>> Secondarily, I think: how do we recommend users store *datetimes*
>>> in
>>>>>>>> Arrow?
>>>>>>>> (I'd expect this to be messier, since every system/language uses a
>>>>>>>> different byte structure.)
>>>>>>>> 
>>>>>>>> Perhaps we can make a spreadsheet and look comprehensively at how
>>> many
>>>>>>>>> use cases would be disenfranchised by requiring UTC
>>> normalization
>>>>>>>>> always.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hear, hear!
>>>>>>>> 
>>>>>>>> Can we also poll people to find out how they're storing Instants
>>>>> today?
>>>>>>>> 
>>>>>>>> Enjoy life,
>>>>>>>> Adam
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Adam Hooper
>>>>>>>> +1-514-882-9694
>>>>>>>> http://adamhooper.com
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> 
>> 
>> --
>> Adam Hooper
>> +1-514-882-9694
>> http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to