Maybe Arrow should add a new type, "INSTANT"? Instant (seconds since the epoch) is the gold standard in storing moments in time. All programming languages; all RDBMSs; Parquet ... everybody uses this.
I use Instants, too. I interpreted TIMESTAMP with no metadata to mean Instant because I read the docs *assuming* Arrow stores Instant. I know, I know, no vote can stop me from interpreting "timestamp without timezone" however the heck I want. But it's painful for me to transition from happy user to heretic. Voting to clarify that Arrow doesn't store Instants is voting to clarify that Arrow *doesn't* do something extremely useful. It's voting for a negative. That sounds painful! What if there were positives to vote for? An "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)? A fiat that timezone=UTC means Instant, not ZonedDateTime? Enjoy life, Adam On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <weston.p...@gmail.com> wrote: > I agree that a vote would be a good idea. Do you want to start a > dedicated vote thread? I can write one up too if you'd rather. > > -Weston > > On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > I think comments on the doc are tailing off. Jorge's test cases I think > > still need some more careful analysis but Weston has provided an > > initial pass. > > > > The matter not resolved on the document is whether Timestamp with > timezone > > logically represents multi-field date and time (that does not represent a > > specific instant) or whether it logically represents an instant (some > > measurable offset from an epoch). Based on comments on the > documentation > > both C++/Python implementations and the Java implementations (those that > > have generally been considered "reference") both have evidence the the > > former representation is what is intended (some links are in the > document). > > > > We can probably continue to debate what is useful but it seems ultimately > > we need to pick one or the other and clarify the specification. Given > how > > the reference implementations currently work I think we should error on > the > > side of interpreting these values as date times. Ultimately, given the > > contention here we will likely need to vote on this. > > > > More comments on the document or here are still useful in case we've > missed > > an interpretation or there are other facts to consider. > > > > Cheers, > > Micah > > > > On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão < > > jorgecarlei...@gmail.com> wrote: > > > > > Thank you everyone for participating so far; really important and > > > useful discussion. > > > > > > I think of this discussion as a set of test cases over behavior: > > > > > > parameterization: > > > * Timestamp(ms, None) > > > * Timestamp(ms, "00:00") > > > * Timestamp(ms, "01:00") > > > > > > Cases: > > > * its string representation equals to > > > * add a duration equals to > > > * add an interval equals to > > > * subtract a Timestamp(ms, None) equals to > > > * subtract a Timestamp(ms, "01:00") equals to > > > * subtract a Date32 equals to > > > * subtract a Time32(ms) equals to > > > * extract the day equals to > > > * extract the timezone equals to > > > * cast to Timestamp(ms, None) equals to > > > * cast to Timestamp(ms, "01:00") equals to > > > * write to parquet v2 equals to (physical value and logical type) > > > > > > In all cases, the result may either be valid or invalid. If valid, we > > > would need a datatype and an actual value. > > > I was hoping to be able to answer each of the above at the end of this > > > discussion. > > > > > > I've suggested adding these in the google docs. > > > > > > Best, > > > Jorge > > > > > > On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield < > emkornfi...@gmail.com> > > > wrote: > > > > > > > > I've posted the examples above in > > > > > > > > https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing > > > > because I think it would be better to collaborate there instead of > linear > > > > e-mail history and then bring the consensus back to the list. > > > > > > > > On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield < > emkornfi...@gmail.com> > > > > wrote: > > > > > > > > > I feel like we might still be talking past each other here or at > least > > > I > > > > > don't understand the two sides of this. I'll try to expand > Weston's > > > > > example because I think it provides the best clarification. > > > > > > > > > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, > > > assuming > > > > > ms) for a timestamp column without timezone (always). This > > > represents an > > > > > offset from the unix epoch. This interpretation should not change > > > based on > > > > > the local system timezone. Extracting the hour field always > yields 14 > > > > > (extraction is done relative to UTC). > > > > > > > > > > The alternative here seems to be that we can encode (1970, 1, 2, > 14, > > > 0) in > > > > > multiple different ways depending on what the current local system > time > > > > > is. As a note, I think ORC and Spark do this, and it leads to > > > > > confusion/misinterpretation when trying to transfer data. > > > > > > > > > > If we then convert this column to a timestamp with a timezone in > "UTC" > > > > > timezone extracting the hour field still yields 14. If the column > is > > > > > converted to Timezone with timestamp PST. Extracting an hour would > > > yield 6 > > > > > (assume PST = -8GMT). Through all of these changes the data > bits do > > > not > > > > > change. > > > > > > > > > > Display is not mentioned because I think the points about how a > time > > > > > display is correct. Applications can choose what they feel makes > sense > > > to > > > > > them (as long as they don't start automatically tacking on > timezones to > > > > > naive timestamps). My interpretation of the specification has been > > > display > > > > > was kind of shorthand for field extraction. > > > > > > > > > > Could others on the thread confirm this is the issue up for debate? > > > Are > > > > > there subtleties/operations we need to consider? > > > > > > > > > > I also agree that we should document recommended conversion > practices > > > from > > > > > other systems. > > > > > > > > > > -Micah > > > > > > > > > > > > > > > So let's invent a third way. I could use > > > > >> the first 16 bits for the year, the next 8 bits for the month, the > > > > >> next 8 bits for the day of month, the next 8 bits for the hour, > the > > > > >> next 8 bits for the minute, and the remaining bits for the > seconds. > > > > >> Using this method I would store (1970, 1, 2, 14, 0) as > > > > >> 0x07B201020E000000. > > > > > > > > > > Aside, With some small variation this is what ZetaSql uses [2] > > > > > > > > > > [1] > > > > > > > > > https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark > > > > > [2] > > > > > > > > > https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62 > > > > > > > > > > > > > > > > > > > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com> > > > wrote: > > > > > > > > > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com > > > > > wrote: > > > > >> > > > > >> > > > > > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types: > > > > >> > with/without time zone — in some SQL implementations each slot > can > > > > >> > have a different time zone > > > > >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html > > > > >> > WITHOUT TIME ZONE: "timestamp without time zone value should be > > > taken > > > > >> > or given as timezone local time" > > > > >> > > > > > >> > > > > >> RDBMSs conflict (universally) with ANSI. > > > > >> > > > > >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since > the > > > epoch. > > > > >> It has no timezone. > > > > >> > > > > >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int > > > Instant > > > > >> since the epoch. It has no timezone. > > > > >> > > > > >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive > > > datetime" in > > > > >> *function*, but not in implementation: > > > > >> > > > > >> - MySQL DATETIME > > > > >> < > > > > >> > > > > https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html > > > > >> > > > > > >> is weird: 1-bit sign, 17-bit month, 5-bit day, .... > > > > >> - MSSQL > > > > >> < > > > > >> > > > > https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15 > > > > >> > > > > > >> uses 6, 7 or 8 bytes > > > > >> - PostgreSQL stores an integer, but I think its epoch is still > > > > >> different > > > > >> < > > > > >> > > > > https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h > > > > >> > > > > > >> (it > > > > >> used to store doubles since 2000-01-01) > > > > >> > > > > >> ... so in general, moving datetimes from these systems into 64-bit > > > > >> integers > > > > >> is nontrivial and lossy. > > > > >> > > > > >> Spark / Databricks discusses how Spark handles this > > > > >> > > > > > >> > > > > > >> > > > > https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps > > > > >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time > > > zone, > > > > >> > and are wall clock timestamps." — not UTC-normalized > > > > >> > * WITH TIME ZONE: "does not affect the physical point in time > that > > > the > > > > >> > timestamp represents, as that is fully represented by the UTC > time > > > > >> > instant given by the other timestamp components" > > > > >> > > > > > >> > > > > >> I don't use Spark, but I read that page twice. First reading, I > got > > > the > > > > >> same thing out of it. But the second time I read it, I read the > > > opposite! > > > > >> > > > > >> The key part is: "*Spark SQL defines the timestamp type as > TIMESTAMP > > > WITH > > > > >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a > TIMESTAMP > > > > >> WITH > > > > >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one > Timestamp > > > > >> type: > > > > >> a 64-bit Instant since the epoch. (It also has a Date type.) > > > > >> > > > > >> If I'm reading correctly, this is exactly the same as PostgreSQL > > > TIMESTAMP > > > > >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and > transmit > > > > >> timestamps as bare 64-bit integers since the epoch -- without > > > timezone. > > > > >> > > > > >> > > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be > > > > >> > interpreted as UTC-normalized, that would force all of these > other > > > > >> > systems (and more) to serialize their data to be UTC-normalized > > > > >> > > > > >> > > > > >> Are those systems' 64-bit integers interoperable in the first > place? > > > > >> > > > > >> As I understand it, there's a ton of variance out there when > encoding > > > > >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres > > > encoding > > > > >> is one of many. As I mentioned in another thread, programming > > > languages > > > > >> all > > > > >> use structs. > > > > >> > > > > >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE > "naive" > > > > >> > timestamps and UTC-normalized WITH TIME ZONE. > > > > >> > > > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be > > > > >> > interpreted as UTC-normalized, that would force all of these > other > > > > >> > systems (and more) to serialize their data to be UTC-normalized > > > (i.e. > > > > >> > calling the equivalent of pandas's tz_localize function) when > they > > > > >> > convert to Arrow. > > > > >> > > > > >> > > > > >> Alternatives: > > > > >> > > > > >> - int64 > > > > >> - date32+time64 > > > > >> - date32+time32 > > > > >> > > > > >> This seems very harmful to me, and will make data > > > > >> > from these systems not accurately representable in Arrow and > unable > > > to > > > > >> > be round-tripped. > > > > >> > > > > > >> > > > > >> Certainly nobody wants to go backwards. > > > > >> > > > > >> We need to clarify: how do we store these *common* types -- MySQL > > > > >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE > -- in > > > > >> Arrow? > > > > >> > > > > >> Secondarily, I think: how do we recommend users store *datetimes* > in > > > > >> Arrow? > > > > >> (I'd expect this to be messier, since every system/language uses a > > > > >> different byte structure.) > > > > >> > > > > >> Perhaps we can make a spreadsheet and look comprehensively at how > many > > > > >> > use cases would be disenfranchised by requiring UTC > normalization > > > > >> > always. > > > > >> > > > > >> > > > > >> Hear, hear! > > > > >> > > > > >> Can we also poll people to find out how they're storing Instants > > > today? > > > > >> > > > > >> Enjoy life, > > > > >> Adam > > > > >> > > > > >> -- > > > > >> Adam Hooper > > > > >> +1-514-882-9694 > > > > >> http://adamhooper.com > > > > >> > > > > > > > > > -- Adam Hooper +1-514-882-9694 http://adamhooper.com