I agree that a vote would be a good idea. Do you want to start a dedicated vote thread? I can write one up too if you'd rather.
-Weston On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > I think comments on the doc are tailing off. Jorge's test cases I think > still need some more careful analysis but Weston has provided an > initial pass. > > The matter not resolved on the document is whether Timestamp with timezone > logically represents multi-field date and time (that does not represent a > specific instant) or whether it logically represents an instant (some > measurable offset from an epoch). Based on comments on the documentation > both C++/Python implementations and the Java implementations (those that > have generally been considered "reference") both have evidence the the > former representation is what is intended (some links are in the document). > > We can probably continue to debate what is useful but it seems ultimately > we need to pick one or the other and clarify the specification. Given how > the reference implementations currently work I think we should error on the > side of interpreting these values as date times. Ultimately, given the > contention here we will likely need to vote on this. > > More comments on the document or here are still useful in case we've missed > an interpretation or there are other facts to consider. > > Cheers, > Micah > > On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Thank you everyone for participating so far; really important and > > useful discussion. > > > > I think of this discussion as a set of test cases over behavior: > > > > parameterization: > > * Timestamp(ms, None) > > * Timestamp(ms, "00:00") > > * Timestamp(ms, "01:00") > > > > Cases: > > * its string representation equals to > > * add a duration equals to > > * add an interval equals to > > * subtract a Timestamp(ms, None) equals to > > * subtract a Timestamp(ms, "01:00") equals to > > * subtract a Date32 equals to > > * subtract a Time32(ms) equals to > > * extract the day equals to > > * extract the timezone equals to > > * cast to Timestamp(ms, None) equals to > > * cast to Timestamp(ms, "01:00") equals to > > * write to parquet v2 equals to (physical value and logical type) > > > > In all cases, the result may either be valid or invalid. If valid, we > > would need a datatype and an actual value. > > I was hoping to be able to answer each of the above at the end of this > > discussion. > > > > I've suggested adding these in the google docs. > > > > Best, > > Jorge > > > > On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > I've posted the examples above in > > > > > https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing > > > because I think it would be better to collaborate there instead of linear > > > e-mail history and then bring the consensus back to the list. > > > > > > On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > > > I feel like we might still be talking past each other here or at least > > I > > > > don't understand the two sides of this. I'll try to expand Weston's > > > > example because I think it provides the best clarification. > > > > > > > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, > > assuming > > > > ms) for a timestamp column without timezone (always). This > > represents an > > > > offset from the unix epoch. This interpretation should not change > > based on > > > > the local system timezone. Extracting the hour field always yields 14 > > > > (extraction is done relative to UTC). > > > > > > > > The alternative here seems to be that we can encode (1970, 1, 2, 14, > > 0) in > > > > multiple different ways depending on what the current local system time > > > > is. As a note, I think ORC and Spark do this, and it leads to > > > > confusion/misinterpretation when trying to transfer data. > > > > > > > > If we then convert this column to a timestamp with a timezone in "UTC" > > > > timezone extracting the hour field still yields 14. If the column is > > > > converted to Timezone with timestamp PST. Extracting an hour would > > yield 6 > > > > (assume PST = -8GMT). Through all of these changes the data bits do > > not > > > > change. > > > > > > > > Display is not mentioned because I think the points about how a time > > > > display is correct. Applications can choose what they feel makes sense > > to > > > > them (as long as they don't start automatically tacking on timezones to > > > > naive timestamps). My interpretation of the specification has been > > display > > > > was kind of shorthand for field extraction. > > > > > > > > Could others on the thread confirm this is the issue up for debate? > > Are > > > > there subtleties/operations we need to consider? > > > > > > > > I also agree that we should document recommended conversion practices > > from > > > > other systems. > > > > > > > > -Micah > > > > > > > > > > > > So let's invent a third way. I could use > > > >> the first 16 bits for the year, the next 8 bits for the month, the > > > >> next 8 bits for the day of month, the next 8 bits for the hour, the > > > >> next 8 bits for the minute, and the remaining bits for the seconds. > > > >> Using this method I would store (1970, 1, 2, 14, 0) as > > > >> 0x07B201020E000000. > > > > > > > > Aside, With some small variation this is what ZetaSql uses [2] > > > > > > > > [1] > > > > > > https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark > > > > [2] > > > > > > https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62 > > > > > > > > > > > > > > > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com> > > wrote: > > > > > > > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > >> > > > >> > > > > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types: > > > >> > with/without time zone — in some SQL implementations each slot can > > > >> > have a different time zone > > > >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html > > > >> > WITHOUT TIME ZONE: "timestamp without time zone value should be > > taken > > > >> > or given as timezone local time" > > > >> > > > > >> > > > >> RDBMSs conflict (universally) with ANSI. > > > >> > > > >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the > > epoch. > > > >> It has no timezone. > > > >> > > > >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int > > Instant > > > >> since the epoch. It has no timezone. > > > >> > > > >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive > > datetime" in > > > >> *function*, but not in implementation: > > > >> > > > >> - MySQL DATETIME > > > >> < > > > >> > > https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html > > > >> > > > > >> is weird: 1-bit sign, 17-bit month, 5-bit day, .... > > > >> - MSSQL > > > >> < > > > >> > > https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15 > > > >> > > > > >> uses 6, 7 or 8 bytes > > > >> - PostgreSQL stores an integer, but I think its epoch is still > > > >> different > > > >> < > > > >> > > https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h > > > >> > > > > >> (it > > > >> used to store doubles since 2000-01-01) > > > >> > > > >> ... so in general, moving datetimes from these systems into 64-bit > > > >> integers > > > >> is nontrivial and lossy. > > > >> > > > >> Spark / Databricks discusses how Spark handles this > > > >> > > > > >> > > > > >> > > https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps > > > >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time > > zone, > > > >> > and are wall clock timestamps." — not UTC-normalized > > > >> > * WITH TIME ZONE: "does not affect the physical point in time that > > the > > > >> > timestamp represents, as that is fully represented by the UTC time > > > >> > instant given by the other timestamp components" > > > >> > > > > >> > > > >> I don't use Spark, but I read that page twice. First reading, I got > > the > > > >> same thing out of it. But the second time I read it, I read the > > opposite! > > > >> > > > >> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP > > WITH > > > >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP > > > >> WITH > > > >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp > > > >> type: > > > >> a 64-bit Instant since the epoch. (It also has a Date type.) > > > >> > > > >> If I'm reading correctly, this is exactly the same as PostgreSQL > > TIMESTAMP > > > >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit > > > >> timestamps as bare 64-bit integers since the epoch -- without > > timezone. > > > >> > > > >> > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be > > > >> > interpreted as UTC-normalized, that would force all of these other > > > >> > systems (and more) to serialize their data to be UTC-normalized > > > >> > > > >> > > > >> Are those systems' 64-bit integers interoperable in the first place? > > > >> > > > >> As I understand it, there's a ton of variance out there when encoding > > > >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres > > encoding > > > >> is one of many. As I mentioned in another thread, programming > > languages > > > >> all > > > >> use structs. > > > >> > > > >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive" > > > >> > timestamps and UTC-normalized WITH TIME ZONE. > > > >> > > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be > > > >> > interpreted as UTC-normalized, that would force all of these other > > > >> > systems (and more) to serialize their data to be UTC-normalized > > (i.e. > > > >> > calling the equivalent of pandas's tz_localize function) when they > > > >> > convert to Arrow. > > > >> > > > >> > > > >> Alternatives: > > > >> > > > >> - int64 > > > >> - date32+time64 > > > >> - date32+time32 > > > >> > > > >> This seems very harmful to me, and will make data > > > >> > from these systems not accurately representable in Arrow and unable > > to > > > >> > be round-tripped. > > > >> > > > > >> > > > >> Certainly nobody wants to go backwards. > > > >> > > > >> We need to clarify: how do we store these *common* types -- MySQL > > > >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in > > > >> Arrow? > > > >> > > > >> Secondarily, I think: how do we recommend users store *datetimes* in > > > >> Arrow? > > > >> (I'd expect this to be messier, since every system/language uses a > > > >> different byte structure.) > > > >> > > > >> Perhaps we can make a spreadsheet and look comprehensively at how many > > > >> > use cases would be disenfranchised by requiring UTC normalization > > > >> > always. > > > >> > > > >> > > > >> Hear, hear! > > > >> > > > >> Can we also poll people to find out how they're storing Instants > > today? > > > >> > > > >> Enjoy life, > > > >> Adam > > > >> > > > >> -- > > > >> Adam Hooper > > > >> +1-514-882-9694 > > > >> http://adamhooper.com > > > >> > > > > > >