A lot of terms in this thread confuse me. Here are my interpretations: *"Timestamp" vs "datetime"*: most languages/libraries have two types (with various names). "Timestamp" is a moment on the timeline; "datetime" is a date plus a time, as seen on a calendar and clock.
The only outliers that spring to mind are SQL (more on this later) and Arrow's own ecosystem. Arrow mentions "naive timestamps" (more on this later); Parquet calls its datetimes "timestamp not adjusted for UTC"; Pandas calls its datetimes "Timestamp". The names aren't always "timestamp" and "datetime". (C calls them "time_t" and "struct tm"; Python calls them "time" and "datetime"; Java calls them "Instant" and "LocalDateTime"; and so on.) But the *operations* are universal: - *"Timestamp"* lets you sort times and compute differences along the timeline. You can't extract fields. - *"Datetime"* lets you extract fields, parse strings, format to string. You can't sort (because clocks sometimes go backwards). You can't convert between timestamps and future datetimes (because timezones change). *"Naive timestamp"*: I googled this, and I think it's unique to Arrow and Pandas (not even Parquet or Numpy or Python). Python and other libraries have "naive *datetime*". That seems common. But by the above definitions, "naive timestamp" is an oxymoron. I propose that whatever the outcome of this discussion, all Arrow-related documentation should avoid the term "naive timestamp". (Parquet spec <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc> allows int64-encoded datetimes, and it avoids the word "naive".) *"TIMESTAMP WITH TIME ZONE" vs "TIMESTAMP WITHOUT TIME ZONE"*: I've seen these terms in SQL. They mean different things on different RDBMSs. I don't know who uses which RDBMS on this mailing list, so I'm not sure what people mean. Oracle <https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm> "TIMESTAMP WITH TIME ZONE" stores (timestamp, tz) tuple; "TIMESTAMP WITH LOCAL TIME ZONE" means (datetime stored as server timezone); "TIMESTAMP" means (datetime). I suppose the de-facto way to store UTC timestamps without timezone information is to set server timezone to UTC and store "TIMESTAMP WITH LOCAL TIME ZONE"? PostgreSQL <https://www.postgresql.org/docs/13/datatype-datetime.html> "TIMESTAMP WITHOUT TIME ZONE" means (datetime) and "TIMESTAMP WITH TIME ZONE" means (timestamp). The way to store UTC timestamps *without timezone information* is "TIMESTAMP WITH TIME ZONE". MySQL <https://dev.mysql.com/doc/refman/8.0/en/datetime.html> "TIMESTAMP" means (timestamp) and "DATETIME" means (datetime). The way to store UTC timestamps without timezone information is "TIMESTAMP". I think all RDBMSs encode datetimes as int64, though not always as epoch offsets. None of these RDBMSs embed timezone information on timestamp columns. *Why we use time zones* Time zones are for converting timestamp=>datetime and datetime=>timestamp. In my experience, the easiest approach is to specify the timezone during conversion. I propose Arrow's documentation should encourage explicit timezone parameters during conversions. *Timestamps without timezone information* My selfish aim here is to figure out: how do I store a timestamp without timezone information? I'm +1 Antoine's suggestion of using naive timestamps, because A) my pedantic reading of the existing spec supports this; B) it's a near-universal best practice; and C) I'm already doing it ;). I dislike storing timezone=UTC because timezone=UTC suggests the values are simultaneously "timestamp" and "datetime". In Java, for instance: would a timezone=UTC column convert to ZonedDateTime or to Instant? I hope we all agree that UTC timestamps without timezones are best practice in many domains. I hope we also agree that Arrow should heavily encourage UTC timestamps without timezones. *Parquet, Pandas* Another important piece of the puzzle is, "how to convert between Parquet/Pandas and Arrow?" Perhaps another metadata field could be added to TIMESTAMP: a Parquet-alike "isAdjustedToUTC". And users could be advised to avoid the "timestamp" field. Or datetimes could be stored as int64 (without metadata, or with Pandas-specific metadata). Or Arrow could introduce a new type, datetime64 (without metadata). I think the most *confusing* outcome would be for timezone=UTC to mean "Instant" and timezone=null to mean "ZonedDateTime". That would require very, very clear documentation because the documentation would contradict the actual terms: "Timestamp is only timestamp if you set timezone=UTC"; "Timestamp with timezone=null means datetime, not timestamp"; etc. Arrow would be just another voice in SQL's sea of confusion. Enjoy life, Adam On Tue, Jun 15, 2021 at 8:33 AM Antoine Pitrou <anto...@python.org> wrote: > > Le 15/06/2021 à 12:57, Joris Van den Bossche a écrit : > > A general observation: it might be useful to get back to the message > > of Julian Hyde in the previous email thread about this 2 weeks ago > > ( > https://lists.apache.org/thread.html/r5a89aa20b1cb812dc01a3817a5bfb365971577986d586dcc7ee21e72%40%3Cdev.arrow.apache.org%3E > ). > > Quoting part of that email: > > > > On Wed, 2 Jun 2021 at 21:39, Julian Hyde <jh...@apache.org> wrote: > >> > >> Good time libraries support all. E.g. Jodatime [1] has > >> > >> * Instant - an instantaneous point on the time-line > >> * DateTime - full date and time with time-zone > >> * LocalDateTime - date-time without a time-zone > >> > >> ... > >> > >> I recommend that Arrow supports all three. Choose clear, distinct > >> names for all three, consistent with names used elsewhere in the > >> industry. > > > > It seems to me that we are discussing whether our "timestamp without > > timezone" should be interpreted as a LocalDateTime or as an Instant > > (since interpreting it as UTC makes it an Instant, I think). Is that a > > correct / helpful framing? > > That is correct, IMHO. > > -- Adam Hooper +1-514-882-9694 http://adamhooper.com