A lot of terms in this thread confuse me. Here are my interpretations:

*"Timestamp" vs "datetime"*: most languages/libraries have two types (with
various names). "Timestamp" is a moment on the timeline; "datetime" is a
date plus a time, as seen on a calendar and clock.

The only outliers that spring to mind are SQL (more on this later) and
Arrow's own ecosystem. Arrow mentions "naive timestamps" (more on this
later); Parquet calls its datetimes "timestamp not adjusted for UTC";
Pandas calls its datetimes "Timestamp".

The names aren't always "timestamp" and "datetime". (C calls them "time_t"
and "struct tm"; Python calls them "time" and "datetime"; Java calls them
"Instant" and "LocalDateTime"; and so on.) But the *operations* are
universal:

   - *"Timestamp"* lets you sort times and compute differences along the
   timeline. You can't extract fields.
   - *"Datetime"* lets you extract fields, parse strings, format to string.
   You can't sort (because clocks sometimes go backwards). You can't convert
   between timestamps and future datetimes (because timezones change).

*"Naive timestamp"*: I googled this, and I think it's unique to Arrow and
Pandas (not even Parquet or Numpy or Python).

Python and other libraries have "naive *datetime*". That seems common.

But by the above definitions, "naive timestamp" is an oxymoron.

I propose that whatever the outcome of this discussion, all Arrow-related
documentation should avoid the term "naive timestamp". (Parquet spec
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc>
allows
int64-encoded datetimes, and it avoids the word "naive".)

*"TIMESTAMP WITH TIME ZONE" vs "TIMESTAMP WITHOUT TIME ZONE"*: I've seen
these terms in SQL. They mean different things on different RDBMSs. I don't
know who uses which RDBMS on this mailing list, so I'm not sure what people
mean.

Oracle
<https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm>
"TIMESTAMP WITH TIME ZONE" stores (timestamp, tz) tuple; "TIMESTAMP WITH
LOCAL TIME ZONE" means (datetime stored as server timezone); "TIMESTAMP"
means (datetime). I suppose the de-facto way to store UTC timestamps
without timezone information is to set server timezone to UTC and store
"TIMESTAMP WITH LOCAL TIME ZONE"?

PostgreSQL <https://www.postgresql.org/docs/13/datatype-datetime.html>
"TIMESTAMP WITHOUT TIME ZONE" means (datetime) and "TIMESTAMP WITH TIME
ZONE" means (timestamp). The way to store UTC timestamps *without timezone
information* is "TIMESTAMP WITH TIME ZONE".

MySQL <https://dev.mysql.com/doc/refman/8.0/en/datetime.html> "TIMESTAMP"
means (timestamp) and "DATETIME" means (datetime). The way to store UTC
timestamps without timezone information is "TIMESTAMP".

I think all RDBMSs encode datetimes as int64, though not always as epoch
offsets.

None of these RDBMSs embed timezone information on timestamp columns.

*Why we use time zones*

Time zones are for converting timestamp=>datetime and datetime=>timestamp.

In my experience, the easiest approach is to specify the timezone during
conversion.

I propose Arrow's documentation should encourage explicit timezone
parameters during conversions.

*Timestamps without timezone information*

My selfish aim here is to figure out: how do I store a timestamp without
timezone information?

I'm +1 Antoine's suggestion of using naive timestamps, because A) my
pedantic reading of the existing spec supports this; B) it's a
near-universal best practice; and C) I'm already doing it ;).

I dislike storing timezone=UTC because timezone=UTC suggests the values are
simultaneously "timestamp" and "datetime". In Java, for instance: would a
timezone=UTC column convert to ZonedDateTime or to Instant?

I hope we all agree that UTC timestamps without timezones are best practice
in many domains. I hope we also agree that Arrow should heavily encourage
UTC timestamps without timezones.

*Parquet, Pandas*

Another important piece of the puzzle is, "how to convert between
Parquet/Pandas and Arrow?"

Perhaps another metadata field could be added to TIMESTAMP: a Parquet-alike
"isAdjustedToUTC". And users could be advised to avoid the "timestamp"
field.

Or datetimes could be stored as int64 (without metadata, or with
Pandas-specific metadata).

Or Arrow could introduce a new type, datetime64 (without metadata).

I think the most *confusing* outcome would be for timezone=UTC to mean
"Instant" and timezone=null to mean "ZonedDateTime". That would require
very, very clear documentation because the documentation would contradict
the actual terms: "Timestamp is only timestamp if you set timezone=UTC";
"Timestamp with timezone=null means datetime, not timestamp"; etc. Arrow
would be just another voice in SQL's sea of confusion.

Enjoy life,
Adam



On Tue, Jun 15, 2021 at 8:33 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Le 15/06/2021 à 12:57, Joris Van den Bossche a écrit :
> > A general observation: it might be useful to get back to the message
> > of Julian Hyde in the previous email thread about this 2 weeks ago
> > (
> https://lists.apache.org/thread.html/r5a89aa20b1cb812dc01a3817a5bfb365971577986d586dcc7ee21e72%40%3Cdev.arrow.apache.org%3E
> ).
> > Quoting part of that email:
> >
> > On Wed, 2 Jun 2021 at 21:39, Julian Hyde <jh...@apache.org> wrote:
> >>
> >> Good time libraries support all. E.g. Jodatime [1] has
> >>
> >> * Instant - an instantaneous point on the time-line
> >> * DateTime - full date and time with time-zone
> >> * LocalDateTime - date-time without a time-zone
> >>
> >> ...
> >>
> >> I recommend that Arrow supports all three. Choose clear, distinct
> >> names for all three, consistent with names used elsewhere in the
> >> industry.
> >
> > It seems to me that we are discussing whether our "timestamp without
> > timezone" should be interpreted as a LocalDateTime or as an Instant
> > (since interpreting it as UTC makes it an Instant, I think). Is that a
> > correct / helpful framing?
>
> That is correct, IMHO.
>
>

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Reply via email to