Also, as a secondary (but IMHO important) concern, if we choose the "always UTC" interpretation, we should stop using the "time zone naive" wording in the spec, because there is a high risk of confusion with Python's different "naive timestamp" concept:

https://docs.python.org/3/library/datetime.html

"""A naive object does not contain enough information to unambiguously locate itself relative to other date/time objects. Whether a naive object represents Coordinated Universal Time (UTC), local time, or time in some other timezone is purely up to the program, just like it is up to the program whether a particular number represents metres, miles, or mass. Naive objects are easy to understand and to work with, at the cost of ignoring some aspects of reality."""


Le 14/06/2021 à 17:57, Antoine Pitrou a écrit :

Hello,

In ARROW-13033, there was a disagreement as to how the specification
about timezone-less timestamps should be interpreted.

Here is the wording in the Schema specification:

   /// * If the time zone is null or equal to an empty string, the data is "time
   ///   zone naive" and shall be displayed *as is* to the user, not localized
   ///   to the locale of the user. This data can be though of as UTC but
   ///   without having "UTC" as the time zone, it is not considered to be
   ///   localized to any time zone

My interpretation is that timestamp *values* are always expressed in
UTC.  The timezone is an optional piece of metadata that describes the
context in which they were obtained, but do not impact how the *values*
should be interpreted.

Joris' interpretation is that timestamp *values* are expressed in an
arbitrary "local time" that is unknown and unspecified. It is therefore
difficult to exactly interpret them, since the timezone information is
unavailable.

(I'll let Joris express his thoughts more accurately, but the gist of
his opinion is that "can be thought of as UTC" is only an indication,
not a prescription)


To me, the problem with the "unknown local timezone" interpretation is
that it renders the data essentially ambiguous and useless.  The problem
is very similar to the problem of having string data without a
well-known encoding. This is well-known to Python users as the Python 2
encoding hell (to the point that it motivated the heavy and disruptive
Python 3 transition).

(note the problem is even worse for timestamps. At least, you can with a
high degree of probability detect that an arbitrary binary string is
*not* UTF8-encoded. You cannot do so with timestamp values: any 64-bit
timestamp may or may not be a UTC timestamp. Once you have lost that
information, you cannot regain it anymore.)

In any case, I think this must be clarified, first on this mailing-list,
then by making the spec wording stronger and more prescriptive.

Regards

Antoine.

Reply via email to