Hello,

In ARROW-13033, there was a disagreement as to how the specification about timezone-less timestamps should be interpreted.

Here is the wording in the Schema specification:

  /// * If the time zone is null or equal to an empty string, the data is "time
  ///   zone naive" and shall be displayed *as is* to the user, not localized
  ///   to the locale of the user. This data can be though of as UTC but
  ///   without having "UTC" as the time zone, it is not considered to be
  ///   localized to any time zone

My interpretation is that timestamp *values* are always expressed in UTC. The timezone is an optional piece of metadata that describes the context in which they were obtained, but do not impact how the *values* should be interpreted.

Joris' interpretation is that timestamp *values* are expressed in an arbitrary "local time" that is unknown and unspecified. It is therefore difficult to exactly interpret them, since the timezone information is unavailable.

(I'll let Joris express his thoughts more accurately, but the gist of his opinion is that "can be thought of as UTC" is only an indication, not a prescription)


To me, the problem with the "unknown local timezone" interpretation is that it renders the data essentially ambiguous and useless. The problem is very similar to the problem of having string data without a well-known encoding. This is well-known to Python users as the Python 2 encoding hell (to the point that it motivated the heavy and disruptive Python 3 transition).

(note the problem is even worse for timestamps. At least, you can with a high degree of probability detect that an arbitrary binary string is *not* UTF8-encoded. You cannot do so with timestamp values: any 64-bit timestamp may or may not be a UTC timestamp. Once you have lost that information, you cannot regain it anymore.)

In any case, I think this must be clarified, first on this mailing-list, then by making the spec wording stronger and more prescriptive.

Regards

Antoine.

Reply via email to