On Thu, Jun 3, 2021 at 1:17 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> That is my understanding as well, a timestamp either has a timezone or it
> has not. If it does not have a timezone, it should be presented as is and
> no assumptions can be made about its timezone. In particular, but given two
> fields X and Y, one with a timezone and another without, e.g. it is not
> meaningful to compute X - Y.
>

My understanding is, timestamp is always UTC. The full spec:

/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970,
excluding
/// leap seconds, as a 64-bit integer. Note that UNIX time does not include
/// leap seconds.
///
/// The Timestamp metadata supports both "time zone naive" and "time zone
/// aware" timestamps. Read about the timezone attribute for more detail
table Timestamp {
  unit: TimeUnit;

  /// The time zone is a string indicating the name of a time zone, one of:
  ///
  /// * As used in the Olson time zone database (the "tz database" or
  ///   "tzdata"), such as "America/New_York"
  /// * An absolute time zone offset of the form +XX:XX or -XX:XX, such as
+07:30
  ///
  /// Whether a timezone string is present indicates different semantics
about
  /// the data:
  ///
  /// * If the time zone is null or equal to an empty string, the data is
"time
  ///   zone naive" and shall be displayed *as is* to the user, not
localized
  ///   to the locale of the user. This data can be though of as UTC but
  ///   without having "UTC" as the time zone, it is not considered to be
  ///   localized to any time zone
  ///
  /// * If the time zone is set to a valid value, values can be displayed as
  ///   "localized" to that time zone, even though the underlying 64-bit
  ///   integers are identical to the same data stored in UTC. Converting
  ///   between time zones is a metadata-only operation and does not change
the
  ///   underlying values
  timezone: string;
}

I *think*, from that description, that "naive timestamp" is UTC. I think
that because *the spec literally tells me I can think of it that way*. So
the way I see it, "X [timezone-naive] - Y [any timezone]" is a valid
computation.

I know that "can be thought of" isn't very strong wording. But there's only
one UNIX epoch, so I don't think the spec will let me think of it as
anything else :).

Good thing, too. Because storing a datetime is very different from storing
a timestamp. When you store a datetime (SQL-style), you have two timezones:

1. The timezone in which the datetime is *stored*
2. The timezone in which the datetime is *displayed*

It would be strange for the spec to imply, "displayed NOT NULL means stored
= UTC" and "displayed NULL means stored = NULL" because that would prevent
Arrow from handling the most important combination: "stored = UTC,
displayed = NULL".

TL;DR by my reading, "timezone" is merely for presentation, and its value
(or null-ness) doesn't alter the data at all.

Maybe a way of framing: given an uint64 X in parquet with
> "isAdjustedToUTC=true", which arrow's datatype and value Y should it
> correspond to? A timestamp with or without a timezone? I have so far
> understood that "isAdjustedToUTC=false" corresponds to no timezone in
> arrow, and "isAdjustedToUTC=true" corresponds to "+00:00". (But maybe this
> is incorrect?)
>

I understand isAdjustedToUTC=true to mean "timestamp", and
isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
docs because
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
lists a whole slew of potential meanings and without extra metadata I'll
never be able to figure out what this column means."

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Reply via email to