On Thu, Jun 3, 2021 at 1:17 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote:
> That is my understanding as well, a timestamp either has a timezone or it > has not. If it does not have a timezone, it should be presented as is and > no assumptions can be made about its timezone. In particular, but given two > fields X and Y, one with a timezone and another without, e.g. it is not > meaningful to compute X - Y. > My understanding is, timestamp is always UTC. The full spec: /// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, excluding /// leap seconds, as a 64-bit integer. Note that UNIX time does not include /// leap seconds. /// /// The Timestamp metadata supports both "time zone naive" and "time zone /// aware" timestamps. Read about the timezone attribute for more detail table Timestamp { unit: TimeUnit; /// The time zone is a string indicating the name of a time zone, one of: /// /// * As used in the Olson time zone database (the "tz database" or /// "tzdata"), such as "America/New_York" /// * An absolute time zone offset of the form +XX:XX or -XX:XX, such as +07:30 /// /// Whether a timezone string is present indicates different semantics about /// the data: /// /// * If the time zone is null or equal to an empty string, the data is "time /// zone naive" and shall be displayed *as is* to the user, not localized /// to the locale of the user. This data can be though of as UTC but /// without having "UTC" as the time zone, it is not considered to be /// localized to any time zone /// /// * If the time zone is set to a valid value, values can be displayed as /// "localized" to that time zone, even though the underlying 64-bit /// integers are identical to the same data stored in UTC. Converting /// between time zones is a metadata-only operation and does not change the /// underlying values timezone: string; } I *think*, from that description, that "naive timestamp" is UTC. I think that because *the spec literally tells me I can think of it that way*. So the way I see it, "X [timezone-naive] - Y [any timezone]" is a valid computation. I know that "can be thought of" isn't very strong wording. But there's only one UNIX epoch, so I don't think the spec will let me think of it as anything else :). Good thing, too. Because storing a datetime is very different from storing a timestamp. When you store a datetime (SQL-style), you have two timezones: 1. The timezone in which the datetime is *stored* 2. The timezone in which the datetime is *displayed* It would be strange for the spec to imply, "displayed NOT NULL means stored = UTC" and "displayed NULL means stored = NULL" because that would prevent Arrow from handling the most important combination: "stored = UTC, displayed = NULL". TL;DR by my reading, "timezone" is merely for presentation, and its value (or null-ness) doesn't alter the data at all. Maybe a way of framing: given an uint64 X in parquet with > "isAdjustedToUTC=true", which arrow's datatype and value Y should it > correspond to? A timestamp with or without a timezone? I have so far > understood that "isAdjustedToUTC=false" corresponds to no timezone in > arrow, and "isAdjustedToUTC=true" corresponds to "+00:00". (But maybe this > is incorrect?) > I understand isAdjustedToUTC=true to mean "timestamp", and isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some docs because https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc lists a whole slew of potential meanings and without extra metadata I'll never be able to figure out what this column means." Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com