jorisvandenbossche commented on a change in pull request #10997:
URL: https://github.com/apache/arrow/pull/10997#discussion_r696345906



##########
File path: format/Schema.fbs
##########
@@ -214,58 +214,123 @@ table Time {
   bitWidth: int = 32;
 }
 
-/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, excluding
-/// leap seconds, as a 64-bit integer. Note that UNIX time does not include
-/// leap seconds.
+/// Timestamp is a 64-bit signed integer representing an elapsed time since a
+/// fixed epoch, stored in either of four units: seconds, milliseconds,
+/// microseconds or nanoseconds, and is optionally annotated with a timezone.
+///
+/// Timestamp values do not include any leap seconds (in other words, all
+/// days are considered 86400 seconds long).
+///
+/// Timestamps with a non-empty timezone
+/// ------------------------------------
+///
+/// If a Timestamp column has a non-empty timezone value, its epoch is
+/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in the *UTC* timezone
+/// (the Unix epoch), regardless of the Timestamp's own timezone.
+///
+/// Therefore, timestamp values with a non-empty timezone correspond to
+/// physical points in time together with some additional information about
+/// how the data was obtained and/or how to display it (the timezone).
+///
+///   For example, the timestamp value 0 with the timezone string 
"Europe/Paris"
+///   corresponds to "January 1st 1970, 00h00" in the UTC timezone, but could
+///   also be displayed as "January 1st 1970, 01h00" in the Europe/Paris 
timezone
+///   (which is the same physical point in time).
+///
+/// One consequence is that timestamp values with a non-empty timezone
+/// can be compared and ordered directly, since they all share the same
+/// well-known point of reference (the Unix epoch).
+///
+/// Timestamps with an unset / empty timezone
+/// -----------------------------------------
+///
+/// If a Timestamp column has no timezone value, its epoch is
+/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in an *unknown* timezone.
+///
+/// Therefore, timestamp values without a timezone cannot be meaningfully
+/// interpreted as physical points in time, but only as calendar / clock
+/// indications ("wall clock time") in an unspecified timezone.
+///
+///   For example, the timestamp value 0 with an empty timezone string
+///   corresponds to "January 1st 1970, 00h00" in an unknown timezone: there
+///   is not enough information to interpret it as a well-defined physical
+///   point in time.
+///
+/// One consequence is that timestamp values without an timezone cannot
+/// be reliably compared or ordered, since they may have different points of
+/// reference.  In particular, it is *not* possible to interpret an unset
+/// or empty timezone as the same as "UTC".
+///
+/// Conversion between timezones
+/// ----------------------------
+///
+/// If a Timestamp column has a non-empty timezone, changing the timezone
+/// to a different non-empty value is a metadata-only operation:
+/// the timestamp values need not change as their point of reference remains
+/// the same (the Unix epoch).
+///
+/// However, if a Timestamp column has no timezone value, changing it to a
+/// non-empty value requires to think about the desired semantics.
+/// One possibility is to assume that the original timestamp values are
+/// relative to the epoch of the timezone being set; timestamp values should
+/// then be "unlocalized" by adjusting them to the Unix epoch

Review comment:
       ```suggestion
   /// then be adjusted to the Unix epoch
   ```
   
   (just one possible suggestion)
   
   In the PR adding a kernel for this operation, one of the options is to call 
this "localize", and not "unlocalize" ... So I would maybe avoid using the term.

##########
File path: format/Schema.fbs
##########
@@ -214,58 +214,123 @@ table Time {
   bitWidth: int = 32;
 }
 
-/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, excluding
-/// leap seconds, as a 64-bit integer. Note that UNIX time does not include
-/// leap seconds.
+/// Timestamp is a 64-bit signed integer representing an elapsed time since a
+/// fixed epoch, stored in either of four units: seconds, milliseconds,
+/// microseconds or nanoseconds, and is optionally annotated with a timezone.
+///
+/// Timestamp values do not include any leap seconds (in other words, all
+/// days are considered 86400 seconds long).
+///
+/// Timestamps with a non-empty timezone
+/// ------------------------------------
+///
+/// If a Timestamp column has a non-empty timezone value, its epoch is
+/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in the *UTC* timezone
+/// (the Unix epoch), regardless of the Timestamp's own timezone.
+///
+/// Therefore, timestamp values with a non-empty timezone correspond to
+/// physical points in time together with some additional information about
+/// how the data was obtained and/or how to display it (the timezone).
+///
+///   For example, the timestamp value 0 with the timezone string 
"Europe/Paris"
+///   corresponds to "January 1st 1970, 00h00" in the UTC timezone, but could
+///   also be displayed as "January 1st 1970, 01h00" in the Europe/Paris 
timezone
+///   (which is the same physical point in time).
+///
+/// One consequence is that timestamp values with a non-empty timezone
+/// can be compared and ordered directly, since they all share the same
+/// well-known point of reference (the Unix epoch).
+///
+/// Timestamps with an unset / empty timezone
+/// -----------------------------------------
+///
+/// If a Timestamp column has no timezone value, its epoch is
+/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in an *unknown* timezone.
+///
+/// Therefore, timestamp values without a timezone cannot be meaningfully
+/// interpreted as physical points in time, but only as calendar / clock
+/// indications ("wall clock time") in an unspecified timezone.
+///
+///   For example, the timestamp value 0 with an empty timezone string
+///   corresponds to "January 1st 1970, 00h00" in an unknown timezone: there
+///   is not enough information to interpret it as a well-defined physical
+///   point in time.
+///
+/// One consequence is that timestamp values without an timezone cannot

Review comment:
       ```suggestion
   /// One consequence is that timestamp values without a timezone cannot
   ```

##########
File path: format/Schema.fbs
##########
@@ -214,58 +214,123 @@ table Time {
   bitWidth: int = 32;
 }
 
-/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, excluding
-/// leap seconds, as a 64-bit integer. Note that UNIX time does not include
-/// leap seconds.
+/// Timestamp is a 64-bit signed integer representing an elapsed time since a
+/// fixed epoch, stored in either of four units: seconds, milliseconds,
+/// microseconds or nanoseconds, and is optionally annotated with a timezone.
+///
+/// Timestamp values do not include any leap seconds (in other words, all
+/// days are considered 86400 seconds long).
+///
+/// Timestamps with a non-empty timezone
+/// ------------------------------------
+///
+/// If a Timestamp column has a non-empty timezone value, its epoch is
+/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in the *UTC* timezone
+/// (the Unix epoch), regardless of the Timestamp's own timezone.
+///
+/// Therefore, timestamp values with a non-empty timezone correspond to
+/// physical points in time together with some additional information about
+/// how the data was obtained and/or how to display it (the timezone).
+///
+///   For example, the timestamp value 0 with the timezone string 
"Europe/Paris"
+///   corresponds to "January 1st 1970, 00h00" in the UTC timezone, but could
+///   also be displayed as "January 1st 1970, 01h00" in the Europe/Paris 
timezone
+///   (which is the same physical point in time).
+///
+/// One consequence is that timestamp values with a non-empty timezone
+/// can be compared and ordered directly, since they all share the same
+/// well-known point of reference (the Unix epoch).
+///
+/// Timestamps with an unset / empty timezone
+/// -----------------------------------------
+///
+/// If a Timestamp column has no timezone value, its epoch is
+/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in an *unknown* timezone.
+///
+/// Therefore, timestamp values without a timezone cannot be meaningfully
+/// interpreted as physical points in time, but only as calendar / clock
+/// indications ("wall clock time") in an unspecified timezone.
+///
+///   For example, the timestamp value 0 with an empty timezone string
+///   corresponds to "January 1st 1970, 00h00" in an unknown timezone: there
+///   is not enough information to interpret it as a well-defined physical
+///   point in time.
+///
+/// One consequence is that timestamp values without an timezone cannot
+/// be reliably compared or ordered, since they may have different points of
+/// reference.  In particular, it is *not* possible to interpret an unset
+/// or empty timezone as the same as "UTC".
+///
+/// Conversion between timezones
+/// ----------------------------
+///
+/// If a Timestamp column has a non-empty timezone, changing the timezone
+/// to a different non-empty value is a metadata-only operation:
+/// the timestamp values need not change as their point of reference remains
+/// the same (the Unix epoch).
+///
+/// However, if a Timestamp column has no timezone value, changing it to a
+/// non-empty value requires to think about the desired semantics.
+/// One possibility is to assume that the original timestamp values are
+/// relative to the epoch of the timezone being set; timestamp values should
+/// then be "unlocalized" by adjusting them to the Unix epoch
+/// (for example, changing the timezone from empty to "Europe/Paris" would
+///  require converting the timestamp values from "Europe/Paris" to "UTC",
+///  which seems counter-intuitive but is nevertheless correct).
+///
+/// Guidelines for encoding data from external libraries
+/// ----------------------------------------------------
 ///
 /// Date & time libraries often have multiple different data types for temporal
-/// data.  In order to ease interoperability between different implementations 
the
+/// data. In order to ease interoperability between different implementations 
the
 /// Arrow project has some recommendations for encoding these types into a 
Timestamp
 /// column.
 ///
-/// An "instant" represents a single moment in time that has no meaningful 
time zone
-/// or the time zone is unknown.  A column of instants can also contain values 
from
-/// multiple time zones.  To encode an instant set the timezone string to 
"UTC".
+/// An "instant" represents a physical point in time that has no relevant time 
zone
+/// (for example, astronomical data). To encode an instant, use a Timestamp 
with
+/// the timezone string set to "UTC", and make sure the Timestamp values
+/// are relative to the UTC epoch (January 1st 1970, midnight).
+///
+/// A "zoned date-time" represents a physical point in time annotated with an
+/// informative time zone (for example, the time zone in which the data was
+/// recorded).  To encode a zoned date-time, use a Timestamp with the timezone
+/// string set to the name of the timezone, and make sure the Timestamp values
+/// are relative to the UTC epoch (January 1st 1970, midnight).
 ///
-/// A "zoned date-time" represents a single moment in time that has a 
meaningful
-/// reference time zone.  To encode a zoned date-time as a Timestamp set the 
timezone
-/// string to the name of the timezone.  There is some ambiguity between an 
instant
-/// and a zoned date-time with the UTC time zone.  Both of these are stored 
the same.
-/// Typically, this distinction does not matter.  If it does, then an 
application should
-/// use custom metadata or an extension type to distinguish between the two 
cases.
+///  (There is some ambiguity between an instant and a zoned date-time with the
+///   UTC time zone.  Both of these are stored the same in Arrow.  Typically,
+///   this distinction does not matter.  If it does, then an application should
+///   use custom metadata or an extension type to distinguish between the two 
cases.)
 ///
-/// An "offset date-time" represents a single moment in time combined with a 
meaningful
-/// offset from UTC.  To encode an offset date-time as a Timestamp set the 
timezone string
-/// to the numeric time zone offset string (e.g. "+03:00").
+/// An "offset date-time" represents a physical point in time combined with an
+/// explicit offset from UTC.  To encode an offset date-time, use a Timestamp
+/// with the timezone string set to the numeric time zone offset string
+/// (e.g. "+03:00"), and make sure the Timestamp values are relative to
+/// the UTC epoch (January 1st 1970, midnight).
 ///
-/// A "local date-time" does not represent a single moment in time.  It 
represents a wall
-/// clock time combined with a date.  Because of daylight savings time there 
may multiple
-/// instants that correspond to a single local date-time in any given time 
zone.  A
-/// local date-time is often stored as a struct or a Date32/Time64 pair.  
However, it can
-/// also be encoded into a Timestamp column.  To do so the value should be the 
the time
-/// elapsed from the Unix epoch so that a wall clock in UTC would display the 
desired time.
-/// The timezone string should be set to null or the empty string.
+/// A "naive date-time" represents a wall clock time combined with a calendar

Review comment:
       +1 on using "naive" instead of "local", I personally find that less 
ambiguous (but I also have a Python background, where this "naive" term is 
already used ..). 
   (I find that you could also easily interpret "local" as "with a *local 
timezone* attached", but which is a zoned date-time)
   
   But maybe it's worth keeping a reference to local? (eg `.. (also known as 
"local date-time")` since other systems (eg Java, parquet) are calling it like 
that.

##########
File path: format/Schema.fbs
##########
@@ -214,58 +214,123 @@ table Time {
   bitWidth: int = 32;
 }
 
-/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, excluding
-/// leap seconds, as a 64-bit integer. Note that UNIX time does not include
-/// leap seconds.
+/// Timestamp is a 64-bit signed integer representing an elapsed time since a
+/// fixed epoch, stored in either of four units: seconds, milliseconds,
+/// microseconds or nanoseconds, and is optionally annotated with a timezone.
+///
+/// Timestamp values do not include any leap seconds (in other words, all
+/// days are considered 86400 seconds long).
+///
+/// Timestamps with a non-empty timezone
+/// ------------------------------------
+///
+/// If a Timestamp column has a non-empty timezone value, its epoch is
+/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in the *UTC* timezone
+/// (the Unix epoch), regardless of the Timestamp's own timezone.
+///
+/// Therefore, timestamp values with a non-empty timezone correspond to
+/// physical points in time together with some additional information about
+/// how the data was obtained and/or how to display it (the timezone).
+///
+///   For example, the timestamp value 0 with the timezone string 
"Europe/Paris"
+///   corresponds to "January 1st 1970, 00h00" in the UTC timezone, but could
+///   also be displayed as "January 1st 1970, 01h00" in the Europe/Paris 
timezone
+///   (which is the same physical point in time).

Review comment:
       I am not sure the reference to "data producer" necessarily helps. How 
the timezone is interpreted / used for display will also typically depend on 
the application/library you are using. Although it's of course true that in end 
it's only the data producer that knows what they meant with the timezone. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to