Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
felipecrv commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546738845 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. + +.. note:: + + It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be nullable, but not recommended. + + If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. + + If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. Review Comment: Using the validity buffer of the `timestamp` field for both the struct and the field itself is good for both *producers* and *consumers*. A producer doesn't have to set default values for the nulls. And consumers don't have to look at both validity buffers for anything they do. Sharing the buffer (if it exists at all) is just a ref-count bump on the buffer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
pitrou commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546706495 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. + +.. note:: + + It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be nullable, but not recommended. + + If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. + + If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. + + It is *recommended* that implementations normalize this type's representation by dropping the inner validity buffers and applying the aforementioned transformations, only keeping the top-level struct's validity buffer. + +.. note:: Review Comment: I think it doesn't belong indeed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
felipecrv commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546691725 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. + +.. note:: + + It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be nullable, but not recommended. + + If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. + + If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. Review Comment: >> And return an error on instantiation? > > Which instantiation are we talking about exactly? The ExtensionArray subclass in Arrow C++? In Arrow implementations for any language. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546657710
##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
| UUID extension type | UUID |
+--++
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different
timezone offsets per value. The timestamp is stored in UTC alongside the
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+ * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``,
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+ * ``offset_minutes``: a preferably non-nullable signed 16-bit integer
(``Int16``) representing the offset in minutes from the UTC timezone. Negative
offsets represent time zones west of UTC, while positive offsets represent
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+ * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+ Extension metadata is an empty string.
+
+.. note::
+
+ It is also *permissible* for the ``offset_minutes`` field to be
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+ It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be
nullable, but not recommended.
+
+ If ``timestamp`` is nullable and a value is found to be null, then the
whole ``TimestampWithOffset`` value should be interpreted as null. One way of
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+ If ``offset_minutes`` is nullable and a value is found to be null, then
this value should be interpreted as if the offset value were were zero.
+
+ It is *recommended* that implementations normalize this type's
representation by dropping the inner validity buffers and applying the
aforementioned transformations, only keeping the top-level struct's validity
buffer.
+
+.. note::
Review Comment:
I added this mostly to make integrating to Arrow via JSON more ergonomic so
people can do stuff like `time.Parse()` in Go or `datetime.fromisoformat()` in
Python, instead of having to manually parse something like `{"timestamp":
"...", "offset": 123}` to get timezone-aware objects in their language.
I can take this off of the spec if you think it doesn't belong here though.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
pitrou commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546650112 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. + +.. note:: + + It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be nullable, but not recommended. + + If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. + + If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. Review Comment: > And return an error on instantiation? Which instantiation are we talking about exactly? The `ExtensionArray` subclass in Arrow C++? > I think we can allow nulls on the `timestamp` field and on instantiation put it on the struct array also. > > Would you be fine with both points above? We can, but that places a burden on the consumer for the sake of making things easier for the producer. I'm not sure that's a good tradeoff to make. > I think many operations will care only about the timestamp field and having the validity buffer there helps them simply delegate to functions on the `Timestamp` type. At the same time, many function only look at the top-level validity buffer (i.e. the validity of the struct). So during instantiation it would make sense to share the validity buffer if there is one. I admit that can make sense. Let's hope for other opinions on the topic. @paleolimbot @alamb @tustvold -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
felipecrv commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546637086 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. + +.. note:: + + It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be nullable, but not recommended. + + If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. + + If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. Review Comment: I think many operations will care only about the timestamp field and having the validity buffer there helps them simply delegate to functions on the `Timestamp` type. At the same time, many function only look at the top-level validity buffer (i.e. the validity of the struct). So during instantiation it would make sense to share the validity buffer if there is one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
felipecrv commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546631460 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. + +.. note:: + + It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be nullable, but not recommended. + + If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. + + If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. Review Comment: And return an error on instantiation? I think we can allow nulls on the `timestamp` field and on instantiation put it on the struct array also. Would you be fine with both points above? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
pitrou commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546556204 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. + +.. note:: + + It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be nullable, but not recommended. + + If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. + + If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. + + It is *recommended* that implementations normalize this type's representation by dropping the inner validity buffers and applying the aforementioned transformations, only keeping the top-level struct's validity buffer. + +.. note:: Review Comment: This seems slightly out of scope. We don't have any similar wording for built-in Arrow types such as Timestamp. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
pitrou commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546553261 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. + +.. note:: + + It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be nullable, but not recommended. + + If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. + + If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. Review Comment: This makes consumption of these data more complicated. Why not mandate that struct fields are always non-null? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
pitrou commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546541747 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. Review Comment: This is not an extension type parameter, it's a parameter of the storage type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
rok commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2545537567 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: Sorry for a late reply. Yes, I think this is good, let's close the thread. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: Hey all! To move this forward, I: 1. Changed the wording from `... non-nullable ...` to `... preferably non-nullable ...` 2. Added a new note section outlining the expected semantics when the inner fields do have a validity buffer. This should be enough to avoid implementations drifting and having their own freestyle interpretation. 3. Added a recommendation that implementations _should_ (but aren't required to) normalize the type by dropping the inner validity buffers and only keep the top-level one. This should make it easier for compute kernels to interpret this type correctly without having to implement the logic themselves. Let me know if this is enough to resolve this thread! https://github.com/apache/arrow/pull/48002/commits/b0d9be3be93098e7932236b1e5aa921d3d1c8469 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: Hey all! To move this forward, I: 1. Changed the wording from `... non-nullable ...` to `... preferably non-nullable ...` 2. Added a new note section outlining the expected semantics when the inner fields do have a validity buffer. This should be enough to avoid implementations drifting and having their own freestyle interpretation. 3. Added a recommendation that implementations _should_ (but aren't required to) normalize the type by dropping the inner validity buffers and only keeping the top-level one. This should make it easier for compute kernels to interpret this type correctly without having to implement the logic themselves. Let me know if this is enough to resolve this thread! https://github.com/apache/arrow/pull/48002/commits/b0d9be3be93098e7932236b1e5aa921d3d1c8469 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: Hey all! To move this forward, I: 1. Changed the wording from `... non-nullable ...` to `... preferably non-nullable ...` 2. Added a new note section outlining the expected semantics when the inner fields do have a validity buffer 3. Added a recommendation that implementations _should_ (but aren't required to) normalize the type by dropping the inner validity buffers and only keeping the top-level one. This should make it easier for compute kernels to interpret this type correctly without having to implement the logic themselves. Let me know if this is enough to resolve this thread! https://github.com/apache/arrow/pull/48002/commits/b0d9be3be93098e7932236b1e5aa921d3d1c8469 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: Hey all! To move this forward, I: 1. Changed the wording from `... non-nullable ...` to `... preferably non-nullable ...` 2. Added a new note section outlining the expected semantics when the inner fields do have a validity buffer 3. Added a recommendation that implementations _should_ (but aren't required to) normalize the type by dropping the inner validity buffers and only keeping the top-level one. Let me know if this is enough to resolve this thread! https://github.com/apache/arrow/pull/48002/commits/b0d9be3be93098e7932236b1e5aa921d3d1c8469 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: Hey all! To move this forward, I: 1. Changed the wording from `... non-nullable ...` to `... preferably non-nullable ...` 2. Added a new note section outlining the expected semantics when the inner fields do have a validity buffer 3. Added a recommendation that implementations _should_ (but aren't required to) normalize the type by dropping the inner validity buffers and only keeping the top-level one. Let me know if this is enough to resolve this thread! https://github.com/apache/arrow/pull/48002/commits/d40c0bb2383ec3fa82bfb6df8e4ee47d82ca152a -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2533749555 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: @rok Are you satisfied with the current wording for the spec? I want to keep this moving forward, and so I am wondering if I can close this thread :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
lidavidm commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2508392727 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,39 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +.. _timestamp_with_offset_extension: + +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``, or run-end-encoded with a preferred (*but not required*) runs type of ``int8``. + +.. note:: + + Although not required, it is *recommended* that implementations represent this type as an RFC3339 string when de/serializing to/from JSON, respecting the ``TimeUnit`` precision and time zone offset without loss of information. For example ``2025-01-01T00:00:00Z`` represents January 1st 2025 in UTC with second precision, and ``2025-01-01T00:00:00.1-07:00`` represents one nanosecond after January 1st 2025 in UTC-07. + + The rationale behind this recommendation is that many programming languages provide support for parsing RFC3339 out of the box, facilitating consumption of timezone-aware JSON-encoded Arrow arrays without extra boilerplate just for integrating with Arrow. Review Comment: ```suggestion .. note:: Although not required, it is *recommended* that implementations represent this type as an RFC3339 string when de/serializing to/from JSON, respecting the ``TimeUnit`` precision and time zone offset without loss of information. For example ``2025-01-01T00:00:00Z`` represents January 1st 2025 in UTC with second precision, and ``2025-01-01T00:00:00.1-07:00`` represents one nanosecond after January 1st 2025 in UTC-07. The rationale behind this recommendation is that many programming languages provide support for parsing RFC3339 out of the box, facilitating consumption of timezone-aware JSON-encoded Arrow arrays without extra boilerplate just for integrating with Arrow. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
rok commented on PR #48002: URL: https://github.com/apache/arrow/pull/48002#issuecomment-3503602930 FWIW, forgot to mention this before in arrow-cpp we have an internal OffsetZone that represents offset in minutes. We define it per array, not per row of course. https://github.com/apache/arrow/blob/cd23a765442bdbaaef43d0e4b239094fb01e37ae/cpp/src/arrow/util/date_internal.h#L29-L33 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
rok commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2503880802 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: One thing we could do with nullable `offset_minutes` field is to have nulls indicate timezone-naive timestamps. As per discussion on the Arrow call - cases with mixed timezone-naive and timezone-aware columns are probably not common, so I'm only bringing this up here for completeness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
felipecrv commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2503778940 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: @jorisvandenbossche now I understand what you meant. One complexity here is that some compute kernels might want to look at just the UTC timestamp field because they only care about the instant, so we should at least warn/recommend what to do when two bitmaps exist. If we make the spec require that the top-level bitmap can be more selective than the inner bitmap, kernels looking at just the timestamp would be allowed to grab the top-level bitmap and apply it to the processing and the output. @lidavidm I think top-level bitmaps is the best, but inevitably someone will have to make a decision on what to do when more than one bitmap exists and the spec having recommendations could prevent divergence between implementations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2503708486 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,29 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: Resolved: https://github.com/apache/arrow/pull/48002/commits/06ffdf958c34405918300789dd88a1b6ac5684ee -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on PR #48002: URL: https://github.com/apache/arrow/pull/48002#issuecomment-3502663004 Hey friends, I pushed 2 new patches, as discussed in the Arrow dev community meeting: 1. Made RFC3339 encoding to/from JSON a _recommendation_ but not a _requirement_, and added a rationale on why this is recommended. Implementations can still provide ways to encode it as JSON objects without violating the spec if they want, though. 2. The `offset_minutes` field now can also be run-end or dictionary encoded for columns with lots of repeating offsets. Next, I need to change the Go and Rust implementation drafts to accept the new supported encodings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2495148508 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,29 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: Actually, my bad. I overlooked a line that says: > (4) The index type of a Dictionary type can only be an integer type, preferably signed, with width 8 to 64 bits. I had interpreted it differently since the only example of dictionary encoding is `int32`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
lidavidm commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2493949264 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,29 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: Any of the integer types possible, so uint8 is perfectly fine. Where do you see it saying that int32 should be used? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2493412961 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,29 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: There was a request in the mailing list to add dictionary encoding and run-end encoding to the offset column. I don't see why we wouldn't wanna do run-end encoding, especially for large columns with lots of repeated offsets it could save a lot of space. **Should we add it to the spec already to avoid breaking changes?** For dictionary encoding: is it possible to use `uint8` or possibly something even smaller to represent the dictionary indices? Otherwise it only adds extra abstraction without saving that much space... The docs suggest using `int32` for dictionary encoding which would actually be worse than just using the `int16` offsets directly. We can keep the implementations simple (only primitive encoding for now), and then patch them later to support all the encodings we decide to add to the spec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2493412961 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,29 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: There was a request in the mailing list to add dictionary encoding and run-end encoding to the offset column. I don't see why we wouldn't wanna do run-end encoding, especially for large columns with lots of repeated offsets it could save a lot of space. **Should we add it to the spec already to avoid breaking changes?** For dictionary encoding: is it possible to use `uint8` or possibly something even smaller to represent the dictionary indices? Otherwise it only adds extra abstraction without saving that much space... The docs suggest using `int32` for dictionary encoding which would actually be worse... We can keep the implementations simple (only primitive encoding for now), and then patch them later to support all the encodings we decide to add to the spec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2493412961 ## docs/source/format/CanonicalExtensions.rst: ## @@ -544,6 +544,29 @@ Primitive Type Mappings | UUID extension type | UUID | +--++ +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: There was a request in the mailing list to add dictionary encoding and run-end encoding to the offset column. I don't see why we wouldn't wanna do that, especially for large columns with lots of repeated offsets it could save a lot of space. **Should we add it to the spec already to avoid breaking changes?** We can keep the implementations simple (only primitive encoding for now), and then patch them later to support all the encodings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
jorisvandenbossche commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2486512426 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: To be clear, I am not arguing for assigning a specific meaning to a certain combination of nullability, but just for allowing the fields to be null as well. For example, we could say that if the element is null (top-level struct validity), the individual fields are allowed to contain a null as well. Of course, when constructing a timestamp with offset from the individual fields, it is relatively straightforward to just drop the validity bitmaps of the individual fields, and ensure a union of both bitmaps is assigned to the struct. (it is just that the current pyarrow APIs don't make this particularly easy .. but that is something we can also improve in the exposed APIs) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2486474463 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset Review Comment: Fixed: https://github.com/apache/arrow/pull/48002/commits/080220a36401e6468f381c241c6dae245c76967d -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2486475138 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. Review Comment: Fixed: https://github.com/apache/arrow/pull/48002/commits/d8b900fd29603a34670ea2e45ae4f8eb5cbc8432 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
lidavidm commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2485119505 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: I think I prefer the current wording (require nullability to be handled at the struct level) instead of trying to assign semantics to the other combinations. ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. Review Comment: Can we explicitly call out that this is intended to be 1:1 with ANSI SQL? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2481948785 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: > Alternatively we could also specify that if one is null, the other should be null as well? Yea, that's more or less what I was thinking about. In principle this type only has meaning if both fields are set. To relax these constraints we'd need to come up with a meaning for what a null timestamp and non-null offset would mean and vice versa. Could be: - If timestamp is set and offset is null, assume `offset=0`, i.e timestamp is UTC - If timestamp is null and offset is set, assume the whole value is null (a standalone offset floating around has no meaning) Or, alternatively: - If any of the fields is null, assume the whole value is null as well The second option has the advantage of being able to and the two bitmasks together to figure out the global nullable buffer, and it reduces branching. ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: > Alternatively we could also specify that if one is null, the other should be null as well? Yea, that's more or less what I was thinking about. In principle this type only has meaning if both fields are set. To relax these constraints we'd need to come up with a meaning for what a null timestamp and non-null offset would mean and vice versa. Could be: - If timestamp is set and offset is null, assume `offset=0`, i.e timestamp is UTC - If timestamp is null and offset is set, assume the whole value is null (a standalone offset floating around has no meaning) Or, alternatively: - If any of the fields is null, assume the whole value is null as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2481948785 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: > Alternatively we could also specify that if one is null, the other should be null as well? Yea, that's more or less what I was thinking about. In principle this type only has meaning if both fields are set. To relax these constraints we'd need to come up with a meaning for what a null timestamp and non-null offset would mean and vice versa. Could be: - If timestamp is set and offset is null, assume `offset=0`, i.e timestamp is UTC - If timestamp is null and offset is set, assume the whole value is null (a standalone offset floating around has no meaning) Or, alternatively: - If any of the fields is null, assume the whole value is null as well None of these is inherently better or worse IMHO, it's just a matter of standardizing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2481782459 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset Review Comment: Oops, yes will do! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2481766888 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: @rok we just sent this to the mailing list yesterday. The [discussion thread](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk) has a more extensive argumentation around why we chose these constraints. > Out of curiosity - would the proposed memory layout of match any existing system? The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and Trino, of which only one of them (Trino) is open source. It's hard to know for a fact what is the internal memory layout of proprietary systems... We do know Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't match there and some Arrow conversion layer would need to resolve the timezone names to offsets. This (resolving offsets on the server) is an explicit choice so that consumer systems don't need to mess with the IANA database or reasoning about daylight savings etc. Arrow consumers just get the offset, add it to the timestamp and voila you have the original timestamp in the original timezone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2481766888 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: @rok we just sent this to the mailing list yesterday. The [discussion thread](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk) has a more extensive argumentation around why we chose these constraints. > Out of curiosity - would the proposed memory layout of match any existing system? The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and Trino, of which only one of them (Trino) is open source. It's hard to know for a fact what is the internal memory layout of proprietary systems... We do know Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't match there and some Arrow conversion layer would need to resolve the timezone names to offsets. This (resolving offsets on the server) is an explicit choice so that consumer systems don't need to mess with the IANA database or reasoning about daylight savings etc. Arrow consumers just get the offset, add it to the timestamp and voila. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2481766888 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: @rok we just sent this to the mailing list yesterday. The [discussion thread](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk) has a more extensive argumentation around why we chose these constraints. > Out of curiosity - would the proposed memory layout of match any existing system? The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and Trino, of which only one of them (Trino) is open source. It's hard to know for a fact what is the internal memory layout of proprietary systems... We do know Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't match there and some Arrow conversion layer would need to resolve the timezone names to offsets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
rok commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2481129311 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: @serramatutu aligning with ANSI SQL seems like a good idea (and doesn't create a new convention), perhaps we could state this in the docs? Out of curiosity - would the proposed memory layout of match any existing system? Hey @felipecrv ! I was thinking about int8 for 15 min offset blocks as well, but I'm not sure it's worth it. Politically I would not expect new sub-60 minutes offsets. But ANSI SQL does seem safer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
felipecrv commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2480139536 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: Interesting points! We should expand the spec text here and clarify expectations. Since I can see many operations on this array not caring about the two fields, having a validity buffer on the timestamp field could be a simplification in these cases. It would reduce the risk of computation being performed on garbage values if the struct's validity bitmap is being ignored. But a top-level validity buffer is necessary to keep generic code going through columns processing nulls correctly. One way we can adapt to this reality is to make a recommendation against validity on the timestamp field and a warning that even when the offset field is not touched, the validity bitmap of the computation's result should come from the struct validity, or, if both have validity buffers, the & of the two bitmaps. For the offset column we can recommend the absence of validity bitmap as well (non-nullable) but if a value is null, process it as if it were zero. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
felipecrv commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2480139536 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: Interesting points! We should expand the spec text here and clarify expectations. Since I can see many operations on this array not caring about the two fields, having a validity buffer on the timestamp field could be a simplification in these cases. It would reduce the risk of computation being performed on garbage values if the struct's validity bitmap is being ignored. But a top-level validity buffer is necessary to keep generic code going through columns processing nulls correctly. One way we can adapt to this reality is to make a recommendation against validity on the timestamp field and a warning that even when the offset field is not touched, the validity bitmap of the computation's result should come from the struct validity, or, if both have validity buffers, the & of the two bitmaps. For the offset column we can recommend the absence of validity bitmap also (non-nullable) but if it's null, process the array as if it were zero. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
jorisvandenbossche commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479813864 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). Review Comment: Why explicitly saying that it should be non-nullable? Is that because the nullability can (should) be defined at the struct level, and you want to avoid having an inconsistency between the "timestamp" and "offset_minutes" fields? (e.g. the case where only the "offset_minutes" field would be null for a given row, what does that mean?) I am only not sure how practical this limitation is in practice. For example when creating a struct from its individual fields, typically the fields itself will contain nulls. Alternatively we could also specify that if one is null, the other should be null as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
jorisvandenbossche commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479807607 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset Review Comment: Formatting: can you move this new section below after the "Primitive Type Mappings" section? Because that is a subsection of the "Parquet Variant" section above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479304385 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: Oh and @rok you're right in saying timezones can go up until +14:00 in the wild, even if that's not standard. Politics is weird... We should maybe take these hard limits off of the format spec. Anyways, I digress. Let's discuss these things in the mailing list. Would love if you chimed in too @rok ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
felipecrv commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479277740 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: @rok minutes is coarse enough to fit in 16 bits. 15-min blocks would give us the ability of using just 8 bits, but I'm not so comfortable with the promise of the 15-minute convention holding forever everywhere in the planet. And it would create awkwardness when parsing inputs that contain non-15-minute-multiple offsets as @serramatutu pointed above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479249353 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: But... the main reason behind this proposal is compatibility with ANSI SQL `TIMESTAMP WITH TIME ZONE`, which is supported by multiple database systems (Snowflake, MS SQL Server, Oracle, Trino). This is the reasoning behind why we're proposing an offset in minutes as signed 16-bit int: > In ANSI SQL, the time zone information is defined in terms of an "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with which you can represent a time zone offset, and the maximum minutes in the offset is 13*60=780, we believe it makes sense for the offset to be stored as a 16-bit integer in minutes. > It is important to point out that some systems such as MS SQL Server do implement data types that can represent offsets with sub-minute granularity. We believe representing sub-minute granularity is out of scope for this proposal given that no current or past time zone standards have ever specified sub-minute offsets [9], and that is what we're trying to solve for. Furthermore, representing the offset in seconds rather than minutes would mean the maximum offset is 13*60*60=46800, which is greater than the maximum positive integer an int16 can represent (32768), and thus the offset type would need to be wider (int32). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479249353 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: But... the main reason behind this proposal is compatibility with ANSI SQL `TIMESTAMP WITH TIME ZONE`, which is supported by multiple database systems. This is the reasoning behind why we're proposing an offset in minutes as signed 16-bit int: > In ANSI SQL, the time zone information is defined in terms of an "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with which you can represent a time zone offset, and the maximum minutes in the offset is 13*60=780, we believe it makes sense for the offset to be stored as a 16-bit integer in minutes. > It is important to point out that some systems such as MS SQL Server do implement data types that can represent offsets with sub-minute granularity. We believe representing sub-minute granularity is out of scope for this proposal given that no current or past time zone standards have ever specified sub-minute offsets [9], and that is what we're trying to solve for. Furthermore, representing the offset in seconds rather than minutes would mean the maximum offset is 13*60*60=46800, which is greater than the maximum positive integer an int16 can represent (32768), and thus the offset type would need to be wider (int32). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479245886 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: Hey! We will send a [DISCUSS] in the mailing list to discuss this shortly (next few days, still drafting it). Let's discuss it there! 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479249353 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: But... the main reason behind this is compatibility with ANSI SQL `TIMESTAMP WITH TIME ZONE`, which is supported by multiple database systems. This is the reasoning behind why we're proposing an offset in minutes as signed 16-bit int: > In ANSI SQL, the time zone information is defined in terms of an "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with which you can represent a time zone offset, and the maximum minutes in the offset is 13*60=780, we believe it makes sense for the offset to be stored as a 16-bit integer in minutes. > It is important to point out that some systems such as MS SQL Server do implement data types that can represent offsets with sub-minute granularity. We believe representing sub-minute granularity is out of scope for this proposal given that no current or past time zone standards have ever specified sub-minute offsets [9], and that is what we're trying to solve for. Furthermore, representing the offset in seconds rather than minutes would mean the maximum offset is 13*60*60=46800, which is greater than the maximum positive integer an int16 can represent (32768), and thus the offset type would need to be wider (int32). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479245886 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: Hey! We will send a [DISCUSS] in the mailing list to discuss this shortly (next few days, still drafting it). Let's discuss it there! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
serramatutu commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2479245886 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: Hey! We will send a proposal in the mailing list to discuss this shortly (next few days, still drafting it). Let's discuss it there! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
rok commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2478871611 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: I believe (current) [timezones in the wild](https://en.wikipedia.org/wiki/List_of_UTC_offsets) cover a range of -12:00 to + 14:00. We could specify offsets should preferably be multiples of 15 minutes as [suggested here](https://en.wikipedia.org/wiki/UTC_offset): > By convention, every inhabited place in the world has a UTC offset that is a multiple of 15 minutes but the majority of offsets are stated in whole hours. There are many cases where the national [standard time](https://en.wikipedia.org/wiki/Standard_time) uses a UTC offset that is not defined solely by longitude. Alternatively - if we wanted to represent old sun time offsets - we'd have to consider [fractions of seconds](https://github.com/eggert/tz/blob/8e43ecbb256178b4fa7314491abdc0ce8f5f1c9c/northamerica#L329-L331). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]
rok commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2478871611 ## docs/source/format/CanonicalExtensions.rst: ## @@ -483,6 +483,28 @@ binary values look like. .. _variant_primitive_type_mapping: +Timestamp With Offset += +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). Review Comment: I believe (current) [timezones in the wild](https://en.wikipedia.org/wiki/List_of_UTC_offsets) cover a range of -12:00 to + 14:00. We could specify minute offsets should preferably be multiples of 15 minutes as [suggested here](https://en.wikipedia.org/wiki/UTC_offset): > By convention, every inhabited place in the world has a UTC offset that is a multiple of 15 minutes but the majority of offsets are stated in whole hours. There are many cases where the national [standard time](https://en.wikipedia.org/wiki/Standard_time) uses a UTC offset that is not defined solely by longitude. Alternatively - if we wanted to represent old sun time offsets - we'd have to consider [fractions of seconds](https://github.com/eggert/tz/blob/8e43ecbb256178b4fa7314491abdc0ce8f5f1c9c/northamerica#L329-L331). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
