Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


felipecrv commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546738845


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+   It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be 
nullable, but not recommended.
+
+   If ``timestamp`` is nullable and a value is found to be null, then the 
whole ``TimestampWithOffset`` value should be interpreted as null. One way of 
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the 
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+   If ``offset_minutes`` is nullable and a value is found to be null, then 
this value should be interpreted as if the offset value were were zero.

Review Comment:
   Using the validity buffer of the `timestamp` field for both the struct and 
the field itself is good for both *producers* and *consumers*. A producer 
doesn't have to set default values for the nulls. And consumers don't have to 
look at both validity buffers for anything they do. Sharing the buffer (if it 
exists at all) is just a ref-count bump on the buffer.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


pitrou commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546706495


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+   It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be 
nullable, but not recommended.
+
+   If ``timestamp`` is nullable and a value is found to be null, then the 
whole ``TimestampWithOffset`` value should be interpreted as null. One way of 
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the 
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+   If ``offset_minutes`` is nullable and a value is found to be null, then 
this value should be interpreted as if the offset value were were zero.
+
+   It is *recommended* that implementations normalize this type's 
representation by dropping the inner validity buffers and applying the 
aforementioned transformations, only keeping the top-level struct's validity 
buffer.
+
+.. note::

Review Comment:
   I think it doesn't belong indeed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


felipecrv commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546691725


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+   It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be 
nullable, but not recommended.
+
+   If ``timestamp`` is nullable and a value is found to be null, then the 
whole ``TimestampWithOffset`` value should be interpreted as null. One way of 
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the 
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+   If ``offset_minutes`` is nullable and a value is found to be null, then 
this value should be interpreted as if the offset value were were zero.

Review Comment:
   >> And return an error on instantiation?
   >
   > Which instantiation are we talking about exactly? The ExtensionArray 
subclass in Arrow C++?
   
   In Arrow implementations for any language.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546657710


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+   It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be 
nullable, but not recommended.
+
+   If ``timestamp`` is nullable and a value is found to be null, then the 
whole ``TimestampWithOffset`` value should be interpreted as null. One way of 
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the 
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+   If ``offset_minutes`` is nullable and a value is found to be null, then 
this value should be interpreted as if the offset value were were zero.
+
+   It is *recommended* that implementations normalize this type's 
representation by dropping the inner validity buffers and applying the 
aforementioned transformations, only keeping the top-level struct's validity 
buffer.
+
+.. note::

Review Comment:
   I added this mostly to make integrating to Arrow via JSON more ergonomic so 
people can do stuff like `time.Parse()` in Go or `datetime.fromisoformat()` in 
Python, instead of having to manually parse something like `{"timestamp": 
"...", "offset": 123}` to get timezone-aware objects in their language.
   
   I can take this off of the spec if you think it doesn't belong here though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


pitrou commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546650112


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+   It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be 
nullable, but not recommended.
+
+   If ``timestamp`` is nullable and a value is found to be null, then the 
whole ``TimestampWithOffset`` value should be interpreted as null. One way of 
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the 
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+   If ``offset_minutes`` is nullable and a value is found to be null, then 
this value should be interpreted as if the offset value were were zero.

Review Comment:
   > And return an error on instantiation?
   
   Which instantiation are we talking about exactly? The `ExtensionArray` 
subclass in Arrow C++?
   
   > I think we can allow nulls on the `timestamp` field and on instantiation 
put it on the struct array also.
   > 
   > Would you be fine with both points above?
   
   We can, but that places a burden on the consumer for the sake of making 
things easier for the producer. I'm not sure that's a good tradeoff to make.
   
   > I think many operations will care only about the timestamp field and 
having the validity buffer there helps them simply delegate to functions on the 
`Timestamp` type. At the same time, many function only look at the top-level 
validity buffer (i.e. the validity of the struct). So during instantiation it 
would make sense to share the validity buffer if there is one.
   
   I admit that can make sense. Let's hope for other opinions on the topic. 
@paleolimbot @alamb @tustvold 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


felipecrv commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546637086


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+   It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be 
nullable, but not recommended.
+
+   If ``timestamp`` is nullable and a value is found to be null, then the 
whole ``TimestampWithOffset`` value should be interpreted as null. One way of 
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the 
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+   If ``offset_minutes`` is nullable and a value is found to be null, then 
this value should be interpreted as if the offset value were were zero.

Review Comment:
   I think many operations will care only about the timestamp field and having 
the validity buffer there helps them simply delegate to functions on the 
`Timestamp` type. At the same time, many function only look at the top-level 
validity buffer (i.e. the validity of the struct). So during instantiation it 
would make sense to share the validity buffer if there is one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


felipecrv commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546631460


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+   It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be 
nullable, but not recommended.
+
+   If ``timestamp`` is nullable and a value is found to be null, then the 
whole ``TimestampWithOffset`` value should be interpreted as null. One way of 
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the 
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+   If ``offset_minutes`` is nullable and a value is found to be null, then 
this value should be interpreted as if the offset value were were zero.

Review Comment:
   And return an error on instantiation?
   
   I think we can allow nulls on the `timestamp` field and on instantiation put 
it on the struct array also.
   
   Would you be fine with both points above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


pitrou commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546556204


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+   It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be 
nullable, but not recommended.
+
+   If ``timestamp`` is nullable and a value is found to be null, then the 
whole ``TimestampWithOffset`` value should be interpreted as null. One way of 
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the 
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+   If ``offset_minutes`` is nullable and a value is found to be null, then 
this value should be interpreted as if the offset value were were zero.
+
+   It is *recommended* that implementations normalize this type's 
representation by dropping the inner validity buffers and applying the 
aforementioned transformations, only keeping the top-level struct's validity 
buffer.
+
+.. note::

Review Comment:
   This seems slightly out of scope. We don't have any similar wording for 
built-in Arrow types such as Timestamp.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


pitrou commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546553261


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded or run-end-encoded.
+
+.. note::
+
+   It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be 
nullable, but not recommended.
+
+   If ``timestamp`` is nullable and a value is found to be null, then the 
whole ``TimestampWithOffset`` value should be interpreted as null. One way of 
achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the 
top-level struct's validity buffer (V2) with the result of ``V1 AND V2``.
+
+   If ``offset_minutes`` is nullable and a value is found to be null, then 
this value should be interpreted as if the offset value were were zero.

Review Comment:
   This makes consumption of these data more complicated. Why not mandate that 
struct fields are always non-null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


pitrou commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2546541747


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,49 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, 
where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a preferably non-nullable signed 16-bit integer 
(``Int16``) representing the offset in minutes from the UTC timezone. Negative 
offsets represent time zones west of UTC, while positive offsets represent 
east. Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.

Review Comment:
   This is not an extension type parameter, it's a parameter of the storage 
type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


rok commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2545537567


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   Sorry for a late reply. Yes, I think this is good, let's close the thread.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   Hey all! To move this forward, I:
   
   1. Changed the wording from `... non-nullable ...` to `... preferably 
non-nullable ...`
   2. Added a new note section outlining the expected semantics when the inner 
fields do have a validity buffer. This should be enough to avoid 
implementations drifting and having their own freestyle interpretation.
   3. Added a recommendation that implementations _should_ (but aren't required 
to) normalize the type by dropping the inner validity buffers and only keep the 
top-level one. This should make it easier for compute kernels to interpret this 
type correctly without having to implement the logic themselves.
   
   Let me know if this is enough to resolve this thread!
   
   
https://github.com/apache/arrow/pull/48002/commits/b0d9be3be93098e7932236b1e5aa921d3d1c8469



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   Hey all! To move this forward, I:
   
   1. Changed the wording from `... non-nullable ...` to `... preferably 
non-nullable ...`
   2. Added a new note section outlining the expected semantics when the inner 
fields do have a validity buffer. This should be enough to avoid 
implementations drifting and having their own freestyle interpretation.
   3. Added a recommendation that implementations _should_ (but aren't required 
to) normalize the type by dropping the inner validity buffers and only keeping 
the top-level one. This should make it easier for compute kernels to interpret 
this type correctly without having to implement the logic themselves.
   
   Let me know if this is enough to resolve this thread!
   
   
https://github.com/apache/arrow/pull/48002/commits/b0d9be3be93098e7932236b1e5aa921d3d1c8469



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   Hey all! To move this forward, I:
   
   1. Changed the wording from `... non-nullable ...` to `... preferably 
non-nullable ...`
   2. Added a new note section outlining the expected semantics when the inner 
fields do have a validity buffer
   3. Added a recommendation that implementations _should_ (but aren't required 
to) normalize the type by dropping the inner validity buffers and only keeping 
the top-level one. This should make it easier for compute kernels to interpret 
this type correctly without having to implement the logic themselves.
   
   Let me know if this is enough to resolve this thread!
   
   
https://github.com/apache/arrow/pull/48002/commits/b0d9be3be93098e7932236b1e5aa921d3d1c8469



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   Hey all! To move this forward, I:
   
   1. Changed the wording from `... non-nullable ...` to `... preferably 
non-nullable ...`
   2. Added a new note section outlining the expected semantics when the inner 
fields do have a validity buffer
   3. Added a recommendation that implementations _should_ (but aren't required 
to) normalize the type by dropping the inner validity buffers and only keeping 
the top-level one.
   
   Let me know if this is enough to resolve this thread!
   
   
https://github.com/apache/arrow/pull/48002/commits/b0d9be3be93098e7932236b1e5aa921d3d1c8469



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-20 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2544923839


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   Hey all! To move this forward, I:
   
   1. Changed the wording from `... non-nullable ...` to `... preferably 
non-nullable ...`
   2. Added a new note section outlining the expected semantics when the inner 
fields do have a validity buffer
   3. Added a recommendation that implementations _should_ (but aren't required 
to) normalize the type by dropping the inner validity buffers and only keeping 
the top-level one.
   
   Let me know if this is enough to resolve this thread!
   
   
https://github.com/apache/arrow/pull/48002/commits/d40c0bb2383ec3fa82bfb6df8e4ee47d82ca152a



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-17 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2533749555


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   @rok Are you satisfied with the current wording for the spec? I want to keep 
this moving forward, and so I am wondering if I can close this thread :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-09 Thread via GitHub


lidavidm commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2508392727


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,39 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+.. _timestamp_with_offset_extension:
+
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).
+
+* Extension type parameters:
+
+  * ``time_unit``: the time-unit of each of the stored UTC timestamps.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``offset_minutes`` field to be 
dictionary-encoded with a preferred (*but not required*) index type of 
``int8``, or run-end-encoded with a preferred (*but not required*) runs type of 
``int8``.
+
+.. note::
+
+  Although not required, it is *recommended* that implementations represent 
this type as an RFC3339 string when de/serializing to/from JSON, respecting the 
``TimeUnit`` precision and time zone offset without loss of information. For 
example ``2025-01-01T00:00:00Z`` represents January 1st 2025 in UTC with second 
precision, and ``2025-01-01T00:00:00.1-07:00`` represents one 
nanosecond after January 1st 2025 in UTC-07.
+
+  The rationale behind this recommendation is that many programming languages 
provide support for parsing RFC3339 out of the box, facilitating consumption of 
timezone-aware JSON-encoded Arrow arrays without extra boilerplate just for 
integrating with Arrow.

Review Comment:
   ```suggestion
   .. note::
   
  Although not required, it is *recommended* that implementations represent 
this type as an RFC3339 string when de/serializing to/from JSON, respecting the 
``TimeUnit`` precision and time zone offset without loss of information. For 
example ``2025-01-01T00:00:00Z`` represents January 1st 2025 in UTC with second 
precision, and ``2025-01-01T00:00:00.1-07:00`` represents one 
nanosecond after January 1st 2025 in UTC-07.
   
  The rationale behind this recommendation is that many programming 
languages provide support for parsing RFC3339 out of the box, facilitating 
consumption of timezone-aware JSON-encoded Arrow arrays without extra 
boilerplate just for integrating with Arrow.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-07 Thread via GitHub


rok commented on PR #48002:
URL: https://github.com/apache/arrow/pull/48002#issuecomment-3503602930

   FWIW, forgot to mention this before in arrow-cpp we have an internal 
OffsetZone that represents offset in minutes. We define it per array, not per 
row of course.
   
https://github.com/apache/arrow/blob/cd23a765442bdbaaef43d0e4b239094fb01e37ae/cpp/src/arrow/util/date_internal.h#L29-L33


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-07 Thread via GitHub


rok commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2503880802


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   One thing we could do with nullable `offset_minutes` field is to have nulls 
indicate timezone-naive timestamps. As per discussion on the Arrow call - cases 
with mixed timezone-naive and timezone-aware columns are probably not common, 
so I'm only bringing this up here for completeness.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-07 Thread via GitHub


felipecrv commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2503778940


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   @jorisvandenbossche now I understand what you meant.
   
   One complexity here is that some compute kernels might want to look at just 
the UTC timestamp field because they only care about the instant, so we should 
at least warn/recommend what to do when two bitmaps exist. If we make the spec 
require that the top-level bitmap can be more selective than the inner bitmap, 
kernels looking at just the timestamp would be allowed to grab the top-level 
bitmap and apply it to the processing and the output.
   
   @lidavidm I think top-level bitmaps is the best, but inevitably someone will 
have to make a decision on what to do when more than one bitmap exists and the 
spec having recommendations could prevent divergence between implementations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-07 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2503708486


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,29 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   Resolved: 
https://github.com/apache/arrow/pull/48002/commits/06ffdf958c34405918300789dd88a1b6ac5684ee



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-07 Thread via GitHub


serramatutu commented on PR #48002:
URL: https://github.com/apache/arrow/pull/48002#issuecomment-3502663004

   Hey friends, I pushed 2 new patches, as discussed in the Arrow dev community 
meeting:
   
   1. Made RFC3339 encoding to/from JSON a _recommendation_ but not a 
_requirement_, and added a rationale on why this is recommended. 
Implementations can still provide ways to encode it as JSON objects without 
violating the spec if they want, though.
   
   2. The `offset_minutes` field now can also be run-end or dictionary encoded 
for columns with lots of repeating offsets.
   
   Next, I need to change the Go and Rust implementation drafts to accept the 
new supported encodings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-05 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2495148508


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,29 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   Actually, my bad. I overlooked a line that says:
   
   > (4) The index type of a Dictionary type can only be an integer type, 
preferably signed, with width 8 to 64 bits.
   
   I had interpreted it differently since the only example of dictionary 
encoding is `int32`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-05 Thread via GitHub


lidavidm commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2493949264


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,29 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   Any of the integer types possible, so uint8 is perfectly fine. Where do you 
see it saying that int32 should be used? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-05 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2493412961


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,29 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   There was a request in the mailing list to add dictionary encoding and 
run-end encoding to the offset column. 
   
   I don't see why we wouldn't wanna do run-end encoding, especially for large 
columns with lots of repeated offsets it could save a lot of space.
   
   **Should we add it to the spec already to avoid breaking changes?**
   
   For dictionary encoding: is it possible to use `uint8` or possibly something 
even smaller to represent the dictionary indices? Otherwise it only adds extra 
abstraction without saving that much space... The docs suggest using `int32` 
for dictionary encoding which would actually be worse than just using the 
`int16` offsets directly.
   
   We can keep the implementations simple (only primitive encoding for now), 
and then patch them later to support all the encodings we decide to add to the 
spec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-05 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2493412961


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,29 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   There was a request in the mailing list to add dictionary encoding and 
run-end encoding to the offset column. 
   
   I don't see why we wouldn't wanna do run-end encoding, especially for large 
columns with lots of repeated offsets it could save a lot of space.
   
   **Should we add it to the spec already to avoid breaking changes?**
   
   For dictionary encoding: is it possible to use `uint8` or possibly something 
even smaller to represent the dictionary indices? Otherwise it only adds extra 
abstraction without saving that much space... The docs suggest using `int32` 
for dictionary encoding which would actually be worse...
   
   We can keep the implementations simple (only primitive encoding for now), 
and then patch them later to support all the encodings we decide to add to the 
spec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-05 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2493412961


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -544,6 +544,29 @@ Primitive Type Mappings
 | UUID extension type  | UUID   |
 +--++
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP 
WITH TIME ZONE``, which is supported by multiple database engines.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   There was a request in the mailing list to add dictionary encoding and 
run-end encoding to the offset column. I don't see why we wouldn't wanna do 
that, especially for large columns with lots of repeated offsets it could save 
a lot of space.
   
   **Should we add it to the spec already to avoid breaking changes?**
   
   We can keep the implementations simple (only primitive encoding for now), 
and then patch them later to support all the encodings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-03 Thread via GitHub


jorisvandenbossche commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2486512426


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   To be clear, I am not arguing for assigning a specific meaning to a certain 
combination of nullability, but just for allowing the fields to be null as well.
   
   For example, we could say that if the element is null (top-level struct 
validity), the individual fields are allowed to contain a null as well.
   
   Of course, when constructing a timestamp with offset from the individual 
fields, it is relatively straightforward to just drop the validity bitmaps of 
the individual fields, and ensure a union of both bitmaps is assigned to the 
struct. 
   (it is just that the current pyarrow APIs don't make this particularly easy 
.. but that is something we can also improve in the exposed APIs)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-03 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2486474463


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset

Review Comment:
   Fixed: 
https://github.com/apache/arrow/pull/48002/commits/080220a36401e6468f381c241c6dae245c76967d



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-03 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2486475138


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.

Review Comment:
   Fixed: 
https://github.com/apache/arrow/pull/48002/commits/d8b900fd29603a34670ea2e45ae4f8eb5cbc8432



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-11-02 Thread via GitHub


lidavidm commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2485119505


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   I think I prefer the current wording (require nullability to be handled at 
the struct level) instead of trying to assign semantics to the other 
combinations.



##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.

Review Comment:
   Can we explicitly call out that this is intended to be 1:1 with ANSI SQL?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-31 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2481948785


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   > Alternatively we could also specify that if one is null, the other should 
be null as well?
   
   Yea, that's more or less what I was thinking about. In principle this type 
only has meaning if both fields are set. To relax these constraints we'd need 
to come up with a meaning for what a null timestamp and non-null offset would 
mean and vice versa.
   
   Could be:
   - If timestamp is set and offset is null, assume `offset=0`, i.e timestamp 
is UTC
   - If timestamp is null and offset is set, assume the whole value is null (a 
standalone offset floating around has no meaning)
   
   Or, alternatively:
   - If any of the fields is null, assume the whole value is null as well
   
   The second option has the advantage of being able to and the two bitmasks 
together to figure out the global nullable buffer, and it reduces branching.



##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   > Alternatively we could also specify that if one is null, the other should 
be null as well?
   
   Yea, that's more or less what I was thinking about. In principle this type 
only has meaning if both fields are set. To relax these constraints we'd need 
to come up with a meaning for what a null timestamp and non-null offset would 
mean and vice versa.
   
   Could be:
   - If timestamp is set and offset is null, assume `offset=0`, i.e timestamp 
is UTC
   - If timestamp is null and offset is set, assume the whole value is null (a 
standalone offset floating around has no meaning)
   
   Or, alternatively:
   - If any of the fields is null, assume the whole value is null as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-31 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2481948785


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   > Alternatively we could also specify that if one is null, the other should 
be null as well?
   
   Yea, that's more or less what I was thinking about. In principle this type 
only has meaning if both fields are set. To relax these constraints we'd need 
to come up with a meaning for what a null timestamp and non-null offset would 
mean and vice versa.
   
   Could be:
   - If timestamp is set and offset is null, assume `offset=0`, i.e timestamp 
is UTC
   - If timestamp is null and offset is set, assume the whole value is null (a 
standalone offset floating around has no meaning)
   
   Or, alternatively:
   - If any of the fields is null, assume the whole value is null as well
   
   None of these is inherently better or worse IMHO, it's just a matter of 
standardizing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-31 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2481782459


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset

Review Comment:
   Oops, yes will do!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-31 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2481766888


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   @rok we just sent this to the mailing list yesterday. The [discussion 
thread](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk) has a 
more extensive argumentation around why we chose these constraints.
   
   > Out of curiosity - would the proposed memory layout of match any existing 
system?
   
   The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and 
Trino, of which only one of them (Trino) is open source. It's hard to know for 
a fact what is the internal memory layout of proprietary systems... We do know 
Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't 
match there and some Arrow conversion layer would need to resolve the timezone 
names to offsets. 
   
   This (resolving offsets on the server) is an explicit choice so that 
consumer systems don't need to mess with the IANA database or reasoning about 
daylight savings etc. Arrow consumers just get the offset, add it to the 
timestamp and voila you have the original timestamp in the original timezone.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-31 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2481766888


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   @rok we just sent this to the mailing list yesterday. The [discussion 
thread](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk) has a 
more extensive argumentation around why we chose these constraints.
   
   > Out of curiosity - would the proposed memory layout of match any existing 
system?
   
   The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and 
Trino, of which only one of them (Trino) is open source. It's hard to know for 
a fact what is the internal memory layout of proprietary systems... We do know 
Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't 
match there and some Arrow conversion layer would need to resolve the timezone 
names to offsets. 
   
   This (resolving offsets on the server) is an explicit choice so that 
consumer systems don't need to mess with the IANA database or reasoning about 
daylight savings etc. Arrow consumers just get the offset, add it to the 
timestamp and voila.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-31 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2481766888


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   @rok we just sent this to the mailing list yesterday. The [discussion 
thread](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk) has a 
more extensive argumentation around why we chose these constraints.
   
   > Out of curiosity - would the proposed memory layout of match any existing 
system?
   
   The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and 
Trino, of which only one of them (Trino) is open source. It's hard to know for 
a fact what is the internal memory layout of proprietary systems... We do know 
Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't 
match there and some Arrow conversion layer would need to resolve the timezone 
names to offsets.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-31 Thread via GitHub


rok commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2481129311


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   @serramatutu aligning with ANSI SQL seems like a good idea (and doesn't 
create a new convention), perhaps we could state this in the docs?
   
   Out of curiosity - would the proposed memory layout of match any existing 
system?
   
   Hey @felipecrv ! I was thinking about int8 for 15 min offset blocks as well, 
but I'm not sure it's worth it. Politically I would not expect new sub-60 
minutes offsets. But ANSI SQL does seem safer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


felipecrv commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2480139536


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   Interesting points! We should expand the spec text here and clarify 
expectations.
   
   Since I can see many operations on this array not caring about the two 
fields, having a validity buffer on the timestamp field could be a 
simplification in these cases. It would reduce the risk of computation being 
performed on garbage values if the struct's validity bitmap is being ignored.
   
   But a top-level validity buffer is necessary to keep generic code going 
through columns processing nulls correctly.
   
   One way we can adapt to this reality is to make a recommendation against 
validity on the timestamp field and a warning that even when the offset field 
is not touched, the validity bitmap of the computation's result should come 
from the struct validity, or, if both have validity buffers, the & of the two 
bitmaps.
   
   For the offset column we can recommend the absence of validity bitmap as 
well (non-nullable) but if a value is null, process it as if it were zero.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


felipecrv commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2480139536


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   Interesting points! We should expand the spec text here and clarify 
expectations.
   
   Since I can see many operations on this array not caring about the two 
fields, having a validity buffer on the timestamp field could be a 
simplification in these cases. It would reduce the risk of computation being 
performed on garbage values if the struct's validity bitmap is being ignored.
   
   But a top-level validity buffer is necessary to keep generic code going 
through columns processing nulls correctly.
   
   One way we can adapt to this reality is to make a recommendation against 
validity on the timestamp field and a warning that even when the offset field 
is not touched, the validity bitmap of the computation's result should come 
from the struct validity, or, if both have validity buffers, the & of the two 
bitmaps.
   
   For the offset column we can recommend the absence of validity bitmap also 
(non-nullable) but if it's null, process the array as if it were zero.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


jorisvandenbossche commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479813864


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

Review Comment:
   Why explicitly saying that it should be non-nullable? 
   
   Is that because the nullability can (should) be defined at the struct level, 
and you want to avoid having an inconsistency between the "timestamp" and 
"offset_minutes" fields? (e.g. the case where only the "offset_minutes" field 
would be null for a given row, what does that mean?)
   
   I am only not sure how practical this limitation is in practice. For example 
when creating a struct from its individual fields, typically the fields itself 
will contain nulls. Alternatively we could also specify that if one is null, 
the other should be null as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


jorisvandenbossche commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479807607


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset

Review Comment:
   Formatting: can you move this new section below after the "Primitive Type 
Mappings" section? Because that is a subsection of the "Parquet Variant" 
section above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479304385


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   Oh and @rok you're right in saying timezones can go up until +14:00 in the 
wild, even if that's not standard. Politics is weird... We should maybe take 
these hard limits off of the format spec.
   
   Anyways, I digress. Let's discuss these things in the mailing list. Would 
love if you chimed in too @rok !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


felipecrv commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479277740


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   @rok minutes is coarse enough to fit in 16 bits. 15-min blocks would give us 
the ability of using just 8 bits, but I'm not so comfortable with the promise 
of the 15-minute convention holding forever everywhere in the planet.
   
   And it would create awkwardness when parsing inputs that contain 
non-15-minute-multiple offsets as @serramatutu pointed above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479249353


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   But...
   
   the main reason behind this proposal is compatibility with ANSI SQL 
`TIMESTAMP WITH TIME ZONE`, which is supported by multiple database systems 
(Snowflake, MS SQL Server, Oracle, Trino).
   
   This is the reasoning behind why we're proposing an offset in minutes as 
signed 16-bit int:
   
   > In ANSI SQL, the time zone information is defined in terms of an 
"INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL 
+ '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with 
which you can represent a time zone offset, and the maximum minutes in the 
offset is 13*60=780, we believe it makes sense for the offset to be stored as a 
16-bit integer in minutes.
   
   > It is important to point out that some systems such as MS SQL Server do 
implement data types that can represent offsets with sub-minute granularity. We 
believe representing sub-minute granularity is out of scope for this proposal 
given that no current or past time zone standards have ever specified 
sub-minute offsets [9], and that is what we're trying to solve for. 
Furthermore, representing the offset in seconds rather than minutes would mean 
the maximum offset is 13*60*60=46800, which is greater than the maximum 
positive integer an int16 can represent (32768), and thus the offset type would 
need to be wider (int32).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479249353


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   But...
   
   the main reason behind this proposal is compatibility with ANSI SQL 
`TIMESTAMP WITH TIME ZONE`, which is supported by multiple database systems.
   
   This is the reasoning behind why we're proposing an offset in minutes as 
signed 16-bit int:
   
   > In ANSI SQL, the time zone information is defined in terms of an 
"INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL 
+ '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with 
which you can represent a time zone offset, and the maximum minutes in the 
offset is 13*60=780, we believe it makes sense for the offset to be stored as a 
16-bit integer in minutes.
   
   > It is important to point out that some systems such as MS SQL Server do 
implement data types that can represent offsets with sub-minute granularity. We 
believe representing sub-minute granularity is out of scope for this proposal 
given that no current or past time zone standards have ever specified 
sub-minute offsets [9], and that is what we're trying to solve for. 
Furthermore, representing the offset in seconds rather than minutes would mean 
the maximum offset is 13*60*60=46800, which is greater than the maximum 
positive integer an int16 can represent (32768), and thus the offset type would 
need to be wider (int32).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479245886


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   Hey! We will send a [DISCUSS] in the mailing list to discuss this shortly 
(next few days, still drafting it). Let's discuss it there! 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479249353


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   But...
   
   the main reason behind this is compatibility with ANSI SQL `TIMESTAMP WITH 
TIME ZONE`, which is supported by multiple database systems.
   
   This is the reasoning behind why we're proposing an offset in minutes as 
signed 16-bit int:
   
   > In ANSI SQL, the time zone information is defined in terms of an 
"INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL 
+ '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with 
which you can represent a time zone offset, and the maximum minutes in the 
offset is 13*60=780, we believe it makes sense for the offset to be stored as a 
16-bit integer in minutes.
   
   > It is important to point out that some systems such as MS SQL Server do 
implement data types that can represent offsets with sub-minute granularity. We 
believe representing sub-minute granularity is out of scope for this proposal 
given that no current or past time zone standards have ever specified 
sub-minute offsets [9], and that is what we're trying to solve for. 
Furthermore, representing the offset in seconds rather than minutes would mean 
the maximum offset is 13*60*60=46800, which is greater than the maximum 
positive integer an int16 can represent (32768), and thus the offset type would 
need to be wider (int32).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479245886


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   Hey! We will send a [DISCUSS] in the mailing list to discuss this shortly 
(next few days, still drafting it). Let's discuss it there!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


serramatutu commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2479245886


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   Hey! We will send a proposal in the mailing list to discuss this shortly 
(next few days, still drafting it). Let's discuss it there!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


rok commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2478871611


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   I believe (current) [timezones in the 
wild](https://en.wikipedia.org/wiki/List_of_UTC_offsets) cover a range of 
-12:00 to + 14:00.
   
   We could specify offsets should preferably be multiples of 15 minutes as 
[suggested here](https://en.wikipedia.org/wiki/UTC_offset):
   > By convention, every inhabited place in the world has a UTC offset that is 
a multiple of 15 minutes but the majority of offsets are stated in whole hours. 
There are many cases where the national [standard 
time](https://en.wikipedia.org/wiki/Standard_time) uses a UTC offset that is 
not defined solely by longitude.
   
   Alternatively - if we wanted to represent old sun time offsets - we'd have 
to consider [fractions of 
seconds](https://github.com/eggert/tz/blob/8e43ecbb256178b4fa7314491abdc0ce8f5f1c9c/northamerica#L329-L331).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type [arrow]

2025-10-30 Thread via GitHub


rok commented on code in PR #48002:
URL: https://github.com/apache/arrow/pull/48002#discussion_r2478871611


##
docs/source/format/CanonicalExtensions.rst:
##
@@ -483,6 +483,28 @@ binary values look like.
 
 .. _variant_primitive_type_mapping:
 
+Timestamp With Offset
+=
+This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.
+
+* Extension name: ``arrow.timestamp_with_offset``.
+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

Review Comment:
   I believe (current) [timezones in the 
wild](https://en.wikipedia.org/wiki/List_of_UTC_offsets) cover a range of 
-12:00 to + 14:00.
   
   We could specify minute offsets should preferably be multiples of 15 minutes 
as [suggested here](https://en.wikipedia.org/wiki/UTC_offset):
   > By convention, every inhabited place in the world has a UTC offset that is 
a multiple of 15 minutes but the majority of offsets are stated in whole hours. 
There are many cases where the national [standard 
time](https://en.wikipedia.org/wiki/Standard_time) uses a UTC offset that is 
not defined solely by longitude.
   
   Alternatively - if we wanted to represent old sun time offsets - we'd have 
to consider [fractions of 
seconds](https://github.com/eggert/tz/blob/8e43ecbb256178b4fa7314491abdc0ce8f5f1c9c/northamerica#L329-L331).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]