Hello all,
This proposal looks fine to me on the principle. I've posted a couple
comments on the format PR.
Regards
Antoine.
Le 30/10/2025 à 22:02, serramatutu a écrit :
Hello everyone!
We (me and @felipecrv) would like to propose a new canonical extension type:
"TimestampWithOffset". Before we start an official voting, we would like to
discuss our proposal in this thread.
A draft of the format documentation change can be found at [1]. A copy of its
text is attached under the FORMAT section.
A draft Go implementation can be found at [2].
A draft Rust implementation can be found at [3].
THE PROBLEM
---
"TIMESTAMP WITH TIME ZONE" is a standard ANSI SQL type that associates a
timezone offset to each timestamp entry in a database. Many database systems support this
data type (some use aliases). Some examples are Snowflake [4], Trino [5], Oracle DB [6]
and MS SQL Server [7].
The current set of Arrow types can only keep one timezone that applies to the entire
column. This limits the expressiveness of data when interacting with such SQL databases.
Consumer systems currently need to either convert from source "(timestamp,
timezone_offset)" to a normalized arrow UTC timestamp and throw away the original
time zone information, or use bespoke formats if the time zone needs to be preserved. For
example, the ADBC implementation for Snowflake currently implements the former [8].
Dropping the time zone has correctness implications for some applications. Consider a global business which
performs monthly reporting. Each business unit is located in a different continent, and processes a number of
orders every month. Each order is placed in an "orders" fact table, where "ordered_at" is
a "TIMESTAMP WITH TIME ZONE" column. Now, assume there is an order that was placed at 23:00 (11pm)
in California (UTC-8), on the 31st of January. A system that simply casts this timestamp to UTC and drops the
time zone will think this order was placed in February, which is an incorrect assumption given the business'
reporting needs of associating orders to the business unit where it was processed. There is no way to fetch
the original time zone of the order, and using the local client time zone does not help either. And so one
cannot generate a correct report without workarounds to preserve the time zone.
We think Arrow should have a canonical way representation for this.
THE PROPOSAL
---
We propose "TimestampWithOffset" to be a new canonical extension type that stores
timestamps as "struct(timestamp=timestamp[time_unit=any, timezone=utc],
offset_minutes=int16)", such that timestamps can have a per-row timezone offset instead of
having one global timezone attached to the entire column.
CONSIDERATIONS AND LIMITATIONS
---
1. Why use a 16-bit integer offset in minutes?
In ANSI SQL, the time zone information is defined in terms of an "INTERVAL" offset ranging from
"INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL + '13:00' HOUR TO MINUTE". Since
"MINUTE" is the smallest granularity with which you can represent a time zone offset, and the maximum minutes
in the offset is 13*60=780, we believe it makes sense for the offset to be stored as a 16-bit integer in minutes.
Nonetheless, 16-bits is large enough to fit a much wider offset
It is important to point out that some systems such as MS SQL Server do
implement data types that can represent offsets with sub-minute granularity. We
believe representing sub-minute granularity is out of scope for this proposal
given that no current or past time zone standards have ever specified
sub-minute offsets [9], and that is what we're trying to solve for.
Furthermore, representing the offset in seconds rather than minutes would mean
the maximum offset is 13*60*60=46800, which is greater than the maximum
positive integer an int16 can represent (32768), and thus the offset type would
need to be wider (int32).
2. This type can still be "lossy"
Systems like Trino and Oracle DB store the time zone information as an IANA time zone
name, not as an "HOUR TO MINUTE" interval as specified by the ANSI SQL
standard. This means the source system (or the arrow compatibility layer, such as ADBC)
needs to cast time zone strings to the offset in minutes. In other words, the integer
offset is calculated at the source, not at the consumer.
This means that the consumer cannot render the original IANA time zone string, and needs to use
something like "UTC-03:00" instead of "America/Sao_Paulo", for example. It's
impossible for the consumer to lookup the IANA time zone with only the offset, as that is a
one-to-many mapping, and so the type is lossy with respect to the original IANA time zone.
There is an upside to this, which is reducing consumer complexity. It only
needs to add an offset in minutes to the UTC date to get the original date in
its time zone, without any access to the IANA time zone database nor performing
complicated conversions, like reasoning about variable-offset time zones (e.g
daylight savings).
3. JSON representation
We propose that the de/serialization to/from JSON must use RFC3339 strings [10], without loss of information.
RFC3339 is a widely accepted format across programming languages and databases, and we argue encoding
"TimestampWithOffset" with it would make JSON integration with external non-Arrow systems easier.
It enables JSON consumers to decode to their own timezone-aware representation of timestamps (like Go's
"time.Time" or JavaScript's "Date") by leveraging existing RFC3339 de/encoders without
having to implement wrapper boilerplate just for Arrow.
RELEVANT LINKS
---
[1] Format specification pull request.
https://github.com/apache/arrow/pull/48002
[2] Golang implementation draft. https://github.com/apache/arrow-go/pull/558
[3] Rust implementation draft. https://github.com/apache/arrow-rs/pull/8743
[4] Snowflake's TIMESTAMP_TZ.
https://docs.snowflake.com/en/sql-reference/data-types-datetime#timestamp-ltz-timestamp-ntz-timestamp-tz
[5] Trino's TIMESTAMP WITH TIME ZONE.
https://trino.io/docs/current/language/types.html#timestamp-p-with-time-zone
[6] Oracle's TIMESTAMP WITH TIME ZONE.
https://docs.oracle.com/en/database/oracle/oracle-database/19/nlspg/datetime-data-types-and-time-zone-support.html
[7] MS SQL Server's DATETIMEOFFSET.
https://learn.microsoft.com/en-us/sql/t-sql/data-types/datetimeoffset-transact-sql?view=sql-server-ver17
[8] ADBC for Snowflake converts to UTC and drops time zone.
https://github.com/apache/arrow-adbc/blob/a67ab5a509676feaec8e24dba479d4de8dc083e2/go/adbc/driver/snowflake/record_reader.go#L228
[9] Current time zones in effect.
https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
[10] RFC3339 representation of timezone-aware timestamps.
https://datatracker.ietf.org/doc/html/rfc3339
FORMAT
---
Timestamp With Offset
=============
This type represents a timestamp column that stores potentially different
timezone offsets per value. The timestamp is stored in UTC alongside the
original timezone offset in minutes.
* Extension name: ``arrow.timestamp_with_offset``.
* The storage type of the extension is a ``Struct`` with 2 fields, in order:
* ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
* ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``)
representing the offset in minutes from the UTC timezone. Negative offsets
represent time zones west of UTC, while positive offsets represent east.
Offsets range from -779 (-12:59) to +780 (+13:00).
* Extension type parameters:
* ``time_unit``: the time-unit of each of the stored UTC timestamps.
* Description of the serialization:
Extension metadata is an empty string.
When de/serializing to/from JSON, this type must be represented as an
RFC3339 string, respecting the ``TimeUnit`` precision and time zone offset
without loss of information. For example ``2025-01-01T00:00:00Z`` represents
January 1st 2025 in UTC with second precision, and
``2025-01-01T00:00:00.000000001-07:00`` represents one nanosecond after January
1st 2025 in UTC-07.