> Unless there are some concrete performance or memory usage experiments
that suggest concrete benefits, I would probably err on the side of not
allowing this or at least not recommending it.

It's easy to see that dict-encoding or run-end encoding the timezone
offsets will be beneficial in many scenarios (time zones repeat a lot). I
think it's OK to *ALLOW* the alternative encodings, but as always, the
simpler the encoding, the more compatible the data is. Applications tend to
be conservative in adopting advanced encodings like REE and even dictionary
encoding of non-string arrays.

> This proposal requires all timestamps to be expressed in UTC. I'm curious
if there's a need in the community to be able to express part of timestamps
as timezone naive (timestamp is known, but its offset to UTC is not) and
part as timezone aware (timestamp and UTC offset are known)?

That creates too many conditionals per value which becomes a pain for
compute kernels. Original timestamps in Arrow are very uniform (a single
but optional timezone offset), the intention of this type is too only one
new degree of freedom -- one timezone offset per value. In the FORMAT, we
are not recommending nulls for the timezone offset field but if it's
present, treat null as zero. Top-level validity (on the struct) is, of
course, allowed and very meaningful.

Initially, we didn't even put these concerns in the spec, but they arose
from code reviews and discussion here. I think we covered many of the
concerns. Can you all give another look at the FORMAT PR and gives us a
green light to start the VOTE?

--
Felipe


On Fri, Nov 7, 2025 at 12:28 PM Rok Mihevc <[email protected]> wrote:

> +1 on introducing such an extension type and thanks for preparing the
> proposal. One question:
>
> This proposal requires all timestamps to be expressed in UTC. I'm curious
> if there's a need in the community to be able to express part of timestamps
> as timezone naive (timestamp is known, but its offset to UTC is not) and
> part as timezone aware (timestamp and UTC offset are known)?
>
> Rok
>
> On Thu, Nov 6, 2025 at 11:28 PM Dewey Dunnington <
> [email protected]>
> wrote:
>
> > +1! I can't comment on the timestamp-specific details but have had
> similar
> > experiences writing database drivers where Arrow types require a
> type-level
> > parameter but the database has a row-level parameter (Postgres numerics
> > were my last adventure with this one).
> >
> > > Will dictionary encoding of the time zone component be allowed
> >
> > Unless there are some concrete performance or memory usage experiments
> that
> > suggest concrete benefits, I would probably err on the side of not
> allowing
> > this or at least not recommending it. Having a single allowed type is
> quite
> > nice (GeoArrow might be an extreme case, but we have many allowed types
> > there and it's a lot to deal with).
> >
> > > I didn't think we generally specify JSON serialization formats for
> types?
> >
> > It's true that we don't generally specify the cast behaviour to and from
> > string, although most Arrow implementations have to choose how to
> display a
> > value at some point and it might be a nice addition to ensure the value
> is
> > displayed consistently.
> >
> > On Wed, Nov 5, 2025 at 1:47 AM serramatutu <[email protected]>
> > wrote:
> >
> > > Hey!
> > >
> > > 1. I didn't think we generally specify JSON serialization formats for
> > > types?
> > >
> > > We included this in the proposal mostly because we thought it would be
> > > nicer for non-Arrow programs to parse a list of RFC3339 JSON like
> > > `["2025-01-01T00:01:00T+01:00"]` instead of something like
> > `[{"timestamp":
> > > "2025-01-01T00:00:00Z", "offset": 60}]`. I understand if JSON
> > serialization
> > > is not something we want to put in the spec. I do want to point out
> that
> > > the arrow-go implementation of the UUID extension type does have
> > > specialized JSON serialization [1], so I thought we could do the same
> for
> > > TimestampWithOffset as well. I see that the specialized UUID
> > serialization
> > > is not a part of the UUID type spec though, so I assume it's something
> > the
> > > Go implementers thought was best.
> > >
> > >
> > > 2. I assume if there is a high enough demand we can always have a
> > > different variant that uses IANA strings?
> > >
> > > Yes, this is something that we can discuss in the future if folks ask
> for
> > > it, either through extending this type somehow or adding a new
> > > `TimestampWithTimezone`. We thought of keeping things simple for now
> with
> > > this proposal, though.
> > >
> > >
> > > 3. Will dictionary encoding of the time zone component be allowed (for
> > > common cases I would expect one could save 1 byte but this is perhaps
> > over
> > > optimization).
> > >
> > > This is a good point. We could add both dictionary encoding and run-end
> > > encoding to the offsets. I think we can add this to the spec now to
> avoid
> > > future breaking changes to the data types, and keep the implementation
> > PRs
> > > simple (only primitive encoding) for now. We can later add support for
> > > other encodings in followup PRs.
> > >
> > >
> > > [1]
> > >
> >
> https://github.com/apache/arrow-go/blob/f6575665bbf95d92795c37a1a792682d9da69b08/arrow/extensions/uuid_test.go#L242
> > >
> > >
> > > On Tuesday, November 4th, 2025 at 16:18, Micah Kornfield <
> > > [email protected]> wrote:
> > >
> > > >
> > >
> > > >
> > >
> > > > Generally, seems like a good idea two questions:
> > > >
> > >
> > > > 1. I didn't think we generally specify JSON serialization formats for
> > > > types?
> > > > 2. I assume if there is a high enough demand we can always have a
> > > > different variant that uses IANA strings?
> > > > 3. Will dictionary encoding of the time zone component be allowed
> (for
> > > > common cases I would expect one could save 1 byte but this is perhaps
> > > over
> > > > optimization).
> > > >
> > >
> > > > Thanks,
> > > > Micah
> > > >
> > >
> > > > On Tue, Nov 4, 2025 at 5:25 AM Weston Pace [email protected]
> > wrote:
> > > >
> > >
> > > > > +1
> > > > >
> > >
> > > > > This has always felt like a missing option and hopefully its
> presence
> > > will
> > > > > actually make the intent of the other timestamp types clearer.
> > > > >
> > >
> > > > > On Thu, Oct 30, 2025 at 2:04 PM serramatutu
> [email protected]
> > > > > wrote:
> > > > >
> > >
> > > > > > Hello everyone!
> > > > > >
> > >
> > > > > > We (me and @felipecrv) would like to propose a new canonical
> > > extension
> > > > > > type: "TimestampWithOffset". Before we start an official voting,
> we
> > > would
> > > > > > like to discuss our proposal in this thread.
> > > > > >
> > >
> > > > > > A draft of the format documentation change can be found at [1]. A
> > > copy of
> > > > > > its text is attached under the FORMAT section.
> > > > > > A draft Go implementation can be found at [2].
> > > > > > A draft Rust implementation can be found at [3].
> > > > > >
> > >
> > > > > > THE PROBLEM
> > > > > > ---
> > > > > > "TIMESTAMP WITH TIME ZONE" is a standard ANSI SQL type that
> > > associates a
> > > > > > timezone offset to each timestamp entry in a database. Many
> > database
> > > > > > systems support this data type (some use aliases). Some examples
> > are
> > > > > > Snowflake [4], Trino [5], Oracle DB [6] and MS SQL Server [7].
> > > > > >
> > >
> > > > > > The current set of Arrow types can only keep one timezone that
> > > applies to
> > > > > > the entire column. This limits the expressiveness of data when
> > > > > > interacting
> > > > > > with such SQL databases. Consumer systems currently need to
> either
> > > > > > convert
> > > > > > from source "(timestamp, timezone_offset)" to a normalized arrow
> > UTC
> > > > > > timestamp and throw away the original time zone information, or
> use
> > > > > > bespoke
> > > > > > formats if the time zone needs to be preserved. For example, the
> > ADBC
> > > > > > implementation for Snowflake currently implements the former [8].
> > > > > >
> > >
> > > > > > Dropping the time zone has correctness implications for some
> > > > > > applications.
> > > > > > Consider a global business which performs monthly reporting. Each
> > > > > > business
> > > > > > unit is located in a different continent, and processes a number
> of
> > > > > > orders
> > > > > > every month. Each order is placed in an "orders" fact table,
> where
> > > > > > "ordered_at" is a "TIMESTAMP WITH TIME ZONE" column. Now, assume
> > > there is
> > > > > > an order that was placed at 23:00 (11pm) in California (UTC-8),
> on
> > > the
> > > > > > 31st
> > > > > > of January. A system that simply casts this timestamp to UTC and
> > > drops
> > > > > > the
> > > > > > time zone will think this order was placed in February, which is
> an
> > > > > > incorrect assumption given the business' reporting needs of
> > > associating
> > > > > > orders to the business unit where it was processed. There is no
> way
> > > to
> > > > > > fetch the original time zone of the order, and using the local
> > client
> > > > > > time
> > > > > > zone does not help either. And so one cannot generate a correct
> > > report
> > > > > > without workarounds to preserve the time zone.
> > > > > >
> > >
> > > > > > We think Arrow should have a canonical way representation for
> this.
> > > > > >
> > >
> > > > > > THE PROPOSAL
> > > > > > ---
> > > > > > We propose "TimestampWithOffset" to be a new canonical extension
> > type
> > > > > > that
> > > > > > stores timestamps as "struct(timestamp=timestamp[time_unit=any,
> > > > > > timezone=utc], offset_minutes=int16)", such that timestamps can
> > have
> > > a
> > > > > > per-row timezone offset instead of having one global timezone
> > > attached to
> > > > > > the entire column.
> > > > > >
> > >
> > > > > > CONSIDERATIONS AND LIMITATIONS
> > > > > > ---
> > > > > > 1. Why use a 16-bit integer offset in minutes?
> > > > > > In ANSI SQL, the time zone information is defined in terms of an
> > > > > > "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO
> MINUTE"
> > to
> > > > > > "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the
> smallest
> > > > > > granularity with which you can represent a time zone offset, and
> > the
> > > > > > maximum minutes in the offset is 13*60=780, we believe it makes
> > > sense for
> > > > > > the offset to be stored as a 16-bit integer in minutes.
> > Nonetheless,
> > > > > > 16-bits is large enough to fit a much wider offset
> > > > > >
> > >
> > > > > > It is important to point out that some systems such as MS SQL
> > Server
> > > do
> > > > > > implement data types that can represent offsets with sub-minute
> > > > > > granularity. We believe representing sub-minute granularity is
> out
> > of
> > > > > > scope
> > > > > > for this proposal given that no current or past time zone
> standards
> > > have
> > > > > > ever specified sub-minute offsets [9], and that is what we're
> > trying
> > > to
> > > > > > solve for. Furthermore, representing the offset in seconds rather
> > > than
> > > > > > minutes would mean the maximum offset is 136060=46800, which is
> > > greater
> > > > > > than the maximum positive integer an int16 can represent (32768),
> > and
> > > > > > thus
> > > > > > the offset type would need to be wider (int32).
> > > > > >
> > >
> > > > > > 2. This type can still be "lossy"
> > > > > > Systems like Trino and Oracle DB store the time zone information
> as
> > > an
> > > > > > IANA time zone name, not as an "HOUR TO MINUTE" interval as
> > > specified by
> > > > > > the ANSI SQL standard. This means the source system (or the arrow
> > > > > > compatibility layer, such as ADBC) needs to cast time zone
> strings
> > > to the
> > > > > > offset in minutes. In other words, the integer offset is
> calculated
> > > at
> > > > > > the
> > > > > > source, not at the consumer.
> > > > > >
> > >
> > > > > > This means that the consumer cannot render the original IANA time
> > > zone
> > > > > > string, and needs to use something like "UTC-03:00" instead of
> > > > > > "America/Sao_Paulo", for example. It's impossible for the
> consumer
> > to
> > > > > > lookup the IANA time zone with only the offset, as that is a
> > > one-to-many
> > > > > > mapping, and so the type is lossy with respect to the original
> IANA
> > > time
> > > > > > zone.
> > > > > >
> > >
> > > > > > There is an upside to this, which is reducing consumer
> complexity.
> > It
> > > > > > only
> > > > > > needs to add an offset in minutes to the UTC date to get the
> > original
> > > > > > date
> > > > > > in its time zone, without any access to the IANA time zone
> database
> > > nor
> > > > > > performing complicated conversions, like reasoning about
> > > variable-offset
> > > > > > time zones (e.g daylight savings).
> > > > > >
> > >
> > > > > > 3. JSON representation
> > > > > > We propose that the de/serialization to/from JSON must use
> RFC3339
> > > > > > strings
> > > > > > [10], without loss of information. RFC3339 is a widely accepted
> > > format
> > > > > > across programming languages and databases, and we argue encoding
> > > > > > "TimestampWithOffset" with it would make JSON integration with
> > > external
> > > > > > non-Arrow systems easier. It enables JSON consumers to decode to
> > > their
> > > > > > own
> > > > > > timezone-aware representation of timestamps (like Go's
> "time.Time"
> > or
> > > > > > JavaScript's "Date") by leveraging existing RFC3339 de/encoders
> > > without
> > > > > > having to implement wrapper boilerplate just for Arrow.
> > > > > >
> > >
> > > > > > RELEVANT LINKS
> > > > > > ---
> > > > > > [1] Format specification pull request.
> > > > > > https://github.com/apache/arrow/pull/48002
> > > > > > [2] Golang implementation draft.
> > > > > > https://github.com/apache/arrow-go/pull/558
> > > > > > [3] Rust implementation draft.
> > > > > > https://github.com/apache/arrow-rs/pull/8743
> > > > > > [4] Snowflake's TIMESTAMP_TZ.
> > > > >
> > >
> > > > >
> > >
> >
> https://docs.snowflake.com/en/sql-reference/data-types-datetime#timestamp-ltz-timestamp-ntz-timestamp-tz
> > > > >
> > >
> > > > > > [5] Trino's TIMESTAMP WITH TIME ZONE.
> > > > >
> > >
> > > > >
> > >
> >
> https://trino.io/docs/current/language/types.html#timestamp-p-with-time-zone
> > > > >
> > >
> > > > > > [6] Oracle's TIMESTAMP WITH TIME ZONE.
> > > > >
> > >
> > > > >
> > >
> >
> https://docs.oracle.com/en/database/oracle/oracle-database/19/nlspg/datetime-data-types-and-time-zone-support.html
> > > > >
> > >
> > > > > > [7] MS SQL Server's DATETIMEOFFSET.
> > > > >
> > >
> > > > >
> > >
> >
> https://learn.microsoft.com/en-us/sql/t-sql/data-types/datetimeoffset-transact-sql?view=sql-server-ver17
> > > > >
> > >
> > > > > > [8] ADBC for Snowflake converts to UTC and drops time zone.
> > > > >
> > >
> > > > >
> > >
> >
> https://github.com/apache/arrow-adbc/blob/a67ab5a509676feaec8e24dba479d4de8dc083e2/go/adbc/driver/snowflake/record_reader.go#L228
> > > > >
> > >
> > > > > > [9] Current time zones in effect.
> > > > > > https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
> > > > > > [10] RFC3339 representation of timezone-aware timestamps.
> > > > > > https://datatracker.ietf.org/doc/html/rfc3339
> > > > > >
> > >
> > > > > > FORMAT
> > > > > > ---
> > > > > > Timestamp With Offset
> > > > > > =============
> > > > > > This type represents a timestamp column that stores potentially
> > > different
> > > > > > timezone offsets per value. The timestamp is stored in UTC
> > alongside
> > > the
> > > > > > original timezone offset in minutes.
> > > > > >
> > >
> > > > > > * Extension name: `arrow.timestamp_with_offset`.
> > > > > >
> > >
> > > > > > * The storage type of the extension is a `Struct` with 2 fields,
> in
> > > > > > order:
> > > > > >
> > >
> > > > > > * `timestamp`: a non-nullable `Timestamp(time_unit, "UTC")`,
> where
> > > > > > `time_unit` is any Arrow `TimeUnit` (s, ms, us or ns).
> > > > > >
> > >
> > > > > > * `offset_minutes`: a non-nullable signed 16-bit integer
> (`Int16`)
> > > > > > representing the offset in minutes from the UTC timezone.
> Negative
> > > > > > offsets
> > > > > > represent time zones west of UTC, while positive offsets
> represent
> > > east.
> > > > > > Offsets range from -779 (-12:59) to +780 (+13:00).
> > > > > >
> > >
> > > > > > * Extension type parameters:
> > > > > >
> > >
> > > > > > * `time_unit`: the time-unit of each of the stored UTC
> timestamps.
> > > > > >
> > >
> > > > > > * Description of the serialization:
> > > > > >
> > >
> > > > > > Extension metadata is an empty string.
> > > > > >
> > >
> > > > > > When de/serializing to/from JSON, this type must be represented
> as
> > an
> > > > > > RFC3339 string, respecting the `TimeUnit` precision and time zone
> > > > > > offset
> > > > > > without loss of information. For example `2025-01-01T00:00:00Z`
> > > > > > represents January 1st 2025 in UTC with second precision, and
> > > > > > `2025-01-01T00:00:00.000000001-07:00` represents one nanosecond
> > after
> > > > > > January 1st 2025 in UTC-07.
> >
>

Reply via email to