Re: [DISCUSS] SPIP: Nano-second timestamps: micros + nanos of micro

Max Gekk Tue, 19 May 2026 01:47:29 -0700

Hello Xiaoxuan,

>  If we are committing to standard-first for new types going forward,
that's a useful precedent to set explicitly.


That is not a first time: we already separate SQL logical types from
engine/storage physical encodings — e.g. ANSI intervals (logical qualifiers
vs compact internal structs), DECIMAL(p,s) (ANSI semantics vs BigDecimal
with engine max precision), CHAR/VARCHAR(n) vs unbounded STRING, TIME(p),
TIMESTAMP_NTZ / TIMESTAMP_LTZ, and ANSI mode casting rules.

Datasources add another layer: Parquet/Arrow/Avro/JDBC each have their own
limits (max string length, timestamp wire types, partition path character
sets, etc.). Those constraints belong at read/write boundaries, not in the
definition of the SQL type.

> The SPIP doesn't have to solve all of these in v1, but it should at least
state which behavior we're committing to (throw / silently truncate /
nullable representation), so users and downstream connectors know what to
expect.

The SPIP is scoped to the compute engine: logical types, in-memory physical
encoding, SQL semantics, and engine egress (Arrow/Connect/toPandas from
Spark). Per-format limits are owned by each datasource/connector — Parquet,
Avro, Iceberg, JDBC, etc. each define what they can store on disk or over
the wire.

That split already exists today: Spark SQL does not try to specify every
sink’s behavior in one place. Datasources typically fail on write/read when
a value does not fit — e.g. Avro IncompatibleSchemaException for
unsupported logical types, JDBC numeric out-of-range, path-length limits on
file sources. For timestamps, Parquet’s PARQUET_*_REBASE_MODE=EXCEPTION is
the precedent: out-of-range / ambiguous datetime values raise
SparkUpgradeException.

> accepting some regression as documented

First of all, it should not be considered as a regression because we
introduce a new feature. Second, comparing the composite (INT64 + UINT16)
vs INT64 internal representation.

Composite adds a second field (sub-micro nanos) -> extra load/store vs a
single long. On modern CPUs that is usually one more cache line touch in a
hot struct, often amortized by sequential access and prefetch when scanning
columns (Arrow/Parquet/UnsafeRow are already memory-bandwidth–heavy; the
incremental cost is not automatically “2×”).

INT64 epoch-nanos is not free in Spark either: most datetime paths still
decompose into seconds + nanos-of-second for java.time (Instant,
formatters, TZ rules, rebasing). You pay div/mod or similar on every call
into the JDK regardless of how the value sits in the row.

> A concrete plan for performance mitigation. Just curious how you're
thinking about this — benchmarks will tell us where the gap lands

We will run end-to-end benchmarks on workloads that matter for migration —
ns timestamp pipelines from systems like Snowflake, Trino, Oracle, and DB2
— and compare Spark before vs after nano types, and where useful Spark vs
those systems on the same logical query/data. I would not treat the
in-engine INT64 layout as a baseline.

If profiling shows slowdowns, we will optimize proven hot spots first —
including dedicated UnsafeRow getters for the composite layout (epoch
micros + sub-micro nanos) so scan-heavy paths stay competitive with today’s
single-long timestamp access. Broader rewrites wait on data from those
end-to-end runs.

Yours faithfully,
Max Gekk

On Mon, May 18, 2026 at 11:57 PM Xiaoxuan Li <[email protected]>
wrote:
>
> Hi Max,
>
> Sorry for the late response, and thanks all for the thoughtful
discussion. If the community's view is that we should strictly adhere to
the SQL standard's 0001–9999 range for the new nano types, then I agree
composite is the right call given the bit-count constraint, and the rest of
the discussion follows from there. I'd just want to make sure that's a
deliberate decision rather than something assumed implicitly, since today
Spark's posture is mixed (ANSI mode opt-in, TimestampType permits years far
outside 0001–9999, several legacy configs deviate from the standard). If we
are committing to standard-first for new types going forward, that's a
useful precedent to set explicitly.
>
> Assuming that's the direction, a few specifics I'd like to see nailed
down in the SPIP before vote, since they affect implementation scope and
user experience:
>
> 1. Egress behavior beyond Parquet. Wenchen suggested a custom Parquet
struct for wider-range nanos, that's a reasonable Parquet-side fix, but two
things I'd flag: first, a custom struct is a Spark-specific encoding —
other engines reading the same file would see a struct rather than a
timestamp, so we'd lose cross-engine interoperability that Spark's Parquet
write path has today. Second, even if we accept that trade-off for Parquet,
the rest of the egress paths are still all INT64 epoch-nanos — Arrow
Timestamp(NANOSECOND), Iceberg V3 timestamp_ns, Pandas datetime64[ns], Avro
timestamp-nanos, and Spark Connect (Arrow IPC end-to-end) are all INT64.
What's the intended behavior when a user's TIMESTAMP_NTZ(9) column has a
year-1500 value and they:
>
> Call df.toPandas() or df.toArrow()?
> Fetch results through PySpark Connect or a non-JVM Connect client?
> Write to Iceberg V3 / Avro?
> The SPIP doesn't have to solve all of these in v1, but it should at least
state which behavior we're committing to (throw / silently truncate /
nullable representation), so users and downstream connectors know what to
expect.
>
> 2. A concrete plan for performance mitigation. Just curious how you're
thinking about this — benchmarks will tell us where the gap lands, but the
cost surface is wide enough (UnsafeRow operators, codegen, sort/hash/join,
shuffle, Arrow egress) that a single number probably won't cover it. If
numbers come back unfavorable on some paths, what's the rough plan? Native
optimization for the hot paths, accepting some regression as documented, an
opt-out config, something else? Even a high-level direction would help size
the implementation work. Happy to help benchmark an INT64 implementation
alongside composite if that's useful for grounding the comparison.
>
> Thanks,
> Xiaoxuan
>
> On Thu, May 14, 2026 at 11:40 PM Max Gekk <[email protected]> wrote:
>>
>> I would like to kick off voting for the SPIP today if there will be no
objections.
>>
>> On Wed, May 13, 2026 at 8:13 PM serge rielau.com <[email protected]>
wrote:
>>>
>>> Fair enough, I was not aware of elf the Java limitation and resulting
dependency.
>>>
>>> On May 13, 2026, at 10:28 AM, Max Gekk <[email protected]> wrote:
>>>
>>> Hi Serge,
>>>
>>> > If we agree that any performance (and memory) cliff is going
composite and not whether the extra bytes are 2 or 4 bytes, then would it
make sense to match Trino? We would:
>>>
>>> If we would support picosecond precisions, this could cause the
following issues, IMHO:
>>> 1. Spark's datetime stack today is “nanos‑native,” not “picos‑native.”
>>> java.time (Instant, LocalDateTime, ZonedDateTime, Duration, etc.)
exposes nanoseconds as the finest supported unit in the public model.
Supporting p > 9 in Spark SQL means either rounding away picos at almost
every boundary or building custom arithmetic, normalization, parsing, and
calendar logic for the sub‑nano tail. That is a large, long‑lived surface
area, with high regression risk anywhere we already struggle: LTZ vs NTZ,
session time zone, legacy rebasing, Julian/Gregorian, pushdown, codegen,
etc. So "same cost as going composite for nanos" does not imply "picos are
free once we went composite."
>>> 2. Memory is not only “+2 vs +4 bytes” — it is “+delta bytes * row
width * shuffle fanout.”
>>> Picos widen rows further than nanos, which increases OOM / GC / shuffle
spill risk on the same heap and cluster sizes — especially for wide fact
tables and skewed joins on timestamp keys.
>>> 3. Interchange and “federation” still do not become automatic.
>>> Even if Trino is aligned internally, Parquet / Arrow / Pandas / JDBC
paths overwhelmingly standardize on nanos at best for compact physical
encodings.
>>>
>>> Best regards,
>>> Max Gekk
>>>
>>> On Wed, May 13, 2026 at 4:04 PM serge rielau.com <[email protected]>
wrote:
>>> >
>>> > A few questions to ponder:
>>> >
>>> > Are we committed to the SQL Standard, even when it may be tactically
inconvenient?
>>> > Why did Trino and Db2 go to pico? I can answer for Db2 as I was in
the room: We wanted to build for the future and rip the band aid and there
was no extra design or QA cost. What was Trino’s thinking?
>>> > In my career I have seen DBMS needs go from milli to micro to nano.
Nano will not be the end of it. While for all intents and purposes
“antique” nanoseconds are too esoteric to sweat about, sticking with int64
will not be an option for pico.
>>> > Storage is data at rest. It is “easy” to add another format. Engines
like Spark outlive storage formats, and so do their APIs.
>>> >
>>> > If we agree that any performance (and memory) cliff is going
composite and not whether the extra bytes are 2 or 4 bytes, then would it
make sense to match Trino? We would:
>>> >
>>> > Have an actual external benefit outside of the corner case of range
>>> > Peace of mind for the API for at least a decade, perhaps more (if we
go Femto .. which is free upgrade at 4 bytes)
>>> > Full compatibility with any federated datasource
>>> > Standard compliance
>>> >
>>> >
>>> >
>>> >
>>> > On May 13, 2026, at 2:40 AM, Wenchen Fan <[email protected]> wrote:
>>> >
>>> > Sorry, I misclicked the send button, let me finish.
>>> >
>>> > We can throw out of range errors if the actual timestamp value does
not fit the Parquet parquet INT64, and we can work with the Parquet and
other data format communities to add support for timestamp nanos with a
wider year range. Before that, we can write a custom struct in Parquet to
save this timestamp nano type.
>>> >
>>> > On Wed, May 13, 2026 at 5:38 PM Wenchen Fan <[email protected]>
wrote:
>>> >>
>>> >> I think the main question is what are the requirements for this new
timestamp nano type. Personally I think it's better to follow SQL standard,
and support year range 0000 to 9999. This kills the INT64 option. For data
sources, we can throw out of range error of the actual timestamp value does
not fix the Parquet parquet INT64
>>> >>
>>> >> On Tue, May 12, 2026 at 5:38 PM Max Gekk <[email protected]> wrote:
>>> >>>
>>> >>> Hi Xiaoxuan,
>>> >>>
>>> >>> Thank you for the detailed clarification of your proposal.
>>> >>>
>>> >>> > the key difference is internal representation, our draft uses
INT64 epoch-nanos, yours uses composite (epochMicros, nanosOfMicro).
>>> >>>
>>> >>> I think the main difference between our proposals is how we answer
the
>>> >>> question: shall Spark SQL conform to the SQL standard or not? The
>>> >>> standard says clearly that the year range is from 0001 to 9999.
Rough
>>> >>> count of distinct nanosecond instants on a proleptic-Gregorian line
>>> >>> from 0001‑01‑01 through 9999‑12‑31:
>>> >>> - About 3.65*10^6 civil days in that span (order of magnitude is
enough).
>>> >>> - Each day has 86400*10^9 = 8.64*10^13 distinct nanosecond offsets
>>> >>> from midnight.
>>> >>> So the number of distinct values is about: N +-= 3.65*10^6 *
>>> >>> 8.64*10^13 +-= 3.2*10^20
>>> >>> Then: log2(N) ±= 68-69 bits.
>>> >>> Any mapping from that full set would need at least about 69 bits.
>>> >>>
>>> >>> > Four concerns, and I'd value your read on whether they're
solvable:
>>> >>> > Composite doesn't fit UnsafeRow's 8-byte slot, so every
sort/hash/join/shuffle pays the variable-length cost: extra memory access,
worse cache locality, ~2–3x memory per value.
>>> >>>
>>> >>> You are right for UnsafeRows but built-in datasources like Parquet
and
>>> >>> ORC might return Column Vectors where values are stored as arrays of
>>> >>> long, short. And such values could be processed in vectorized ways.
I
>>> >>> believe the new data type will have worse performance, but not so
>>> >>> significant.
>>> >>>
>>> >>> > The range benefit doesn't survive egress. Spark's main egress
paths are all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark
Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns].
>>> >>>
>>> >>> Below are the sources from where timestamps with nanosecond
precision
>>> >>> could come from out of the range 1677-2262:
>>> >>> 1. Parquet: Spark's TIMESTAMP_LTZ is still saved/loaded from INT96
by
>>> >>> default which has nanoseconds precision.
>>> >>> 2. Another built-in datasource ORC stores timestamps with nanosecond
>>> >>> precision, see https://orc.apache.org/specification/ORCv2/
>>> >>> 3. Spark SQL can have access to some external DBMSs that support
>>> >>> nanoseconds precision, for instance Oracle, MS SQL Server,
Snowflake,
>>> >>> Trino, Teradata.
>>> >>>
>>> >>> > Nanosecond precision tends to go with modern-measurement data
(HFT, traces, IoT, logs); wide calendar range tends to go with archival
data where milli or second precision is enough.
>>> >>>
>>> >>> I would imagine that Spark users might need timestamps with nanos
from
>>> >>> out of the range 1677-2262:
>>> >>> - Simulating some physical processes in the future or in the past.
>>> >>> - Migration from other systems.
>>> >>>
>>> >>> > Composite is hard to walk back once shipped. The two directions
aren't symmetric. Starting with INT64 and upgrading to composite later is
SQL-layer compatible
>>> >>>
>>> >>> INT64 epoch-nanos is also a one-way semantic bet in the other
>>> >>> direction: once users store physics-time workloads in that encoding,
>>> >>> widening later without reinterpretation is not free either.
>>> >>>
>>> >>> > The other thing that pulled us toward INT64 is that it's the
choice most open-source columnar and lakehouse engines have already made.
DuckDB's TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp
storage all use INT64 epoch-nanos with the 1678–2262 bound.
>>> >>>
>>> >>> Matching open columnar consensus for wire formats is a strong
default
>>> >>> for interchange, I agree. I would separate that from the question of
>>> >>> Spark’s in-memory representation.
>>> >>>
>>> >>> > Given the perf concern especially, we'd prefer INT64 for now.
@Unstable keeps the door open to the composite layout later
>>> >>>
>>> >>> How about measuring performance of MVP on end-to-end benchmarks. We
>>> >>> could address perf concerns later.
>>> >>>
>>> >>> Yours faithfully,
>>> >>> Max Gekk
>>> >>>
>>> >>>
>>> >>> On Tue, May 12, 2026 at 1:52 AM Xiaoxuan Li <
[email protected]> wrote:
>>> >>> >
>>> >>> > Hi Max,
>>> >>> > Thanks for the writeup. I've been working on a related proposal
in parallel — SPIP: Support NanoSecond Timestamp Types. The user-visible
surface overlaps a lot (SQL syntax, new catalyst types, Parquet NANOS
interop); the key difference is internal representation, our draft uses
INT64 epoch-nanos, yours uses composite (epochMicros, nanosOfMicro).
>>> >>> >
>>> >>> > If we decide to go with composite, I agree your layout is the
right one, reuses micros-based DateTimeUtils, aligns the calendar range
with TimestampType, keeps the extra precision as a small bounded correction.
>>> >>> >
>>> >>> > We started with INT64 because we're worried about paying
composite's cost without getting the real benefit. Four concerns, and I'd
value your read on whether they're solvable:
>>> >>> >
>>> >>> > Hot-path performance. Composite doesn't fit UnsafeRow's 8-byte
slot, so every sort/hash/join/shuffle pays the variable-length cost: extra
memory access, worse cache locality, ~2–3x memory per value. Trino is the
closest precedent — they went composite for TIMESTAMP(p>6) because their
ceiling is picoseconds, and even so the perf gap between short and long
representations was significant enough that they added a
hive.timestamp-precision toggle so users could force high-precision columns
back to micros. Our ceiling is nanoseconds, so we'd take on Trino's cost
without Trino's reason. Curious how you see it playing out differently.
>>> >>> > The range benefit doesn't survive egress. Spark's main egress
paths are all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark
Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. A year-1500
value can live in Spark memory under composite but can't leave — it either
throws on write/fetch or gets silently truncated, depending on how the
boundary is specified. Curious what you have in mind for the egress side.
>>> >>> > Do workloads actually need both? Nanosecond precision tends to go
with modern-measurement data (HFT, traces, IoT, logs); wide calendar range
tends to go with archival data where milli or second precision is enough.
We haven't found a case where a single column needs both — same assumption
Parquet, Arrow, Iceberg, and Pandas seem to make. The one case where they
do intersect is sentinel values — 9999-12-31 for "no end date," 0001-01-01
for "unknown start" — mixed into columns that otherwise hold
nanosecond-precise timestamps. Your proposal handles this natively; ours
asks users to either use NULL, pick a sentinel within range. That's a real
user-facing ask. Curious whether you've seen other patterns, since
sentinels alone feel like something that could also be addressed at the
data-modeling layer.
>>> >>> > Composite is hard to walk back once shipped. The two directions
aren't symmetric. Starting with INT64 and upgrading to composite later is
SQL-layer compatible — user queries and declared schemas don't move, the
existing Parquet files keep meaning the same thing (Spark just reads INT64
nanos into composite at the edge), and new writes can carry the wider range
once Parquet or Arrow grow support. Starting with composite is effectively
a one-way commitment: the moment users persist year-1500 values into
tables, Spark owns supporting those values forever, because narrowing the
type after the fact would be data loss from the user's perspective. So
starting narrow preserves the option to go wider if the evidence shifts;
starting wide locks in the cost on day one.
>>> >>> >
>>> >>> > The other thing that pulled us toward INT64 is that it's the
choice most open-source columnar and lakehouse engines have already made.
DuckDB's TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp
storage all use INT64 epoch-nanos with the 1678–2262 bound. Parquet, Arrow,
Iceberg V3, Avro, and Pandas datetime64[ns] do too. Engines that offer
full-range nanos — Snowflake, Oracle, DB2 — either run on proprietary
storage formats they control end-to-end or are row-based OLTP with
different cost structures. Trino is the one open-source columnar engine
that went wider — it supports TIMESTAMP(p) up to picoseconds (p=12), which
simply doesn't fit in INT64, so composite was necessary. Even so, the
performance penalty is real. For a columnar engine like Spark whose data
plane runs through Parquet and Arrow, matching the open-source columnar
consensus seemed like the less surprising default.
>>> >>> >
>>> >>> > Given the perf concern especially, we'd prefer INT64 for now.
@Unstable keeps the door open to the composite layout later — if the
ecosystem grows full-range nanos, workloads push us there, or we need
sub-nanosecond precision where INT64 isn't enough.
>>> >>> >
>>> >>> > Would love any thought on this, good to align in a single
direction before either moves forward.
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Xiaoxuan Li
>>> >>> >
>>> >>> > On Fri, May 8, 2026 at 1:43 AM Wenchen Fan <[email protected]>
wrote:
>>> >>> >>
>>> >>> >> This new design makes sense to me. So we just add 2 more bytes
to store nanosOfMicro, and the rest is the same as the current timestamp
types: same value range, but higher precision.
>>> >>> >>
>>> >>> >> On Thu, May 7, 2026 at 5:16 PM Max Gekk <[email protected]>
wrote:
>>> >>> >>>
>>> >>> >>> Hi Spark devs,
>>> >>> >>>
>>> >>> >>> I’d like to share a proposal for nano-second-capable timestamp
support
>>> >>> >>> and ask for your feedback.
>>> >>> >>>
>>> >>> >>> Here is the SPIP:
>>> >>> >>>
https://docs.google.com/document/d/1DeW15QueI4PdRyPm6C6jsTZFmIjbXX2j4h-Ja5W_fsg/edit?usp=sharing
>>> >>> >>>
>>> >>> >>> My proposal uses a logical split representation:
>>> >>> >>> - epochMicros: Long
>>> >>> >>> - nanosOfMicro: Short in [0, 999]
>>> >>> >>>
>>> >>> >>> This applies to both NTZ and LTZ nano-capable types; timezone
>>> >>> >>> semantics remain unchanged and are handled at interpretation
>>> >>> >>> boundaries (as today).
>>> >>> >>>
>>> >>> >>> Why this approach? I believe this is the most practical path
for Spark
>>> >>> >>> because it:
>>> >>> >>> 0. Conforms to the SQL standard.
>>> >>> >>> 1. Preserves Spark’s existing microsecond approach. Most
>>> >>> >>> Catalyst/runtime datetime logic already uses micros. The split
model
>>> >>> >>> extends it rather than replacing it.
>>> >>> >>> 2. Avoids INT64 epoch-nanos range cliff as the primary engine
model. A
>>> >>> >>> single Long epoch-nanos representation constrains calendar
range much
>>> >>> >>> more aggressively than Long micros.
>>> >>> >>> 3. Keeps migration risk lower. Existing microsecond behavior
remains
>>> >>> >>> default; nano precision is opt-in via parameterized
types/syntax.
>>> >>> >>> 4. Allows efficient implementation paths. Internals can still
choose
>>> >>> >>> compact physical encodings (row/vector/file boundaries), while
keeping
>>> >>> >>> one canonical logical contract.
>>> >>> >>>
>>> >>> >>> Related SPIPs considered. I reviewed and compared against these
two drafts:
>>> >>> >>> - SPIP: Support NanoSecond Timestamps:
>>> >>> >>>
https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?tab=t.0#heading=h.4kibaxwtx2xo
>>> >>> >>> - SPIP: Support NanoSecond Timestamp Types:
>>> >>> >>>
https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?tab=t.0#heading=h.xk16mmomv6il
>>> >>> >>>
>>> >>> >>> Those drafts are valuable and informed this design. The key
difference
>>> >>> >>> is that I prioritize micros-first engine continuity with a
bounded
>>> >>> >>> nano remainder, instead of making epoch-nanos the primary
internal
>>> >>> >>> semantic unit.
>>> >>> >>> In short: I think epochMicros + nanosOfMicro is a better fit for
>>> >>> >>> Spark’s current architecture and compatibility constraints,
while
>>> >>> >>> still delivering practical nanosecond support.
>>> >>> >>>
>>> >>> >>> Thanks in advance for your feedback.
>>> >>> >>>
>>> >>> >>> Best regards,
>>> >>> >>> Max Gekk
>>> >>> >>>
>>> >>> >>>
---------------------------------------------------------------------
>>> >>> >>> To unsubscribe e-mail: [email protected]
>>> >>> >>>
>>> >>>
>>> >>>
---------------------------------------------------------------------
>>> >>> To unsubscribe e-mail: [email protected]
>>> >>>
>>> >
>>>
>>>

Re: [DISCUSS] SPIP: Nano-second timestamps: micros + nanos of micro

Reply via email to