Re: [DISCUSS] SPIP: Nano-second timestamps: micros + nanos of micro

Xiaoxuan Li Mon, 11 May 2026 16:52:34 -0700

Hi Max,
Thanks for the writeup. I've been working on a related proposal in parallel
— SPIP: Support NanoSecond Timestamp Types
<https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?usp=sharing>.
The user-visible surface overlaps a lot (SQL syntax, new catalyst types,
Parquet NANOS interop); the key difference is internal representation, our
draft uses INT64 epoch-nanos, yours uses composite (epochMicros,
nanosOfMicro).

If we decide to go with composite, I agree your layout is the right one,
reuses micros-based DateTimeUtils, aligns the calendar range with
TimestampType, keeps the extra precision as a small bounded correction.

We started with INT64 because we're worried about paying composite's cost
without getting the real benefit. Four concerns, and I'd value your read on
whether they're solvable:

   1. *Hot-path performance*. Composite doesn't fit UnsafeRow's 8-byte
   slot, so every sort/hash/join/shuffle pays the variable-length cost: extra
   memory access, worse cache locality, ~2–3x memory per value. Trino is the
   closest precedent — they went composite for TIMESTAMP(p>6) because their
   ceiling is picoseconds, and even so the perf gap between short and long
   representations was significant enough that they added a
   hive.timestamp-precision toggle so users could force high-precision columns
   back to micros. Our ceiling is nanoseconds, so we'd take on Trino's cost
   without Trino's reason. Curious how you see it playing out differently.
   2. *The range benefit doesn't survive egress*. Spark's main egress paths
   are all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark
   Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. A year-1500
   value can live in Spark memory under composite but can't leave — it either
   throws on write/fetch or gets silently truncated, depending on how the
   boundary is specified. Curious what you have in mind for the egress side.
   3. *Do workloads actually need both?* Nanosecond precision tends to go
   with modern-measurement data (HFT, traces, IoT, logs); wide calendar range
   tends to go with archival data where milli or second precision is enough.
   We haven't found a case where a single column needs both — same assumption
   Parquet, Arrow, Iceberg, and Pandas seem to make. The one case where they
   do intersect is sentinel values — 9999-12-31 for "no end date," 0001-01-01
   for "unknown start" — mixed into columns that otherwise hold
   nanosecond-precise timestamps. Your proposal handles this natively; ours
   asks users to either use NULL, pick a sentinel within range. That's a real
   user-facing ask. Curious whether you've seen other patterns, since
   sentinels alone feel like something that could also be addressed at the
   data-modeling layer.
   4. *Composite is hard to walk back once shipped.* The two directions
   aren't symmetric. Starting with INT64 and upgrading to composite later is
   SQL-layer compatible — user queries and declared schemas don't move, the
   existing Parquet files keep meaning the same thing (Spark just reads INT64
   nanos into composite at the edge), and new writes can carry the wider range
   once Parquet or Arrow grow support. Starting with composite is effectively
   a one-way commitment: the moment users persist year-1500 values into
   tables, Spark owns supporting those values forever, because narrowing the
   type after the fact would be data loss from the user's perspective. So
   starting narrow preserves the option to go wider if the evidence shifts;
   starting wide locks in the cost on day one.

The other thing that pulled us toward INT64 is that it's the choice most
open-source columnar and lakehouse engines have already made. DuckDB's
TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage
all use INT64 epoch-nanos with the 1678–2262 bound. Parquet, Arrow, Iceberg
V3, Avro, and Pandas datetime64[ns] do too. Engines that offer full-range
nanos — Snowflake, Oracle, DB2 — either run on proprietary storage formats
they control end-to-end or are row-based OLTP with different cost
structures. Trino is the one open-source columnar engine that went wider —
it supports TIMESTAMP(p) up to picoseconds (p=12), which simply doesn't fit
in INT64, so composite was necessary. Even so, the performance penalty is
real. For a columnar engine like Spark whose data plane runs through
Parquet and Arrow, matching the open-source columnar consensus seemed like
the less surprising default.

Given the perf concern especially, we'd prefer INT64 for now. @Unstable
keeps the door open to the composite layout later — if the ecosystem grows
full-range nanos, workloads push us there, or we need sub-nanosecond
precision where INT64 isn't enough.

Would love any thought on this, good to align in a single direction before
either moves forward.

Thanks,
Xiaoxuan Li

On Fri, May 8, 2026 at 1:43 AM Wenchen Fan <[email protected]> wrote:

> This new design makes sense to me. So we just add 2 more bytes to store
> nanosOfMicro, and the rest is the same as the current timestamp types: same
> value range, but higher precision.
>
> On Thu, May 7, 2026 at 5:16 PM Max Gekk <[email protected]> wrote:
>
>> Hi Spark devs,
>>
>> I’d like to share a proposal for nano-second-capable timestamp support
>> and ask for your feedback.
>>
>> Here is the SPIP:
>>
>> https://docs.google.com/document/d/1DeW15QueI4PdRyPm6C6jsTZFmIjbXX2j4h-Ja5W_fsg/edit?usp=sharing
>>
>> My proposal uses a logical split representation:
>> - epochMicros: Long
>> - nanosOfMicro: Short in [0, 999]
>>
>> This applies to both NTZ and LTZ nano-capable types; timezone
>> semantics remain unchanged and are handled at interpretation
>> boundaries (as today).
>>
>> Why this approach? I believe this is the most practical path for Spark
>> because it:
>> 0. Conforms to the SQL standard.
>> 1. Preserves Spark’s existing microsecond approach. Most
>> Catalyst/runtime datetime logic already uses micros. The split model
>> extends it rather than replacing it.
>> 2. Avoids INT64 epoch-nanos range cliff as the primary engine model. A
>> single Long epoch-nanos representation constrains calendar range much
>> more aggressively than Long micros.
>> 3. Keeps migration risk lower. Existing microsecond behavior remains
>> default; nano precision is opt-in via parameterized types/syntax.
>> 4. Allows efficient implementation paths. Internals can still choose
>> compact physical encodings (row/vector/file boundaries), while keeping
>> one canonical logical contract.
>>
>> Related SPIPs considered. I reviewed and compared against these two
>> drafts:
>> - SPIP: Support NanoSecond Timestamps:
>>
>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?tab=t.0#heading=h.4kibaxwtx2xo
>> - SPIP: Support NanoSecond Timestamp Types:
>>
>> https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?tab=t.0#heading=h.xk16mmomv6il
>>
>> Those drafts are valuable and informed this design. The key difference
>> is that I prioritize micros-first engine continuity with a bounded
>> nano remainder, instead of making epoch-nanos the primary internal
>> semantic unit.
>> In short: I think epochMicros + nanosOfMicro is a better fit for
>> Spark’s current architecture and compatibility constraints, while
>> still delivering practical nanosecond support.
>>
>> Thanks in advance for your feedback.
>>
>> Best regards,
>> Max Gekk
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: [DISCUSS] SPIP: Nano-second timestamps: micros + nanos of micro

Reply via email to