Hi Xiaoxuan,

Thank you for the detailed clarification of your proposal.

> the key difference is internal representation, our draft uses INT64 
> epoch-nanos, yours uses composite (epochMicros, nanosOfMicro).

I think the main difference between our proposals is how we answer the
question: shall Spark SQL conform to the SQL standard or not? The
standard says clearly that the year range is from 0001 to 9999. Rough
count of distinct nanosecond instants on a proleptic-Gregorian line
from 0001‑01‑01 through 9999‑12‑31:
- About 3.65*10^6 civil days in that span (order of magnitude is enough).
- Each day has 86400*10^9 = 8.64*10^13 distinct nanosecond offsets
from midnight.
So the number of distinct values is about: N +-= 3.65*10^6 *
8.64*10^13 +-= 3.2*10^20
Then: log2(N) ±= 68-69 bits.
Any mapping from that full set would need at least about 69 bits.

> Four concerns, and I'd value your read on whether they're solvable:
> Composite doesn't fit UnsafeRow's 8-byte slot, so every 
> sort/hash/join/shuffle pays the variable-length cost: extra memory access, 
> worse cache locality, ~2–3x memory per value.

You are right for UnsafeRows but built-in datasources like Parquet and
ORC might return Column Vectors where values are stored as arrays of
long, short. And such values could be processed in vectorized ways. I
believe the new data type will have worse performance, but not so
significant.

> The range benefit doesn't survive egress. Spark's main egress paths are all 
> INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark Connect), 
> Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns].

Below are the sources from where timestamps with nanosecond precision
could come from out of the range 1677-2262:
1. Parquet: Spark's TIMESTAMP_LTZ is still saved/loaded from INT96 by
default which has nanoseconds precision.
2. Another built-in datasource ORC stores timestamps with nanosecond
precision, see https://orc.apache.org/specification/ORCv2/
3. Spark SQL can have access to some external DBMSs that support
nanoseconds precision, for instance Oracle, MS SQL Server, Snowflake,
Trino, Teradata.

> Nanosecond precision tends to go with modern-measurement data (HFT, traces, 
> IoT, logs); wide calendar range tends to go with archival data where milli or 
> second precision is enough.

I would imagine that Spark users might need timestamps with nanos from
out of the range 1677-2262:
- Simulating some physical processes in the future or in the past.
- Migration from other systems.

> Composite is hard to walk back once shipped. The two directions aren't 
> symmetric. Starting with INT64 and upgrading to composite later is SQL-layer 
> compatible

INT64 epoch-nanos is also a one-way semantic bet in the other
direction: once users store physics-time workloads in that encoding,
widening later without reinterpretation is not free either.

> The other thing that pulled us toward INT64 is that it's the choice most 
> open-source columnar and lakehouse engines have already made. DuckDB's 
> TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage 
> all use INT64 epoch-nanos with the 1678–2262 bound.

Matching open columnar consensus for wire formats is a strong default
for interchange, I agree. I would separate that from the question of
Spark’s in-memory representation.

> Given the perf concern especially, we'd prefer INT64 for now. @Unstable keeps 
> the door open to the composite layout later

How about measuring performance of MVP on end-to-end benchmarks. We
could address perf concerns later.

Yours faithfully,
Max Gekk


On Tue, May 12, 2026 at 1:52 AM Xiaoxuan Li <[email protected]> wrote:
>
> Hi Max,
> Thanks for the writeup. I've been working on a related proposal in parallel — 
> SPIP: Support NanoSecond Timestamp Types. The user-visible surface overlaps a 
> lot (SQL syntax, new catalyst types, Parquet NANOS interop); the key 
> difference is internal representation, our draft uses INT64 epoch-nanos, 
> yours uses composite (epochMicros, nanosOfMicro).
>
> If we decide to go with composite, I agree your layout is the right one, 
> reuses micros-based DateTimeUtils, aligns the calendar range with 
> TimestampType, keeps the extra precision as a small bounded correction.
>
> We started with INT64 because we're worried about paying composite's cost 
> without getting the real benefit. Four concerns, and I'd value your read on 
> whether they're solvable:
>
> Hot-path performance. Composite doesn't fit UnsafeRow's 8-byte slot, so every 
> sort/hash/join/shuffle pays the variable-length cost: extra memory access, 
> worse cache locality, ~2–3x memory per value. Trino is the closest precedent 
> — they went composite for TIMESTAMP(p>6) because their ceiling is 
> picoseconds, and even so the perf gap between short and long representations 
> was significant enough that they added a hive.timestamp-precision toggle so 
> users could force high-precision columns back to micros. Our ceiling is 
> nanoseconds, so we'd take on Trino's cost without Trino's reason. Curious how 
> you see it playing out differently.
> The range benefit doesn't survive egress. Spark's main egress paths are all 
> INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark Connect), 
> Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. A year-1500 value can 
> live in Spark memory under composite but can't leave — it either throws on 
> write/fetch or gets silently truncated, depending on how the boundary is 
> specified. Curious what you have in mind for the egress side.
> Do workloads actually need both? Nanosecond precision tends to go with 
> modern-measurement data (HFT, traces, IoT, logs); wide calendar range tends 
> to go with archival data where milli or second precision is enough. We 
> haven't found a case where a single column needs both — same assumption 
> Parquet, Arrow, Iceberg, and Pandas seem to make. The one case where they do 
> intersect is sentinel values — 9999-12-31 for "no end date," 0001-01-01 for 
> "unknown start" — mixed into columns that otherwise hold nanosecond-precise 
> timestamps. Your proposal handles this natively; ours asks users to either 
> use NULL, pick a sentinel within range. That's a real user-facing ask. 
> Curious whether you've seen other patterns, since sentinels alone feel like 
> something that could also be addressed at the data-modeling layer.
> Composite is hard to walk back once shipped. The two directions aren't 
> symmetric. Starting with INT64 and upgrading to composite later is SQL-layer 
> compatible — user queries and declared schemas don't move, the existing 
> Parquet files keep meaning the same thing (Spark just reads INT64 nanos into 
> composite at the edge), and new writes can carry the wider range once Parquet 
> or Arrow grow support. Starting with composite is effectively a one-way 
> commitment: the moment users persist year-1500 values into tables, Spark owns 
> supporting those values forever, because narrowing the type after the fact 
> would be data loss from the user's perspective. So starting narrow preserves 
> the option to go wider if the evidence shifts; starting wide locks in the 
> cost on day one.
>
> The other thing that pulled us toward INT64 is that it's the choice most 
> open-source columnar and lakehouse engines have already made. DuckDB's 
> TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage 
> all use INT64 epoch-nanos with the 1678–2262 bound. Parquet, Arrow, Iceberg 
> V3, Avro, and Pandas datetime64[ns] do too. Engines that offer full-range 
> nanos — Snowflake, Oracle, DB2 — either run on proprietary storage formats 
> they control end-to-end or are row-based OLTP with different cost structures. 
> Trino is the one open-source columnar engine that went wider — it supports 
> TIMESTAMP(p) up to picoseconds (p=12), which simply doesn't fit in INT64, so 
> composite was necessary. Even so, the performance penalty is real. For a 
> columnar engine like Spark whose data plane runs through Parquet and Arrow, 
> matching the open-source columnar consensus seemed like the less surprising 
> default.
>
> Given the perf concern especially, we'd prefer INT64 for now. @Unstable keeps 
> the door open to the composite layout later — if the ecosystem grows 
> full-range nanos, workloads push us there, or we need sub-nanosecond 
> precision where INT64 isn't enough.
>
> Would love any thought on this, good to align in a single direction before 
> either moves forward.
>
> Thanks,
> Xiaoxuan Li
>
> On Fri, May 8, 2026 at 1:43 AM Wenchen Fan <[email protected]> wrote:
>>
>> This new design makes sense to me. So we just add 2 more bytes to store 
>> nanosOfMicro, and the rest is the same as the current timestamp types: same 
>> value range, but higher precision.
>>
>> On Thu, May 7, 2026 at 5:16 PM Max Gekk <[email protected]> wrote:
>>>
>>> Hi Spark devs,
>>>
>>> I’d like to share a proposal for nano-second-capable timestamp support
>>> and ask for your feedback.
>>>
>>> Here is the SPIP:
>>> https://docs.google.com/document/d/1DeW15QueI4PdRyPm6C6jsTZFmIjbXX2j4h-Ja5W_fsg/edit?usp=sharing
>>>
>>> My proposal uses a logical split representation:
>>> - epochMicros: Long
>>> - nanosOfMicro: Short in [0, 999]
>>>
>>> This applies to both NTZ and LTZ nano-capable types; timezone
>>> semantics remain unchanged and are handled at interpretation
>>> boundaries (as today).
>>>
>>> Why this approach? I believe this is the most practical path for Spark
>>> because it:
>>> 0. Conforms to the SQL standard.
>>> 1. Preserves Spark’s existing microsecond approach. Most
>>> Catalyst/runtime datetime logic already uses micros. The split model
>>> extends it rather than replacing it.
>>> 2. Avoids INT64 epoch-nanos range cliff as the primary engine model. A
>>> single Long epoch-nanos representation constrains calendar range much
>>> more aggressively than Long micros.
>>> 3. Keeps migration risk lower. Existing microsecond behavior remains
>>> default; nano precision is opt-in via parameterized types/syntax.
>>> 4. Allows efficient implementation paths. Internals can still choose
>>> compact physical encodings (row/vector/file boundaries), while keeping
>>> one canonical logical contract.
>>>
>>> Related SPIPs considered. I reviewed and compared against these two drafts:
>>> - SPIP: Support NanoSecond Timestamps:
>>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?tab=t.0#heading=h.4kibaxwtx2xo
>>> - SPIP: Support NanoSecond Timestamp Types:
>>> https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?tab=t.0#heading=h.xk16mmomv6il
>>>
>>> Those drafts are valuable and informed this design. The key difference
>>> is that I prioritize micros-first engine continuity with a bounded
>>> nano remainder, instead of making epoch-nanos the primary internal
>>> semantic unit.
>>> In short: I think epochMicros + nanosOfMicro is a better fit for
>>> Spark’s current architecture and compatibility constraints, while
>>> still delivering practical nanosecond support.
>>>
>>> Thanks in advance for your feedback.
>>>
>>> Best regards,
>>> Max Gekk
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Reply via email to