Re: [DISCUSS] SPIP: Nano-second timestamps: micros + nanos of micro

Max Gekk Thu, 14 May 2026 23:40:09 -0700

I would like to kick off voting for the SPIP today if there will be no
objections.


On Wed, May 13, 2026 at 8:13 PM serge rielau.com <[email protected]> wrote:

> Fair enough, I was not aware of elf the Java limitation and resulting
> dependency.
>
> On May 13, 2026, at 10:28 AM, Max Gekk <[email protected]> wrote:
>
> Hi Serge,
>
> > If we agree that any performance (and memory) cliff is going composite
> and not whether the extra bytes are 2 or 4 bytes, then would it make sense
> to match Trino? We would:
>
> If we would support picosecond precisions, this could cause the following
> issues, IMHO:
> 1. Spark's datetime stack today is “nanos‑native,” not “picos‑native.”
> java.time (Instant, LocalDateTime, ZonedDateTime, Duration, etc.) exposes
> nanoseconds as the finest supported unit in the public model. Supporting p
> > 9 in Spark SQL means either rounding away picos at almost every boundary
> or building custom arithmetic, normalization, parsing, and calendar logic
> for the sub‑nano tail. That is a large, long‑lived surface area, with high
> regression risk anywhere we already struggle: LTZ vs NTZ, session time
> zone, legacy rebasing, Julian/Gregorian, pushdown, codegen, etc. So "same
> cost as going composite for nanos" does not imply "picos are free once we
> went composite."
> 2. Memory is not only “+2 vs +4 bytes” — it is “+delta bytes * row width *
> shuffle fanout.”
> Picos widen rows further than nanos, which increases OOM / GC / shuffle
> spill risk on the same heap and cluster sizes — especially for wide fact
> tables and skewed joins on timestamp keys.
> 3. Interchange and “federation” still do not become automatic.
> Even if Trino is aligned internally, Parquet / Arrow / Pandas / JDBC paths
> overwhelmingly standardize on nanos at best for compact physical encodings.
>
> Best regards,
> Max Gekk
>
> On Wed, May 13, 2026 at 4:04 PM serge rielau.com <[email protected]> wrote:
> >
> > A few questions to ponder:
> >
> > Are we committed to the SQL Standard, even when it may be tactically
> inconvenient?
> > Why did Trino and Db2 go to pico? I can answer for Db2 as I was in the
> room: We wanted to build for the future and rip the band aid and there was
> no extra design or QA cost. What was Trino’s thinking?
> > In my career I have seen DBMS needs go from milli to micro to nano. Nano
> will not be the end of it. While for all intents and purposes “antique”
> nanoseconds are too esoteric to sweat about, sticking with int64 will not
> be an option for pico.
> > Storage is data at rest. It is “easy” to add another format. Engines
> like Spark outlive storage formats, and so do their APIs.
> >
> > If we agree that any performance (and memory) cliff is going composite
> and not whether the extra bytes are 2 or 4 bytes, then would it make sense
> to match Trino? We would:
> >
> > Have an actual external benefit outside of the corner case of range
> > Peace of mind for the API for at least a decade, perhaps more (if we go
> Femto .. which is free upgrade at 4 bytes)
> > Full compatibility with any federated datasource
> > Standard compliance
> >
> >
> >
> >
> > On May 13, 2026, at 2:40 AM, Wenchen Fan <[email protected]> wrote:
> >
> > Sorry, I misclicked the send button, let me finish.
> >
> > We can throw out of range errors if the actual timestamp value does not
> fit the Parquet parquet INT64, and we can work with the Parquet and other
> data format communities to add support for timestamp nanos with a wider
> year range. Before that, we can write a custom struct in Parquet to save
> this timestamp nano type.
> >
> > On Wed, May 13, 2026 at 5:38 PM Wenchen Fan <[email protected]> wrote:
> >>
> >> I think the main question is what are the requirements for this new
> timestamp nano type. Personally I think it's better to follow SQL standard,
> and support year range 0000 to 9999. This kills the INT64 option. For data
> sources, we can throw out of range error of the actual timestamp value does
> not fix the Parquet parquet INT64
> >>
> >> On Tue, May 12, 2026 at 5:38 PM Max Gekk <[email protected]> wrote:
> >>>
> >>> Hi Xiaoxuan,
> >>>
> >>> Thank you for the detailed clarification of your proposal.
> >>>
> >>> > the key difference is internal representation, our draft uses INT64
> epoch-nanos, yours uses composite (epochMicros, nanosOfMicro).
> >>>
> >>> I think the main difference between our proposals is how we answer the
> >>> question: shall Spark SQL conform to the SQL standard or not? The
> >>> standard says clearly that the year range is from 0001 to 9999. Rough
> >>> count of distinct nanosecond instants on a proleptic-Gregorian line
> >>> from 0001‑01‑01 through 9999‑12‑31:
> >>> - About 3.65*10^6 civil days in that span (order of magnitude is
> enough).
> >>> - Each day has 86400*10^9 = 8.64*10^13 distinct nanosecond offsets
> >>> from midnight.
> >>> So the number of distinct values is about: N +-= 3.65*10^6 *
> >>> 8.64*10^13 +-= 3.2*10^20
> >>> Then: log2(N) ±= 68-69 bits.
> >>> Any mapping from that full set would need at least about 69 bits.
> >>>
> >>> > Four concerns, and I'd value your read on whether they're solvable:
> >>> > Composite doesn't fit UnsafeRow's 8-byte slot, so every
> sort/hash/join/shuffle pays the variable-length cost: extra memory access,
> worse cache locality, ~2–3x memory per value.
> >>>
> >>> You are right for UnsafeRows but built-in datasources like Parquet and
> >>> ORC might return Column Vectors where values are stored as arrays of
> >>> long, short. And such values could be processed in vectorized ways. I
> >>> believe the new data type will have worse performance, but not so
> >>> significant.
> >>>
> >>> > The range benefit doesn't survive egress. Spark's main egress paths
> are all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark
> Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns].
> >>>
> >>> Below are the sources from where timestamps with nanosecond precision
> >>> could come from out of the range 1677-2262:
> >>> 1. Parquet: Spark's TIMESTAMP_LTZ is still saved/loaded from INT96 by
> >>> default which has nanoseconds precision.
> >>> 2. Another built-in datasource ORC stores timestamps with nanosecond
> >>> precision, see https://orc.apache.org/specification/ORCv2/
> >>> 3. Spark SQL can have access to some external DBMSs that support
> >>> nanoseconds precision, for instance Oracle, MS SQL Server, Snowflake,
> >>> Trino, Teradata.
> >>>
> >>> > Nanosecond precision tends to go with modern-measurement data (HFT,
> traces, IoT, logs); wide calendar range tends to go with archival data
> where milli or second precision is enough.
> >>>
> >>> I would imagine that Spark users might need timestamps with nanos from
> >>> out of the range 1677-2262:
> >>> - Simulating some physical processes in the future or in the past.
> >>> - Migration from other systems.
> >>>
> >>> > Composite is hard to walk back once shipped. The two directions
> aren't symmetric. Starting with INT64 and upgrading to composite later is
> SQL-layer compatible
> >>>
> >>> INT64 epoch-nanos is also a one-way semantic bet in the other
> >>> direction: once users store physics-time workloads in that encoding,
> >>> widening later without reinterpretation is not free either.
> >>>
> >>> > The other thing that pulled us toward INT64 is that it's the choice
> most open-source columnar and lakehouse engines have already made. DuckDB's
> TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage
> all use INT64 epoch-nanos with the 1678–2262 bound.
> >>>
> >>> Matching open columnar consensus for wire formats is a strong default
> >>> for interchange, I agree. I would separate that from the question of
> >>> Spark’s in-memory representation.
> >>>
> >>> > Given the perf concern especially, we'd prefer INT64 for now.
> @Unstable keeps the door open to the composite layout later
> >>>
> >>> How about measuring performance of MVP on end-to-end benchmarks. We
> >>> could address perf concerns later.
> >>>
> >>> Yours faithfully,
> >>> Max Gekk
> >>>
> >>>
> >>> On Tue, May 12, 2026 at 1:52 AM Xiaoxuan Li <[email protected]>
> wrote:
> >>> >
> >>> > Hi Max,
> >>> > Thanks for the writeup. I've been working on a related proposal in
> parallel — SPIP: Support NanoSecond Timestamp Types. The user-visible
> surface overlaps a lot (SQL syntax, new catalyst types, Parquet NANOS
> interop); the key difference is internal representation, our draft uses
> INT64 epoch-nanos, yours uses composite (epochMicros, nanosOfMicro).
> >>> >
> >>> > If we decide to go with composite, I agree your layout is the right
> one, reuses micros-based DateTimeUtils, aligns the calendar range with
> TimestampType, keeps the extra precision as a small bounded correction.
> >>> >
> >>> > We started with INT64 because we're worried about paying composite's
> cost without getting the real benefit. Four concerns, and I'd value your
> read on whether they're solvable:
> >>> >
> >>> > Hot-path performance. Composite doesn't fit UnsafeRow's 8-byte slot,
> so every sort/hash/join/shuffle pays the variable-length cost: extra memory
> access, worse cache locality, ~2–3x memory per value. Trino is the closest
> precedent — they went composite for TIMESTAMP(p>6) because their ceiling is
> picoseconds, and even so the perf gap between short and long
> representations was significant enough that they added a
> hive.timestamp-precision toggle so users could force high-precision columns
> back to micros. Our ceiling is nanoseconds, so we'd take on Trino's cost
> without Trino's reason. Curious how you see it playing out differently.
> >>> > The range benefit doesn't survive egress. Spark's main egress paths
> are all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark
> Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. A year-1500
> value can live in Spark memory under composite but can't leave — it either
> throws on write/fetch or gets silently truncated, depending on how the
> boundary is specified. Curious what you have in mind for the egress side.
> >>> > Do workloads actually need both? Nanosecond precision tends to go
> with modern-measurement data (HFT, traces, IoT, logs); wide calendar range
> tends to go with archival data where milli or second precision is enough.
> We haven't found a case where a single column needs both — same assumption
> Parquet, Arrow, Iceberg, and Pandas seem to make. The one case where they
> do intersect is sentinel values — 9999-12-31 for "no end date," 0001-01-01
> for "unknown start" — mixed into columns that otherwise hold
> nanosecond-precise timestamps. Your proposal handles this natively; ours
> asks users to either use NULL, pick a sentinel within range. That's a real
> user-facing ask. Curious whether you've seen other patterns, since
> sentinels alone feel like something that could also be addressed at the
> data-modeling layer.
> >>> > Composite is hard to walk back once shipped. The two directions
> aren't symmetric. Starting with INT64 and upgrading to composite later is
> SQL-layer compatible — user queries and declared schemas don't move, the
> existing Parquet files keep meaning the same thing (Spark just reads INT64
> nanos into composite at the edge), and new writes can carry the wider range
> once Parquet or Arrow grow support. Starting with composite is effectively
> a one-way commitment: the moment users persist year-1500 values into
> tables, Spark owns supporting those values forever, because narrowing the
> type after the fact would be data loss from the user's perspective. So
> starting narrow preserves the option to go wider if the evidence shifts;
> starting wide locks in the cost on day one.
> >>> >
> >>> > The other thing that pulled us toward INT64 is that it's the choice
> most open-source columnar and lakehouse engines have already made. DuckDB's
> TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage
> all use INT64 epoch-nanos with the 1678–2262 bound. Parquet, Arrow, Iceberg
> V3, Avro, and Pandas datetime64[ns] do too. Engines that offer full-range
> nanos — Snowflake, Oracle, DB2 — either run on proprietary storage formats
> they control end-to-end or are row-based OLTP with different cost
> structures. Trino is the one open-source columnar engine that went wider —
> it supports TIMESTAMP(p) up to picoseconds (p=12), which simply doesn't fit
> in INT64, so composite was necessary. Even so, the performance penalty is
> real. For a columnar engine like Spark whose data plane runs through
> Parquet and Arrow, matching the open-source columnar consensus seemed like
> the less surprising default.
> >>> >
> >>> > Given the perf concern especially, we'd prefer INT64 for now.
> @Unstable keeps the door open to the composite layout later — if the
> ecosystem grows full-range nanos, workloads push us there, or we need
> sub-nanosecond precision where INT64 isn't enough.
> >>> >
> >>> > Would love any thought on this, good to align in a single direction
> before either moves forward.
> >>> >
> >>> > Thanks,
> >>> > Xiaoxuan Li
> >>> >
> >>> > On Fri, May 8, 2026 at 1:43 AM Wenchen Fan <[email protected]>
> wrote:
> >>> >>
> >>> >> This new design makes sense to me. So we just add 2 more bytes to
> store nanosOfMicro, and the rest is the same as the current timestamp
> types: same value range, but higher precision.
> >>> >>
> >>> >> On Thu, May 7, 2026 at 5:16 PM Max Gekk <[email protected]> wrote:
> >>> >>>
> >>> >>> Hi Spark devs,
> >>> >>>
> >>> >>> I’d like to share a proposal for nano-second-capable timestamp
> support
> >>> >>> and ask for your feedback.
> >>> >>>
> >>> >>> Here is the SPIP:
> >>> >>>
> https://docs.google.com/document/d/1DeW15QueI4PdRyPm6C6jsTZFmIjbXX2j4h-Ja5W_fsg/edit?usp=sharing
> >>> >>>
> >>> >>> My proposal uses a logical split representation:
> >>> >>> - epochMicros: Long
> >>> >>> - nanosOfMicro: Short in [0, 999]
> >>> >>>
> >>> >>> This applies to both NTZ and LTZ nano-capable types; timezone
> >>> >>> semantics remain unchanged and are handled at interpretation
> >>> >>> boundaries (as today).
> >>> >>>
> >>> >>> Why this approach? I believe this is the most practical path for
> Spark
> >>> >>> because it:
> >>> >>> 0. Conforms to the SQL standard.
> >>> >>> 1. Preserves Spark’s existing microsecond approach. Most
> >>> >>> Catalyst/runtime datetime logic already uses micros. The split
> model
> >>> >>> extends it rather than replacing it.
> >>> >>> 2. Avoids INT64 epoch-nanos range cliff as the primary engine
> model. A
> >>> >>> single Long epoch-nanos representation constrains calendar range
> much
> >>> >>> more aggressively than Long micros.
> >>> >>> 3. Keeps migration risk lower. Existing microsecond behavior
> remains
> >>> >>> default; nano precision is opt-in via parameterized types/syntax.
> >>> >>> 4. Allows efficient implementation paths. Internals can still
> choose
> >>> >>> compact physical encodings (row/vector/file boundaries), while
> keeping
> >>> >>> one canonical logical contract.
> >>> >>>
> >>> >>> Related SPIPs considered. I reviewed and compared against these
> two drafts:
> >>> >>> - SPIP: Support NanoSecond Timestamps:
> >>> >>>
> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?tab=t.0#heading=h.4kibaxwtx2xo
> >>> >>> - SPIP: Support NanoSecond Timestamp Types:
> >>> >>>
> https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?tab=t.0#heading=h.xk16mmomv6il
> >>> >>>
> >>> >>> Those drafts are valuable and informed this design. The key
> difference
> >>> >>> is that I prioritize micros-first engine continuity with a bounded
> >>> >>> nano remainder, instead of making epoch-nanos the primary internal
> >>> >>> semantic unit.
> >>> >>> In short: I think epochMicros + nanosOfMicro is a better fit for
> >>> >>> Spark’s current architecture and compatibility constraints, while
> >>> >>> still delivering practical nanosecond support.
> >>> >>>
> >>> >>> Thanks in advance for your feedback.
> >>> >>>
> >>> >>> Best regards,
> >>> >>> Max Gekk
> >>> >>>
> >>> >>>
> ---------------------------------------------------------------------
> >>> >>> To unsubscribe e-mail: [email protected]
> >>> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: [email protected]
> >>>
> >
>
>
>

Re: [DISCUSS] SPIP: Nano-second timestamps: micros + nanos of micro

Reply via email to