I would like to kick off voting for the SPIP today if there will be no objections.
On Wed, May 13, 2026 at 8:13 PM serge rielau.com <[email protected]> wrote: > Fair enough, I was not aware of elf the Java limitation and resulting > dependency. > > On May 13, 2026, at 10:28 AM, Max Gekk <[email protected]> wrote: > > Hi Serge, > > > If we agree that any performance (and memory) cliff is going composite > and not whether the extra bytes are 2 or 4 bytes, then would it make sense > to match Trino? We would: > > If we would support picosecond precisions, this could cause the following > issues, IMHO: > 1. Spark's datetime stack today is “nanos‑native,” not “picos‑native.” > java.time (Instant, LocalDateTime, ZonedDateTime, Duration, etc.) exposes > nanoseconds as the finest supported unit in the public model. Supporting p > > 9 in Spark SQL means either rounding away picos at almost every boundary > or building custom arithmetic, normalization, parsing, and calendar logic > for the sub‑nano tail. That is a large, long‑lived surface area, with high > regression risk anywhere we already struggle: LTZ vs NTZ, session time > zone, legacy rebasing, Julian/Gregorian, pushdown, codegen, etc. So "same > cost as going composite for nanos" does not imply "picos are free once we > went composite." > 2. Memory is not only “+2 vs +4 bytes” — it is “+delta bytes * row width * > shuffle fanout.” > Picos widen rows further than nanos, which increases OOM / GC / shuffle > spill risk on the same heap and cluster sizes — especially for wide fact > tables and skewed joins on timestamp keys. > 3. Interchange and “federation” still do not become automatic. > Even if Trino is aligned internally, Parquet / Arrow / Pandas / JDBC paths > overwhelmingly standardize on nanos at best for compact physical encodings. > > Best regards, > Max Gekk > > On Wed, May 13, 2026 at 4:04 PM serge rielau.com <[email protected]> wrote: > > > > A few questions to ponder: > > > > Are we committed to the SQL Standard, even when it may be tactically > inconvenient? > > Why did Trino and Db2 go to pico? I can answer for Db2 as I was in the > room: We wanted to build for the future and rip the band aid and there was > no extra design or QA cost. What was Trino’s thinking? > > In my career I have seen DBMS needs go from milli to micro to nano. Nano > will not be the end of it. While for all intents and purposes “antique” > nanoseconds are too esoteric to sweat about, sticking with int64 will not > be an option for pico. > > Storage is data at rest. It is “easy” to add another format. Engines > like Spark outlive storage formats, and so do their APIs. > > > > If we agree that any performance (and memory) cliff is going composite > and not whether the extra bytes are 2 or 4 bytes, then would it make sense > to match Trino? We would: > > > > Have an actual external benefit outside of the corner case of range > > Peace of mind for the API for at least a decade, perhaps more (if we go > Femto .. which is free upgrade at 4 bytes) > > Full compatibility with any federated datasource > > Standard compliance > > > > > > > > > > On May 13, 2026, at 2:40 AM, Wenchen Fan <[email protected]> wrote: > > > > Sorry, I misclicked the send button, let me finish. > > > > We can throw out of range errors if the actual timestamp value does not > fit the Parquet parquet INT64, and we can work with the Parquet and other > data format communities to add support for timestamp nanos with a wider > year range. Before that, we can write a custom struct in Parquet to save > this timestamp nano type. > > > > On Wed, May 13, 2026 at 5:38 PM Wenchen Fan <[email protected]> wrote: > >> > >> I think the main question is what are the requirements for this new > timestamp nano type. Personally I think it's better to follow SQL standard, > and support year range 0000 to 9999. This kills the INT64 option. For data > sources, we can throw out of range error of the actual timestamp value does > not fix the Parquet parquet INT64 > >> > >> On Tue, May 12, 2026 at 5:38 PM Max Gekk <[email protected]> wrote: > >>> > >>> Hi Xiaoxuan, > >>> > >>> Thank you for the detailed clarification of your proposal. > >>> > >>> > the key difference is internal representation, our draft uses INT64 > epoch-nanos, yours uses composite (epochMicros, nanosOfMicro). > >>> > >>> I think the main difference between our proposals is how we answer the > >>> question: shall Spark SQL conform to the SQL standard or not? The > >>> standard says clearly that the year range is from 0001 to 9999. Rough > >>> count of distinct nanosecond instants on a proleptic-Gregorian line > >>> from 0001‑01‑01 through 9999‑12‑31: > >>> - About 3.65*10^6 civil days in that span (order of magnitude is > enough). > >>> - Each day has 86400*10^9 = 8.64*10^13 distinct nanosecond offsets > >>> from midnight. > >>> So the number of distinct values is about: N +-= 3.65*10^6 * > >>> 8.64*10^13 +-= 3.2*10^20 > >>> Then: log2(N) ±= 68-69 bits. > >>> Any mapping from that full set would need at least about 69 bits. > >>> > >>> > Four concerns, and I'd value your read on whether they're solvable: > >>> > Composite doesn't fit UnsafeRow's 8-byte slot, so every > sort/hash/join/shuffle pays the variable-length cost: extra memory access, > worse cache locality, ~2–3x memory per value. > >>> > >>> You are right for UnsafeRows but built-in datasources like Parquet and > >>> ORC might return Column Vectors where values are stored as arrays of > >>> long, short. And such values could be processed in vectorized ways. I > >>> believe the new data type will have worse performance, but not so > >>> significant. > >>> > >>> > The range benefit doesn't survive egress. Spark's main egress paths > are all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark > Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. > >>> > >>> Below are the sources from where timestamps with nanosecond precision > >>> could come from out of the range 1677-2262: > >>> 1. Parquet: Spark's TIMESTAMP_LTZ is still saved/loaded from INT96 by > >>> default which has nanoseconds precision. > >>> 2. Another built-in datasource ORC stores timestamps with nanosecond > >>> precision, see https://orc.apache.org/specification/ORCv2/ > >>> 3. Spark SQL can have access to some external DBMSs that support > >>> nanoseconds precision, for instance Oracle, MS SQL Server, Snowflake, > >>> Trino, Teradata. > >>> > >>> > Nanosecond precision tends to go with modern-measurement data (HFT, > traces, IoT, logs); wide calendar range tends to go with archival data > where milli or second precision is enough. > >>> > >>> I would imagine that Spark users might need timestamps with nanos from > >>> out of the range 1677-2262: > >>> - Simulating some physical processes in the future or in the past. > >>> - Migration from other systems. > >>> > >>> > Composite is hard to walk back once shipped. The two directions > aren't symmetric. Starting with INT64 and upgrading to composite later is > SQL-layer compatible > >>> > >>> INT64 epoch-nanos is also a one-way semantic bet in the other > >>> direction: once users store physics-time workloads in that encoding, > >>> widening later without reinterpretation is not free either. > >>> > >>> > The other thing that pulled us toward INT64 is that it's the choice > most open-source columnar and lakehouse engines have already made. DuckDB's > TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage > all use INT64 epoch-nanos with the 1678–2262 bound. > >>> > >>> Matching open columnar consensus for wire formats is a strong default > >>> for interchange, I agree. I would separate that from the question of > >>> Spark’s in-memory representation. > >>> > >>> > Given the perf concern especially, we'd prefer INT64 for now. > @Unstable keeps the door open to the composite layout later > >>> > >>> How about measuring performance of MVP on end-to-end benchmarks. We > >>> could address perf concerns later. > >>> > >>> Yours faithfully, > >>> Max Gekk > >>> > >>> > >>> On Tue, May 12, 2026 at 1:52 AM Xiaoxuan Li <[email protected]> > wrote: > >>> > > >>> > Hi Max, > >>> > Thanks for the writeup. I've been working on a related proposal in > parallel — SPIP: Support NanoSecond Timestamp Types. The user-visible > surface overlaps a lot (SQL syntax, new catalyst types, Parquet NANOS > interop); the key difference is internal representation, our draft uses > INT64 epoch-nanos, yours uses composite (epochMicros, nanosOfMicro). > >>> > > >>> > If we decide to go with composite, I agree your layout is the right > one, reuses micros-based DateTimeUtils, aligns the calendar range with > TimestampType, keeps the extra precision as a small bounded correction. > >>> > > >>> > We started with INT64 because we're worried about paying composite's > cost without getting the real benefit. Four concerns, and I'd value your > read on whether they're solvable: > >>> > > >>> > Hot-path performance. Composite doesn't fit UnsafeRow's 8-byte slot, > so every sort/hash/join/shuffle pays the variable-length cost: extra memory > access, worse cache locality, ~2–3x memory per value. Trino is the closest > precedent — they went composite for TIMESTAMP(p>6) because their ceiling is > picoseconds, and even so the perf gap between short and long > representations was significant enough that they added a > hive.timestamp-precision toggle so users could force high-precision columns > back to micros. Our ceiling is nanoseconds, so we'd take on Trino's cost > without Trino's reason. Curious how you see it playing out differently. > >>> > The range benefit doesn't survive egress. Spark's main egress paths > are all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark > Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. A year-1500 > value can live in Spark memory under composite but can't leave — it either > throws on write/fetch or gets silently truncated, depending on how the > boundary is specified. Curious what you have in mind for the egress side. > >>> > Do workloads actually need both? Nanosecond precision tends to go > with modern-measurement data (HFT, traces, IoT, logs); wide calendar range > tends to go with archival data where milli or second precision is enough. > We haven't found a case where a single column needs both — same assumption > Parquet, Arrow, Iceberg, and Pandas seem to make. The one case where they > do intersect is sentinel values — 9999-12-31 for "no end date," 0001-01-01 > for "unknown start" — mixed into columns that otherwise hold > nanosecond-precise timestamps. Your proposal handles this natively; ours > asks users to either use NULL, pick a sentinel within range. That's a real > user-facing ask. Curious whether you've seen other patterns, since > sentinels alone feel like something that could also be addressed at the > data-modeling layer. > >>> > Composite is hard to walk back once shipped. The two directions > aren't symmetric. Starting with INT64 and upgrading to composite later is > SQL-layer compatible — user queries and declared schemas don't move, the > existing Parquet files keep meaning the same thing (Spark just reads INT64 > nanos into composite at the edge), and new writes can carry the wider range > once Parquet or Arrow grow support. Starting with composite is effectively > a one-way commitment: the moment users persist year-1500 values into > tables, Spark owns supporting those values forever, because narrowing the > type after the fact would be data loss from the user's perspective. So > starting narrow preserves the option to go wider if the evidence shifts; > starting wide locks in the cost on day one. > >>> > > >>> > The other thing that pulled us toward INT64 is that it's the choice > most open-source columnar and lakehouse engines have already made. DuckDB's > TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage > all use INT64 epoch-nanos with the 1678–2262 bound. Parquet, Arrow, Iceberg > V3, Avro, and Pandas datetime64[ns] do too. Engines that offer full-range > nanos — Snowflake, Oracle, DB2 — either run on proprietary storage formats > they control end-to-end or are row-based OLTP with different cost > structures. Trino is the one open-source columnar engine that went wider — > it supports TIMESTAMP(p) up to picoseconds (p=12), which simply doesn't fit > in INT64, so composite was necessary. Even so, the performance penalty is > real. For a columnar engine like Spark whose data plane runs through > Parquet and Arrow, matching the open-source columnar consensus seemed like > the less surprising default. > >>> > > >>> > Given the perf concern especially, we'd prefer INT64 for now. > @Unstable keeps the door open to the composite layout later — if the > ecosystem grows full-range nanos, workloads push us there, or we need > sub-nanosecond precision where INT64 isn't enough. > >>> > > >>> > Would love any thought on this, good to align in a single direction > before either moves forward. > >>> > > >>> > Thanks, > >>> > Xiaoxuan Li > >>> > > >>> > On Fri, May 8, 2026 at 1:43 AM Wenchen Fan <[email protected]> > wrote: > >>> >> > >>> >> This new design makes sense to me. So we just add 2 more bytes to > store nanosOfMicro, and the rest is the same as the current timestamp > types: same value range, but higher precision. > >>> >> > >>> >> On Thu, May 7, 2026 at 5:16 PM Max Gekk <[email protected]> wrote: > >>> >>> > >>> >>> Hi Spark devs, > >>> >>> > >>> >>> I’d like to share a proposal for nano-second-capable timestamp > support > >>> >>> and ask for your feedback. > >>> >>> > >>> >>> Here is the SPIP: > >>> >>> > https://docs.google.com/document/d/1DeW15QueI4PdRyPm6C6jsTZFmIjbXX2j4h-Ja5W_fsg/edit?usp=sharing > >>> >>> > >>> >>> My proposal uses a logical split representation: > >>> >>> - epochMicros: Long > >>> >>> - nanosOfMicro: Short in [0, 999] > >>> >>> > >>> >>> This applies to both NTZ and LTZ nano-capable types; timezone > >>> >>> semantics remain unchanged and are handled at interpretation > >>> >>> boundaries (as today). > >>> >>> > >>> >>> Why this approach? I believe this is the most practical path for > Spark > >>> >>> because it: > >>> >>> 0. Conforms to the SQL standard. > >>> >>> 1. Preserves Spark’s existing microsecond approach. Most > >>> >>> Catalyst/runtime datetime logic already uses micros. The split > model > >>> >>> extends it rather than replacing it. > >>> >>> 2. Avoids INT64 epoch-nanos range cliff as the primary engine > model. A > >>> >>> single Long epoch-nanos representation constrains calendar range > much > >>> >>> more aggressively than Long micros. > >>> >>> 3. Keeps migration risk lower. Existing microsecond behavior > remains > >>> >>> default; nano precision is opt-in via parameterized types/syntax. > >>> >>> 4. Allows efficient implementation paths. Internals can still > choose > >>> >>> compact physical encodings (row/vector/file boundaries), while > keeping > >>> >>> one canonical logical contract. > >>> >>> > >>> >>> Related SPIPs considered. I reviewed and compared against these > two drafts: > >>> >>> - SPIP: Support NanoSecond Timestamps: > >>> >>> > https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?tab=t.0#heading=h.4kibaxwtx2xo > >>> >>> - SPIP: Support NanoSecond Timestamp Types: > >>> >>> > https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?tab=t.0#heading=h.xk16mmomv6il > >>> >>> > >>> >>> Those drafts are valuable and informed this design. The key > difference > >>> >>> is that I prioritize micros-first engine continuity with a bounded > >>> >>> nano remainder, instead of making epoch-nanos the primary internal > >>> >>> semantic unit. > >>> >>> In short: I think epochMicros + nanosOfMicro is a better fit for > >>> >>> Spark’s current architecture and compatibility constraints, while > >>> >>> still delivering practical nanosecond support. > >>> >>> > >>> >>> Thanks in advance for your feedback. > >>> >>> > >>> >>> Best regards, > >>> >>> Max Gekk > >>> >>> > >>> >>> > --------------------------------------------------------------------- > >>> >>> To unsubscribe e-mail: [email protected] > >>> >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe e-mail: [email protected] > >>> > > > > >
