Fair enough, I was not aware of elf the Java limitation and resulting dependency.
On May 13, 2026, at 10:28 AM, Max Gekk <[email protected]> wrote: Hi Serge, > If we agree that any performance (and memory) cliff is going composite and > not whether the extra bytes are 2 or 4 bytes, then would it make sense to > match Trino? We would: If we would support picosecond precisions, this could cause the following issues, IMHO: 1. Spark's datetime stack today is “nanos‑native,” not “picos‑native.” java.time (Instant, LocalDateTime, ZonedDateTime, Duration, etc.) exposes nanoseconds as the finest supported unit in the public model. Supporting p > 9 in Spark SQL means either rounding away picos at almost every boundary or building custom arithmetic, normalization, parsing, and calendar logic for the sub‑nano tail. That is a large, long‑lived surface area, with high regression risk anywhere we already struggle: LTZ vs NTZ, session time zone, legacy rebasing, Julian/Gregorian, pushdown, codegen, etc. So "same cost as going composite for nanos" does not imply "picos are free once we went composite." 2. Memory is not only “+2 vs +4 bytes” — it is “+delta bytes * row width * shuffle fanout.” Picos widen rows further than nanos, which increases OOM / GC / shuffle spill risk on the same heap and cluster sizes — especially for wide fact tables and skewed joins on timestamp keys. 3. Interchange and “federation” still do not become automatic. Even if Trino is aligned internally, Parquet / Arrow / Pandas / JDBC paths overwhelmingly standardize on nanos at best for compact physical encodings. Best regards, Max Gekk On Wed, May 13, 2026 at 4:04 PM serge rielau.com<http://rielau.com/> <[email protected]<mailto:[email protected]>> wrote: > > A few questions to ponder: > > Are we committed to the SQL Standard, even when it may be tactically > inconvenient? > Why did Trino and Db2 go to pico? I can answer for Db2 as I was in the room: > We wanted to build for the future and rip the band aid and there was no extra > design or QA cost. What was Trino’s thinking? > In my career I have seen DBMS needs go from milli to micro to nano. Nano will > not be the end of it. While for all intents and purposes “antique” > nanoseconds are too esoteric to sweat about, sticking with int64 will not be > an option for pico. > Storage is data at rest. It is “easy” to add another format. Engines like > Spark outlive storage formats, and so do their APIs. > > If we agree that any performance (and memory) cliff is going composite and > not whether the extra bytes are 2 or 4 bytes, then would it make sense to > match Trino? We would: > > Have an actual external benefit outside of the corner case of range > Peace of mind for the API for at least a decade, perhaps more (if we go Femto > .. which is free upgrade at 4 bytes) > Full compatibility with any federated datasource > Standard compliance > > > > > On May 13, 2026, at 2:40 AM, Wenchen Fan > <[email protected]<mailto:[email protected]>> wrote: > > Sorry, I misclicked the send button, let me finish. > > We can throw out of range errors if the actual timestamp value does not fit > the Parquet parquet INT64, and we can work with the Parquet and other data > format communities to add support for timestamp nanos with a wider year > range. Before that, we can write a custom struct in Parquet to save this > timestamp nano type. > > On Wed, May 13, 2026 at 5:38 PM Wenchen Fan > <[email protected]<mailto:[email protected]>> wrote: >> >> I think the main question is what are the requirements for this new >> timestamp nano type. Personally I think it's better to follow SQL standard, >> and support year range 0000 to 9999. This kills the INT64 option. For data >> sources, we can throw out of range error of the actual timestamp value does >> not fix the Parquet parquet INT64 >> >> On Tue, May 12, 2026 at 5:38 PM Max Gekk >> <[email protected]<mailto:[email protected]>> wrote: >>> >>> Hi Xiaoxuan, >>> >>> Thank you for the detailed clarification of your proposal. >>> >>> > the key difference is internal representation, our draft uses INT64 >>> > epoch-nanos, yours uses composite (epochMicros, nanosOfMicro). >>> >>> I think the main difference between our proposals is how we answer the >>> question: shall Spark SQL conform to the SQL standard or not? The >>> standard says clearly that the year range is from 0001 to 9999. Rough >>> count of distinct nanosecond instants on a proleptic-Gregorian line >>> from 0001‑01‑01 through 9999‑12‑31: >>> - About 3.65*10^6 civil days in that span (order of magnitude is enough). >>> - Each day has 86400*10^9 = 8.64*10^13 distinct nanosecond offsets >>> from midnight. >>> So the number of distinct values is about: N +-= 3.65*10^6 * >>> 8.64*10^13 +-= 3.2*10^20 >>> Then: log2(N) ±= 68-69 bits. >>> Any mapping from that full set would need at least about 69 bits. >>> >>> > Four concerns, and I'd value your read on whether they're solvable: >>> > Composite doesn't fit UnsafeRow's 8-byte slot, so every >>> > sort/hash/join/shuffle pays the variable-length cost: extra memory >>> > access, worse cache locality, ~2–3x memory per value. >>> >>> You are right for UnsafeRows but built-in datasources like Parquet and >>> ORC might return Column Vectors where values are stored as arrays of >>> long, short. And such values could be processed in vectorized ways. I >>> believe the new data type will have worse performance, but not so >>> significant. >>> >>> > The range benefit doesn't survive egress. Spark's main egress paths are >>> > all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark >>> > Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. >>> >>> Below are the sources from where timestamps with nanosecond precision >>> could come from out of the range 1677-2262: >>> 1. Parquet: Spark's TIMESTAMP_LTZ is still saved/loaded from INT96 by >>> default which has nanoseconds precision. >>> 2. Another built-in datasource ORC stores timestamps with nanosecond >>> precision, see https://orc.apache.org/specification/ORCv2/ >>> 3. Spark SQL can have access to some external DBMSs that support >>> nanoseconds precision, for instance Oracle, MS SQL Server, Snowflake, >>> Trino, Teradata. >>> >>> > Nanosecond precision tends to go with modern-measurement data (HFT, >>> > traces, IoT, logs); wide calendar range tends to go with archival data >>> > where milli or second precision is enough. >>> >>> I would imagine that Spark users might need timestamps with nanos from >>> out of the range 1677-2262: >>> - Simulating some physical processes in the future or in the past. >>> - Migration from other systems. >>> >>> > Composite is hard to walk back once shipped. The two directions aren't >>> > symmetric. Starting with INT64 and upgrading to composite later is >>> > SQL-layer compatible >>> >>> INT64 epoch-nanos is also a one-way semantic bet in the other >>> direction: once users store physics-time workloads in that encoding, >>> widening later without reinterpretation is not free either. >>> >>> > The other thing that pulled us toward INT64 is that it's the choice most >>> > open-source columnar and lakehouse engines have already made. DuckDB's >>> > TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp >>> > storage all use INT64 epoch-nanos with the 1678–2262 bound. >>> >>> Matching open columnar consensus for wire formats is a strong default >>> for interchange, I agree. I would separate that from the question of >>> Spark’s in-memory representation. >>> >>> > Given the perf concern especially, we'd prefer INT64 for now. @Unstable >>> > keeps the door open to the composite layout later >>> >>> How about measuring performance of MVP on end-to-end benchmarks. We >>> could address perf concerns later. >>> >>> Yours faithfully, >>> Max Gekk >>> >>> >>> On Tue, May 12, 2026 at 1:52 AM Xiaoxuan Li >>> <[email protected]<mailto:[email protected]>> wrote: >>> > >>> > Hi Max, >>> > Thanks for the writeup. I've been working on a related proposal in >>> > parallel — SPIP: Support NanoSecond Timestamp Types. The user-visible >>> > surface overlaps a lot (SQL syntax, new catalyst types, Parquet NANOS >>> > interop); the key difference is internal representation, our draft uses >>> > INT64 epoch-nanos, yours uses composite (epochMicros, nanosOfMicro). >>> > >>> > If we decide to go with composite, I agree your layout is the right one, >>> > reuses micros-based DateTimeUtils, aligns the calendar range with >>> > TimestampType, keeps the extra precision as a small bounded correction. >>> > >>> > We started with INT64 because we're worried about paying composite's cost >>> > without getting the real benefit. Four concerns, and I'd value your read >>> > on whether they're solvable: >>> > >>> > Hot-path performance. Composite doesn't fit UnsafeRow's 8-byte slot, so >>> > every sort/hash/join/shuffle pays the variable-length cost: extra memory >>> > access, worse cache locality, ~2–3x memory per value. Trino is the >>> > closest precedent — they went composite for TIMESTAMP(p>6) because their >>> > ceiling is picoseconds, and even so the perf gap between short and long >>> > representations was significant enough that they added a >>> > hive.timestamp-precision toggle so users could force high-precision >>> > columns back to micros. Our ceiling is nanoseconds, so we'd take on >>> > Trino's cost without Trino's reason. Curious how you see it playing out >>> > differently. >>> > The range benefit doesn't survive egress. Spark's main egress paths are >>> > all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark >>> > Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. A >>> > year-1500 value can live in Spark memory under composite but can't leave >>> > — it either throws on write/fetch or gets silently truncated, depending >>> > on how the boundary is specified. Curious what you have in mind for the >>> > egress side. >>> > Do workloads actually need both? Nanosecond precision tends to go with >>> > modern-measurement data (HFT, traces, IoT, logs); wide calendar range >>> > tends to go with archival data where milli or second precision is enough. >>> > We haven't found a case where a single column needs both — same >>> > assumption Parquet, Arrow, Iceberg, and Pandas seem to make. The one case >>> > where they do intersect is sentinel values — 9999-12-31 for "no end >>> > date," 0001-01-01 for "unknown start" — mixed into columns that otherwise >>> > hold nanosecond-precise timestamps. Your proposal handles this natively; >>> > ours asks users to either use NULL, pick a sentinel within range. That's >>> > a real user-facing ask. Curious whether you've seen other patterns, since >>> > sentinels alone feel like something that could also be addressed at the >>> > data-modeling layer. >>> > Composite is hard to walk back once shipped. The two directions aren't >>> > symmetric. Starting with INT64 and upgrading to composite later is >>> > SQL-layer compatible — user queries and declared schemas don't move, the >>> > existing Parquet files keep meaning the same thing (Spark just reads >>> > INT64 nanos into composite at the edge), and new writes can carry the >>> > wider range once Parquet or Arrow grow support. Starting with composite >>> > is effectively a one-way commitment: the moment users persist year-1500 >>> > values into tables, Spark owns supporting those values forever, because >>> > narrowing the type after the fact would be data loss from the user's >>> > perspective. So starting narrow preserves the option to go wider if the >>> > evidence shifts; starting wide locks in the cost on day one. >>> > >>> > The other thing that pulled us toward INT64 is that it's the choice most >>> > open-source columnar and lakehouse engines have already made. DuckDB's >>> > TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp >>> > storage all use INT64 epoch-nanos with the 1678–2262 bound. Parquet, >>> > Arrow, Iceberg V3, Avro, and Pandas datetime64[ns] do too. Engines that >>> > offer full-range nanos — Snowflake, Oracle, DB2 — either run on >>> > proprietary storage formats they control end-to-end or are row-based OLTP >>> > with different cost structures. Trino is the one open-source columnar >>> > engine that went wider — it supports TIMESTAMP(p) up to picoseconds >>> > (p=12), which simply doesn't fit in INT64, so composite was necessary. >>> > Even so, the performance penalty is real. For a columnar engine like >>> > Spark whose data plane runs through Parquet and Arrow, matching the >>> > open-source columnar consensus seemed like the less surprising default. >>> > >>> > Given the perf concern especially, we'd prefer INT64 for now. @Unstable >>> > keeps the door open to the composite layout later — if the ecosystem >>> > grows full-range nanos, workloads push us there, or we need >>> > sub-nanosecond precision where INT64 isn't enough. >>> > >>> > Would love any thought on this, good to align in a single direction >>> > before either moves forward. >>> > >>> > Thanks, >>> > Xiaoxuan Li >>> > >>> > On Fri, May 8, 2026 at 1:43 AM Wenchen Fan >>> > <[email protected]<mailto:[email protected]>> wrote: >>> >> >>> >> This new design makes sense to me. So we just add 2 more bytes to store >>> >> nanosOfMicro, and the rest is the same as the current timestamp types: >>> >> same value range, but higher precision. >>> >> >>> >> On Thu, May 7, 2026 at 5:16 PM Max Gekk >>> >> <[email protected]<mailto:[email protected]>> wrote: >>> >>> >>> >>> Hi Spark devs, >>> >>> >>> >>> I’d like to share a proposal for nano-second-capable timestamp support >>> >>> and ask for your feedback. >>> >>> >>> >>> Here is the SPIP: >>> >>> https://docs.google.com/document/d/1DeW15QueI4PdRyPm6C6jsTZFmIjbXX2j4h-Ja5W_fsg/edit?usp=sharing >>> >>> >>> >>> My proposal uses a logical split representation: >>> >>> - epochMicros: Long >>> >>> - nanosOfMicro: Short in [0, 999] >>> >>> >>> >>> This applies to both NTZ and LTZ nano-capable types; timezone >>> >>> semantics remain unchanged and are handled at interpretation >>> >>> boundaries (as today). >>> >>> >>> >>> Why this approach? I believe this is the most practical path for Spark >>> >>> because it: >>> >>> 0. Conforms to the SQL standard. >>> >>> 1. Preserves Spark’s existing microsecond approach. Most >>> >>> Catalyst/runtime datetime logic already uses micros. The split model >>> >>> extends it rather than replacing it. >>> >>> 2. Avoids INT64 epoch-nanos range cliff as the primary engine model. A >>> >>> single Long epoch-nanos representation constrains calendar range much >>> >>> more aggressively than Long micros. >>> >>> 3. Keeps migration risk lower. Existing microsecond behavior remains >>> >>> default; nano precision is opt-in via parameterized types/syntax. >>> >>> 4. Allows efficient implementation paths. Internals can still choose >>> >>> compact physical encodings (row/vector/file boundaries), while keeping >>> >>> one canonical logical contract. >>> >>> >>> >>> Related SPIPs considered. I reviewed and compared against these two >>> >>> drafts: >>> >>> - SPIP: Support NanoSecond Timestamps: >>> >>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?tab=t.0#heading=h.4kibaxwtx2xo >>> >>> - SPIP: Support NanoSecond Timestamp Types: >>> >>> https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?tab=t.0#heading=h.xk16mmomv6il >>> >>> >>> >>> Those drafts are valuable and informed this design. The key difference >>> >>> is that I prioritize micros-first engine continuity with a bounded >>> >>> nano remainder, instead of making epoch-nanos the primary internal >>> >>> semantic unit. >>> >>> In short: I think epochMicros + nanosOfMicro is a better fit for >>> >>> Spark’s current architecture and compatibility constraints, while >>> >>> still delivering practical nanosecond support. >>> >>> >>> >>> Thanks in advance for your feedback. >>> >>> >>> >>> Best regards, >>> >>> Max Gekk >>> >>> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe e-mail: >>> >>> [email protected]<mailto:[email protected]> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: >>> [email protected]<mailto:[email protected]> >>> >
