[
https://issues.apache.org/jira/browse/SPARK-57103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57103:
-----------------------------------
Labels: pull-request-available (was: )
> Add ordering, compare, and hash for nanosecond timestamp types
> --------------------------------------------------------------
>
> Key: SPARK-57103
> URL: https://issues.apache.org/jira/browse/SPARK-57103
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
> Labels: pull-request-available
>
> h3. Summary
> SPARK-56981 added physical storage for TimestampNTZNanosType(p) and
> TimestampLTZNanosType(p) (p in [7, 9]) as TimestampNanosVal (epochMicros +
> nanosWithinMicro). Values can be written and read from InternalRow /
> UnsafeRow, but ordering, comparison, and hashing are not implemented:
> PhysicalTimestampNTZNanosType and PhysicalTimestampLTZNanosType throw on
> ordering, and hash expressions do not handle the composite value.
> This issue adds compare, PhysicalDataType.ordering, and hash support so
> queries using ORDER BY, sort, join keys, GROUP BY, DISTINCT, BETWEEN, and
> hash() / xxhash64() on nanosecond timestamp columns work.
> h3. Background
> * Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
> * Depends on: SPARK-56981 (TimestampNanosVal, row accessors)
> * Logical types: SPARK-56876
> * TimestampNanosVal already implements equals and hashCode (manual mix on
> epochMicros and nanosWithinMicro); compareTo / Ordering is missing.
> * PhysicalDataType.ordering for PhysicalTimestampNTZNanosType /
> PhysicalTimestampLTZNanosType currently throws
> orderedOperationUnsupportedByDataTypeError (deferred in SPARK-56981).
> * hash.scala handles TimestampType and TimestampNTZType as microsecond long
> values; there is no branch for nanos composite values.
> Comparison semantics: total order on (epochMicros, nanosWithinMicro) in
> proleptic-Gregorian epoch-micro timeline; same pair layout for NTZ and LTZ
> (zone affects interpretation elsewhere, not the stored pair). NTZ and LTZ
> columns are not mutually comparable unless explicit cast rules say otherwise
> (out of scope here).
> h3. What to do
> h4. 1. Compare on TimestampNanosVal
> * Add compareTo (or a shared Ordering[TimestampNanosVal]) that orders by
> epochMicros, then nanosWithinMicro.
> * Handle nulls via existing Catalyst null ordering, not inside compareTo.
> * Align with equals: if compare == 0 then values must be equal for normalized
> values.
> h4. 2. PhysicalDataType.ordering
> * Implement ordering on PhysicalTimestampNTZNanosType and
> PhysicalTimestampLTZNanosType returning Ordering[TimestampNanosVal] (or
> Ordering[Any] as other physical types do).
> * Remove orderedOperationUnsupportedByDataTypeError from these physical types.
> * Update scaladoc that ordering was deferred.
> h4. 3. Hash expressions (interpreted + codegen)
> * Extend hash.scala (and related codegen paths) for TimestampNTZNanosType and
> TimestampLTZNanosType.
> * Hash the composite consistently with TimestampNanosVal.hashCode
> (epochMicros and nanosWithinMicro); follow the pattern used for
> CalendarInterval or other multi-field physical types where applicable.
> * Cover hash and xxhash64 (and murmur3 if other timestamp types do).
> h4. 4. Codegen comparison
> * Ensure CodeGenerator.genComp / ordering paths for AtomicType or
> physical-type-specific branches can compare nanos timestamp columns (may
> already route through PhysicalDataType.ordering once implemented; verify
> whole-stage codegen and interpreted paths).
> h4. 5. Tests
> * Unit: compareTo / Ordering on TimestampNanosVal (including negatives, equal
> epochMicros different nanosWithinMicro, Long.MinValue / Long.MaxValue
> epochMicros).
> * SQL: ORDER BY asc/desc on nanos NTZ and LTZ columns.
> * SQL: join on nanos timestamp key (equi-join).
> * SQL: GROUP BY and DISTINCT on nanos column.
> * SQL: hash(expr) and xxhash64(expr) stable and consistent with equals.
> * Regression: microsecond TimestampType / TimestampNTZType behavior unchanged.
> h3. Acceptance criteria
> * ORDER BY on a column of TimestampNTZNanosType or TimestampLTZNanosType
> succeeds and sorts by (epochMicros, nanosWithinMicro).
> * Equi-join and GROUP BY / DISTINCT on nanos timestamp columns succeed in
> tests.
> * hash() / xxhash64() on nanos timestamp values match expected semantics and
> align with equals.
> * PhysicalDataType.ordering no longer throws for
> PhysicalTimestampNTZNanosType / PhysicalTimestampLTZNanosType.
> * No change to comparison or hash behavior of existing microsecond timestamp
> types.
> h3. Out of scope
> * Cast matrix, type coercion, Parquet read/write, string parsing, java.time
> encoders
> * Cross-type comparison (nanos LTZ vs micro LTZ, NTZ vs LTZ) except what
> existing analyzer already allows via casts
> * Types Framework registration (SPARK-57101)
> * ColumnVector / vectorized hash (can follow SPARK-57100 separately if needed)
> * ANSI interval / timestamp subtraction at nanos precision
> h3. Unblocks
> * Mid-term SPIP goal: filters, joins, aggregations, and sort on nanosecond
> timestamp columns
> * Expression and benchmark work that assumes comparable, hashable keys
> h3. References
> * org.apache.spark.unsafe.types.TimestampNanosVal
> * sql/catalyst/.../PhysicalDataType.scala (PhysicalTimestampNTZNanosType /
> PhysicalTimestampLTZNanosType)
> * sql/catalyst/.../expressions/hash.scala
> * Precedent: TIME type hash support (SPARK-51664)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]