[ 
https://issues.apache.org/jira/browse/SPARK-57103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-57103.
------------------------------
    Fix Version/s: 4.3.0
       Resolution: Fixed

Issue resolved by pull request 56187
[https://github.com/apache/spark/pull/56187]

> Add ordering, compare, and hash for nanosecond timestamp types
> --------------------------------------------------------------
>
>                 Key: SPARK-57103
>                 URL: https://issues.apache.org/jira/browse/SPARK-57103
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Assignee: Stevo Mitric
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.3.0
>
>
> h3. Summary
> SPARK-56981 added physical storage for TimestampNTZNanosType(p) and 
> TimestampLTZNanosType(p) (p in [7, 9]) as TimestampNanosVal (epochMicros + 
> nanosWithinMicro). Values can be written and read from InternalRow / 
> UnsafeRow, but ordering, comparison, and hashing are not implemented: 
> PhysicalTimestampNTZNanosType and PhysicalTimestampLTZNanosType throw on 
> ordering, and hash expressions do not handle the composite value.
> This issue adds compare, PhysicalDataType.ordering, and hash support so 
> queries using ORDER BY, sort, join keys, GROUP BY, DISTINCT, BETWEEN, and 
> hash() / xxhash64() on nanosecond timestamp columns work.
> h3. Background
> * Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
> * Depends on: SPARK-56981 (TimestampNanosVal, row accessors)
> * Logical types: SPARK-56876
> * TimestampNanosVal already implements equals and hashCode (manual mix on 
> epochMicros and nanosWithinMicro); compareTo / Ordering is missing.
> * PhysicalDataType.ordering for PhysicalTimestampNTZNanosType / 
> PhysicalTimestampLTZNanosType currently throws 
> orderedOperationUnsupportedByDataTypeError (deferred in SPARK-56981).
> * hash.scala handles TimestampType and TimestampNTZType as microsecond long 
> values; there is no branch for nanos composite values.
> Comparison semantics: total order on (epochMicros, nanosWithinMicro) in 
> proleptic-Gregorian epoch-micro timeline; same pair layout for NTZ and LTZ 
> (zone affects interpretation elsewhere, not the stored pair). NTZ and LTZ 
> columns are not mutually comparable unless explicit cast rules say otherwise 
> (out of scope here).
> h3. What to do
> h4. 1. Compare on TimestampNanosVal
> * Add compareTo (or a shared Ordering[TimestampNanosVal]) that orders by 
> epochMicros, then nanosWithinMicro.
> * Handle nulls via existing Catalyst null ordering, not inside compareTo.
> * Align with equals: if compare == 0 then values must be equal for normalized 
> values.
> h4. 2. PhysicalDataType.ordering
> * Implement ordering on PhysicalTimestampNTZNanosType and 
> PhysicalTimestampLTZNanosType returning Ordering[TimestampNanosVal] (or 
> Ordering[Any] as other physical types do).
> * Remove orderedOperationUnsupportedByDataTypeError from these physical types.
> * Update scaladoc that ordering was deferred.
> h4. 3. Hash expressions (interpreted + codegen)
> * Extend hash.scala (and related codegen paths) for TimestampNTZNanosType and 
> TimestampLTZNanosType.
> * Hash the composite consistently with TimestampNanosVal.hashCode 
> (epochMicros and nanosWithinMicro); follow the pattern used for 
> CalendarInterval or other multi-field physical types where applicable.
> * Cover hash and xxhash64 (and murmur3 if other timestamp types do).
> h4. 4. Codegen comparison
> * Ensure CodeGenerator.genComp / ordering paths for AtomicType or 
> physical-type-specific branches can compare nanos timestamp columns (may 
> already route through PhysicalDataType.ordering once implemented; verify 
> whole-stage codegen and interpreted paths).
> h4. 5. Tests
> * Unit: compareTo / Ordering on TimestampNanosVal (including negatives, equal 
> epochMicros different nanosWithinMicro, Long.MinValue / Long.MaxValue 
> epochMicros).
> * SQL: ORDER BY asc/desc on nanos NTZ and LTZ columns.
> * SQL: join on nanos timestamp key (equi-join).
> * SQL: GROUP BY and DISTINCT on nanos column.
> * SQL: hash(expr) and xxhash64(expr) stable and consistent with equals.
> * Regression: microsecond TimestampType / TimestampNTZType behavior unchanged.
> h3. Acceptance criteria
> * ORDER BY on a column of TimestampNTZNanosType or TimestampLTZNanosType 
> succeeds and sorts by (epochMicros, nanosWithinMicro).
> * Equi-join and GROUP BY / DISTINCT on nanos timestamp columns succeed in 
> tests.
> * hash() / xxhash64() on nanos timestamp values match expected semantics and 
> align with equals.
> * PhysicalDataType.ordering no longer throws for 
> PhysicalTimestampNTZNanosType / PhysicalTimestampLTZNanosType.
> * No change to comparison or hash behavior of existing microsecond timestamp 
> types.
> h3. Out of scope
> * Cast matrix, type coercion, Parquet read/write, string parsing, java.time 
> encoders
> * Cross-type comparison (nanos LTZ vs micro LTZ, NTZ vs LTZ) except what 
> existing analyzer already allows via casts
> * Types Framework registration (SPARK-57101)
> * ColumnVector / vectorized hash (can follow SPARK-57100 separately if needed)
> * ANSI interval / timestamp subtraction at nanos precision
> h3. Unblocks
> * Mid-term SPIP goal: filters, joins, aggregations, and sort on nanosecond 
> timestamp columns
> * Expression and benchmark work that assumes comparable, hashable keys
> h3. References
> * org.apache.spark.unsafe.types.TimestampNanosVal
> * sql/catalyst/.../PhysicalDataType.scala (PhysicalTimestampNTZNanosType / 
> PhysicalTimestampLTZNanosType)
> * sql/catalyst/.../expressions/hash.scala
> * Precedent: TIME type hash support (SPARK-51664)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to