[ 
https://issues.apache.org/jira/browse/SPARK-57100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57100:
-----------------------------------
    Labels: pull-request-available  (was: )

> Add columnar (ColumnVector) support for nanosecond timestamp types
> ------------------------------------------------------------------
>
>                 Key: SPARK-57100
>                 URL: https://issues.apache.org/jira/browse/SPARK-57100
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Summary
> SPARK-56981 added physical row storage for TimestampNTZNanosType(p) and 
> TimestampLTZNanosType(p) (p in [7, 9]) via TimestampNanosVal and UnsafeRow. 
> Columnar execution still cannot hold or move these values: 
> ColumnVector.getTimestampNTZNanos / getTimestampLTZNanos throw 
> SparkUnsupportedOperationException, and RowToColumnConverter / 
> ColumnVectorUtils have no support.
> This issue implements the columnar layer so ColumnarBatch can store 
> nanosecond timestamps and interoperate with InternalRow / UnsafeRow 
> (ColumnarToRow, RowToColumnar, whole-stage codegen paths that read column 
> vectors).
> Parquet vectorized decode (ParquetVectorUpdaterFactory, TIMESTAMP(NANOS) 
> pages) is a separate follow-up that depends on this issue.
> h3. Background
> * Logical types and parser: SPARK-56876, SPARK-56965
> * Physical / UnsafeRow layer: SPARK-56981 (merged, PR #56059)
> * SPIP composite value: epochMicros (long) + nanosWithinMicro (short, 0-999)
> * UnsafeRow uses a 16-byte variable-length payload; column batches should use 
> a fixed struct-like layout (see below), not the UnsafeRow blob layout.
> h3. Recommended column layout
> Mirror CalendarInterval (multi-child column), not a single primitive column:
> || Child || Spark type || Field ||
> | 0 | LongType | epochMicros |
> | 1 | IntegerType | nanosWithinMicro (0-999) |
> NTZ and LTZ share the same physical column layout; SQL semantics stay on the 
> logical type (same pattern as row layer).
> h3. What to do
> *ColumnVector API (sql/catalyst)*
> * Implement default getTimestampNTZNanos / getTimestampLTZNanos on 
> ColumnVector using getChild(0).getLong + getChild(1).getInt (remove throw).
> * WritableColumnVector: allocate two child columns for TimestampNTZNanosType 
> / TimestampLTZNanosType in the constructor (like CalendarIntervalType).
> * Add putTimestampNanos (or putTimestampNTZNanos / LTZ) and append paths 
> writing both children.
> *On-heap / off-heap vectors (sql/core)*
> * OnHeapColumnVector / OffHeapColumnVector: read/write/append for nanos 
> columns.
> * ConstantColumnVector: set/get for constant nanos values.
> * MutableColumnarRow: ensure setters write through to WritableColumnVector 
> (getters already delegate).
> *Row <-> column bridges*
> * RowToColumnConverter (Columnar.scala): TimestampNanosConverter (like 
> CalendarConverter) using row.getTimestampNTZNanos / LTZ.
> * ColumnVectorUtils: populate and appendValue for 
> PhysicalTimestampNTZNanosType / PhysicalTimestampLTZNanosType.
> *Columnar surface stubs*
> * ColumnVector / ColumnarRow / ColumnarArray / ColumnarBatchRow: already 
> delegate to ColumnVector; verify after base implementation.
> * ColumnVector stubs that still throw UnsupportedOperationException until 
> vectorized Parquet/columnar writers land may remain documented; this ticket 
> focuses on read/get/put/append and row roundtrip.
> *Codegen*
> * CodeGenerator already emits getTimestampNTZNanos / getTimestampLTZNanos for 
> columnar inputs; no change expected once ColumnVector implements getters.
> h3. Tests
> * Unit tests: write/read/append/null handling on OnHeapColumnVector (and 
> OffHeap if enabled in tests).
> * RowToColumnar -> ColumnarToRow -> UnsafeProjection roundtrip for NTZ and 
> LTZ nanos types (null and non-null).
> * Regression: microsecond TimestampType / TimestampNTZType column vectors 
> unchanged.
> h3. Acceptance criteria
> * ColumnarBatch can be built from InternalRow rows containing 
> TimestampNanosVal for nanos timestamp columns.
> * ColumnarBatch.rowIterator() + UnsafeProjection produces UnsafeRow values 
> equal to the source row for nanos columns.
> * getTimestampNTZNanos / getTimestampLTZNanos on column vectors return 
> correct TimestampNanosVal for batch rows.
> * RowToColumnConverter no longer throws unsupportedDataTypeError for 
> TimestampNTZNanosType / TimestampLTZNanosType.
> h3. Unblocks
> * Parquet vectorized read of TIMESTAMP(NANOS) into ColumnarBatch.
> * Vectorized scan performance for nanos columns; RowToColumnarExec / 
> ColumnarToRowExec in nanos pipelines.
> h3. References
> * Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
> * Precedent: CalendarInterval column layout in WritableColumnVector and 
> Columnar.scala
> * Physical value: org.apache.spark.unsafe.types.TimestampNanosVal



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to