[
https://issues.apache.org/jira/browse/SPARK-57100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57100:
-----------------------------------
Labels: pull-request-available (was: )
> Add columnar (ColumnVector) support for nanosecond timestamp types
> ------------------------------------------------------------------
>
> Key: SPARK-57100
> URL: https://issues.apache.org/jira/browse/SPARK-57100
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: pull-request-available
>
> h3. Summary
> SPARK-56981 added physical row storage for TimestampNTZNanosType(p) and
> TimestampLTZNanosType(p) (p in [7, 9]) via TimestampNanosVal and UnsafeRow.
> Columnar execution still cannot hold or move these values:
> ColumnVector.getTimestampNTZNanos / getTimestampLTZNanos throw
> SparkUnsupportedOperationException, and RowToColumnConverter /
> ColumnVectorUtils have no support.
> This issue implements the columnar layer so ColumnarBatch can store
> nanosecond timestamps and interoperate with InternalRow / UnsafeRow
> (ColumnarToRow, RowToColumnar, whole-stage codegen paths that read column
> vectors).
> Parquet vectorized decode (ParquetVectorUpdaterFactory, TIMESTAMP(NANOS)
> pages) is a separate follow-up that depends on this issue.
> h3. Background
> * Logical types and parser: SPARK-56876, SPARK-56965
> * Physical / UnsafeRow layer: SPARK-56981 (merged, PR #56059)
> * SPIP composite value: epochMicros (long) + nanosWithinMicro (short, 0-999)
> * UnsafeRow uses a 16-byte variable-length payload; column batches should use
> a fixed struct-like layout (see below), not the UnsafeRow blob layout.
> h3. Recommended column layout
> Mirror CalendarInterval (multi-child column), not a single primitive column:
> || Child || Spark type || Field ||
> | 0 | LongType | epochMicros |
> | 1 | IntegerType | nanosWithinMicro (0-999) |
> NTZ and LTZ share the same physical column layout; SQL semantics stay on the
> logical type (same pattern as row layer).
> h3. What to do
> *ColumnVector API (sql/catalyst)*
> * Implement default getTimestampNTZNanos / getTimestampLTZNanos on
> ColumnVector using getChild(0).getLong + getChild(1).getInt (remove throw).
> * WritableColumnVector: allocate two child columns for TimestampNTZNanosType
> / TimestampLTZNanosType in the constructor (like CalendarIntervalType).
> * Add putTimestampNanos (or putTimestampNTZNanos / LTZ) and append paths
> writing both children.
> *On-heap / off-heap vectors (sql/core)*
> * OnHeapColumnVector / OffHeapColumnVector: read/write/append for nanos
> columns.
> * ConstantColumnVector: set/get for constant nanos values.
> * MutableColumnarRow: ensure setters write through to WritableColumnVector
> (getters already delegate).
> *Row <-> column bridges*
> * RowToColumnConverter (Columnar.scala): TimestampNanosConverter (like
> CalendarConverter) using row.getTimestampNTZNanos / LTZ.
> * ColumnVectorUtils: populate and appendValue for
> PhysicalTimestampNTZNanosType / PhysicalTimestampLTZNanosType.
> *Columnar surface stubs*
> * ColumnVector / ColumnarRow / ColumnarArray / ColumnarBatchRow: already
> delegate to ColumnVector; verify after base implementation.
> * ColumnVector stubs that still throw UnsupportedOperationException until
> vectorized Parquet/columnar writers land may remain documented; this ticket
> focuses on read/get/put/append and row roundtrip.
> *Codegen*
> * CodeGenerator already emits getTimestampNTZNanos / getTimestampLTZNanos for
> columnar inputs; no change expected once ColumnVector implements getters.
> h3. Tests
> * Unit tests: write/read/append/null handling on OnHeapColumnVector (and
> OffHeap if enabled in tests).
> * RowToColumnar -> ColumnarToRow -> UnsafeProjection roundtrip for NTZ and
> LTZ nanos types (null and non-null).
> * Regression: microsecond TimestampType / TimestampNTZType column vectors
> unchanged.
> h3. Acceptance criteria
> * ColumnarBatch can be built from InternalRow rows containing
> TimestampNanosVal for nanos timestamp columns.
> * ColumnarBatch.rowIterator() + UnsafeProjection produces UnsafeRow values
> equal to the source row for nanos columns.
> * getTimestampNTZNanos / getTimestampLTZNanos on column vectors return
> correct TimestampNanosVal for batch rows.
> * RowToColumnConverter no longer throws unsupportedDataTypeError for
> TimestampNTZNanosType / TimestampLTZNanosType.
> h3. Unblocks
> * Parquet vectorized read of TIMESTAMP(NANOS) into ColumnarBatch.
> * Vectorized scan performance for nanos columns; RowToColumnarExec /
> ColumnarToRowExec in nanos pipelines.
> h3. References
> * Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
> * Precedent: CalendarInterval column layout in WritableColumnVector and
> Columnar.scala
> * Physical value: org.apache.spark.unsafe.types.TimestampNanosVal
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]