[
https://issues.apache.org/jira/browse/SPARK-56159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaoxuan Li updated SPARK-56159:
--------------------------------
Description:
Add two new nanosecond-precision timestamp types to Spark SQL:
{{TimestampNSType}} (with local timezone) and {{TimestampNTZNSType}} (without
timezone), following the same pattern as {{TimestampNTZType}} (SPARK-35662).
Both types store epoch nanoseconds as a single {{Long}} (INT64). This directly
matches Parquet {{{}TIMESTAMP(NANOS){}}}, Arrow {{{}Timestamp(NANOSECOND){}}},
Iceberg V3 {{{}timestamp_ns{}}}, DuckDB {{{}TIMESTAMP_NS{}}}, and ClickHouse
{{DateTime64(9)}} -- enabling zero-conversion-overhead read/write. The INT64
representation fits in UnsafeRow's 8-byte fixed-width slot, requiring no
changes to Tungsten memory layout or CodeGen.
The representable range is ~1677-2262 (same as the Parquet INT64 nanos
specification). Users needing wider range need to use existing microsecond
types (0001-9999).
We also support {{TIMESTAMP(p)}} parameterized SQL syntax: p=0..6 maps to
existing microsecond types, p=7..9 maps to the new nanosecond types.
h3. *Milestone 1 – Core type system (TimestampNSType / TimestampNTZNSType meets
or exceeds all function of the existing TimestampType / TimestampNTZType):*
* Add new DataType implementations for TimestampNSType and TimestampNTZNSType
* Ability to read/write Parquet files with TIMESTAMP(NANOS, true/false) columns
* TimestampNSType / TimestampNTZNSType literals
* TimestampNSType arithmetic (e.g. TimestampNSType - TimestampNSType,
TimestampNSType - Date)
* Datetime functions/operators: dayofweek, weekofyear, year, etc
* Cast to and from TimestampNSType / TimestampNTZNSType (Long, String,
TimestampType, TimestampNTZType, DateType), with the SQL syntax to specify the
types
* Support sorting and hashing TimestampNSType / TimestampNTZNSType
* Type coercion rules for mixed-precision expressions
* {{timestamp_nanos()}} and {{unix_nanos()}} functions
* Configuration: {{spark.sql.parquet.inferTimestampNS.enabled}}
h3. *Milestone 2 – Persistence:*
* Support {{TIMESTAMP(p)}} parameterized SQL syntax (p=0..6 -> micros, p=7..9
-> nanos)
* Ability to read/write ORC, CSV, JSON, Avro files with nanosecond timestamps
* Arrow type mapping: {{Timestamp(NANOSECOND)}} <-> TimestampNSType /
TimestampNTZNSType
* INSERT, SELECT, UPDATE, MERGE
* Precision enforcement for {{TIMESTAMP(p)}} using CHAR/VARCHAR metadata
pattern
* Discovery (schema inference)
h3. *Milestone 3 – Client support*
* JDBC support
* Hive Thrift server
* Spark Connect protocol support
h3. *Milestone 4 – PySpark integration*
* Python UDF can take and return TimestampNSType / TimestampNTZNSType
* PySpark {{toPandas()}} / {{createDataFrame()}} nanosecond support
* Pandas UDF support
* DataFrame support
* Dataset/Encoder support
was:
Add two new nanosecond-precision timestamp types to Spark SQL:
{{TimestampNSType}} (with local timezone) and {{TimestampNTZNSType}} (without
timezone), following the same pattern as {{TimestampNTZType}} (SPARK-35662).
Both types store epoch nanoseconds as a single {{Long}} (INT64). This directly
matches Parquet {{{}TIMESTAMP(NANOS){}}}, Arrow {{{}Timestamp(NANOSECOND){}}},
Iceberg V3 {{{}timestamp_ns{}}}, DuckDB {{{}TIMESTAMP_NS{}}}, and ClickHouse
{{DateTime64(9)}} -- enabling zero-conversion-overhead read/write. The INT64
representation fits in UnsafeRow's 8-byte fixed-width slot, requiring no
changes to Tungsten memory layout or CodeGen.
The representable range is ~1677-2262 (same as the Parquet INT64 nanos
specification). Users needing wider range need to use existing microsecond
types (0001-9999).
We also support {{TIMESTAMP(p)}} parameterized SQL syntax: p=0..6 maps to
existing microsecond types, p=7..9 maps to the new nanosecond types.
h3. *Milestone 1 – Core type system (TimestampNSType / TimestampNTZNSType meets
or exceeds all function of the existing TimestampType / TimestampNTZType):*
* Add new DataType implementations for TimestampNSType and TimestampNTZNSType
*
* TimestampNSType / TimestampNTZNSType literals
* TimestampNSType arithmetic (e.g. TimestampNSType - TimestampNSType,
TimestampNSType - Date)
* Datetime functions/operators: dayofweek, weekofyear, year, etc
* Cast to and from TimestampNSType / TimestampNTZNSType (Long, String,
TimestampType, TimestampNTZType, DateType), with the SQL syntax to specify the
types
* Support sorting and hashing TimestampNSType / TimestampNTZNSType
* Type coercion rules for mixed-precision expressions
* {{timestamp_nanos()}} and {{unix_nanos()}} functions
* Configuration: {{spark.sql.parquet.inferTimestampNS.enabled}}
h3. *Milestone 2 – Persistence:*
* Support {{TIMESTAMP(p)}} parameterized SQL syntax (p=0..6 -> micros, p=7..9
-> nanos)
* Ability to read/write ORC, CSV, JSON, Avro files with nanosecond timestamps
* Arrow type mapping: {{Timestamp(NANOSECOND)}} <-> TimestampNSType /
TimestampNTZNSType
* INSERT, SELECT, UPDATE, MERGE
* Precision enforcement for {{TIMESTAMP(p)}} using CHAR/VARCHAR metadata
pattern
* Discovery (schema inference)
h3. *Milestone 3 – Client support*
* JDBC support
* Hive Thrift server
* Spark Connect protocol support
h3. *Milestone 4 – PySpark integration*
* Python UDF can take and return TimestampNSType / TimestampNTZNSType
* PySpark {{toPandas()}} / {{createDataFrame()}} nanosecond support
* Pandas UDF support
* DataFrame support
* Dataset/Encoder support
> Support Nano Second Timestamp Data Types
> ----------------------------------------
>
> Key: SPARK-56159
> URL: https://issues.apache.org/jira/browse/SPARK-56159
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Xiaoxuan Li
> Priority: Major
>
> Add two new nanosecond-precision timestamp types to Spark SQL:
> {{TimestampNSType}} (with local timezone) and {{TimestampNTZNSType}} (without
> timezone), following the same pattern as {{TimestampNTZType}} (SPARK-35662).
> Both types store epoch nanoseconds as a single {{Long}} (INT64). This
> directly matches Parquet {{{}TIMESTAMP(NANOS){}}}, Arrow
> {{{}Timestamp(NANOSECOND){}}}, Iceberg V3 {{{}timestamp_ns{}}}, DuckDB
> {{{}TIMESTAMP_NS{}}}, and ClickHouse {{DateTime64(9)}} -- enabling
> zero-conversion-overhead read/write. The INT64 representation fits in
> UnsafeRow's 8-byte fixed-width slot, requiring no changes to Tungsten memory
> layout or CodeGen.
> The representable range is ~1677-2262 (same as the Parquet INT64 nanos
> specification). Users needing wider range need to use existing microsecond
> types (0001-9999).
> We also support {{TIMESTAMP(p)}} parameterized SQL syntax: p=0..6 maps to
> existing microsecond types, p=7..9 maps to the new nanosecond types.
>
> h3. *Milestone 1 – Core type system (TimestampNSType / TimestampNTZNSType
> meets or exceeds all function of the existing TimestampType /
> TimestampNTZType):*
> * Add new DataType implementations for TimestampNSType and TimestampNTZNSType
> * Ability to read/write Parquet files with TIMESTAMP(NANOS, true/false)
> columns
> * TimestampNSType / TimestampNTZNSType literals
> * TimestampNSType arithmetic (e.g. TimestampNSType - TimestampNSType,
> TimestampNSType - Date)
> * Datetime functions/operators: dayofweek, weekofyear, year, etc
> * Cast to and from TimestampNSType / TimestampNTZNSType (Long, String,
> TimestampType, TimestampNTZType, DateType), with the SQL syntax to specify
> the types
> * Support sorting and hashing TimestampNSType / TimestampNTZNSType
> * Type coercion rules for mixed-precision expressions
> * {{timestamp_nanos()}} and {{unix_nanos()}} functions
> * Configuration: {{spark.sql.parquet.inferTimestampNS.enabled}}
> h3. *Milestone 2 – Persistence:*
> * Support {{TIMESTAMP(p)}} parameterized SQL syntax (p=0..6 -> micros,
> p=7..9 -> nanos)
> * Ability to read/write ORC, CSV, JSON, Avro files with nanosecond timestamps
> * Arrow type mapping: {{Timestamp(NANOSECOND)}} <-> TimestampNSType /
> TimestampNTZNSType
> * INSERT, SELECT, UPDATE, MERGE
> * Precision enforcement for {{TIMESTAMP(p)}} using CHAR/VARCHAR metadata
> pattern
> * Discovery (schema inference)
> h3. *Milestone 3 – Client support*
> * JDBC support
> * Hive Thrift server
> * Spark Connect protocol support
> h3. *Milestone 4 – PySpark integration*
> * Python UDF can take and return TimestampNSType / TimestampNTZNSType
> * PySpark {{toPandas()}} / {{createDataFrame()}} nanosecond support
> * Pandas UDF support
> * DataFrame support
> * Dataset/Encoder support
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]