[ 
https://issues.apache.org/jira/browse/SPARK-56159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoxuan Li updated SPARK-56159:
--------------------------------
    Description: 
Add two new nanosecond-precision timestamp types to Spark SQL: 
TimestampNanosType (with local timezone) and TimestampNTZNanosType (without 
timezone), following the same pattern as {{TimestampNTZType}} (SPARK-35662).

Both types store epoch nanoseconds as a single {{Long}} (INT64). This directly 
matches Parquet {{{}TIMESTAMP(NANOS){}}}, Arrow {{{}Timestamp(NANOSECOND){}}}, 
Iceberg V3 {{{}timestamp_ns{}}}, DuckDB {{{}TIMESTAMP_NS{}}}, and ClickHouse 
{{DateTime64(9)}} -- enabling zero-conversion-overhead read/write. The INT64 
representation fits in UnsafeRow's 8-byte fixed-width slot, requiring no 
changes to Tungsten memory layout or CodeGen.

The representable range is ~1677-2262 (same as the Parquet INT64 nanos 
specification). Users needing wider range need to use existing microsecond 
types (0001-9999).

We also support {{TIMESTAMP(p)}} parameterized SQL syntax: p=0..6 maps to 
existing microsecond types, p=7..9 maps to the new nanosecond types.

 
h3. *Milestone 1 – Core type system:*
 * Add new DataType implementations for TimestampNanosType and 
TimestampNTZNanosType
 * Ability to create tables of type TimestampNanosType / TimestampNTZNanosType
 * Ability to read/write Parquet files with TIMESTAMP(NANOS, true/false) columns
 * TimestampNanosType / TimestampNTZNanosType literals
 * TimestampNanosType arithmetic (e.g. TimestampNanosType - TimestampNanosType, 
TimestampNanosType - Date)
 * Datetime functions/operators: dayofweek, weekofyear, year, etc
 * Cast to and from TimestampNanosType / TimestampNTZNanosType (Long, String, 
TimestampType, TimestampNTZType, DateType), with the SQL syntax to specify the 
types
 * Support sorting and hashing TimestampNanosType / TimestampNTZNanosType
 * Type coercion rules for mixed-precision expressions
 * {{timestamp_nanos()}} and {{unix_nanos()}} functions
 * Configuration: {{spark.sql.parquet.inferTimestampNanos.enabled}}

h3. *Milestone 2 – Persistence:*
 * Support {{TIMESTAMP(p)}} parameterized SQL syntax (p=0..6 -> micros, p=7..9 
-> nanos)
 * Ability to read/write ORC, CSV, JSON, Avro files with nanosecond timestamps
 * Arrow type mapping: {{Timestamp(NANOSECOND)}} <-> TimestampNanosType / 
TimestampNTZNanosType
 * INSERT, SELECT, UPDATE, MERGE
 * Precision enforcement for {{TIMESTAMP(p)}} using CHAR/VARCHAR metadata 
pattern
 * Discovery (schema inference)

h3. *Milestone 3 – Client support*
 * JDBC support
 * Hive Thrift server
 * Spark Connect protocol support

h3. *Milestone 4 – PySpark integration*
 * Python UDF can take and return TimestampNanosType / TimestampNTZNanosType
 * PySpark {{toPandas()}} / {{createDataFrame()}} nanosecond support
 * Pandas UDF support
 * DataFrame support
 * Dataset/Encoder support

 

  was:
Add two new nanosecond-precision timestamp types to Spark SQL: 
{{TimestampNSType}} (with local timezone) and {{TimestampNTZNSType}} (without 
timezone), following the same pattern as {{TimestampNTZType}} (SPARK-35662).

Both types store epoch nanoseconds as a single {{Long}} (INT64). This directly 
matches Parquet {{{}TIMESTAMP(NANOS){}}}, Arrow {{{}Timestamp(NANOSECOND){}}}, 
Iceberg V3 {{{}timestamp_ns{}}}, DuckDB {{{}TIMESTAMP_NS{}}}, and ClickHouse 
{{DateTime64(9)}} -- enabling zero-conversion-overhead read/write. The INT64 
representation fits in UnsafeRow's 8-byte fixed-width slot, requiring no 
changes to Tungsten memory layout or CodeGen.

The representable range is ~1677-2262 (same as the Parquet INT64 nanos 
specification). Users needing wider range need to use existing microsecond 
types (0001-9999).

We also support {{TIMESTAMP(p)}} parameterized SQL syntax: p=0..6 maps to 
existing microsecond types, p=7..9 maps to the new nanosecond types.

 
h3. *Milestone 1 – Core type system (TimestampNSType / TimestampNTZNSType meets 
or exceeds all function of the existing TimestampType / TimestampNTZType):*
 * Add new DataType implementations for TimestampNSType and TimestampNTZNSType
 * Ability to read/write Parquet files with TIMESTAMP(NANOS, true/false) columns
 * TimestampNSType / TimestampNTZNSType literals
 * TimestampNSType arithmetic (e.g. TimestampNSType - TimestampNSType, 
TimestampNSType - Date)
 * Datetime functions/operators: dayofweek, weekofyear, year, etc
 * Cast to and from TimestampNSType / TimestampNTZNSType (Long, String, 
TimestampType, TimestampNTZType, DateType), with the SQL syntax to specify the 
types
 * Support sorting and hashing TimestampNSType / TimestampNTZNSType
 * Type coercion rules for mixed-precision expressions
 * {{timestamp_nanos()}} and {{unix_nanos()}} functions
 * Configuration: {{spark.sql.parquet.inferTimestampNS.enabled}}

h3. *Milestone 2 – Persistence:*
 * Support {{TIMESTAMP(p)}} parameterized SQL syntax (p=0..6 -> micros, p=7..9 
-> nanos)
 * Ability to read/write ORC, CSV, JSON, Avro files with nanosecond timestamps
 * Arrow type mapping: {{Timestamp(NANOSECOND)}} <-> TimestampNSType / 
TimestampNTZNSType
 * INSERT, SELECT, UPDATE, MERGE
 * Precision enforcement for {{TIMESTAMP(p)}} using CHAR/VARCHAR metadata 
pattern
 * Discovery (schema inference)

h3. *Milestone 3 – Client support*
 * JDBC support
 * Hive Thrift server
 * Spark Connect protocol support

h3. *Milestone 4 – PySpark integration*
 * Python UDF can take and return TimestampNSType / TimestampNTZNSType
 * PySpark {{toPandas()}} / {{createDataFrame()}} nanosecond support
 * Pandas UDF support
 * DataFrame support
 * Dataset/Encoder support

 


> Support Nano Second Timestamp Data Types
> ----------------------------------------
>
>                 Key: SPARK-56159
>                 URL: https://issues.apache.org/jira/browse/SPARK-56159
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Xiaoxuan Li
>            Priority: Major
>
> Add two new nanosecond-precision timestamp types to Spark SQL: 
> TimestampNanosType (with local timezone) and TimestampNTZNanosType (without 
> timezone), following the same pattern as {{TimestampNTZType}} (SPARK-35662).
> Both types store epoch nanoseconds as a single {{Long}} (INT64). This 
> directly matches Parquet {{{}TIMESTAMP(NANOS){}}}, Arrow 
> {{{}Timestamp(NANOSECOND){}}}, Iceberg V3 {{{}timestamp_ns{}}}, DuckDB 
> {{{}TIMESTAMP_NS{}}}, and ClickHouse {{DateTime64(9)}} -- enabling 
> zero-conversion-overhead read/write. The INT64 representation fits in 
> UnsafeRow's 8-byte fixed-width slot, requiring no changes to Tungsten memory 
> layout or CodeGen.
> The representable range is ~1677-2262 (same as the Parquet INT64 nanos 
> specification). Users needing wider range need to use existing microsecond 
> types (0001-9999).
> We also support {{TIMESTAMP(p)}} parameterized SQL syntax: p=0..6 maps to 
> existing microsecond types, p=7..9 maps to the new nanosecond types.
>  
> h3. *Milestone 1 – Core type system:*
>  * Add new DataType implementations for TimestampNanosType and 
> TimestampNTZNanosType
>  * Ability to create tables of type TimestampNanosType / TimestampNTZNanosType
>  * Ability to read/write Parquet files with TIMESTAMP(NANOS, true/false) 
> columns
>  * TimestampNanosType / TimestampNTZNanosType literals
>  * TimestampNanosType arithmetic (e.g. TimestampNanosType - 
> TimestampNanosType, TimestampNanosType - Date)
>  * Datetime functions/operators: dayofweek, weekofyear, year, etc
>  * Cast to and from TimestampNanosType / TimestampNTZNanosType (Long, String, 
> TimestampType, TimestampNTZType, DateType), with the SQL syntax to specify 
> the types
>  * Support sorting and hashing TimestampNanosType / TimestampNTZNanosType
>  * Type coercion rules for mixed-precision expressions
>  * {{timestamp_nanos()}} and {{unix_nanos()}} functions
>  * Configuration: {{spark.sql.parquet.inferTimestampNanos.enabled}}
> h3. *Milestone 2 – Persistence:*
>  * Support {{TIMESTAMP(p)}} parameterized SQL syntax (p=0..6 -> micros, 
> p=7..9 -> nanos)
>  * Ability to read/write ORC, CSV, JSON, Avro files with nanosecond timestamps
>  * Arrow type mapping: {{Timestamp(NANOSECOND)}} <-> TimestampNanosType / 
> TimestampNTZNanosType
>  * INSERT, SELECT, UPDATE, MERGE
>  * Precision enforcement for {{TIMESTAMP(p)}} using CHAR/VARCHAR metadata 
> pattern
>  * Discovery (schema inference)
> h3. *Milestone 3 – Client support*
>  * JDBC support
>  * Hive Thrift server
>  * Spark Connect protocol support
> h3. *Milestone 4 – PySpark integration*
>  * Python UDF can take and return TimestampNanosType / TimestampNTZNanosType
>  * PySpark {{toPandas()}} / {{createDataFrame()}} nanosecond support
>  * Pandas UDF support
>  * DataFrame support
>  * Dataset/Encoder support
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to