[jira] [Updated] (SPARK-51162) [WIP] SPIP: Add the TIME data type

Max Gekk (Jira) Tue, 11 Feb 2025 07:10:29 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-51162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Max Gekk updated SPARK-51162:
-----------------------------
    Description: 
*Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.*

Add new data type *TIME* to Spark SQL which represents a time value with fields 
hour, minute, second, up to microseconds. All operations over the type are 
performed without taking any time zone into account. New data type should 
conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.

*Q3. How is it done today, and what are the limits of current practice?*

The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the 
date part to the some constant value like 1970-01-01, 0001-01-01 or 0000-00-00 
(though this is out of supported rage of dates).

Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot 
recognize it in data sources, and for instance cannot load the TIME values from 
parquet files.

*Q4. What is new in your approach and why do you think it will be successful?*

The approach is not new, and we have clear picture how to split the work by 
sub-tasks based on our experience of adding new types ANSI intervals and 
TIMESTAMP_NTZ.

*Q6. What are the risks?*

Additional handling new type in operators, expression and data sources can 
cause performance regressions. Such risk can be compensated by developing time 
benchmarks in parallel with supporting new type in different places in Spark 
SQL.
 
*Q7. How long will it take?*

In total it might take around *9 months*. The estimation is based on similar 
tasks: ANSI intervals 
([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and 
TIMESTAMP_NTZ 
([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can split 
the work by function blocks:
# Base functionality - *3 weeks*
Add new type TimeType, forming/parsing time literals, type constructor, and 
external types.
# Persistence - *3.5 months*
Ability to create tables of the type TIME, read/write from/to Parquet and other 
built-in data types, partitioning, stats, predicate push down.
# Time operators - *2 months*
Arithmetic ops, field extract, sorting, and aggregations.
# Clients support - *1 month*
JDBC, Hive, Thrift server, connect
# PySpark integration - *1 month*
DataFrame support, pandas API, python UDFs, Arrow column vectors
# Docs + testing/benchmarking - *1 month*

  was:
*Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.*

Add new data type *TIME* to Spark SQL which represents a time value with fields 
hour, minute, second, up to microseconds. All operations over the type are 
performed without taking any time zone into account. New data type should 
conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.

*Q3. How is it done today, and what are the limits of current practice?*

The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the 
date part to the some constant value like 1970-01-01, 0001-01-01 or 0000-00-00 
(though this is out of supported rage of dates).

Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot 
recognize it in data sources, and for instance cannot load the TIME values from 
parquet files.

*Q6. What are the risks?*

Additional handling new type in operators, expression and data sources can 
cause performance regressions. Such risk can be compensated by developing time 
benchmarks in parallel with supporting new type in different places in Spark 
SQL.
 
*Q7. How long will it take?*

In total it might take around *9 months*. The estimation is based on similar 
tasks: ANSI intervals 
([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and 
TIMESTAMP_NTZ 
([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can split 
the work by function blocks:
# Base functionality - *3 weeks*
Add new type TimeType, forming/parsing time literals, type constructor, and 
external types.
# Persistence - *3.5 months*
Ability to create tables of the type TIME, read/write from/to Parquet and other 
built-in data types, partitioning, stats, predicate push down.
# Time operators - *2 months*
Arithmetic ops, field extract, sorting, and aggregations.
# Clients support - *1 month*
JDBC, Hive, Thrift server, connect
# PySpark integration - *1 month*
DataFrame support, pandas API, python UDFs, Arrow column vectors
# Docs + testing/benchmarking - *1 month*


> [WIP] SPIP: Add the TIME data type
> ----------------------------------
>
>                 Key: SPARK-51162
>                 URL: https://issues.apache.org/jira/browse/SPARK-51162
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: SPIP
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> Add new data type *TIME* to Spark SQL which represents a time value with 
> fields hour, minute, second, up to microseconds. All operations over the type 
> are performed without taking any time zone into account. New data type should 
> conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.
> *Q3. How is it done today, and what are the limits of current practice?*
> The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the 
> date part to the some constant value like 1970-01-01, 0001-01-01 or 
> 0000-00-00 (though this is out of supported rage of dates).
> Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot 
> recognize it in data sources, and for instance cannot load the TIME values 
> from parquet files.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The approach is not new, and we have clear picture how to split the work by 
> sub-tasks based on our experience of adding new types ANSI intervals and 
> TIMESTAMP_NTZ.
> *Q6. What are the risks?*
> Additional handling new type in operators, expression and data sources can 
> cause performance regressions. Such risk can be compensated by developing 
> time benchmarks in parallel with supporting new type in different places in 
> Spark SQL.
>  
> *Q7. How long will it take?*
> In total it might take around *9 months*. The estimation is based on similar 
> tasks: ANSI intervals 
> ([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and 
> TIMESTAMP_NTZ 
> ([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can 
> split the work by function blocks:
> # Base functionality - *3 weeks*
> Add new type TimeType, forming/parsing time literals, type constructor, and 
> external types.
> # Persistence - *3.5 months*
> Ability to create tables of the type TIME, read/write from/to Parquet and 
> other built-in data types, partitioning, stats, predicate push down.
> # Time operators - *2 months*
> Arithmetic ops, field extract, sorting, and aggregations.
> # Clients support - *1 month*
> JDBC, Hive, Thrift server, connect
> # PySpark integration - *1 month*
> DataFrame support, pandas API, python UDFs, Arrow column vectors
> # Docs + testing/benchmarking - *1 month*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-51162) [WIP] SPIP: Add the TIME data type

Reply via email to