[jira] [Updated] (SPARK-51162) [WIP] SPIP: Add the TIME data type

Max Gekk (Jira) Tue, 11 Feb 2025 08:26:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-51162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Max Gekk updated SPARK-51162:
-----------------------------
    Description: 
*Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.*

Add new data type *TIME* to Spark SQL which represents a time value with fields 
hour, minute, second, up to microseconds. All operations over the type are 
performed without taking any time zone into account. New data type should 
conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.

*Q2. What problem is this proposal NOT designed to solve?*

Don't support the TIME type with time zone defined by the SQL standard: 
*TIME\(n\) WITH TIME ZONE*.

*Q3. How is it done today, and what are the limits of current practice?*

The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the 
date part to the some constant value like 1970-01-01, 0001-01-01 or 0000-00-00 
(though this is out of supported rage of dates).

Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot 
recognize it in data sources, and for instance cannot load the TIME values from 
parquet files.

*Q4. What is new in your approach and why do you think it will be successful?*

The approach is not new, and we have clear picture how to split the work by 
sub-tasks based on our experience of adding new types ANSI intervals and 
TIMESTAMP_NTZ.

*Q5. Who cares? If you are successful, what difference will it make?*

The new type simplifies migrations to Spark SQL from other DBMS like 
PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users 
don't have to rewrite their SQL code to emulate the TIME type. Also new 
functionality impacts on existing Spark SQL users who need to load data w/ the 
TIME values that were stored by other systems.

*Q6. What are the risks?*

Additional handling new type in operators, expression and data sources can 
cause performance regressions. Such risk can be compensated by developing time 
benchmarks in parallel with supporting new type in different places in Spark 
SQL.
 
*Q7. How long will it take?*

In total it might take around *9 months*. The estimation is based on similar 
tasks: ANSI intervals 
([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and 
TIMESTAMP_NTZ 
([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can split 
the work by function blocks:
# Base functionality - *3 weeks*
Add new type TimeType, forming/parsing time literals, type constructor, and 
external types.
# Persistence - *3.5 months*
Ability to create tables of the type TIME, read/write from/to Parquet and other 
built-in data types, partitioning, stats, predicate push down.
# Time operators - *2 months*
Arithmetic ops, field extract, sorting, and aggregations.
# Clients support - *1 month*
JDBC, Hive, Thrift server, connect
# PySpark integration - *1 month*
DataFrame support, pandas API, python UDFs, Arrow column vectors
# Docs + testing/benchmarking - *1 month*

*Q8. What are the mid-term and final “exams” to check for success?*
The mid-term is in 4 month: basic functionality, read/write new type to 
built-in datasources, basic time operations such as arithmetic ops, casting.
The final "exams" is to support the same functionality as other time types: 
TIMESTAMP_NTZ, DATE, TIMESTAMP.



  was:
*Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.*

Add new data type *TIME* to Spark SQL which represents a time value with fields 
hour, minute, second, up to microseconds. All operations over the type are 
performed without taking any time zone into account. New data type should 
conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.

*Q3. How is it done today, and what are the limits of current practice?*

The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the 
date part to the some constant value like 1970-01-01, 0001-01-01 or 0000-00-00 
(though this is out of supported rage of dates).

Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot 
recognize it in data sources, and for instance cannot load the TIME values from 
parquet files.

*Q4. What is new in your approach and why do you think it will be successful?*

The approach is not new, and we have clear picture how to split the work by 
sub-tasks based on our experience of adding new types ANSI intervals and 
TIMESTAMP_NTZ.

*Q5. Who cares? If you are successful, what difference will it make?*

The new type simplifies migrations to Spark SQL from other DBMS like 
PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users 
don't have to rewrite their SQL code to emulate the TIME type. Also new 
functionality impacts on existing Spark SQL users who need to load data w/ the 
TIME values that were stored by other systems.

*Q6. What are the risks?*

Additional handling new type in operators, expression and data sources can 
cause performance regressions. Such risk can be compensated by developing time 
benchmarks in parallel with supporting new type in different places in Spark 
SQL.
 
*Q7. How long will it take?*

In total it might take around *9 months*. The estimation is based on similar 
tasks: ANSI intervals 
([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and 
TIMESTAMP_NTZ 
([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can split 
the work by function blocks:
# Base functionality - *3 weeks*
Add new type TimeType, forming/parsing time literals, type constructor, and 
external types.
# Persistence - *3.5 months*
Ability to create tables of the type TIME, read/write from/to Parquet and other 
built-in data types, partitioning, stats, predicate push down.
# Time operators - *2 months*
Arithmetic ops, field extract, sorting, and aggregations.
# Clients support - *1 month*
JDBC, Hive, Thrift server, connect
# PySpark integration - *1 month*
DataFrame support, pandas API, python UDFs, Arrow column vectors
# Docs + testing/benchmarking - *1 month*

*Q8. What are the mid-term and final “exams” to check for success?*
The mid-term is in 4 month: basic functionality, read/write new type to 
built-in datasources, basic time operations such as arithmetic ops, casting.
The final "exams" is to support the same functionality as other time types: 
TIMESTAMP_NTZ, DATE, TIMESTAMP.




> [WIP] SPIP: Add the TIME data type
> ----------------------------------
>
>                 Key: SPARK-51162
>                 URL: https://issues.apache.org/jira/browse/SPARK-51162
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: SPIP
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> Add new data type *TIME* to Spark SQL which represents a time value with 
> fields hour, minute, second, up to microseconds. All operations over the type 
> are performed without taking any time zone into account. New data type should 
> conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.
> *Q2. What problem is this proposal NOT designed to solve?*
> Don't support the TIME type with time zone defined by the SQL standard: 
> *TIME\(n\) WITH TIME ZONE*.
> *Q3. How is it done today, and what are the limits of current practice?*
> The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the 
> date part to the some constant value like 1970-01-01, 0001-01-01 or 
> 0000-00-00 (though this is out of supported rage of dates).
> Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot 
> recognize it in data sources, and for instance cannot load the TIME values 
> from parquet files.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The approach is not new, and we have clear picture how to split the work by 
> sub-tasks based on our experience of adding new types ANSI intervals and 
> TIMESTAMP_NTZ.
> *Q5. Who cares? If you are successful, what difference will it make?*
> The new type simplifies migrations to Spark SQL from other DBMS like 
> PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users 
> don't have to rewrite their SQL code to emulate the TIME type. Also new 
> functionality impacts on existing Spark SQL users who need to load data w/ 
> the TIME values that were stored by other systems.
> *Q6. What are the risks?*
> Additional handling new type in operators, expression and data sources can 
> cause performance regressions. Such risk can be compensated by developing 
> time benchmarks in parallel with supporting new type in different places in 
> Spark SQL.
>  
> *Q7. How long will it take?*
> In total it might take around *9 months*. The estimation is based on similar 
> tasks: ANSI intervals 
> ([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and 
> TIMESTAMP_NTZ 
> ([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can 
> split the work by function blocks:
> # Base functionality - *3 weeks*
> Add new type TimeType, forming/parsing time literals, type constructor, and 
> external types.
> # Persistence - *3.5 months*
> Ability to create tables of the type TIME, read/write from/to Parquet and 
> other built-in data types, partitioning, stats, predicate push down.
> # Time operators - *2 months*
> Arithmetic ops, field extract, sorting, and aggregations.
> # Clients support - *1 month*
> JDBC, Hive, Thrift server, connect
> # PySpark integration - *1 month*
> DataFrame support, pandas API, python UDFs, Arrow column vectors
> # Docs + testing/benchmarking - *1 month*
> *Q8. What are the mid-term and final “exams” to check for success?*
> The mid-term is in 4 month: basic functionality, read/write new type to 
> built-in datasources, basic time operations such as arithmetic ops, casting.
> The final "exams" is to support the same functionality as other time types: 
> TIMESTAMP_NTZ, DATE, TIMESTAMP.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-51162) [WIP] SPIP: Add the TIME data type

Reply via email to