[
https://issues.apache.org/jira/browse/SPARK-51162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-51162:
-----------------------------
Description:
*Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.*
Add new data type *TIME* to Spark SQL which represents a time value with fields
hour, minute, second, up to microseconds. All operations over the type are
performed without taking any time zone into account. New data type should
conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.
*Q2. What problem is this proposal NOT designed to solve?*
Don't support the TIME type with time zone defined by the SQL standard:
*TIME\(n\) WITH TIME ZONE*.
*Q3. How is it done today, and what are the limits of current practice?*
The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the
date part to the some constant value like 1970-01-01, 0001-01-01 or 0000-00-00
(though this is out of supported rage of dates).
Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot
recognize it in data sources, and for instance cannot load the TIME values from
parquet files.
*Q4. What is new in your approach and why do you think it will be successful?*
The approach is not new, and we have clear picture how to split the work by
sub-tasks based on our experience of adding new types ANSI intervals and
TIMESTAMP_NTZ.
*Q5. Who cares? If you are successful, what difference will it make?*
The new type simplifies migrations to Spark SQL from other DBMS like
PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users
don't have to rewrite their SQL code to emulate the TIME type. Also new
functionality impacts on existing Spark SQL users who need to load data w/ the
TIME values that were stored by other systems.
*Q6. What are the risks?*
Additional handling new type in operators, expression and data sources can
cause performance regressions. Such risk can be compensated by developing time
benchmarks in parallel with supporting new type in different places in Spark
SQL.
*Q7. How long will it take?*
In total it might take around *9 months*. The estimation is based on similar
tasks: ANSI intervals
([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and
TIMESTAMP_NTZ
([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can split
the work by function blocks:
# Base functionality - *3 weeks*
Add new type TimeType, forming/parsing time literals, type constructor, and
external types.
# Persistence - *3.5 months*
Ability to create tables of the type TIME, read/write from/to Parquet and other
built-in data types, partitioning, stats, predicate push down.
# Time operators - *2 months*
Arithmetic ops, field extract, sorting, and aggregations.
# Clients support - *1 month*
JDBC, Hive, Thrift server, connect
# PySpark integration - *1 month*
DataFrame support, pandas API, python UDFs, Arrow column vectors
# Docs + testing/benchmarking - *1 month*
*Q8. What are the mid-term and final “exams” to check for success?*
The mid-term is in 4 month: basic functionality, read/write new type to
built-in datasources, basic time operations such as arithmetic ops, casting.
The final "exams" is to support the same functionality as other time types:
TIMESTAMP_NTZ, DATE, TIMESTAMP.
was:
*Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.*
Add new data type *TIME* to Spark SQL which represents a time value with fields
hour, minute, second, up to microseconds. All operations over the type are
performed without taking any time zone into account. New data type should
conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.
*Q3. How is it done today, and what are the limits of current practice?*
The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the
date part to the some constant value like 1970-01-01, 0001-01-01 or 0000-00-00
(though this is out of supported rage of dates).
Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot
recognize it in data sources, and for instance cannot load the TIME values from
parquet files.
*Q4. What is new in your approach and why do you think it will be successful?*
The approach is not new, and we have clear picture how to split the work by
sub-tasks based on our experience of adding new types ANSI intervals and
TIMESTAMP_NTZ.
*Q5. Who cares? If you are successful, what difference will it make?*
The new type simplifies migrations to Spark SQL from other DBMS like
PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users
don't have to rewrite their SQL code to emulate the TIME type. Also new
functionality impacts on existing Spark SQL users who need to load data w/ the
TIME values that were stored by other systems.
*Q6. What are the risks?*
Additional handling new type in operators, expression and data sources can
cause performance regressions. Such risk can be compensated by developing time
benchmarks in parallel with supporting new type in different places in Spark
SQL.
*Q7. How long will it take?*
In total it might take around *9 months*. The estimation is based on similar
tasks: ANSI intervals
([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and
TIMESTAMP_NTZ
([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can split
the work by function blocks:
# Base functionality - *3 weeks*
Add new type TimeType, forming/parsing time literals, type constructor, and
external types.
# Persistence - *3.5 months*
Ability to create tables of the type TIME, read/write from/to Parquet and other
built-in data types, partitioning, stats, predicate push down.
# Time operators - *2 months*
Arithmetic ops, field extract, sorting, and aggregations.
# Clients support - *1 month*
JDBC, Hive, Thrift server, connect
# PySpark integration - *1 month*
DataFrame support, pandas API, python UDFs, Arrow column vectors
# Docs + testing/benchmarking - *1 month*
*Q8. What are the mid-term and final “exams” to check for success?*
The mid-term is in 4 month: basic functionality, read/write new type to
built-in datasources, basic time operations such as arithmetic ops, casting.
The final "exams" is to support the same functionality as other time types:
TIMESTAMP_NTZ, DATE, TIMESTAMP.
> [WIP] SPIP: Add the TIME data type
> ----------------------------------
>
> Key: SPARK-51162
> URL: https://issues.apache.org/jira/browse/SPARK-51162
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: SPIP
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.*
> Add new data type *TIME* to Spark SQL which represents a time value with
> fields hour, minute, second, up to microseconds. All operations over the type
> are performed without taking any time zone into account. New data type should
> conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.
> *Q2. What problem is this proposal NOT designed to solve?*
> Don't support the TIME type with time zone defined by the SQL standard:
> *TIME\(n\) WITH TIME ZONE*.
> *Q3. How is it done today, and what are the limits of current practice?*
> The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the
> date part to the some constant value like 1970-01-01, 0001-01-01 or
> 0000-00-00 (though this is out of supported rage of dates).
> Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot
> recognize it in data sources, and for instance cannot load the TIME values
> from parquet files.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The approach is not new, and we have clear picture how to split the work by
> sub-tasks based on our experience of adding new types ANSI intervals and
> TIMESTAMP_NTZ.
> *Q5. Who cares? If you are successful, what difference will it make?*
> The new type simplifies migrations to Spark SQL from other DBMS like
> PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users
> don't have to rewrite their SQL code to emulate the TIME type. Also new
> functionality impacts on existing Spark SQL users who need to load data w/
> the TIME values that were stored by other systems.
> *Q6. What are the risks?*
> Additional handling new type in operators, expression and data sources can
> cause performance regressions. Such risk can be compensated by developing
> time benchmarks in parallel with supporting new type in different places in
> Spark SQL.
>
> *Q7. How long will it take?*
> In total it might take around *9 months*. The estimation is based on similar
> tasks: ANSI intervals
> ([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and
> TIMESTAMP_NTZ
> ([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can
> split the work by function blocks:
> # Base functionality - *3 weeks*
> Add new type TimeType, forming/parsing time literals, type constructor, and
> external types.
> # Persistence - *3.5 months*
> Ability to create tables of the type TIME, read/write from/to Parquet and
> other built-in data types, partitioning, stats, predicate push down.
> # Time operators - *2 months*
> Arithmetic ops, field extract, sorting, and aggregations.
> # Clients support - *1 month*
> JDBC, Hive, Thrift server, connect
> # PySpark integration - *1 month*
> DataFrame support, pandas API, python UDFs, Arrow column vectors
> # Docs + testing/benchmarking - *1 month*
> *Q8. What are the mid-term and final “exams” to check for success?*
> The mid-term is in 4 month: basic functionality, read/write new type to
> built-in datasources, basic time operations such as arithmetic ops, casting.
> The final "exams" is to support the same functionality as other time types:
> TIMESTAMP_NTZ, DATE, TIMESTAMP.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]