[
https://issues.apache.org/jira/browse/SPARK-51162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-51162:
-----------------------------
Description:
*Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.*
Add new data type *TIME* to Spark SQL which represents a time value with fields
hour, minute, second, up to microseconds. All operations over the type are
performed without taking any time zone into account. New data type should
conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.
*Q3. How is it done today, and what are the limits of current practice?*
The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the
date part to the some constant value like 1970-01-01, 0001-01-01 or 0000-00-00
(though this is out of supported rage of dates).
Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot
recognize it in data sources, and for instance cannot load the TIME values from
parquet files.
*Q4. What is new in your approach and why do you think it will be successful?*
The approach is not new, and we have clear picture how to split the work by
sub-tasks based on our experience of adding new types ANSI intervals and
TIMESTAMP_NTZ.
*Q6. What are the risks?*
Additional handling new type in operators, expression and data sources can
cause performance regressions. Such risk can be compensated by developing time
benchmarks in parallel with supporting new type in different places in Spark
SQL.
*Q7. How long will it take?*
In total it might take around *9 months*. The estimation is based on similar
tasks: ANSI intervals
([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and
TIMESTAMP_NTZ
([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can split
the work by function blocks:
# Base functionality - *3 weeks*
Add new type TimeType, forming/parsing time literals, type constructor, and
external types.
# Persistence - *3.5 months*
Ability to create tables of the type TIME, read/write from/to Parquet and other
built-in data types, partitioning, stats, predicate push down.
# Time operators - *2 months*
Arithmetic ops, field extract, sorting, and aggregations.
# Clients support - *1 month*
JDBC, Hive, Thrift server, connect
# PySpark integration - *1 month*
DataFrame support, pandas API, python UDFs, Arrow column vectors
# Docs + testing/benchmarking - *1 month*
was:
*Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.*
Add new data type *TIME* to Spark SQL which represents a time value with fields
hour, minute, second, up to microseconds. All operations over the type are
performed without taking any time zone into account. New data type should
conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.
*Q3. How is it done today, and what are the limits of current practice?*
The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the
date part to the some constant value like 1970-01-01, 0001-01-01 or 0000-00-00
(though this is out of supported rage of dates).
Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot
recognize it in data sources, and for instance cannot load the TIME values from
parquet files.
*Q6. What are the risks?*
Additional handling new type in operators, expression and data sources can
cause performance regressions. Such risk can be compensated by developing time
benchmarks in parallel with supporting new type in different places in Spark
SQL.
*Q7. How long will it take?*
In total it might take around *9 months*. The estimation is based on similar
tasks: ANSI intervals
([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and
TIMESTAMP_NTZ
([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can split
the work by function blocks:
# Base functionality - *3 weeks*
Add new type TimeType, forming/parsing time literals, type constructor, and
external types.
# Persistence - *3.5 months*
Ability to create tables of the type TIME, read/write from/to Parquet and other
built-in data types, partitioning, stats, predicate push down.
# Time operators - *2 months*
Arithmetic ops, field extract, sorting, and aggregations.
# Clients support - *1 month*
JDBC, Hive, Thrift server, connect
# PySpark integration - *1 month*
DataFrame support, pandas API, python UDFs, Arrow column vectors
# Docs + testing/benchmarking - *1 month*
> [WIP] SPIP: Add the TIME data type
> ----------------------------------
>
> Key: SPARK-51162
> URL: https://issues.apache.org/jira/browse/SPARK-51162
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: SPIP
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.*
> Add new data type *TIME* to Spark SQL which represents a time value with
> fields hour, minute, second, up to microseconds. All operations over the type
> are performed without taking any time zone into account. New data type should
> conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.
> *Q3. How is it done today, and what are the limits of current practice?*
> The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the
> date part to the some constant value like 1970-01-01, 0001-01-01 or
> 0000-00-00 (though this is out of supported rage of dates).
> Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot
> recognize it in data sources, and for instance cannot load the TIME values
> from parquet files.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The approach is not new, and we have clear picture how to split the work by
> sub-tasks based on our experience of adding new types ANSI intervals and
> TIMESTAMP_NTZ.
> *Q6. What are the risks?*
> Additional handling new type in operators, expression and data sources can
> cause performance regressions. Such risk can be compensated by developing
> time benchmarks in parallel with supporting new type in different places in
> Spark SQL.
>
> *Q7. How long will it take?*
> In total it might take around *9 months*. The estimation is based on similar
> tasks: ANSI intervals
> ([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and
> TIMESTAMP_NTZ
> ([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can
> split the work by function blocks:
> # Base functionality - *3 weeks*
> Add new type TimeType, forming/parsing time literals, type constructor, and
> external types.
> # Persistence - *3.5 months*
> Ability to create tables of the type TIME, read/write from/to Parquet and
> other built-in data types, partitioning, stats, predicate push down.
> # Time operators - *2 months*
> Arithmetic ops, field extract, sorting, and aggregations.
> # Clients support - *1 month*
> JDBC, Hive, Thrift server, connect
> # PySpark integration - *1 month*
> DataFrame support, pandas API, python UDFs, Arrow column vectors
> # Docs + testing/benchmarking - *1 month*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]