[
https://issues.apache.org/jira/browse/SPARK-51162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042780#comment-18042780
]
Uroš Bojanić commented on SPARK-51162:
--------------------------------------
Good note about considering to protect the feature behind a flag since it's not
fully ready for 4.1 at this time. [~maxgekk] What do you say? Also cc
[~davidm-db]
> SPIP: Add the TIME data type
> ----------------------------
>
> Key: SPARK-51162
> URL: https://issues.apache.org/jira/browse/SPARK-51162
> Project: Spark
> Issue Type: Umbrella
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: SPIP, pull-request-available, releasenotes
> Fix For: 4.1.0
>
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.*
> Add new data type *TIME* to Spark SQL which represents a time value with
> fields hour, minute, second, up to microseconds. All operations over the type
> are performed without taking any time zone into account. New data type should
> conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard
> where 0 <= n <= 6.
> *Q2. What problem is this proposal NOT designed to solve?*
> Don't support the TIME type with time zone defined by the SQL standard:
> {*}TIME\(n\) WITH TIME ZONE{*}.
> Also don't support TIME with local timezone.
> *Q3. How is it done today, and what are the limits of current practice?*
> The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the
> date part to the some constant value like 1970-01-01, 0001-01-01 or
> 0000-00-00 (though this is out of supported range of dates).
> Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot
> recognize it in data sources, and for instance cannot load the TIME values
> from parquet files.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The approach is not new, and we have clear picture how to split the work by
> sub-tasks based on our experience of adding new types ANSI intervals and
> TIMESTAMP_NTZ.
> *Q5. Who cares? If you are successful, what difference will it make?*
> The new type simplifies migrations to Spark SQL from other DBMS like
> PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users
> don't have to rewrite their SQL code to emulate the TIME type. Also new
> functionality impacts on existing Spark SQL users who need to load data w/
> the TIME values that were stored by other systems.
> *Q6. What are the risks?*
> Additional handling new type in operators, expression and data sources can
> cause performance regressions. Such risk can be compensated by developing
> time benchmarks in parallel with supporting new type in different places in
> Spark SQL.
>
> *Q7. How long will it take?*
> In total it might take around {*}9 months{*}. The estimation is based on
> similar tasks: ANSI intervals (SPARK-27790) and TIMESTAMP_NTZ (SPARK-35662).
> We can split the work by function blocks:
> # Base functionality - *3 weeks*
> Add new type TimeType, forming/parsing time literals, type constructor, and
> external types.
> # Persistence - *3.5 months*
> Ability to create tables of the type TIME, read/write from/to Parquet and
> other built-in data types, partitioning, stats, predicate push down.
> # Time operators - *2 months*
> Arithmetic ops, field extract, sorting, and aggregations.
> # Clients support - *1 month*
> JDBC, Hive, Thrift server, connect
> # PySpark integration - *1 month*
> DataFrame support, pandas API, python UDFs, Arrow column vectors
> # Docs + testing/benchmarking - *1 month*
> *Q8. What are the mid-term and final “exams” to check for success?*
> The mid-term is in 4 month: basic functionality, read/write new type to
> built-in datasources, basic time operations such as arithmetic ops, casting.
> The final "exams" is to support the same functionality as other time types:
> TIMESTAMP_NTZ, DATE, TIMESTAMP.
> *Appendix A. Proposed API Changes.*
> Add new case class *TimeType* to {_}org.apache.spark.sql.types{_}:
> {code:scala}
> /**
> * The time type represents a time value with fields hour, minute, second, up
> to microseconds.
> * The range of times supported is 00:00:00.000000 to 23:59:59.999999.
> *
> * Please use the singleton `DataTypes.TimeType` to refer the type.
> */
> class TimeType(precisionField: Byte) extends DatetimeType {
> /**
> * The default size of a value of the TimeType is 8 bytes.
> */
> override def defaultSize: Int = 8
> private[spark] override def asNullable: DateType = this
> }
> {code}
> *Appendix B:* As the external types for the new TIME type, we propose:
> - Java/Scala:
> [java.time.LocalTime|https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/time/LocalTime.html]
> - PySpark:
> [time|https://docs.python.org/3/library/datetime.html#time-objects]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]