[ https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuanjian Li updated SPARK-31030: -------------------------------- Description: *Background* In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar ([Julian + Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]). Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes [the java.time packages that are based on[ ISO chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]]. The switching job is completed in SPARK-26651. *Problem* Switching to Java 8 datetime API breaks the backward compatibility of Spark 2.4 and earlier when parsing datetime. Moreover, for the build-in SQL expressions like to_date, to_timestamp and etc, in the existing implementation of Spark 3.0 will catch all the exceptions and return `null` when hitting the parsing errors. This will cause the silent result changes, which are hard to debug for end-users when the data volume is huge and business logics are complex. *Solution* To avoid unexpected result changes after the underlying datetime API switch, we propose the following solution. * Introduce the fallback mechanism: when the Java 8-based parser fails, we need to detect these behavior differences by falling back to the legacy parser, and fail with a user-friendly error message to tell users what gets changed and how to fix the pattern. * Document the Spark’s datetime patterns: The date-time formatter of Spark is decoupled with the Java patterns. The Spark’s patterns are mainly based on the [Java 7’s pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] [for better backward compatibility] with the customized logic [caused by the breaking changes between[ Java 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] and[ Java 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] pattern string]. Below are the customized rules: ||Pattern||Java 7||Java 8||Example||Rule|| |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u accept a negative value to represent BC, while y should be used together with G to do the same thing.)| ||SQL||2.4||3.0|| |to_timestamp('7', 'u')|1970-01-04 00:00:00|0007-01-01 00:00:00| |to_timestamp('7', 'uu')|1970-01-04 00:00:00|null| |to_timestamp('700', 'uuu')|null|0700-01-01 00:00:00| |to_timestamp('2000', 'uuuu')|null|2000-01-01 00:00:00| |Substitute ‘u’ to ‘e’ and use Java 8 parser to parse the string. If parsable, return the result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to parse. When it is successfully parsed, throw an exception and ask users to change the pattern strings or turn on the legacy mode; otherwise, return NULL as what Spark 2.4 does.| | z| General time zone which also accepts[ RFC 822 time zones|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#rfc822timezone]|Only accept time-zone name, e.g. Pacific Standard Time; PST | ||Heading 1||Heading 2|| |Col A1|Col A2| | The semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 follows the semantics of Java 8. Use Java 8 to parse the string. If parsable, return the result; otherwise, use the legacy Java 7 parser to parse. When it is successfully parsed, throw an exception and ask users to change the pattern strings or turn on the legacy mode; otherwise, return NULL as what Spark 2.4 does.| was: *Background* In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar ([Julian + Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]). Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes [the java.time packages that are based on[ ISO chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]]. The switching job is completed in SPARK-26651. *Problem* Switching to Java 8 datetime API breaks the backward compatibility of Spark 2.4 and earlier when parsing datetime. Moreover, for the build-in SQL expressions like to_date, to_timestamp and etc, in the existing implementation of Spark 3.0 will catch all the exceptions and return `null` when hitting the parsing errors. This will cause the silent result changes, which are hard to debug for end-users when the data volume is huge and business logics are complex. *Solution* To avoid unexpected result changes after the underlying datetime API switch, we propose the following solution. * Introduce the fallback mechanism: when the Java 8-based parser fails, we need to detect these behavior differences by falling back to the legacy parser, and fail with a user-friendly error message to tell users what gets changed and how to fix the pattern. * Document the Spark’s datetime patterns: The date-time formatter of Spark is decoupled with the Java patterns. The Spark’s patterns are mainly based on the [Java 7’s pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] [for better backward compatibility] with the customized logic [caused by the breaking changes between[ Java 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] and[ Java 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] pattern string]. Below are the customized rules: ||Pattern||Java 7||Java 8||Example||Rule|| |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u accept a negative value to represent BC, while y should be used together with G to do the same thing.)| |SQL|2.4|3.0| |to_timestamp('7', 'u')|1970-01-04 00:00:00|0007-01-01 00:00:00| |to_timestamp('7', 'uu')|1970-01-04 00:00:00|null| |to_timestamp('700', 'uuu')|null|0700-01-01 00:00:00| |to_timestamp('2000', 'uuuu')|null|2000-01-01 00:00:00| |Substitute ‘u’ to ‘e’ and use Java 8 parser to parse the string. If parsable, return the result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to parse. When it is successfully parsed, throw an exception and ask users to change the pattern strings or turn on the legacy mode; otherwise, return NULL as what Spark 2.4 does.| | z| General time zone which also accepts[ RFC 822 time zones|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#rfc822timezone]|Only accept time-zone name, e.g. Pacific Standard Time; PST | |SQL|2.4|3.0| |to_timestamp('2020 +0800', 'yyyy z'|2019-12-31 16:00:00|null| | The semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 follows the semantics of Java 8. Use Java 8 to parse the string. If parsable, return the result; otherwise, use the legacy Java 7 parser to parse. When it is successfully parsed, throw an exception and ask users to change the pattern strings or turn on the legacy mode; otherwise, return NULL as what Spark 2.4 does.| > Backward Compatibility for Parsing Datetime > ------------------------------------------- > > Key: SPARK-31030 > URL: https://issues.apache.org/jira/browse/SPARK-31030 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.0.0 > Reporter: Yuanjian Li > Priority: Major > > *Background* > In Spark version 2.4 and earlier, datetime parsing, formatting and conversion > are performed by using the hybrid calendar ([Julian + > Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]). > > Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as > well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by > using Java 8 API classes [the java.time packages that are based on[ ISO > chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]]. > The switching job is completed in SPARK-26651. > > *Problem* > Switching to Java 8 datetime API breaks the backward compatibility of Spark > 2.4 and earlier when parsing datetime. Moreover, for the build-in SQL > expressions like to_date, to_timestamp and etc, in the existing > implementation of Spark 3.0 will catch all the exceptions and return `null` > when hitting the parsing errors. This will cause the silent result changes, > which are hard to debug for end-users when the data volume is huge and > business logics are complex. > > *Solution* > To avoid unexpected result changes after the underlying datetime API switch, > we propose the following solution. > * Introduce the fallback mechanism: when the Java 8-based parser fails, we > need to detect these behavior differences by falling back to the legacy > parser, and fail with a user-friendly error message to tell users what gets > changed and how to fix the pattern. > * Document the Spark’s datetime patterns: The date-time formatter of Spark > is decoupled with the Java patterns. The Spark’s patterns are mainly based on > the [Java 7’s > pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] > [for better backward compatibility] with the customized logic [caused by the > breaking changes between[ Java > 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] > and[ Java > 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] > pattern string]. Below are the customized rules: > > > ||Pattern||Java 7||Java 8||Example||Rule|| > |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u > accept a negative value to represent BC, while y should be used together with > G to do the same thing.)| > ||SQL||2.4||3.0|| > |to_timestamp('7', 'u')|1970-01-04 00:00:00|0007-01-01 00:00:00| > |to_timestamp('7', 'uu')|1970-01-04 00:00:00|null| > |to_timestamp('700', 'uuu')|null|0700-01-01 00:00:00| > |to_timestamp('2000', 'uuuu')|null|2000-01-01 00:00:00| > |Substitute ‘u’ to ‘e’ and use Java 8 parser to parse the string. If > parsable, return the result; otherwise, fall back to ‘u’, and then use the > legacy Java 7 parser to parse. When it is successfully parsed, throw an > exception and ask users to change the pattern strings or turn on the legacy > mode; otherwise, return NULL as what Spark 2.4 does.| > | z| General time zone which also accepts[ RFC 822 time > zones|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#rfc822timezone]|Only > accept time-zone name, e.g. Pacific Standard Time; PST | > ||Heading 1||Heading 2|| > |Col A1|Col A2| > | The semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark > 3.0 follows the semantics of Java 8. > Use Java 8 to parse the string. If parsable, return the result; otherwise, > use the legacy Java 7 parser to parse. When it is successfully parsed, throw > an exception and ask users to change the pattern strings or turn on the > legacy mode; otherwise, return NULL as what Spark 2.4 does.| > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org