[jira] [Updated] (SPARK-31030) Backward Compatibility for Parsing Datetime

Yuanjian Li (Jira) Tue, 03 Mar 2020 18:48:25 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yuanjian Li updated SPARK-31030:
--------------------------------
    Description: 
*Background*

In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
are performed by using the hybrid calendar ([Julian + 
Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
 

Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well 
as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 
8 API classes [the java.time packages that are based on[ ISO 
chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]].

The switching job is completed in SPARK-26651. 

 

*Problem*

Switching to Java 8 datetime API breaks the backward compatibility of Spark 2.4 
and earlier when parsing datetime. Moreover, for the build-in SQL expressions 
like to_date, to_timestamp and etc,  in the existing implementation of Spark 
3.0 will catch all the exceptions and return `null` when hitting the parsing 
errors. This will cause the silent result changes, which are hard to debug for 
end-users when the data volume is huge and business logics are complex.

 

*Solution*

To avoid unexpected result changes after the underlying datetime API switch, we 
propose the following solution. 
 * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
need to detect these behavior differences by falling back to the legacy parser, 
and fail with a user-friendly error message to tell users what gets changed and 
how to fix the pattern.

 * Document the Spark’s datetime patterns: The date-time formatter of Spark is 
decoupled with the Java patterns. The Spark’s patterns are mainly based on the 
[Java 7’s 
pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
 [for better backward compatibility] with the customized logic [caused by the 
breaking changes between[ Java 
7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
and[ Java 
8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
 pattern string]. Below are the customized rules:

 

 
||Pattern||Java 7||Java 8||Example||Rule||
|u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
accept a negative value to represent BC, while y should be used together with G 
to do the same thing.)|
||SQL||2.4||3.0||
|to_timestamp('7', 'u')|1970-01-04 00:00:00|0007-01-01 00:00:00|
|to_timestamp('7', 'uu')|1970-01-04 00:00:00|null|
|to_timestamp('700', 'uuu')|null|0700-01-01 00:00:00|
|to_timestamp('2000', 'uuuu')|null|2000-01-01 00:00:00|
|Substitute ‘u’ to ‘e’ and use Java 8 parser to parse the string. If parsable, 
return the result; otherwise, fall back to ‘u’, and then use the legacy Java 7 
parser to parse. When it is successfully parsed, throw an exception and ask 
users to change the pattern strings or turn on the legacy mode; otherwise, 
return NULL as what Spark 2.4 does.|
| z| General time zone which also accepts[ RFC 822 time 
zones|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#rfc822timezone]|Only
 accept time-zone name, e.g. Pacific Standard Time; PST |
||Heading 1||Heading 2||
|Col A1|Col A2|
| The semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 
follows the semantics of Java 8. 

Use Java 8 to parse the string. If parsable, return the result; otherwise, use 
the legacy Java 7 parser to parse. When it is successfully parsed, throw an 
exception and ask users to change the pattern strings or turn on the legacy 
mode; otherwise, return NULL as what Spark 2.4 does.|

 

 

 

  was:
*Background*

In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
are performed by using the hybrid calendar ([Julian + 
Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
 

Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well 
as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 
8 API classes [the java.time packages that are based on[ ISO 
chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]].

The switching job is completed in SPARK-26651. 

 

*Problem*

Switching to Java 8 datetime API breaks the backward compatibility of Spark 2.4 
and earlier when parsing datetime. Moreover, for the build-in SQL expressions 
like to_date, to_timestamp and etc,  in the existing implementation of Spark 
3.0 will catch all the exceptions and return `null` when hitting the parsing 
errors. This will cause the silent result changes, which are hard to debug for 
end-users when the data volume is huge and business logics are complex.

 

*Solution*

To avoid unexpected result changes after the underlying datetime API switch, we 
propose the following solution. 
 * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
need to detect these behavior differences by falling back to the legacy parser, 
and fail with a user-friendly error message to tell users what gets changed and 
how to fix the pattern.

 * Document the Spark’s datetime patterns: The date-time formatter of Spark is 
decoupled with the Java patterns. The Spark’s patterns are mainly based on the 
[Java 7’s 
pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
 [for better backward compatibility] with the customized logic [caused by the 
breaking changes between[ Java 
7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
and[ Java 
8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
 pattern string]. Below are the customized rules:

 

 
||Pattern||Java 7||Java 8||Example||Rule||
|u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
accept a negative value to represent BC, while y should be used together with G 
to do the same thing.)|
|SQL|2.4|3.0|
|to_timestamp('7', 'u')|1970-01-04 00:00:00|0007-01-01 00:00:00|
|to_timestamp('7', 'uu')|1970-01-04 00:00:00|null|
|to_timestamp('700', 'uuu')|null|0700-01-01 00:00:00|
|to_timestamp('2000', 'uuuu')|null|2000-01-01 00:00:00|
|Substitute ‘u’ to ‘e’ and use Java 8 parser to parse the string. If parsable, 
return the result; otherwise, fall back to ‘u’, and then use the legacy Java 7 
parser to parse. When it is successfully parsed, throw an exception and ask 
users to change the pattern strings or turn on the legacy mode; otherwise, 
return NULL as what Spark 2.4 does.|
| z| General time zone which also accepts[ RFC 822 time 
zones|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#rfc822timezone]|Only
 accept time-zone name, e.g. Pacific Standard Time; PST | 
|SQL|2.4|3.0|
|to_timestamp('2020 +0800', 'yyyy z'|2019-12-31 16:00:00|null|
| The semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 
follows the semantics of Java 8. 

Use Java 8 to parse the string. If parsable, return the result; otherwise, use 
the legacy Java 7 parser to parse. When it is successfully parsed, throw an 
exception and ask users to change the pattern strings or turn on the legacy 
mode; otherwise, return NULL as what Spark 2.4 does.|

 

 

 


> Backward Compatibility for Parsing Datetime
> -------------------------------------------
>
>                 Key: SPARK-31030
>                 URL: https://issues.apache.org/jira/browse/SPARK-31030
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Yuanjian Li
>            Priority: Major
>
> *Background*
> In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
> are performed by using the hybrid calendar ([Julian + 
> Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
>  
> Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as 
> well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by 
> using Java 8 API classes [the java.time packages that are based on[ ISO 
> chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]].
> The switching job is completed in SPARK-26651. 
>  
> *Problem*
> Switching to Java 8 datetime API breaks the backward compatibility of Spark 
> 2.4 and earlier when parsing datetime. Moreover, for the build-in SQL 
> expressions like to_date, to_timestamp and etc,  in the existing 
> implementation of Spark 3.0 will catch all the exceptions and return `null` 
> when hitting the parsing errors. This will cause the silent result changes, 
> which are hard to debug for end-users when the data volume is huge and 
> business logics are complex.
>  
> *Solution*
> To avoid unexpected result changes after the underlying datetime API switch, 
> we propose the following solution. 
>  * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
> need to detect these behavior differences by falling back to the legacy 
> parser, and fail with a user-friendly error message to tell users what gets 
> changed and how to fix the pattern.
>  * Document the Spark’s datetime patterns: The date-time formatter of Spark 
> is decoupled with the Java patterns. The Spark’s patterns are mainly based on 
> the [Java 7’s 
> pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
>  [for better backward compatibility] with the customized logic [caused by the 
> breaking changes between[ Java 
> 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
> and[ Java 
> 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
>  pattern string]. Below are the customized rules:
>  
>  
> ||Pattern||Java 7||Java 8||Example||Rule||
> |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
> accept a negative value to represent BC, while y should be used together with 
> G to do the same thing.)|
> ||SQL||2.4||3.0||
> |to_timestamp('7', 'u')|1970-01-04 00:00:00|0007-01-01 00:00:00|
> |to_timestamp('7', 'uu')|1970-01-04 00:00:00|null|
> |to_timestamp('700', 'uuu')|null|0700-01-01 00:00:00|
> |to_timestamp('2000', 'uuuu')|null|2000-01-01 00:00:00|
> |Substitute ‘u’ to ‘e’ and use Java 8 parser to parse the string. If 
> parsable, return the result; otherwise, fall back to ‘u’, and then use the 
> legacy Java 7 parser to parse. When it is successfully parsed, throw an 
> exception and ask users to change the pattern strings or turn on the legacy 
> mode; otherwise, return NULL as what Spark 2.4 does.|
> | z| General time zone which also accepts[ RFC 822 time 
> zones|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#rfc822timezone]|Only
>  accept time-zone name, e.g. Pacific Standard Time; PST |
> ||Heading 1||Heading 2||
> |Col A1|Col A2|
> | The semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 
> 3.0 follows the semantics of Java 8. 
> Use Java 8 to parse the string. If parsable, return the result; otherwise, 
> use the legacy Java 7 parser to parse. When it is successfully parsed, throw 
> an exception and ask users to change the pattern strings or turn on the 
> legacy mode; otherwise, return NULL as what Spark 2.4 does.|
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31030) Backward Compatibility for Parsing Datetime

Reply via email to