[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-08 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r24119
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

I added an example when there is a difference, and updated the migration 
guide.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-08 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r239998126
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

Tests passed on new parser. I reverted back all settings for 
`HiveCompatibilitySuite`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r239068840
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

I think consistency is indeed a problem, but why disable the new parser, 
rather than make this consistent? I haven't looked into whether there's a good 
reason they behave differently but suspect not.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-05 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r239010321
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

Our current approach for converting dates is inconsistent in a few places, 
for example:
- `UTF8String` -> `num days` uses hardcoded `GMT` and ignores SQL config: 
https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L493
- `String` -> `java.util.Date` ignores Spark's time zone settings, and uses 
system time zone: 
https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L186
- In many places even a function accepts timeZone parameter, it is not 
passed (used default time zone - **not from config but from 
TimeZone.getDefault()**). For example: 
https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L187
 .
- Casting to the date type depends on type of argument, if it is 
`TimestampType`, expression-wise timezone is used, otherwise `GMT`: 
https://github.com/apache/spark/blob/d03e0af80d7659f12821cc2442efaeaee94d3985/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L403-L410

I do really think to disable new parser/formatter outside of CSV/JSON 
datasources because it is hard to guarantee consistent behavior in combination 
with other date/timestamp functions. @srowen @gatorsmile @HyukjinKwon WDYT?
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-04 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238853314
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

While porting on new parser/formatter, I faced to 2problems at least:
- The time zone from SQL config is not taken into account on parsing at 
all. Basically used functions take default time zone from jvm settings. It 
could be fixed by `TimeZone.setDefault` or using absolute values.
-  Round trip in parsing a date to `DateType` and back to a date as a 
string could give different string because `DateType` stores only days (as 
`Int`) since epoch (in `UTC`). And such representation loses time zone offset. 
So, exact matching is impossible due to lack of information. For example, 
roundtrip converting for `TimestampType` works perfectly. This is the case for 
the changes. Previously, it worked because the specified time zone is not used 
at all (did not impact on number of days while converting a string to 
`DateType`). With new parser/formatter, it becomes matter, and I have to change 
time zone to `GMT` to eliminate the problem of loosing timezone offsets (it is 
zero for `GMT`). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238496050
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

@MaxGekk, BTW, why does this have to be GMT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238495344
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

I think we can add an example that shows the diff. IIRC it has a difference 
about exact match or non-exact match.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-02 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238116870
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

@gatorsmile What would I you recommend to improve the text? I can add the 
links above, so, an user can figure out what is difference in their particular 
case. Our tests don't show any difference on our default timestamp/date 
patterns but the user can use something more specific and face to behaviour 
change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-02 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238111358
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

New implementation and old one have slightly different pattern formats. See 
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html
 and https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html 
. And two Java API can have different behaviours. Besides of that, new one can 
parse timestamps with microseconds precision as a consequence of using Java 8 
java.time API.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238110375
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

The impact is not clearly documented. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238110416
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

What is the behavior changes? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-01 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23196

[SPARK-26243][SQL] Use java.time API for parsing timestamps and dates from 
JSON

## What changes were proposed in this pull request?

In the PR, I propose to switch on **java.time API** for parsing timestamps 
and dates from JSON inputs with microseconds precision. The SQL config 
`spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior 
with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating 
timestamps/dates.

## How was this patch tested?

It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and 
`JsonSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 json-time-parser

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23196.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23196


commit fb10b91502b67b98f2904a06b017a6e56dd6e39f
Author: Maxim Gekk 
Date:   2018-12-01T11:01:47Z

Adding DateTimeFormatter

commit a9b39eccfdc3bd9da8fe52eda24ed240588b282f
Author: Maxim Gekk 
Date:   2018-12-01T11:15:31Z

Support DateTimeFormatter by JacksonParser and JacksonGenerator

commit ff589f505f7aaa8c94ce304aaf77792505998c16
Author: Maxim Gekk 
Date:   2018-12-01T12:31:43Z

Make test independent from current time zone

commit 4646dededae832185a35a85244baab6507d28f0d
Author: Maxim Gekk 
Date:   2018-12-01T18:19:31Z

Fix a test by new fallback




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org