[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r24119 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- I added an example when there is a difference, and updated the migration guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r239998126 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- Tests passed on new parser. I reverted back all settings for `HiveCompatibilitySuite` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r239068840 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- I think consistency is indeed a problem, but why disable the new parser, rather than make this consistent? I haven't looked into whether there's a good reason they behave differently but suspect not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r239010321 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- Our current approach for converting dates is inconsistent in a few places, for example: - `UTF8String` -> `num days` uses hardcoded `GMT` and ignores SQL config: https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L493 - `String` -> `java.util.Date` ignores Spark's time zone settings, and uses system time zone: https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L186 - In many places even a function accepts timeZone parameter, it is not passed (used default time zone - **not from config but from TimeZone.getDefault()**). For example: https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L187 . - Casting to the date type depends on type of argument, if it is `TimestampType`, expression-wise timezone is used, otherwise `GMT`: https://github.com/apache/spark/blob/d03e0af80d7659f12821cc2442efaeaee94d3985/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L403-L410 I do really think to disable new parser/formatter outside of CSV/JSON datasources because it is hard to guarantee consistent behavior in combination with other date/timestamp functions. @srowen @gatorsmile @HyukjinKwon WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238853314 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- While porting on new parser/formatter, I faced to 2problems at least: - The time zone from SQL config is not taken into account on parsing at all. Basically used functions take default time zone from jvm settings. It could be fixed by `TimeZone.setDefault` or using absolute values. - Round trip in parsing a date to `DateType` and back to a date as a string could give different string because `DateType` stores only days (as `Int`) since epoch (in `UTC`). And such representation loses time zone offset. So, exact matching is impossible due to lack of information. For example, roundtrip converting for `TimestampType` works perfectly. This is the case for the changes. Previously, it worked because the specified time zone is not used at all (did not impact on number of days while converting a string to `DateType`). With new parser/formatter, it becomes matter, and I have to change time zone to `GMT` to eliminate the problem of loosing timezone offsets (it is zero for `GMT`). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238496050 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- @MaxGekk, BTW, why does this have to be GMT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238495344 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- I think we can add an example that shows the diff. IIRC it has a difference about exact match or non-exact match. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238116870 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- @gatorsmile What would I you recommend to improve the text? I can add the links above, so, an user can figure out what is difference in their particular case. Our tests don't show any difference on our default timestamp/date patterns but the user can use something more specific and face to behaviour change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238111358 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- New implementation and old one have slightly different pattern formats. See https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html and https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html . And two Java API can have different behaviours. Besides of that, new one can parse timestamps with microseconds precision as a consequence of using Java 8 java.time API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238110375 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- The impact is not clearly documented. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238110416 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- What is the behavior changes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23196 [SPARK-26243][SQL] Use java.time API for parsing timestamps and dates from JSON ## What changes were proposed in this pull request? In the PR, I propose to switch on **java.time API** for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates. ## How was this patch tested? It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and `JsonSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 json-time-parser Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23196.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23196 commit fb10b91502b67b98f2904a06b017a6e56dd6e39f Author: Maxim Gekk Date: 2018-12-01T11:01:47Z Adding DateTimeFormatter commit a9b39eccfdc3bd9da8fe52eda24ed240588b282f Author: Maxim Gekk Date: 2018-12-01T11:15:31Z Support DateTimeFormatter by JacksonParser and JacksonGenerator commit ff589f505f7aaa8c94ce304aaf77792505998c16 Author: Maxim Gekk Date: 2018-12-01T12:31:43Z Make test independent from current time zone commit 4646dededae832185a35a85244baab6507d28f0d Author: Maxim Gekk Date: 2018-12-01T18:19:31Z Fix a test by new fallback --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org