[spark] branch branch-3.0 updated: [SPARK-31439][SQL] Fix perf regression of fromJavaDate
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e415a42 [SPARK-31439][SQL] Fix perf regression of fromJavaDate e415a42 is described below commit e415a428bba8da40fe79523a8ad1be44d4e50d82 Author: Max Gekk AuthorDate: Tue Apr 14 14:44:00 2020 + [SPARK-31439][SQL] Fix perf regression of fromJavaDate ### What changes were proposed in this pull request? In the PR, I propose to re-use optimized implementation of days rebase function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function `fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation from Spark 2.4, and by rebasing the final results via `rebaseJulianToGregorianDays()`. Also I updated `DateTimeBenchmark`, and added a benchmark for conversion from `java.sql.Date`. ### Why are the changes needed? The PR fixes the regression of parallelizing a collection of `java.sql.Date` values, and improves performance of converting external values to Catalyst's `DATE` values: - x4 on the master branch - 30% against Spark 2.4.6-SNAPSHOT Spark 2.4.6-SNAPSHOT: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 614655 43 8.1 122.8 1.0X ``` Before the changes: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 1154 1206 46 4.3 230.9 1.0X ``` After: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 427434 7 11.7 85.3 1.0X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`. - Re-run `DateTimeBenchmark` in the environment: | Item | Description | | | | | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28205 from MaxGekk/optimize-fromJavaDate. Authored-by: Max Gekk Signed-off-by: Wenchen Fan (cherry picked from commit 2c5d489679ba3814973680d65853877664bcd931) Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/util/DateTimeUtils.scala| 14 +- .../benchmarks/DateTimeBenchmark-jdk11-results.txt | 221 +++-- sql/core/benchmarks/DateTimeBenchmark-results.txt | 221 +++-- .../execution/benchmark/DateTimeBenchmark.scala| 7 +- 4 files changed, 232 insertions(+), 231 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index 88ffd48..dede92f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -103,16 +103,10 @@ object DateTimeUtils { * @return The number of days since epoch from java.sql.Date. */ def fromJavaDate(date: Date): SQLDate = { -val era = if (date.before(julianCommonEraStart)) 0 else 1 -val localDate = LocalDate - .of(date.getYear + 1900, date.getMonth + 1, 1) - .`with`(ChronoField.ERA, era) - // Add days separately to convert dates existed in Julian calendar but not - // in Proleptic Gregorian calendar. For example, 1000-02-29 is valid date - // in Julian calendar because 1000 is a leap year but 1000 is not a leap - // year in Proleptic Gregorian calendar. And 1000-02-29 doesn't exist in it. - .plusDays(date.getDate - 1) // Returns the next valid date after `date.getDate - 1` days -localDateToDays(localD
[spark] branch branch-3.0 updated: [SPARK-31439][SQL] Fix perf regression of fromJavaDate
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e415a42 [SPARK-31439][SQL] Fix perf regression of fromJavaDate e415a42 is described below commit e415a428bba8da40fe79523a8ad1be44d4e50d82 Author: Max Gekk AuthorDate: Tue Apr 14 14:44:00 2020 + [SPARK-31439][SQL] Fix perf regression of fromJavaDate ### What changes were proposed in this pull request? In the PR, I propose to re-use optimized implementation of days rebase function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function `fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation from Spark 2.4, and by rebasing the final results via `rebaseJulianToGregorianDays()`. Also I updated `DateTimeBenchmark`, and added a benchmark for conversion from `java.sql.Date`. ### Why are the changes needed? The PR fixes the regression of parallelizing a collection of `java.sql.Date` values, and improves performance of converting external values to Catalyst's `DATE` values: - x4 on the master branch - 30% against Spark 2.4.6-SNAPSHOT Spark 2.4.6-SNAPSHOT: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 614655 43 8.1 122.8 1.0X ``` Before the changes: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 1154 1206 46 4.3 230.9 1.0X ``` After: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 427434 7 11.7 85.3 1.0X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`. - Re-run `DateTimeBenchmark` in the environment: | Item | Description | | | | | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28205 from MaxGekk/optimize-fromJavaDate. Authored-by: Max Gekk Signed-off-by: Wenchen Fan (cherry picked from commit 2c5d489679ba3814973680d65853877664bcd931) Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/util/DateTimeUtils.scala| 14 +- .../benchmarks/DateTimeBenchmark-jdk11-results.txt | 221 +++-- sql/core/benchmarks/DateTimeBenchmark-results.txt | 221 +++-- .../execution/benchmark/DateTimeBenchmark.scala| 7 +- 4 files changed, 232 insertions(+), 231 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index 88ffd48..dede92f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -103,16 +103,10 @@ object DateTimeUtils { * @return The number of days since epoch from java.sql.Date. */ def fromJavaDate(date: Date): SQLDate = { -val era = if (date.before(julianCommonEraStart)) 0 else 1 -val localDate = LocalDate - .of(date.getYear + 1900, date.getMonth + 1, 1) - .`with`(ChronoField.ERA, era) - // Add days separately to convert dates existed in Julian calendar but not - // in Proleptic Gregorian calendar. For example, 1000-02-29 is valid date - // in Julian calendar because 1000 is a leap year but 1000 is not a leap - // year in Proleptic Gregorian calendar. And 1000-02-29 doesn't exist in it. - .plusDays(date.getDate - 1) // Returns the next valid date after `date.getDate - 1` days -localDateToDays(localD