[spark] branch branch-3.0 updated: [SPARK-31439][SQL] Fix perf regression of fromJavaDate

2020-04-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new e415a42  [SPARK-31439][SQL] Fix perf regression of fromJavaDate
e415a42 is described below

commit e415a428bba8da40fe79523a8ad1be44d4e50d82
Author: Max Gekk 
AuthorDate: Tue Apr 14 14:44:00 2020 +

[SPARK-31439][SQL] Fix perf regression of fromJavaDate

### What changes were proposed in this pull request?
In the PR, I propose to re-use optimized implementation of days rebase 
function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in 
conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function 
`fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation 
from Spark 2.4, and by rebasing the final results via 
`rebaseJulianToGregorianDays()`.

Also I updated `DateTimeBenchmark`, and added a benchmark for conversion 
from `java.sql.Date`.

### Why are the changes needed?
The PR fixes the regression of parallelizing a collection of 
`java.sql.Date` values, and improves performance of converting external values 
to Catalyst's `DATE` values:
- x4 on the master branch
- 30% against Spark 2.4.6-SNAPSHOT

Spark 2.4.6-SNAPSHOT:
```
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date  614655  
43  8.1 122.8   1.0X
```

Before the changes:
```
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date 1154   1206  
46  4.3 230.9   1.0X
```

After:
```
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date  427434  
 7 11.7  85.3   1.0X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, 
`RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
|  | |
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 
(ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 
11.0.6+10 |

Closes #28205 from MaxGekk/optimize-fromJavaDate.

Authored-by: Max Gekk 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 2c5d489679ba3814973680d65853877664bcd931)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/DateTimeUtils.scala|  14 +-
 .../benchmarks/DateTimeBenchmark-jdk11-results.txt | 221 +++--
 sql/core/benchmarks/DateTimeBenchmark-results.txt  | 221 +++--
 .../execution/benchmark/DateTimeBenchmark.scala|   7 +-
 4 files changed, 232 insertions(+), 231 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
index 88ffd48..dede92f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
@@ -103,16 +103,10 @@ object DateTimeUtils {
* @return The number of days since epoch from java.sql.Date.
*/
   def fromJavaDate(date: Date): SQLDate = {
-val era = if (date.before(julianCommonEraStart)) 0 else 1
-val localDate = LocalDate
-  .of(date.getYear + 1900, date.getMonth + 1, 1)
-  .`with`(ChronoField.ERA, era)
-  // Add days separately to convert dates existed in Julian calendar but 
not
-  // in Proleptic Gregorian calendar. For example, 1000-02-29 is valid date
-  // in Julian calendar because 1000 is a leap year but 1000 is not a leap
-  // year in Proleptic Gregorian calendar. And 1000-02-29 doesn't exist in 
it.
-  .plusDays(date.getDate - 1) // Returns the next valid date after 
`date.getDate - 1` days
-localDateToDays(localD

[spark] branch branch-3.0 updated: [SPARK-31439][SQL] Fix perf regression of fromJavaDate

2020-04-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new e415a42  [SPARK-31439][SQL] Fix perf regression of fromJavaDate
e415a42 is described below

commit e415a428bba8da40fe79523a8ad1be44d4e50d82
Author: Max Gekk 
AuthorDate: Tue Apr 14 14:44:00 2020 +

[SPARK-31439][SQL] Fix perf regression of fromJavaDate

### What changes were proposed in this pull request?
In the PR, I propose to re-use optimized implementation of days rebase 
function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in 
conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function 
`fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation 
from Spark 2.4, and by rebasing the final results via 
`rebaseJulianToGregorianDays()`.

Also I updated `DateTimeBenchmark`, and added a benchmark for conversion 
from `java.sql.Date`.

### Why are the changes needed?
The PR fixes the regression of parallelizing a collection of 
`java.sql.Date` values, and improves performance of converting external values 
to Catalyst's `DATE` values:
- x4 on the master branch
- 30% against Spark 2.4.6-SNAPSHOT

Spark 2.4.6-SNAPSHOT:
```
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date  614655  
43  8.1 122.8   1.0X
```

Before the changes:
```
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date 1154   1206  
46  4.3 230.9   1.0X
```

After:
```
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date  427434  
 7 11.7  85.3   1.0X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, 
`RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
|  | |
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 
(ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 
11.0.6+10 |

Closes #28205 from MaxGekk/optimize-fromJavaDate.

Authored-by: Max Gekk 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 2c5d489679ba3814973680d65853877664bcd931)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/DateTimeUtils.scala|  14 +-
 .../benchmarks/DateTimeBenchmark-jdk11-results.txt | 221 +++--
 sql/core/benchmarks/DateTimeBenchmark-results.txt  | 221 +++--
 .../execution/benchmark/DateTimeBenchmark.scala|   7 +-
 4 files changed, 232 insertions(+), 231 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
index 88ffd48..dede92f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
@@ -103,16 +103,10 @@ object DateTimeUtils {
* @return The number of days since epoch from java.sql.Date.
*/
   def fromJavaDate(date: Date): SQLDate = {
-val era = if (date.before(julianCommonEraStart)) 0 else 1
-val localDate = LocalDate
-  .of(date.getYear + 1900, date.getMonth + 1, 1)
-  .`with`(ChronoField.ERA, era)
-  // Add days separately to convert dates existed in Julian calendar but 
not
-  // in Proleptic Gregorian calendar. For example, 1000-02-29 is valid date
-  // in Julian calendar because 1000 is a leap year but 1000 is not a leap
-  // year in Proleptic Gregorian calendar. And 1000-02-29 doesn't exist in 
it.
-  .plusDays(date.getDate - 1) // Returns the next valid date after 
`date.getDate - 1` days
-localDateToDays(localD