[spark] branch master updated (98ec4a8 -> c76c31e)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 98ec4a8 [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' in CORE and 'dev/.rat-excludes' in BUILD autolabeller add c76c31e [SPARK-31455][SQL] Fix rebasing of not-existed timestamps No new revisions were added by this update. Summary of changes: .../resources/gregorian-julian-rebase-micros.json | 2384 ++-- .../spark/sql/catalyst/util/RebaseDateTime.scala | 20 +- .../sql/catalyst/util/RebaseDateTimeSuite.scala| 37 +- .../execution/datasources/orc/OrcSourceSuite.scala |8 +- 4 files changed, 1244 insertions(+), 1205 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (92c1b24 -> 98ec4a8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 92c1b24 [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference add 98ec4a8 [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' in CORE and 'dev/.rat-excludes' in BUILD autolabeller No new revisions were added by this update. Summary of changes: .github/autolabeler.yml | 4 1 file changed, 4 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (92c1b24 -> 98ec4a8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 92c1b24 [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference add 98ec4a8 [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' in CORE and 'dev/.rat-excludes' in BUILD autolabeller No new revisions were added by this update. Summary of changes: .github/autolabeler.yml | 4 1 file changed, 4 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (92c1b24 -> 98ec4a8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 92c1b24 [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference add 98ec4a8 [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' in CORE and 'dev/.rat-excludes' in BUILD autolabeller No new revisions were added by this update. Summary of changes: .github/autolabeler.yml | 4 1 file changed, 4 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (92c1b24 -> 98ec4a8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 92c1b24 [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference add 98ec4a8 [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' in CORE and 'dev/.rat-excludes' in BUILD autolabeller No new revisions were added by this update. Summary of changes: .github/autolabeler.yml | 4 1 file changed, 4 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (92c1b24 -> 98ec4a8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 92c1b24 [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference add 98ec4a8 [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' in CORE and 'dev/.rat-excludes' in BUILD autolabeller No new revisions were added by this update. Summary of changes: .github/autolabeler.yml | 4 1 file changed, 4 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 4476c85 [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference 4476c85 is described below commit 4476c85775d231c8bb26399284c0baf4292bec7c Author: Huaxin Gao AuthorDate: Thu Apr 16 08:34:26 2020 +0900 [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference ### What changes were proposed in this pull request? Document Common Table Expression in SQL Reference ### Why are the changes needed? Make SQL Reference complete ### Does this PR introduce any user-facing change? Yes https://user-images.githubusercontent.com/13592258/79100257-f61def00-7d1a-11ea-8402-17017059232e.png;> https://user-images.githubusercontent.com/13592258/79100260-f7e7b280-7d1a-11ea-9408-058c0851f0b6.png;> https://user-images.githubusercontent.com/13592258/79100262-fa4a0c80-7d1a-11ea-8862-eb1d8960296b.png;> Also link to Select page https://user-images.githubusercontent.com/13592258/79082246-217fea00-7cd9-11ea-8d96-1a69769d1e19.png;> ### How was this patch tested? Manually build and check Closes #28196 from huaxingao/cte. Authored-by: Huaxin Gao Signed-off-by: Takeshi Yamamuro (cherry picked from commit 92c1b246174948d0c1f4d0833e1ceac265b17be7) Signed-off-by: Takeshi Yamamuro --- docs/_data/menu-sql.yaml | 2 + docs/sql-ref-syntax-qry-select-cte.md | 109 +- docs/sql-ref-syntax-qry-select.md | 3 +- 3 files changed, 112 insertions(+), 2 deletions(-) diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml index badb98d..7827a0f 100644 --- a/docs/_data/menu-sql.yaml +++ b/docs/_data/menu-sql.yaml @@ -166,6 +166,8 @@ url: sql-ref-syntax-qry-select-tvf.html - text: Inline Table url: sql-ref-syntax-qry-select-inline-table.html +- text: Common Table Expression + url: sql-ref-syntax-qry-select-cte.html - text: EXPLAIN url: sql-ref-syntax-qry-explain.html - text: Auxiliary Statements diff --git a/docs/sql-ref-syntax-qry-select-cte.md b/docs/sql-ref-syntax-qry-select-cte.md index 2bd7748..2146f8e 100644 --- a/docs/sql-ref-syntax-qry-select-cte.md +++ b/docs/sql-ref-syntax-qry-select-cte.md @@ -19,4 +19,111 @@ license: | limitations under the License. --- -**This page is under construction** +### Description + +A common table expression (CTE) defines a temporary result set that a user can reference possibly multiple times within the scope of a SQL statement. A CTE is used mainly in a SELECT statement. + +### Syntax + +{% highlight sql %} +WITH common_table_expression [ , ... ] +{% endhighlight %} + +While `common_table_expression` is defined as +{% highlight sql %} +expression_name [ ( column_name [ , ... ] ) ] [ AS ] ( [ common_table_expression ] query ) +{% endhighlight %} + +### Parameters + + + expression_name + +Specifies a name for the common table expression. + + + + query + +A SELECT statement. + + + +### Examples + +{% highlight sql %} +-- CTE with multiple column aliases +WITH t(x, y) AS (SELECT 1, 2) +SELECT * FROM t WHERE x = 1 AND y = 2; + +---+---+ + | x| y| + +---+---+ + | 1| 2| + +---+---+ + +-- CTE in CTE definition +WITH t as ( +WITH t2 AS (SELECT 1) +SELECT * FROM t2 +) +SELECT * FROM t; + +---+ + | 1| + +---+ + | 1| + +---+ + +-- CTE in subquery +SELECT max(c) FROM ( +WITH t(c) AS (SELECT 1) +SELECT * FROM t +); + +--+ + |max(c)| + +--+ + | 1| + +--+ + +-- CTE in subquery expression +SELECT ( +WITH t AS (SELECT 1) +SELECT * FROM t +); + ++ + |scalarsubquery()| + ++ + | 1| + ++ + +-- CTE in CREATE VIEW statement +CREATE VIEW v AS +WITH t(a, b, c, d) AS (SELECT 1, 2, 3, 4) +SELECT * FROM t; +SELECT * FROM v; + +---+---+---+---+ + | a| b| c| d| + +---+---+---+---+ + | 1| 2| 3| 4| + +---+---+---+---+ + +-- If name conflict is detected in nested CTE, then AnalysisException is thrown by default. +-- SET spark.sql.legacy.ctePrecedencePolicy = CORRECTED (which is recommended), +-- inner CTE definitions take precedence over outer definitions. +SET spark.sql.legacy.ctePrecedencePolicy = CORRECTED; +WITH +t AS (SELECT 1), +t2 AS ( +WITH t AS (SELECT 2) +SELECT * FROM t +) +SELECT * FROM t2; + +---+ + | 2| + +---+ + | 2| + +---+ +{% endhighlight %} + +### Related Statements + + * [SELECT](sql-ref-syntax-qry-select.html) diff --git a/docs/sql-ref-syntax-qry-select.md b/docs/sql-ref-syntax-qry-select.md index 94f69d4..bc2cc02
[spark] branch master updated: [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 92c1b24 [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference 92c1b24 is described below commit 92c1b246174948d0c1f4d0833e1ceac265b17be7 Author: Huaxin Gao AuthorDate: Thu Apr 16 08:34:26 2020 +0900 [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference ### What changes were proposed in this pull request? Document Common Table Expression in SQL Reference ### Why are the changes needed? Make SQL Reference complete ### Does this PR introduce any user-facing change? Yes https://user-images.githubusercontent.com/13592258/79100257-f61def00-7d1a-11ea-8402-17017059232e.png;> https://user-images.githubusercontent.com/13592258/79100260-f7e7b280-7d1a-11ea-9408-058c0851f0b6.png;> https://user-images.githubusercontent.com/13592258/79100262-fa4a0c80-7d1a-11ea-8862-eb1d8960296b.png;> Also link to Select page https://user-images.githubusercontent.com/13592258/79082246-217fea00-7cd9-11ea-8d96-1a69769d1e19.png;> ### How was this patch tested? Manually build and check Closes #28196 from huaxingao/cte. Authored-by: Huaxin Gao Signed-off-by: Takeshi Yamamuro --- docs/_data/menu-sql.yaml | 2 + docs/sql-ref-syntax-qry-select-cte.md | 109 +- docs/sql-ref-syntax-qry-select.md | 3 +- 3 files changed, 112 insertions(+), 2 deletions(-) diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml index badb98d..7827a0f 100644 --- a/docs/_data/menu-sql.yaml +++ b/docs/_data/menu-sql.yaml @@ -166,6 +166,8 @@ url: sql-ref-syntax-qry-select-tvf.html - text: Inline Table url: sql-ref-syntax-qry-select-inline-table.html +- text: Common Table Expression + url: sql-ref-syntax-qry-select-cte.html - text: EXPLAIN url: sql-ref-syntax-qry-explain.html - text: Auxiliary Statements diff --git a/docs/sql-ref-syntax-qry-select-cte.md b/docs/sql-ref-syntax-qry-select-cte.md index 2bd7748..2146f8e 100644 --- a/docs/sql-ref-syntax-qry-select-cte.md +++ b/docs/sql-ref-syntax-qry-select-cte.md @@ -19,4 +19,111 @@ license: | limitations under the License. --- -**This page is under construction** +### Description + +A common table expression (CTE) defines a temporary result set that a user can reference possibly multiple times within the scope of a SQL statement. A CTE is used mainly in a SELECT statement. + +### Syntax + +{% highlight sql %} +WITH common_table_expression [ , ... ] +{% endhighlight %} + +While `common_table_expression` is defined as +{% highlight sql %} +expression_name [ ( column_name [ , ... ] ) ] [ AS ] ( [ common_table_expression ] query ) +{% endhighlight %} + +### Parameters + + + expression_name + +Specifies a name for the common table expression. + + + + query + +A SELECT statement. + + + +### Examples + +{% highlight sql %} +-- CTE with multiple column aliases +WITH t(x, y) AS (SELECT 1, 2) +SELECT * FROM t WHERE x = 1 AND y = 2; + +---+---+ + | x| y| + +---+---+ + | 1| 2| + +---+---+ + +-- CTE in CTE definition +WITH t as ( +WITH t2 AS (SELECT 1) +SELECT * FROM t2 +) +SELECT * FROM t; + +---+ + | 1| + +---+ + | 1| + +---+ + +-- CTE in subquery +SELECT max(c) FROM ( +WITH t(c) AS (SELECT 1) +SELECT * FROM t +); + +--+ + |max(c)| + +--+ + | 1| + +--+ + +-- CTE in subquery expression +SELECT ( +WITH t AS (SELECT 1) +SELECT * FROM t +); + ++ + |scalarsubquery()| + ++ + | 1| + ++ + +-- CTE in CREATE VIEW statement +CREATE VIEW v AS +WITH t(a, b, c, d) AS (SELECT 1, 2, 3, 4) +SELECT * FROM t; +SELECT * FROM v; + +---+---+---+---+ + | a| b| c| d| + +---+---+---+---+ + | 1| 2| 3| 4| + +---+---+---+---+ + +-- If name conflict is detected in nested CTE, then AnalysisException is thrown by default. +-- SET spark.sql.legacy.ctePrecedencePolicy = CORRECTED (which is recommended), +-- inner CTE definitions take precedence over outer definitions. +SET spark.sql.legacy.ctePrecedencePolicy = CORRECTED; +WITH +t AS (SELECT 1), +t2 AS ( +WITH t AS (SELECT 2) +SELECT * FROM t +) +SELECT * FROM t2; + +---+ + | 2| + +---+ + | 2| + +---+ +{% endhighlight %} + +### Related Statements + + * [SELECT](sql-ref-syntax-qry-select.html) diff --git a/docs/sql-ref-syntax-qry-select.md b/docs/sql-ref-syntax-qry-select.md index 94f69d4..bc2cc02 100644 --- a/docs/sql-ref-syntax-qry-select.md +++ b/docs/sql-ref-syntax-qry-select.md @@ -53,7 +53,7 @@ SELECT [
[spark] branch branch-2.4 updated (49abdc4 -> d34590c)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git. from 49abdc4 [SPARK-31186][PYSPARK][SQL][2.4] toPandas should not fail on duplicate column names add d34590c [SPARK-31441][PYSPARK][SQL][2.4] Support duplicated column names for toPandas with arrow execution No new revisions were added by this update. Summary of changes: python/pyspark/sql/dataframe.py | 6 +- python/pyspark/sql/tests.py | 27 +-- 2 files changed, 26 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone
This is an automated email from the ASF dual-hosted git repository. jiangxb1987 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new d286db1 [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone d286db1 is described below commit d286db145433d6d7610c69980512369f389930ca Author: yi.wu AuthorDate: Wed Apr 15 11:29:55 2020 -0700 [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone ### What changes were proposed in this pull request? Update the document and shell script to warn user about the deprecation of multiple workers on the same host support. ### Why are the changes needed? This is a sub-task of [SPARK-30978](https://issues.apache.org/jira/browse/SPARK-30978), which plans to totally remove support of multiple workers in Spark 3.1. This PR makes the first step to deprecate it firstly in Spark 3.0. ### Does this PR introduce any user-facing change? Yeah, user see warning when they run start worker script. ### How was this patch tested? Tested manually. Closes #27768 from Ngone51/deprecate_spark_worker_instances. Authored-by: yi.wu Signed-off-by: Xingbo Jiang (cherry picked from commit 0d4e4df06105cf2985dde17c1af76093b3ae8c13) Signed-off-by: Xingbo Jiang --- docs/core-migration-guide.md | 2 ++ docs/hardware-provisioning.md | 8 sbin/start-slave.sh | 2 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md index 66a489b..cde6e07 100644 --- a/docs/core-migration-guide.md +++ b/docs/core-migration-guide.md @@ -38,3 +38,5 @@ license: | - Event log file will be written as UTF-8 encoding, and Spark History Server will replay event log files as UTF-8 encoding. Previously Spark wrote the event log file as default charset of driver JVM process, so Spark History Server of Spark 2.x is needed to read the old event log files in case of incompatible encoding. - A new protocol for fetching shuffle blocks is used. It's recommended that external shuffle services be upgraded when running Spark 3.0 apps. You can still use old external shuffle services by setting the configuration `spark.shuffle.useOldFetchProtocol` to `true`. Otherwise, Spark may run into errors with messages like `IllegalArgumentException: Unexpected message type: `. + +- `SPARK_WORKER_INSTANCES` is deprecated in Standalone mode. It's recommended to launch multiple executors in one worker and launch one worker per node instead of launching multiple workers per node and launching one executor per worker. diff --git a/docs/hardware-provisioning.md b/docs/hardware-provisioning.md index 4e5d681..fc87995f 100644 --- a/docs/hardware-provisioning.md +++ b/docs/hardware-provisioning.md @@ -63,10 +63,10 @@ Note that memory usage is greatly affected by storage level and serialization fo the [tuning guide](tuning.html) for tips on how to reduce it. Finally, note that the Java VM does not always behave well with more than 200 GiB of RAM. If you -purchase machines with more RAM than this, you can run _multiple worker JVMs per node_. In -Spark's [standalone mode](spark-standalone.html), you can set the number of workers per node -with the `SPARK_WORKER_INSTANCES` variable in `conf/spark-env.sh`, and the number of cores -per worker with `SPARK_WORKER_CORES`. +purchase machines with more RAM than this, you can launch multiple executors in a single node. In +Spark's [standalone mode](spark-standalone.html), a worker is responsible for launching multiple +executors according to its available memory and cores, and each executor will be launched in a +separate Java VM. # Network diff --git a/sbin/start-slave.sh b/sbin/start-slave.sh index 2cb17a0..9b3b26b 100755 --- a/sbin/start-slave.sh +++ b/sbin/start-slave.sh @@ -22,7 +22,7 @@ # Environment Variables # # SPARK_WORKER_INSTANCES The number of worker instances to run on this -# slave. Default is 1. +# slave. Default is 1. Note it has been deprecate since Spark 3.0. # SPARK_WORKER_PORT The base port number for the first worker. If set, # subsequent workers will increment this number. If # unset, Spark will find a valid port number, but - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone
This is an automated email from the ASF dual-hosted git repository. jiangxb1987 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new d286db1 [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone d286db1 is described below commit d286db145433d6d7610c69980512369f389930ca Author: yi.wu AuthorDate: Wed Apr 15 11:29:55 2020 -0700 [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone ### What changes were proposed in this pull request? Update the document and shell script to warn user about the deprecation of multiple workers on the same host support. ### Why are the changes needed? This is a sub-task of [SPARK-30978](https://issues.apache.org/jira/browse/SPARK-30978), which plans to totally remove support of multiple workers in Spark 3.1. This PR makes the first step to deprecate it firstly in Spark 3.0. ### Does this PR introduce any user-facing change? Yeah, user see warning when they run start worker script. ### How was this patch tested? Tested manually. Closes #27768 from Ngone51/deprecate_spark_worker_instances. Authored-by: yi.wu Signed-off-by: Xingbo Jiang (cherry picked from commit 0d4e4df06105cf2985dde17c1af76093b3ae8c13) Signed-off-by: Xingbo Jiang --- docs/core-migration-guide.md | 2 ++ docs/hardware-provisioning.md | 8 sbin/start-slave.sh | 2 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md index 66a489b..cde6e07 100644 --- a/docs/core-migration-guide.md +++ b/docs/core-migration-guide.md @@ -38,3 +38,5 @@ license: | - Event log file will be written as UTF-8 encoding, and Spark History Server will replay event log files as UTF-8 encoding. Previously Spark wrote the event log file as default charset of driver JVM process, so Spark History Server of Spark 2.x is needed to read the old event log files in case of incompatible encoding. - A new protocol for fetching shuffle blocks is used. It's recommended that external shuffle services be upgraded when running Spark 3.0 apps. You can still use old external shuffle services by setting the configuration `spark.shuffle.useOldFetchProtocol` to `true`. Otherwise, Spark may run into errors with messages like `IllegalArgumentException: Unexpected message type: `. + +- `SPARK_WORKER_INSTANCES` is deprecated in Standalone mode. It's recommended to launch multiple executors in one worker and launch one worker per node instead of launching multiple workers per node and launching one executor per worker. diff --git a/docs/hardware-provisioning.md b/docs/hardware-provisioning.md index 4e5d681..fc87995f 100644 --- a/docs/hardware-provisioning.md +++ b/docs/hardware-provisioning.md @@ -63,10 +63,10 @@ Note that memory usage is greatly affected by storage level and serialization fo the [tuning guide](tuning.html) for tips on how to reduce it. Finally, note that the Java VM does not always behave well with more than 200 GiB of RAM. If you -purchase machines with more RAM than this, you can run _multiple worker JVMs per node_. In -Spark's [standalone mode](spark-standalone.html), you can set the number of workers per node -with the `SPARK_WORKER_INSTANCES` variable in `conf/spark-env.sh`, and the number of cores -per worker with `SPARK_WORKER_CORES`. +purchase machines with more RAM than this, you can launch multiple executors in a single node. In +Spark's [standalone mode](spark-standalone.html), a worker is responsible for launching multiple +executors according to its available memory and cores, and each executor will be launched in a +separate Java VM. # Network diff --git a/sbin/start-slave.sh b/sbin/start-slave.sh index 2cb17a0..9b3b26b 100755 --- a/sbin/start-slave.sh +++ b/sbin/start-slave.sh @@ -22,7 +22,7 @@ # Environment Variables # # SPARK_WORKER_INSTANCES The number of worker instances to run on this -# slave. Default is 1. +# slave. Default is 1. Note it has been deprecate since Spark 3.0. # SPARK_WORKER_PORT The base port number for the first worker. If set, # subsequent workers will increment this number. If # unset, Spark will find a valid port number, but - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (2b10d70 -> 0d4e4df)
This is an automated email from the ASF dual-hosted git repository. jiangxb1987 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 2b10d70 [SPARK-31423][SQL] Fix rebasing of not-existed dates add 0d4e4df [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone No new revisions were added by this update. Summary of changes: docs/core-migration-guide.md | 2 ++ docs/hardware-provisioning.md | 8 sbin/start-slave.sh | 2 +- 3 files changed, 7 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone
This is an automated email from the ASF dual-hosted git repository. jiangxb1987 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0d4e4df [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone 0d4e4df is described below commit 0d4e4df06105cf2985dde17c1af76093b3ae8c13 Author: yi.wu AuthorDate: Wed Apr 15 11:29:55 2020 -0700 [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone ### What changes were proposed in this pull request? Update the document and shell script to warn user about the deprecation of multiple workers on the same host support. ### Why are the changes needed? This is a sub-task of [SPARK-30978](https://issues.apache.org/jira/browse/SPARK-30978), which plans to totally remove support of multiple workers in Spark 3.1. This PR makes the first step to deprecate it firstly in Spark 3.0. ### Does this PR introduce any user-facing change? Yeah, user see warning when they run start worker script. ### How was this patch tested? Tested manually. Closes #27768 from Ngone51/deprecate_spark_worker_instances. Authored-by: yi.wu Signed-off-by: Xingbo Jiang --- docs/core-migration-guide.md | 2 ++ docs/hardware-provisioning.md | 8 sbin/start-slave.sh | 2 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md index 66a489b..cde6e07 100644 --- a/docs/core-migration-guide.md +++ b/docs/core-migration-guide.md @@ -38,3 +38,5 @@ license: | - Event log file will be written as UTF-8 encoding, and Spark History Server will replay event log files as UTF-8 encoding. Previously Spark wrote the event log file as default charset of driver JVM process, so Spark History Server of Spark 2.x is needed to read the old event log files in case of incompatible encoding. - A new protocol for fetching shuffle blocks is used. It's recommended that external shuffle services be upgraded when running Spark 3.0 apps. You can still use old external shuffle services by setting the configuration `spark.shuffle.useOldFetchProtocol` to `true`. Otherwise, Spark may run into errors with messages like `IllegalArgumentException: Unexpected message type: `. + +- `SPARK_WORKER_INSTANCES` is deprecated in Standalone mode. It's recommended to launch multiple executors in one worker and launch one worker per node instead of launching multiple workers per node and launching one executor per worker. diff --git a/docs/hardware-provisioning.md b/docs/hardware-provisioning.md index 4e5d681..fc87995f 100644 --- a/docs/hardware-provisioning.md +++ b/docs/hardware-provisioning.md @@ -63,10 +63,10 @@ Note that memory usage is greatly affected by storage level and serialization fo the [tuning guide](tuning.html) for tips on how to reduce it. Finally, note that the Java VM does not always behave well with more than 200 GiB of RAM. If you -purchase machines with more RAM than this, you can run _multiple worker JVMs per node_. In -Spark's [standalone mode](spark-standalone.html), you can set the number of workers per node -with the `SPARK_WORKER_INSTANCES` variable in `conf/spark-env.sh`, and the number of cores -per worker with `SPARK_WORKER_CORES`. +purchase machines with more RAM than this, you can launch multiple executors in a single node. In +Spark's [standalone mode](spark-standalone.html), a worker is responsible for launching multiple +executors according to its available memory and cores, and each executor will be launched in a +separate Java VM. # Network diff --git a/sbin/start-slave.sh b/sbin/start-slave.sh index 2cb17a0..9b3b26b 100755 --- a/sbin/start-slave.sh +++ b/sbin/start-slave.sh @@ -22,7 +22,7 @@ # Environment Variables # # SPARK_WORKER_INSTANCES The number of worker instances to run on this -# slave. Default is 1. +# slave. Default is 1. Note it has been deprecate since Spark 3.0. # SPARK_WORKER_PORT The base port number for the first worker. If set, # subsequent workers will increment this number. If # unset, Spark will find a valid port number, but - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31423][SQL] Fix rebasing of not-existed dates
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3702327 [SPARK-31423][SQL] Fix rebasing of not-existed dates 3702327 is described below commit 37023273fe0171ab758a81956dcc0ae9f8d2253b Author: Max Gekk AuthorDate: Wed Apr 15 16:33:56 2020 + [SPARK-31423][SQL] Fix rebasing of not-existed dates ### What changes were proposed in this pull request? In the PR, I propose to change rebasing of not-existed dates in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range (1582-10-04, 1582-10-15). Not existed dates from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianDays()` because reverse rebasing from the hybrid dates to Proleptic Gregorian dates does not have such problem. ### Why are the changes needed? Currently, not-existed dates are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 -> 1582-10-24. That's contradict to shifting not existed dates in other cases, for example: ``` scala> sql("select date'1990-9-31'").show +-+ |DATE '1990-10-01'| +-+ | 1990-10-01| +-+ ``` ### Does this PR introduce any user-facing change? Yes, this impacts on conversion of Spark SQL `DATE` values to external dates based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 date to ORC files, it will be shifted to the next valid date 1582-10-15. ### How was this patch tested? - Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite` - By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`. Closes #28225 from MaxGekk/fix-not-exist-dates. Authored-by: Max Gekk Signed-off-by: Wenchen Fan (cherry picked from commit 2b10d70bad30fb7b7c293338c2acc908031af0b8) Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/util/RebaseDateTime.scala | 16 .../sql/catalyst/util/RebaseDateTimeSuite.scala| 22 ++ .../execution/datasources/orc/OrcSourceSuite.scala | 8 +--- 3 files changed, 39 insertions(+), 7 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala index 50b552e..6338a59 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala @@ -131,7 +131,8 @@ object RebaseDateTime { // The differences in days between Proleptic Gregorian and Julian dates. // The diff at the index `i` is applicable for all days in the date interval: // [gregJulianDiffSwitchDay(i), gregJulianDiffSwitchDay(i+1)) - private val gregJulianDiffs = Array(-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0) + private val gregJulianDiffs = Array( +-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0) // The sorted days in Proleptic Gregorian calendar when difference in days between // Proleptic Gregorian and Julian was changed. // The starting point is the `0001-01-01` (-719162 days since the epoch in @@ -139,13 +140,17 @@ object RebaseDateTime { // Rebasing switch days and diffs `gregJulianDiffSwitchDay` and `gregJulianDiffs` // was generated by the `localRebaseGregorianToJulianDays` function. private val gregJulianDiffSwitchDay = Array( --719162, -682944, -646420, -609896, -536847, -500323, -463799, --390750, -354226, -317702, -244653, -208129, -171605, -141427) +-719162, -682944, -646420, -609896, -536847, -500323, -463799, -390750, +-354226, -317702, -244653, -208129, -171605, -141436, -141435, -141434, +-141433, -141432, -141431, -141430, -141429, -141428, -141427) // The first days of Common Era (CE) which is mapped to the '0001-01-01' date // in Proleptic Gregorian calendar. private final val gregorianCommonEraStartDay = gregJulianDiffSwitchDay(0) + private final val gregorianStartDay = LocalDate.of(1582, 10, 15) + private final val julianEndDay = LocalDate.of(1582, 10, 4) + /** * Converts the given number of days since the epoch day 1970-01-01 to a local date in Proleptic * Gregorian calendar, interprets the result as a local date in Julian calendar, and takes the @@ -165,7 +170,10 @@ object RebaseDateTime { * @return The rebased number of days in Julian calendar. */ private[sql] def localRebaseGregorianToJulianDays(days: Int): Int = { -val localDate = LocalDate.ofEpochDay(days)
[spark] branch master updated (7699f76 -> 2b10d70)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7699f76 [SPARK-31394][K8S] Adds support for Kubernetes NFS volume mounts add 2b10d70 [SPARK-31423][SQL] Fix rebasing of not-existed dates No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/util/RebaseDateTime.scala | 16 .../sql/catalyst/util/RebaseDateTimeSuite.scala| 22 ++ .../execution/datasources/orc/OrcSourceSuite.scala | 8 +--- 3 files changed, 39 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31423][SQL] Fix rebasing of not-existed dates
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3702327 [SPARK-31423][SQL] Fix rebasing of not-existed dates 3702327 is described below commit 37023273fe0171ab758a81956dcc0ae9f8d2253b Author: Max Gekk AuthorDate: Wed Apr 15 16:33:56 2020 + [SPARK-31423][SQL] Fix rebasing of not-existed dates ### What changes were proposed in this pull request? In the PR, I propose to change rebasing of not-existed dates in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range (1582-10-04, 1582-10-15). Not existed dates from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianDays()` because reverse rebasing from the hybrid dates to Proleptic Gregorian dates does not have such problem. ### Why are the changes needed? Currently, not-existed dates are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 -> 1582-10-24. That's contradict to shifting not existed dates in other cases, for example: ``` scala> sql("select date'1990-9-31'").show +-+ |DATE '1990-10-01'| +-+ | 1990-10-01| +-+ ``` ### Does this PR introduce any user-facing change? Yes, this impacts on conversion of Spark SQL `DATE` values to external dates based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 date to ORC files, it will be shifted to the next valid date 1582-10-15. ### How was this patch tested? - Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite` - By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`. Closes #28225 from MaxGekk/fix-not-exist-dates. Authored-by: Max Gekk Signed-off-by: Wenchen Fan (cherry picked from commit 2b10d70bad30fb7b7c293338c2acc908031af0b8) Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/util/RebaseDateTime.scala | 16 .../sql/catalyst/util/RebaseDateTimeSuite.scala| 22 ++ .../execution/datasources/orc/OrcSourceSuite.scala | 8 +--- 3 files changed, 39 insertions(+), 7 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala index 50b552e..6338a59 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala @@ -131,7 +131,8 @@ object RebaseDateTime { // The differences in days between Proleptic Gregorian and Julian dates. // The diff at the index `i` is applicable for all days in the date interval: // [gregJulianDiffSwitchDay(i), gregJulianDiffSwitchDay(i+1)) - private val gregJulianDiffs = Array(-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0) + private val gregJulianDiffs = Array( +-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0) // The sorted days in Proleptic Gregorian calendar when difference in days between // Proleptic Gregorian and Julian was changed. // The starting point is the `0001-01-01` (-719162 days since the epoch in @@ -139,13 +140,17 @@ object RebaseDateTime { // Rebasing switch days and diffs `gregJulianDiffSwitchDay` and `gregJulianDiffs` // was generated by the `localRebaseGregorianToJulianDays` function. private val gregJulianDiffSwitchDay = Array( --719162, -682944, -646420, -609896, -536847, -500323, -463799, --390750, -354226, -317702, -244653, -208129, -171605, -141427) +-719162, -682944, -646420, -609896, -536847, -500323, -463799, -390750, +-354226, -317702, -244653, -208129, -171605, -141436, -141435, -141434, +-141433, -141432, -141431, -141430, -141429, -141428, -141427) // The first days of Common Era (CE) which is mapped to the '0001-01-01' date // in Proleptic Gregorian calendar. private final val gregorianCommonEraStartDay = gregJulianDiffSwitchDay(0) + private final val gregorianStartDay = LocalDate.of(1582, 10, 15) + private final val julianEndDay = LocalDate.of(1582, 10, 4) + /** * Converts the given number of days since the epoch day 1970-01-01 to a local date in Proleptic * Gregorian calendar, interprets the result as a local date in Julian calendar, and takes the @@ -165,7 +170,10 @@ object RebaseDateTime { * @return The rebased number of days in Julian calendar. */ private[sql] def localRebaseGregorianToJulianDays(days: Int): Int = { -val localDate = LocalDate.ofEpochDay(days)
[spark] branch master updated (7699f76 -> 2b10d70)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7699f76 [SPARK-31394][K8S] Adds support for Kubernetes NFS volume mounts add 2b10d70 [SPARK-31423][SQL] Fix rebasing of not-existed dates No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/util/RebaseDateTime.scala | 16 .../sql/catalyst/util/RebaseDateTimeSuite.scala| 22 ++ .../execution/datasources/orc/OrcSourceSuite.scala | 8 +--- 3 files changed, 39 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (744c248 -> 7699f76)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 744c248 [SPARK-31443][SQL] Fix perf regression of toJavaDate add 7699f76 [SPARK-31394][K8S] Adds support for Kubernetes NFS volume mounts No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/deploy/k8s/Config.scala | 2 + .../spark/deploy/k8s/KubernetesVolumeSpec.scala| 5 ++ .../spark/deploy/k8s/KubernetesVolumeUtils.scala | 7 +++ .../k8s/features/MountVolumesFeatureStep.scala | 4 ++ .../spark/deploy/k8s/KubernetesTestConf.scala | 5 ++ .../deploy/k8s/KubernetesVolumeUtilsSuite.scala| 54 ++ .../features/MountVolumesFeatureStepSuite.scala| 44 ++ 7 files changed, 121 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (744c248 -> 7699f76)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 744c248 [SPARK-31443][SQL] Fix perf regression of toJavaDate add 7699f76 [SPARK-31394][K8S] Adds support for Kubernetes NFS volume mounts No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/deploy/k8s/Config.scala | 2 + .../spark/deploy/k8s/KubernetesVolumeSpec.scala| 5 ++ .../spark/deploy/k8s/KubernetesVolumeUtils.scala | 7 +++ .../k8s/features/MountVolumesFeatureStep.scala | 4 ++ .../spark/deploy/k8s/KubernetesTestConf.scala | 5 ++ .../deploy/k8s/KubernetesVolumeUtilsSuite.scala| 54 ++ .../features/MountVolumesFeatureStepSuite.scala| 44 ++ 7 files changed, 121 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31443][SQL] Fix perf regression of toJavaDate
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new a5a4194 [SPARK-31443][SQL] Fix perf regression of toJavaDate a5a4194 is described below commit a5a41942138c06cc67ffd1183bbf6da5214f78f0 Author: Max Gekk AuthorDate: Wed Apr 15 06:19:12 2020 + [SPARK-31443][SQL] Fix perf regression of toJavaDate ### What changes were proposed in this pull request? Optimise the `toJavaDate()` method of `DateTimeUtils` by: 1. Re-using `rebaseGregorianToJulianDays` optimised by #28067 2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of `java.sql.Date`. Also new benchmark for collecting dates is added to `DateTimeBenchmark`. ### Why are the changes needed? The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in https://github.com/MaxGekk/spark/pull/27): Spark 2.4.6-SNAPSHOT: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 559603 38 8.9 111.8 1.0X Collect dates 2306 3221 1558 2.2 461.1 0.2X ``` Before the changes: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 1052 1130 73 4.8 210.3 1.0X Collect dates 3251 4943 1624 1.5 650.2 0.3X ``` After: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 416419 3 12.0 83.2 1.0X Collect dates 1928 2759 1180 2.6 385.6 0.2X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`. - Re-run `DateTimeBenchmark` in the environment: | Item | Description | | | | | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28212 from MaxGekk/optimize-toJavaDate. Authored-by: Max Gekk Signed-off-by: Wenchen Fan (cherry picked from commit 744c2480b580e0b68387328926ef7634cfb93adc) Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/util/DateTimeUtils.scala| 11 +- .../benchmarks/DateTimeBenchmark-jdk11-results.txt | 221 +++-- sql/core/benchmarks/DateTimeBenchmark-results.txt | 221 +++-- .../execution/benchmark/DateTimeBenchmark.scala| 5 + 4 files changed, 236 insertions(+), 222 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index dede92f..8486bba 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -26,6 +26,8 @@ import java.util.concurrent.TimeUnit._ import scala.util.control.NonFatal +import sun.util.calendar.ZoneInfo + import org.apache.spark.sql.catalyst.util.DateTimeConstants._ import org.apache.spark.sql.catalyst.util.RebaseDateTime._ import org.apache.spark.sql.types.Decimal @@ -123,8 +125,13 @@ object DateTimeUtils { * @return A `java.sql.Date` from number of days since epoch. */ def toJavaDate(daysSinceEpoch: SQLDate): Date = { -val localDate = LocalDate.ofEpochDay(daysSinceEpoch) -new Date(localDate.getYear - 1900, localDate.getMonthValue - 1, localDate.getDayOfMonth) +
[spark] branch master updated: [SPARK-31443][SQL] Fix perf regression of toJavaDate
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 744c248 [SPARK-31443][SQL] Fix perf regression of toJavaDate 744c248 is described below commit 744c2480b580e0b68387328926ef7634cfb93adc Author: Max Gekk AuthorDate: Wed Apr 15 06:19:12 2020 + [SPARK-31443][SQL] Fix perf regression of toJavaDate ### What changes were proposed in this pull request? Optimise the `toJavaDate()` method of `DateTimeUtils` by: 1. Re-using `rebaseGregorianToJulianDays` optimised by #28067 2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of `java.sql.Date`. Also new benchmark for collecting dates is added to `DateTimeBenchmark`. ### Why are the changes needed? The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in https://github.com/MaxGekk/spark/pull/27): Spark 2.4.6-SNAPSHOT: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 559603 38 8.9 111.8 1.0X Collect dates 2306 3221 1558 2.2 461.1 0.2X ``` Before the changes: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 1052 1130 73 4.8 210.3 1.0X Collect dates 3251 4943 1624 1.5 650.2 0.3X ``` After: ``` To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative From java.sql.Date 416419 3 12.0 83.2 1.0X Collect dates 1928 2759 1180 2.6 385.6 0.2X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`. - Re-run `DateTimeBenchmark` in the environment: | Item | Description | | | | | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28212 from MaxGekk/optimize-toJavaDate. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/util/DateTimeUtils.scala| 11 +- .../benchmarks/DateTimeBenchmark-jdk11-results.txt | 221 +++-- sql/core/benchmarks/DateTimeBenchmark-results.txt | 221 +++-- .../execution/benchmark/DateTimeBenchmark.scala| 5 + 4 files changed, 236 insertions(+), 222 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index 56259df..021072c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -26,6 +26,8 @@ import java.util.concurrent.TimeUnit._ import scala.util.control.NonFatal +import sun.util.calendar.ZoneInfo + import org.apache.spark.sql.catalyst.util.DateTimeConstants._ import org.apache.spark.sql.catalyst.util.RebaseDateTime._ import org.apache.spark.sql.types.Decimal @@ -121,8 +123,13 @@ object DateTimeUtils { * @return A `java.sql.Date` from number of days since epoch. */ def toJavaDate(daysSinceEpoch: SQLDate): Date = { -val localDate = LocalDate.ofEpochDay(daysSinceEpoch) -new Date(localDate.getYear - 1900, localDate.getMonthValue - 1, localDate.getDayOfMonth) +val rebasedDays = rebaseGregorianToJulianDays(daysSinceEpoch) +val localMillis =