date:20200415

[spark] branch master updated (98ec4a8 -> c76c31e)

2020-04-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 98ec4a8  [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' 
in CORE and 'dev/.rat-excludes' in BUILD autolabeller
 add c76c31e  [SPARK-31455][SQL] Fix rebasing of not-existed timestamps

No new revisions were added by this update.

Summary of changes:
 .../resources/gregorian-julian-rebase-micros.json  | 2384 ++--
 .../spark/sql/catalyst/util/RebaseDateTime.scala   |   20 +-
 .../sql/catalyst/util/RebaseDateTimeSuite.scala|   37 +-
 .../execution/datasources/orc/OrcSourceSuite.scala |8 +-
 4 files changed, 1244 insertions(+), 1205 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (92c1b24 -> 98ec4a8)

2020-04-15 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 92c1b24  [SPARK-31428][SQL][DOCS] Document Common Table Expression in 
SQL Reference
 add 98ec4a8  [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' 
in CORE and 'dev/.rat-excludes' in BUILD autolabeller

No new revisions were added by this update.

Summary of changes:
 .github/autolabeler.yml | 4 
 1 file changed, 4 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (92c1b24 -> 98ec4a8)

2020-04-15 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 92c1b24  [SPARK-31428][SQL][DOCS] Document Common Table Expression in 
SQL Reference
 add 98ec4a8  [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' 
in CORE and 'dev/.rat-excludes' in BUILD autolabeller

No new revisions were added by this update.

Summary of changes:
 .github/autolabeler.yml | 4 
 1 file changed, 4 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (92c1b24 -> 98ec4a8)

2020-04-15 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 92c1b24  [SPARK-31428][SQL][DOCS] Document Common Table Expression in 
SQL Reference
 add 98ec4a8  [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' 
in CORE and 'dev/.rat-excludes' in BUILD autolabeller

No new revisions were added by this update.

Summary of changes:
 .github/autolabeler.yml | 4 
 1 file changed, 4 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (92c1b24 -> 98ec4a8)

2020-04-15 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 92c1b24  [SPARK-31428][SQL][DOCS] Document Common Table Expression in 
SQL Reference
 add 98ec4a8  [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' 
in CORE and 'dev/.rat-excludes' in BUILD autolabeller

No new revisions were added by this update.

Summary of changes:
 .github/autolabeler.yml | 4 
 1 file changed, 4 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (92c1b24 -> 98ec4a8)

2020-04-15 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 92c1b24  [SPARK-31428][SQL][DOCS] Document Common Table Expression in 
SQL Reference
 add 98ec4a8  [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' 
in CORE and 'dev/.rat-excludes' in BUILD autolabeller

No new revisions were added by this update.

Summary of changes:
 .github/autolabeler.yml | 4 
 1 file changed, 4 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference

2020-04-15 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 4476c85  [SPARK-31428][SQL][DOCS] Document Common Table Expression in 
SQL Reference
4476c85 is described below

commit 4476c85775d231c8bb26399284c0baf4292bec7c
Author: Huaxin Gao 
AuthorDate: Thu Apr 16 08:34:26 2020 +0900

[SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference

### What changes were proposed in this pull request?
Document Common Table Expression in SQL Reference

### Why are the changes needed?
Make SQL Reference complete

### Does this PR introduce any user-facing change?
Yes
https://user-images.githubusercontent.com/13592258/79100257-f61def00-7d1a-11ea-8402-17017059232e.png;>

https://user-images.githubusercontent.com/13592258/79100260-f7e7b280-7d1a-11ea-9408-058c0851f0b6.png;>

https://user-images.githubusercontent.com/13592258/79100262-fa4a0c80-7d1a-11ea-8862-eb1d8960296b.png;>

Also link to Select page

https://user-images.githubusercontent.com/13592258/79082246-217fea00-7cd9-11ea-8d96-1a69769d1e19.png;>

### How was this patch tested?
Manually build and check

Closes #28196 from huaxingao/cte.

Authored-by: Huaxin Gao 
Signed-off-by: Takeshi Yamamuro 
(cherry picked from commit 92c1b246174948d0c1f4d0833e1ceac265b17be7)
Signed-off-by: Takeshi Yamamuro 
---
 docs/_data/menu-sql.yaml  |   2 +
 docs/sql-ref-syntax-qry-select-cte.md | 109 +-
 docs/sql-ref-syntax-qry-select.md |   3 +-
 3 files changed, 112 insertions(+), 2 deletions(-)

diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
index badb98d..7827a0f 100644
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@@ -166,6 +166,8 @@
   url: sql-ref-syntax-qry-select-tvf.html
 - text: Inline Table
   url: sql-ref-syntax-qry-select-inline-table.html
+- text: Common Table Expression
+  url: sql-ref-syntax-qry-select-cte.html
 - text: EXPLAIN
   url: sql-ref-syntax-qry-explain.html
 - text: Auxiliary Statements
diff --git a/docs/sql-ref-syntax-qry-select-cte.md 
b/docs/sql-ref-syntax-qry-select-cte.md
index 2bd7748..2146f8e 100644
--- a/docs/sql-ref-syntax-qry-select-cte.md
+++ b/docs/sql-ref-syntax-qry-select-cte.md
@@ -19,4 +19,111 @@ license: |
   limitations under the License.
 ---
 
-**This page is under construction**
+### Description
+
+A common table expression (CTE) defines a temporary result set that a user can 
reference possibly multiple times within the scope of a SQL statement. A CTE is 
used mainly in a SELECT statement.
+
+### Syntax
+
+{% highlight sql %}
+WITH common_table_expression [ , ... ]
+{% endhighlight %}
+
+While `common_table_expression` is defined as
+{% highlight sql %}
+expression_name [ ( column_name [ , ... ] ) ] [ AS ] ( [ 
common_table_expression ] query )
+{% endhighlight %}
+
+### Parameters
+
+
+  expression_name
+  
+Specifies a name for the common table expression.
+  
+
+
+  query
+  
+A SELECT statement.
+  
+
+
+### Examples
+
+{% highlight sql %}
+-- CTE with multiple column aliases
+WITH t(x, y) AS (SELECT 1, 2)
+SELECT * FROM t WHERE x = 1 AND y = 2;
+  +---+---+
+  |  x|  y|
+  +---+---+
+  |  1|  2|
+  +---+---+
+
+-- CTE in CTE definition
+WITH t as (
+WITH t2 AS (SELECT 1)
+SELECT * FROM t2
+)
+SELECT * FROM t;
+  +---+
+  |  1|
+  +---+
+  |  1|
+  +---+
+
+-- CTE in subquery
+SELECT max(c) FROM (
+WITH t(c) AS (SELECT 1)
+SELECT * FROM t
+);
+  +--+
+  |max(c)|
+  +--+
+  | 1|
+  +--+
+
+-- CTE in subquery expression
+SELECT (
+WITH t AS (SELECT 1)
+SELECT * FROM t
+);
+  ++
+  |scalarsubquery()|
+  ++
+  |   1|
+  ++
+
+-- CTE in CREATE VIEW statement
+CREATE VIEW v AS
+WITH t(a, b, c, d) AS (SELECT 1, 2, 3, 4)
+SELECT * FROM t;
+SELECT * FROM v;
+  +---+---+---+---+
+  |  a|  b|  c|  d|
+  +---+---+---+---+
+  |  1|  2|  3|  4|
+  +---+---+---+---+
+
+-- If name conflict is detected in nested CTE, then AnalysisException is 
thrown by default.
+-- SET spark.sql.legacy.ctePrecedencePolicy = CORRECTED (which is recommended),
+-- inner CTE definitions take precedence over outer definitions.
+SET spark.sql.legacy.ctePrecedencePolicy = CORRECTED;
+WITH
+t AS (SELECT 1),
+t2 AS (
+WITH t AS (SELECT 2)
+SELECT * FROM t
+)
+SELECT * FROM t2;
+  +---+
+  |  2|
+  +---+
+  |  2|
+  +---+
+{% endhighlight %}
+
+### Related Statements
+
+ * [SELECT](sql-ref-syntax-qry-select.html)
diff --git a/docs/sql-ref-syntax-qry-select.md 
b/docs/sql-ref-syntax-qry-select.md
index 94f69d4..bc2cc02

[spark] branch master updated: [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference

2020-04-15 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 92c1b24  [SPARK-31428][SQL][DOCS] Document Common Table Expression in 
SQL Reference
92c1b24 is described below

commit 92c1b246174948d0c1f4d0833e1ceac265b17be7
Author: Huaxin Gao 
AuthorDate: Thu Apr 16 08:34:26 2020 +0900

[SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference

### What changes were proposed in this pull request?
Document Common Table Expression in SQL Reference

### Why are the changes needed?
Make SQL Reference complete

### Does this PR introduce any user-facing change?
Yes
https://user-images.githubusercontent.com/13592258/79100257-f61def00-7d1a-11ea-8402-17017059232e.png;>

https://user-images.githubusercontent.com/13592258/79100260-f7e7b280-7d1a-11ea-9408-058c0851f0b6.png;>

https://user-images.githubusercontent.com/13592258/79100262-fa4a0c80-7d1a-11ea-8862-eb1d8960296b.png;>

Also link to Select page

https://user-images.githubusercontent.com/13592258/79082246-217fea00-7cd9-11ea-8d96-1a69769d1e19.png;>

### How was this patch tested?
Manually build and check

Closes #28196 from huaxingao/cte.

Authored-by: Huaxin Gao 
Signed-off-by: Takeshi Yamamuro 
---
 docs/_data/menu-sql.yaml  |   2 +
 docs/sql-ref-syntax-qry-select-cte.md | 109 +-
 docs/sql-ref-syntax-qry-select.md |   3 +-
 3 files changed, 112 insertions(+), 2 deletions(-)

diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
index badb98d..7827a0f 100644
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@@ -166,6 +166,8 @@
   url: sql-ref-syntax-qry-select-tvf.html
 - text: Inline Table
   url: sql-ref-syntax-qry-select-inline-table.html
+- text: Common Table Expression
+  url: sql-ref-syntax-qry-select-cte.html
 - text: EXPLAIN
   url: sql-ref-syntax-qry-explain.html
 - text: Auxiliary Statements
diff --git a/docs/sql-ref-syntax-qry-select-cte.md 
b/docs/sql-ref-syntax-qry-select-cte.md
index 2bd7748..2146f8e 100644
--- a/docs/sql-ref-syntax-qry-select-cte.md
+++ b/docs/sql-ref-syntax-qry-select-cte.md
@@ -19,4 +19,111 @@ license: |
   limitations under the License.
 ---
 
-**This page is under construction**
+### Description
+
+A common table expression (CTE) defines a temporary result set that a user can 
reference possibly multiple times within the scope of a SQL statement. A CTE is 
used mainly in a SELECT statement.
+
+### Syntax
+
+{% highlight sql %}
+WITH common_table_expression [ , ... ]
+{% endhighlight %}
+
+While `common_table_expression` is defined as
+{% highlight sql %}
+expression_name [ ( column_name [ , ... ] ) ] [ AS ] ( [ 
common_table_expression ] query )
+{% endhighlight %}
+
+### Parameters
+
+
+  expression_name
+  
+Specifies a name for the common table expression.
+  
+
+
+  query
+  
+A SELECT statement.
+  
+
+
+### Examples
+
+{% highlight sql %}
+-- CTE with multiple column aliases
+WITH t(x, y) AS (SELECT 1, 2)
+SELECT * FROM t WHERE x = 1 AND y = 2;
+  +---+---+
+  |  x|  y|
+  +---+---+
+  |  1|  2|
+  +---+---+
+
+-- CTE in CTE definition
+WITH t as (
+WITH t2 AS (SELECT 1)
+SELECT * FROM t2
+)
+SELECT * FROM t;
+  +---+
+  |  1|
+  +---+
+  |  1|
+  +---+
+
+-- CTE in subquery
+SELECT max(c) FROM (
+WITH t(c) AS (SELECT 1)
+SELECT * FROM t
+);
+  +--+
+  |max(c)|
+  +--+
+  | 1|
+  +--+
+
+-- CTE in subquery expression
+SELECT (
+WITH t AS (SELECT 1)
+SELECT * FROM t
+);
+  ++
+  |scalarsubquery()|
+  ++
+  |   1|
+  ++
+
+-- CTE in CREATE VIEW statement
+CREATE VIEW v AS
+WITH t(a, b, c, d) AS (SELECT 1, 2, 3, 4)
+SELECT * FROM t;
+SELECT * FROM v;
+  +---+---+---+---+
+  |  a|  b|  c|  d|
+  +---+---+---+---+
+  |  1|  2|  3|  4|
+  +---+---+---+---+
+
+-- If name conflict is detected in nested CTE, then AnalysisException is 
thrown by default.
+-- SET spark.sql.legacy.ctePrecedencePolicy = CORRECTED (which is recommended),
+-- inner CTE definitions take precedence over outer definitions.
+SET spark.sql.legacy.ctePrecedencePolicy = CORRECTED;
+WITH
+t AS (SELECT 1),
+t2 AS (
+WITH t AS (SELECT 2)
+SELECT * FROM t
+)
+SELECT * FROM t2;
+  +---+
+  |  2|
+  +---+
+  |  2|
+  +---+
+{% endhighlight %}
+
+### Related Statements
+
+ * [SELECT](sql-ref-syntax-qry-select.html)
diff --git a/docs/sql-ref-syntax-qry-select.md 
b/docs/sql-ref-syntax-qry-select.md
index 94f69d4..bc2cc02 100644
--- a/docs/sql-ref-syntax-qry-select.md
+++ b/docs/sql-ref-syntax-qry-select.md
@@ -53,7 +53,7 @@ SELECT [

[spark] branch branch-2.4 updated (49abdc4 -> d34590c)

2020-04-15 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 49abdc4  [SPARK-31186][PYSPARK][SQL][2.4] toPandas should not fail on 
duplicate column names
 add d34590c  [SPARK-31441][PYSPARK][SQL][2.4] Support duplicated column 
names for toPandas with arrow execution

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py |  6 +-
 python/pyspark/sql/tests.py | 27 +--
 2 files changed, 26 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone

2020-04-15 Thread jiangxb1987

This is an automated email from the ASF dual-hosted git repository.

jiangxb1987 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new d286db1  [SPARK-31018][CORE][DOCS] Deprecate support of multiple 
workers on the same host in Standalone
d286db1 is described below

commit d286db145433d6d7610c69980512369f389930ca
Author: yi.wu 
AuthorDate: Wed Apr 15 11:29:55 2020 -0700

[SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same 
host in Standalone

### What changes were proposed in this pull request?

Update the document and shell script to warn user about the deprecation of 
multiple workers on the same host support.

### Why are the changes needed?

This is a sub-task of 
[SPARK-30978](https://issues.apache.org/jira/browse/SPARK-30978), which plans 
to totally remove support of multiple workers in Spark 3.1. This PR makes the 
first step to deprecate it firstly in Spark 3.0.

### Does this PR introduce any user-facing change?

Yeah, user see warning when they run start worker script.

### How was this patch tested?

Tested manually.

Closes #27768 from Ngone51/deprecate_spark_worker_instances.

Authored-by: yi.wu 
Signed-off-by: Xingbo Jiang 
(cherry picked from commit 0d4e4df06105cf2985dde17c1af76093b3ae8c13)
Signed-off-by: Xingbo Jiang 
---
 docs/core-migration-guide.md  | 2 ++
 docs/hardware-provisioning.md | 8 
 sbin/start-slave.sh   | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md
index 66a489b..cde6e07 100644
--- a/docs/core-migration-guide.md
+++ b/docs/core-migration-guide.md
@@ -38,3 +38,5 @@ license: |
 - Event log file will be written as UTF-8 encoding, and Spark History Server 
will replay event log files as UTF-8 encoding. Previously Spark wrote the event 
log file as default charset of driver JVM process, so Spark History Server of 
Spark 2.x is needed to read the old event log files in case of incompatible 
encoding.
 
 - A new protocol for fetching shuffle blocks is used. It's recommended that 
external shuffle services be upgraded when running Spark 3.0 apps. You can 
still use old external shuffle services by setting the configuration 
`spark.shuffle.useOldFetchProtocol` to `true`. Otherwise, Spark may run into 
errors with messages like `IllegalArgumentException: Unexpected message type: 
`.
+
+- `SPARK_WORKER_INSTANCES` is deprecated in Standalone mode. It's recommended 
to launch multiple executors in one worker and launch one worker per node 
instead of launching multiple workers per node and launching one executor per 
worker.
diff --git a/docs/hardware-provisioning.md b/docs/hardware-provisioning.md
index 4e5d681..fc87995f 100644
--- a/docs/hardware-provisioning.md
+++ b/docs/hardware-provisioning.md
@@ -63,10 +63,10 @@ Note that memory usage is greatly affected by storage level 
and serialization fo
 the [tuning guide](tuning.html) for tips on how to reduce it.
 
 Finally, note that the Java VM does not always behave well with more than 200 
GiB of RAM. If you
-purchase machines with more RAM than this, you can run _multiple worker JVMs 
per node_. In
-Spark's [standalone mode](spark-standalone.html), you can set the number of 
workers per node
-with the `SPARK_WORKER_INSTANCES` variable in `conf/spark-env.sh`, and the 
number of cores
-per worker with `SPARK_WORKER_CORES`.
+purchase machines with more RAM than this, you can launch multiple executors 
in a single node. In
+Spark's [standalone mode](spark-standalone.html), a worker is responsible for 
launching multiple
+executors according to its available memory and cores, and each executor will 
be launched in a
+separate Java VM.
 
 # Network
 
diff --git a/sbin/start-slave.sh b/sbin/start-slave.sh
index 2cb17a0..9b3b26b 100755
--- a/sbin/start-slave.sh
+++ b/sbin/start-slave.sh
@@ -22,7 +22,7 @@
 # Environment Variables
 #
 #   SPARK_WORKER_INSTANCES  The number of worker instances to run on this
-#   slave.  Default is 1.
+#   slave.  Default is 1. Note it has been deprecate 
since Spark 3.0.
 #   SPARK_WORKER_PORT   The base port number for the first worker. If set,
 #   subsequent workers will increment this number.  If
 #   unset, Spark will find a valid port number, but


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone

2020-04-15 Thread jiangxb1987

This is an automated email from the ASF dual-hosted git repository.

jiangxb1987 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new d286db1  [SPARK-31018][CORE][DOCS] Deprecate support of multiple 
workers on the same host in Standalone
d286db1 is described below

commit d286db145433d6d7610c69980512369f389930ca
Author: yi.wu 
AuthorDate: Wed Apr 15 11:29:55 2020 -0700

[SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same 
host in Standalone

### What changes were proposed in this pull request?

Update the document and shell script to warn user about the deprecation of 
multiple workers on the same host support.

### Why are the changes needed?

This is a sub-task of 
[SPARK-30978](https://issues.apache.org/jira/browse/SPARK-30978), which plans 
to totally remove support of multiple workers in Spark 3.1. This PR makes the 
first step to deprecate it firstly in Spark 3.0.

### Does this PR introduce any user-facing change?

Yeah, user see warning when they run start worker script.

### How was this patch tested?

Tested manually.

Closes #27768 from Ngone51/deprecate_spark_worker_instances.

Authored-by: yi.wu 
Signed-off-by: Xingbo Jiang 
(cherry picked from commit 0d4e4df06105cf2985dde17c1af76093b3ae8c13)
Signed-off-by: Xingbo Jiang 
---
 docs/core-migration-guide.md  | 2 ++
 docs/hardware-provisioning.md | 8 
 sbin/start-slave.sh   | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md
index 66a489b..cde6e07 100644
--- a/docs/core-migration-guide.md
+++ b/docs/core-migration-guide.md
@@ -38,3 +38,5 @@ license: |
 - Event log file will be written as UTF-8 encoding, and Spark History Server 
will replay event log files as UTF-8 encoding. Previously Spark wrote the event 
log file as default charset of driver JVM process, so Spark History Server of 
Spark 2.x is needed to read the old event log files in case of incompatible 
encoding.
 
 - A new protocol for fetching shuffle blocks is used. It's recommended that 
external shuffle services be upgraded when running Spark 3.0 apps. You can 
still use old external shuffle services by setting the configuration 
`spark.shuffle.useOldFetchProtocol` to `true`. Otherwise, Spark may run into 
errors with messages like `IllegalArgumentException: Unexpected message type: 
`.
+
+- `SPARK_WORKER_INSTANCES` is deprecated in Standalone mode. It's recommended 
to launch multiple executors in one worker and launch one worker per node 
instead of launching multiple workers per node and launching one executor per 
worker.
diff --git a/docs/hardware-provisioning.md b/docs/hardware-provisioning.md
index 4e5d681..fc87995f 100644
--- a/docs/hardware-provisioning.md
+++ b/docs/hardware-provisioning.md
@@ -63,10 +63,10 @@ Note that memory usage is greatly affected by storage level 
and serialization fo
 the [tuning guide](tuning.html) for tips on how to reduce it.
 
 Finally, note that the Java VM does not always behave well with more than 200 
GiB of RAM. If you
-purchase machines with more RAM than this, you can run _multiple worker JVMs 
per node_. In
-Spark's [standalone mode](spark-standalone.html), you can set the number of 
workers per node
-with the `SPARK_WORKER_INSTANCES` variable in `conf/spark-env.sh`, and the 
number of cores
-per worker with `SPARK_WORKER_CORES`.
+purchase machines with more RAM than this, you can launch multiple executors 
in a single node. In
+Spark's [standalone mode](spark-standalone.html), a worker is responsible for 
launching multiple
+executors according to its available memory and cores, and each executor will 
be launched in a
+separate Java VM.
 
 # Network
 
diff --git a/sbin/start-slave.sh b/sbin/start-slave.sh
index 2cb17a0..9b3b26b 100755
--- a/sbin/start-slave.sh
+++ b/sbin/start-slave.sh
@@ -22,7 +22,7 @@
 # Environment Variables
 #
 #   SPARK_WORKER_INSTANCES  The number of worker instances to run on this
-#   slave.  Default is 1.
+#   slave.  Default is 1. Note it has been deprecate 
since Spark 3.0.
 #   SPARK_WORKER_PORT   The base port number for the first worker. If set,
 #   subsequent workers will increment this number.  If
 #   unset, Spark will find a valid port number, but


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (2b10d70 -> 0d4e4df)

2020-04-15 Thread jiangxb1987

This is an automated email from the ASF dual-hosted git repository.

jiangxb1987 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 2b10d70  [SPARK-31423][SQL] Fix rebasing of not-existed dates
 add 0d4e4df  [SPARK-31018][CORE][DOCS] Deprecate support of multiple 
workers on the same host in Standalone

No new revisions were added by this update.

Summary of changes:
 docs/core-migration-guide.md  | 2 ++
 docs/hardware-provisioning.md | 8 
 sbin/start-slave.sh   | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone

2020-04-15 Thread jiangxb1987

This is an automated email from the ASF dual-hosted git repository.

jiangxb1987 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0d4e4df  [SPARK-31018][CORE][DOCS] Deprecate support of multiple 
workers on the same host in Standalone
0d4e4df is described below

commit 0d4e4df06105cf2985dde17c1af76093b3ae8c13
Author: yi.wu 
AuthorDate: Wed Apr 15 11:29:55 2020 -0700

[SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same 
host in Standalone

### What changes were proposed in this pull request?

Update the document and shell script to warn user about the deprecation of 
multiple workers on the same host support.

### Why are the changes needed?

This is a sub-task of 
[SPARK-30978](https://issues.apache.org/jira/browse/SPARK-30978), which plans 
to totally remove support of multiple workers in Spark 3.1. This PR makes the 
first step to deprecate it firstly in Spark 3.0.

### Does this PR introduce any user-facing change?

Yeah, user see warning when they run start worker script.

### How was this patch tested?

Tested manually.

Closes #27768 from Ngone51/deprecate_spark_worker_instances.

Authored-by: yi.wu 
Signed-off-by: Xingbo Jiang 
---
 docs/core-migration-guide.md  | 2 ++
 docs/hardware-provisioning.md | 8 
 sbin/start-slave.sh   | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md
index 66a489b..cde6e07 100644
--- a/docs/core-migration-guide.md
+++ b/docs/core-migration-guide.md
@@ -38,3 +38,5 @@ license: |
 - Event log file will be written as UTF-8 encoding, and Spark History Server 
will replay event log files as UTF-8 encoding. Previously Spark wrote the event 
log file as default charset of driver JVM process, so Spark History Server of 
Spark 2.x is needed to read the old event log files in case of incompatible 
encoding.
 
 - A new protocol for fetching shuffle blocks is used. It's recommended that 
external shuffle services be upgraded when running Spark 3.0 apps. You can 
still use old external shuffle services by setting the configuration 
`spark.shuffle.useOldFetchProtocol` to `true`. Otherwise, Spark may run into 
errors with messages like `IllegalArgumentException: Unexpected message type: 
`.
+
+- `SPARK_WORKER_INSTANCES` is deprecated in Standalone mode. It's recommended 
to launch multiple executors in one worker and launch one worker per node 
instead of launching multiple workers per node and launching one executor per 
worker.
diff --git a/docs/hardware-provisioning.md b/docs/hardware-provisioning.md
index 4e5d681..fc87995f 100644
--- a/docs/hardware-provisioning.md
+++ b/docs/hardware-provisioning.md
@@ -63,10 +63,10 @@ Note that memory usage is greatly affected by storage level 
and serialization fo
 the [tuning guide](tuning.html) for tips on how to reduce it.
 
 Finally, note that the Java VM does not always behave well with more than 200 
GiB of RAM. If you
-purchase machines with more RAM than this, you can run _multiple worker JVMs 
per node_. In
-Spark's [standalone mode](spark-standalone.html), you can set the number of 
workers per node
-with the `SPARK_WORKER_INSTANCES` variable in `conf/spark-env.sh`, and the 
number of cores
-per worker with `SPARK_WORKER_CORES`.
+purchase machines with more RAM than this, you can launch multiple executors 
in a single node. In
+Spark's [standalone mode](spark-standalone.html), a worker is responsible for 
launching multiple
+executors according to its available memory and cores, and each executor will 
be launched in a
+separate Java VM.
 
 # Network
 
diff --git a/sbin/start-slave.sh b/sbin/start-slave.sh
index 2cb17a0..9b3b26b 100755
--- a/sbin/start-slave.sh
+++ b/sbin/start-slave.sh
@@ -22,7 +22,7 @@
 # Environment Variables
 #
 #   SPARK_WORKER_INSTANCES  The number of worker instances to run on this
-#   slave.  Default is 1.
+#   slave.  Default is 1. Note it has been deprecate 
since Spark 3.0.
 #   SPARK_WORKER_PORT   The base port number for the first worker. If set,
 #   subsequent workers will increment this number.  If
 #   unset, Spark will find a valid port number, but


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31423][SQL] Fix rebasing of not-existed dates

2020-04-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 3702327  [SPARK-31423][SQL] Fix rebasing of not-existed dates
3702327 is described below

commit 37023273fe0171ab758a81956dcc0ae9f8d2253b
Author: Max Gekk 
AuthorDate: Wed Apr 15 16:33:56 2020 +

[SPARK-31423][SQL] Fix rebasing of not-existed dates

### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed dates in the hybrid 
calendar (Julian + Gregorian since 1582-10-15) in the range (1582-10-04, 
1582-10-15). Not existed dates from the range are shifted to the first valid 
date in the hybrid calendar - 1582-10-15. The changes affect only 
`rebaseGregorianToJulianDays()` because reverse rebasing from the hybrid dates 
to Proleptic Gregorian dates does not have such problem.

### Why are the changes needed?
Currently, not-existed dates are shifted by standard difference between 
Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 -> 
1582-10-24. That's contradict to shifting not existed dates in other cases, for 
example:
```
scala> sql("select date'1990-9-31'").show
+-+
|DATE '1990-10-01'|
+-+
|   1990-10-01|
+-+
```

### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `DATE` values to external 
dates based on non-Proleptic Gregorian calendar. For example, while saving the 
1582-10-14 date to ORC files, it will be shifted to the next valid date 
1582-10-15.

### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, 
`DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.

Closes #28225 from MaxGekk/fix-not-exist-dates.

Authored-by: Max Gekk 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 2b10d70bad30fb7b7c293338c2acc908031af0b8)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/RebaseDateTime.scala   | 16 
 .../sql/catalyst/util/RebaseDateTimeSuite.scala| 22 ++
 .../execution/datasources/orc/OrcSourceSuite.scala |  8 +---
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala
index 50b552e..6338a59 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala
@@ -131,7 +131,8 @@ object RebaseDateTime {
   // The differences in days between Proleptic Gregorian and Julian dates.
   // The diff at the index `i` is applicable for all days in the date interval:
   // [gregJulianDiffSwitchDay(i), gregJulianDiffSwitchDay(i+1))
-  private val gregJulianDiffs = Array(-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 0)
+  private val gregJulianDiffs = Array(
+-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
   // The sorted days in Proleptic Gregorian calendar when difference in days 
between
   // Proleptic Gregorian and Julian was changed.
   // The starting point is the `0001-01-01` (-719162 days since the epoch in
@@ -139,13 +140,17 @@ object RebaseDateTime {
   // Rebasing switch days and diffs `gregJulianDiffSwitchDay` and 
`gregJulianDiffs`
   // was generated by the `localRebaseGregorianToJulianDays` function.
   private val gregJulianDiffSwitchDay = Array(
--719162, -682944, -646420, -609896, -536847, -500323, -463799,
--390750, -354226, -317702, -244653, -208129, -171605, -141427)
+-719162, -682944, -646420, -609896, -536847, -500323, -463799, -390750,
+-354226, -317702, -244653, -208129, -171605, -141436, -141435, -141434,
+-141433, -141432, -141431, -141430, -141429, -141428, -141427)
 
   // The first days of Common Era (CE) which is mapped to the '0001-01-01' date
   // in Proleptic Gregorian calendar.
   private final val gregorianCommonEraStartDay = gregJulianDiffSwitchDay(0)
 
+  private final val gregorianStartDay = LocalDate.of(1582, 10, 15)
+  private final val julianEndDay = LocalDate.of(1582, 10, 4)
+
   /**
* Converts the given number of days since the epoch day 1970-01-01 to a 
local date in Proleptic
* Gregorian calendar, interprets the result as a local date in Julian 
calendar, and takes the
@@ -165,7 +170,10 @@ object RebaseDateTime {
* @return The rebased number of days in Julian calendar.
*/
   private[sql] def localRebaseGregorianToJulianDays(days: Int): Int = {
-val localDate = LocalDate.ofEpochDay(days)

[spark] branch master updated (7699f76 -> 2b10d70)

2020-04-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7699f76  [SPARK-31394][K8S] Adds support for Kubernetes NFS volume 
mounts
 add 2b10d70  [SPARK-31423][SQL] Fix rebasing of not-existed dates

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/util/RebaseDateTime.scala   | 16 
 .../sql/catalyst/util/RebaseDateTimeSuite.scala| 22 ++
 .../execution/datasources/orc/OrcSourceSuite.scala |  8 +---
 3 files changed, 39 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31423][SQL] Fix rebasing of not-existed dates

2020-04-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 3702327  [SPARK-31423][SQL] Fix rebasing of not-existed dates
3702327 is described below

commit 37023273fe0171ab758a81956dcc0ae9f8d2253b
Author: Max Gekk 
AuthorDate: Wed Apr 15 16:33:56 2020 +

[SPARK-31423][SQL] Fix rebasing of not-existed dates

### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed dates in the hybrid 
calendar (Julian + Gregorian since 1582-10-15) in the range (1582-10-04, 
1582-10-15). Not existed dates from the range are shifted to the first valid 
date in the hybrid calendar - 1582-10-15. The changes affect only 
`rebaseGregorianToJulianDays()` because reverse rebasing from the hybrid dates 
to Proleptic Gregorian dates does not have such problem.

### Why are the changes needed?
Currently, not-existed dates are shifted by standard difference between 
Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 -> 
1582-10-24. That's contradict to shifting not existed dates in other cases, for 
example:
```
scala> sql("select date'1990-9-31'").show
+-+
|DATE '1990-10-01'|
+-+
|   1990-10-01|
+-+
```

### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `DATE` values to external 
dates based on non-Proleptic Gregorian calendar. For example, while saving the 
1582-10-14 date to ORC files, it will be shifted to the next valid date 
1582-10-15.

### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, 
`DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.

Closes #28225 from MaxGekk/fix-not-exist-dates.

Authored-by: Max Gekk 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 2b10d70bad30fb7b7c293338c2acc908031af0b8)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/RebaseDateTime.scala   | 16 
 .../sql/catalyst/util/RebaseDateTimeSuite.scala| 22 ++
 .../execution/datasources/orc/OrcSourceSuite.scala |  8 +---
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala
index 50b552e..6338a59 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala
@@ -131,7 +131,8 @@ object RebaseDateTime {
   // The differences in days between Proleptic Gregorian and Julian dates.
   // The diff at the index `i` is applicable for all days in the date interval:
   // [gregJulianDiffSwitchDay(i), gregJulianDiffSwitchDay(i+1))
-  private val gregJulianDiffs = Array(-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 0)
+  private val gregJulianDiffs = Array(
+-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
   // The sorted days in Proleptic Gregorian calendar when difference in days 
between
   // Proleptic Gregorian and Julian was changed.
   // The starting point is the `0001-01-01` (-719162 days since the epoch in
@@ -139,13 +140,17 @@ object RebaseDateTime {
   // Rebasing switch days and diffs `gregJulianDiffSwitchDay` and 
`gregJulianDiffs`
   // was generated by the `localRebaseGregorianToJulianDays` function.
   private val gregJulianDiffSwitchDay = Array(
--719162, -682944, -646420, -609896, -536847, -500323, -463799,
--390750, -354226, -317702, -244653, -208129, -171605, -141427)
+-719162, -682944, -646420, -609896, -536847, -500323, -463799, -390750,
+-354226, -317702, -244653, -208129, -171605, -141436, -141435, -141434,
+-141433, -141432, -141431, -141430, -141429, -141428, -141427)
 
   // The first days of Common Era (CE) which is mapped to the '0001-01-01' date
   // in Proleptic Gregorian calendar.
   private final val gregorianCommonEraStartDay = gregJulianDiffSwitchDay(0)
 
+  private final val gregorianStartDay = LocalDate.of(1582, 10, 15)
+  private final val julianEndDay = LocalDate.of(1582, 10, 4)
+
   /**
* Converts the given number of days since the epoch day 1970-01-01 to a 
local date in Proleptic
* Gregorian calendar, interprets the result as a local date in Julian 
calendar, and takes the
@@ -165,7 +170,10 @@ object RebaseDateTime {
* @return The rebased number of days in Julian calendar.
*/
   private[sql] def localRebaseGregorianToJulianDays(days: Int): Int = {
-val localDate = LocalDate.ofEpochDay(days)

[spark] branch master updated (7699f76 -> 2b10d70)

2020-04-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7699f76  [SPARK-31394][K8S] Adds support for Kubernetes NFS volume 
mounts
 add 2b10d70  [SPARK-31423][SQL] Fix rebasing of not-existed dates

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/util/RebaseDateTime.scala   | 16 
 .../sql/catalyst/util/RebaseDateTimeSuite.scala| 22 ++
 .../execution/datasources/orc/OrcSourceSuite.scala |  8 +---
 3 files changed, 39 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (744c248 -> 7699f76)

2020-04-15 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 744c248  [SPARK-31443][SQL] Fix perf regression of toJavaDate
 add 7699f76  [SPARK-31394][K8S] Adds support for Kubernetes NFS volume 
mounts

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/deploy/k8s/Config.scala |  2 +
 .../spark/deploy/k8s/KubernetesVolumeSpec.scala|  5 ++
 .../spark/deploy/k8s/KubernetesVolumeUtils.scala   |  7 +++
 .../k8s/features/MountVolumesFeatureStep.scala |  4 ++
 .../spark/deploy/k8s/KubernetesTestConf.scala  |  5 ++
 .../deploy/k8s/KubernetesVolumeUtilsSuite.scala| 54 ++
 .../features/MountVolumesFeatureStepSuite.scala| 44 ++
 7 files changed, 121 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (744c248 -> 7699f76)

2020-04-15 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 744c248  [SPARK-31443][SQL] Fix perf regression of toJavaDate
 add 7699f76  [SPARK-31394][K8S] Adds support for Kubernetes NFS volume 
mounts

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/deploy/k8s/Config.scala |  2 +
 .../spark/deploy/k8s/KubernetesVolumeSpec.scala|  5 ++
 .../spark/deploy/k8s/KubernetesVolumeUtils.scala   |  7 +++
 .../k8s/features/MountVolumesFeatureStep.scala |  4 ++
 .../spark/deploy/k8s/KubernetesTestConf.scala  |  5 ++
 .../deploy/k8s/KubernetesVolumeUtilsSuite.scala| 54 ++
 .../features/MountVolumesFeatureStepSuite.scala| 44 ++
 7 files changed, 121 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31443][SQL] Fix perf regression of toJavaDate

2020-04-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new a5a4194  [SPARK-31443][SQL] Fix perf regression of toJavaDate
a5a4194 is described below

commit a5a41942138c06cc67ffd1183bbf6da5214f78f0
Author: Max Gekk 
AuthorDate: Wed Apr 15 06:19:12 2020 +

[SPARK-31443][SQL] Fix perf regression of toJavaDate

### What changes were proposed in this pull request?
Optimise the `toJavaDate()` method of `DateTimeUtils` by:
1. Re-using `rebaseGregorianToJulianDays` optimised by #28067
2. Creating `java.sql.Date` instances from milliseconds in UTC since the 
epoch instead of date-time fields. This allows to avoid "normalization" inside 
of  `java.sql.Date`.

Also new benchmark for collecting dates is added to `DateTimeBenchmark`.

### Why are the changes needed?
The changes fix the performance regression of collecting `DATE` values 
comparing to Spark 2.4 (see `DateTimeBenchmark` in 
https://github.com/MaxGekk/spark/pull/27):

Spark 2.4.6-SNAPSHOT:
```
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date  559603  
38  8.9 111.8   1.0X
Collect dates  2306   3221  
  1558  2.2 461.1   0.2X
```
Before the changes:
```
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date 1052   1130  
73  4.8 210.3   1.0X
Collect dates  3251   4943  
  1624  1.5 650.2   0.3X
```
After:
```
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date  416419  
 3 12.0  83.2   1.0X
Collect dates  1928   2759  
  1180  2.6 385.6   0.2X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, 
`RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
|  | |
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 
(ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 
11.0.6+10 |

Closes #28212 from MaxGekk/optimize-toJavaDate.

Authored-by: Max Gekk 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 744c2480b580e0b68387328926ef7634cfb93adc)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/DateTimeUtils.scala|  11 +-
 .../benchmarks/DateTimeBenchmark-jdk11-results.txt | 221 +++--
 sql/core/benchmarks/DateTimeBenchmark-results.txt  | 221 +++--
 .../execution/benchmark/DateTimeBenchmark.scala|   5 +
 4 files changed, 236 insertions(+), 222 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
index dede92f..8486bba 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
@@ -26,6 +26,8 @@ import java.util.concurrent.TimeUnit._
 
 import scala.util.control.NonFatal
 
+import sun.util.calendar.ZoneInfo
+
 import org.apache.spark.sql.catalyst.util.DateTimeConstants._
 import org.apache.spark.sql.catalyst.util.RebaseDateTime._
 import org.apache.spark.sql.types.Decimal
@@ -123,8 +125,13 @@ object DateTimeUtils {
* @return A `java.sql.Date` from number of days since epoch.
*/
   def toJavaDate(daysSinceEpoch: SQLDate): Date = {
-val localDate = LocalDate.ofEpochDay(daysSinceEpoch)
-new Date(localDate.getYear - 1900, localDate.getMonthValue - 1, 
localDate.getDayOfMonth)
+

[spark] branch master updated: [SPARK-31443][SQL] Fix perf regression of toJavaDate

2020-04-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 744c248  [SPARK-31443][SQL] Fix perf regression of toJavaDate
744c248 is described below

commit 744c2480b580e0b68387328926ef7634cfb93adc
Author: Max Gekk 
AuthorDate: Wed Apr 15 06:19:12 2020 +

[SPARK-31443][SQL] Fix perf regression of toJavaDate

### What changes were proposed in this pull request?
Optimise the `toJavaDate()` method of `DateTimeUtils` by:
1. Re-using `rebaseGregorianToJulianDays` optimised by #28067
2. Creating `java.sql.Date` instances from milliseconds in UTC since the 
epoch instead of date-time fields. This allows to avoid "normalization" inside 
of  `java.sql.Date`.

Also new benchmark for collecting dates is added to `DateTimeBenchmark`.

### Why are the changes needed?
The changes fix the performance regression of collecting `DATE` values 
comparing to Spark 2.4 (see `DateTimeBenchmark` in 
https://github.com/MaxGekk/spark/pull/27):

Spark 2.4.6-SNAPSHOT:
```
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date  559603  
38  8.9 111.8   1.0X
Collect dates  2306   3221  
  1558  2.2 461.1   0.2X
```
Before the changes:
```
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date 1052   1130  
73  4.8 210.3   1.0X
Collect dates  3251   4943  
  1624  1.5 650.2   0.3X
```
After:
```
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


From java.sql.Date  416419  
 3 12.0  83.2   1.0X
Collect dates  1928   2759  
  1180  2.6 385.6   0.2X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, 
`RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
|  | |
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 
(ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 
11.0.6+10 |

Closes #28212 from MaxGekk/optimize-toJavaDate.

Authored-by: Max Gekk 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/DateTimeUtils.scala|  11 +-
 .../benchmarks/DateTimeBenchmark-jdk11-results.txt | 221 +++--
 sql/core/benchmarks/DateTimeBenchmark-results.txt  | 221 +++--
 .../execution/benchmark/DateTimeBenchmark.scala|   5 +
 4 files changed, 236 insertions(+), 222 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
index 56259df..021072c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
@@ -26,6 +26,8 @@ import java.util.concurrent.TimeUnit._
 
 import scala.util.control.NonFatal
 
+import sun.util.calendar.ZoneInfo
+
 import org.apache.spark.sql.catalyst.util.DateTimeConstants._
 import org.apache.spark.sql.catalyst.util.RebaseDateTime._
 import org.apache.spark.sql.types.Decimal
@@ -121,8 +123,13 @@ object DateTimeUtils {
* @return A `java.sql.Date` from number of days since epoch.
*/
   def toJavaDate(daysSinceEpoch: SQLDate): Date = {
-val localDate = LocalDate.ofEpochDay(daysSinceEpoch)
-new Date(localDate.getYear - 1900, localDate.getMonthValue - 1, 
localDate.getDayOfMonth)
+val rebasedDays = rebaseGregorianToJulianDays(daysSinceEpoch)
+val localMillis =

[spark] branch master updated (98ec4a8 -> c76c31e)

[spark] branch master updated (92c1b24 -> 98ec4a8)

[spark] branch master updated (92c1b24 -> 98ec4a8)

[spark] branch master updated (92c1b24 -> 98ec4a8)

[spark] branch master updated (92c1b24 -> 98ec4a8)

[spark] branch master updated (92c1b24 -> 98ec4a8)

[spark] branch branch-3.0 updated: [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference

[spark] branch master updated: [SPARK-31428][SQL][DOCS] Document Common Table Expression in SQL Reference

[spark] branch branch-2.4 updated (49abdc4 -> d34590c)

[spark] branch branch-3.0 updated: [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone

[spark] branch branch-3.0 updated: [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone

[spark] branch master updated (2b10d70 -> 0d4e4df)

[spark] branch master updated: [SPARK-31018][CORE][DOCS] Deprecate support of multiple workers on the same host in Standalone

[spark] branch branch-3.0 updated: [SPARK-31423][SQL] Fix rebasing of not-existed dates

[spark] branch master updated (7699f76 -> 2b10d70)

[spark] branch branch-3.0 updated: [SPARK-31423][SQL] Fix rebasing of not-existed dates

[spark] branch master updated (7699f76 -> 2b10d70)

[spark] branch master updated (744c248 -> 7699f76)

[spark] branch master updated (744c248 -> 7699f76)

[spark] branch branch-3.0 updated: [SPARK-31443][SQL] Fix perf regression of toJavaDate

[spark] branch master updated: [SPARK-31443][SQL] Fix perf regression of toJavaDate

21 matches

Site Navigation

Mail list logo

Footer information