date:20210519

[spark] branch master updated (de59e01 -> 00b63c8)

2021-05-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from de59e01  [SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by 
Spark as immutable
 add 00b63c8  [SPARK-27991][CORE] Defer the fetch request on Netty OOM

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/network/util/NettyUtils.java  |   4 +
 .../executor/CoarseGrainedExecutorBackend.scala|   9 ++
 .../org/apache/spark/internal/config/package.scala |   9 ++
 .../spark/shuffle/BlockStoreShuffleReader.scala|   1 +
 .../storage/ShuffleBlockFetcherIterator.scala  | 153 ++---
 .../storage/ShuffleBlockFetcherIteratorSuite.scala | 132 ++
 6 files changed, 291 insertions(+), 17 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (a970f85 -> de59e01)

2021-05-19 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a970f85  [SPARK-35338][PYTHON] Separate arithmetic operations into 
data type based structures
 add de59e01  [SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by 
Spark as immutable

No new revisions were added by this update.

Summary of changes:
 .../deploy/k8s/features/DriverKubernetesCredentialsFeatureStep.scala   | 1 +
 .../apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStep.scala | 1 +
 .../spark/deploy/k8s/features/KerberosConfDriverFeatureStep.scala  | 3 +++
 .../apache/spark/deploy/k8s/features/PodTemplateConfigMapStep.scala| 1 +
 .../org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala | 1 +
 .../spark/deploy/k8s/features/PodTemplateConfigMapStepSuite.scala  | 1 +
 .../test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala| 1 +
 7 files changed, 9 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r47830 - /release/spark/spark-2.4.7/

2021-05-19 Thread dongjoon

Author: dongjoon
Date: Thu May 20 03:14:53 2021
New Revision: 47830

Log:
Remove Apache Spark 2.4.7 after 2.4.8 release

Removed:
release/spark/spark-2.4.7/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7eaabf4 -> a970f85)

2021-05-19 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7eaabf4  [SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string 
format
 add a970f85  [SPARK-35338][PYTHON] Separate arithmetic operations into 
data type based structures

No new revisions were added by this update.

Summary of changes:
 dev/sparktestsupport/modules.py|   6 +
 python/pyspark/pandas/base.py  | 268 ++-
 .../tests => pandas/data_type_ops}/__init__.py |   0
 python/pyspark/pandas/data_type_ops/base.py| 120 +++
 python/pyspark/pandas/data_type_ops/boolean_ops.py |  28 ++
 .../pandas/data_type_ops/categorical_ops.py|  28 ++
 python/pyspark/pandas/data_type_ops/date_ops.py|  71 
 .../pyspark/pandas/data_type_ops/datetime_ops.py   |  72 
 python/pyspark/pandas/data_type_ops/num_ops.py | 378 +
 python/pyspark/pandas/data_type_ops/string_ops.py  | 104 ++
 .../tests/data_type_ops}/__init__.py   |   0
 .../pandas/tests/data_type_ops/test_boolean_ops.py | 150 
 .../tests/data_type_ops/test_categorical_ops.py| 128 +++
 .../pandas/tests/data_type_ops/test_date_ops.py| 158 +
 .../tests/data_type_ops/test_datetime_ops.py   | 160 +
 .../pandas/tests/data_type_ops/test_num_ops.py | 195 +++
 .../pandas/tests/data_type_ops/test_string_ops.py  | 140 
 .../pandas/tests/data_type_ops/testing_utils.py|  75 
 .../pyspark/pandas/tests/indexes/test_datetime.py  |  10 +-
 python/pyspark/pandas/tests/test_dataframe.py  |   4 +-
 .../pyspark/pandas/tests/test_series_datetime.py   |  10 +-
 python/pyspark/testing/pandasutils.py  |   2 +-
 python/setup.py|   1 +
 23 files changed, 1849 insertions(+), 259 deletions(-)
 copy python/pyspark/{streaming/tests => pandas/data_type_ops}/__init__.py 
(100%)
 create mode 100644 python/pyspark/pandas/data_type_ops/base.py
 create mode 100644 python/pyspark/pandas/data_type_ops/boolean_ops.py
 create mode 100644 python/pyspark/pandas/data_type_ops/categorical_ops.py
 create mode 100644 python/pyspark/pandas/data_type_ops/date_ops.py
 create mode 100644 python/pyspark/pandas/data_type_ops/datetime_ops.py
 create mode 100644 python/pyspark/pandas/data_type_ops/num_ops.py
 create mode 100644 python/pyspark/pandas/data_type_ops/string_ops.py
 copy python/pyspark/{streaming/tests => 
pandas/tests/data_type_ops}/__init__.py (100%)
 create mode 100644 
python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py
 create mode 100644 
python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
 create mode 100644 python/pyspark/pandas/tests/data_type_ops/test_date_ops.py
 create mode 100644 
python/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py
 create mode 100644 python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
 create mode 100644 python/pyspark/pandas/tests/data_type_ops/test_string_ops.py
 create mode 100644 python/pyspark/pandas/tests/data_type_ops/testing_utils.py

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c064805 -> 7eaabf4)

2021-05-19 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c064805  [SPARK-35450][INFRA] Follow checkout-merge way to use the 
latest commit for linter, or other workflows
 add 7eaabf4  [SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string 
format

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d44e6c7 -> c064805)

2021-05-19 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d44e6c7  Revert "[SPARK-35338][PYTHON] Separate arithmetic operations 
into data type based structures"
 add c064805  [SPARK-35450][INFRA] Follow checkout-merge way to use the 
latest commit for linter, or other workflows

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_and_test.yml | 55 
 1 file changed, 55 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (586caae -> d44e6c7)

2021-05-19 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 586caae  [SPARK-35438][SQL][DOCS] Minor documentation fix for window 
physical operator
 add d44e6c7  Revert "[SPARK-35338][PYTHON] Separate arithmetic operations 
into data type based structures"

No new revisions were added by this update.

Summary of changes:
 dev/sparktestsupport/modules.py|   6 -
 python/pyspark/pandas/base.py  | 268 +--
 python/pyspark/pandas/data_type_ops/__init__.py|  16 -
 python/pyspark/pandas/data_type_ops/base.py| 120 ---
 python/pyspark/pandas/data_type_ops/boolean_ops.py |  28 --
 .../pandas/data_type_ops/categorical_ops.py|  28 --
 python/pyspark/pandas/data_type_ops/date_ops.py|  71 
 .../pyspark/pandas/data_type_ops/datetime_ops.py   |  72 
 python/pyspark/pandas/data_type_ops/num_ops.py | 378 -
 python/pyspark/pandas/data_type_ops/string_ops.py  | 104 --
 .../pyspark/pandas/tests/data_type_ops/__init__.py |  16 -
 .../pandas/tests/data_type_ops/test_boolean_ops.py | 150 
 .../tests/data_type_ops/test_categorical_ops.py| 128 ---
 .../pandas/tests/data_type_ops/test_date_ops.py| 158 -
 .../tests/data_type_ops/test_datetime_ops.py   | 160 -
 .../pandas/tests/data_type_ops/test_num_ops.py | 195 ---
 .../pandas/tests/data_type_ops/test_string_ops.py  | 140 
 .../pandas/tests/data_type_ops/testing_utils.py|  75 
 .../pyspark/pandas/tests/indexes/test_datetime.py  |  10 +-
 python/pyspark/pandas/tests/test_dataframe.py  |   4 +-
 .../pyspark/pandas/tests/test_series_datetime.py   |  10 +-
 python/pyspark/testing/pandasutils.py  |   2 +-
 python/setup.py|   1 -
 23 files changed, 259 insertions(+), 1881 deletions(-)
 delete mode 100644 python/pyspark/pandas/data_type_ops/__init__.py
 delete mode 100644 python/pyspark/pandas/data_type_ops/base.py
 delete mode 100644 python/pyspark/pandas/data_type_ops/boolean_ops.py
 delete mode 100644 python/pyspark/pandas/data_type_ops/categorical_ops.py
 delete mode 100644 python/pyspark/pandas/data_type_ops/date_ops.py
 delete mode 100644 python/pyspark/pandas/data_type_ops/datetime_ops.py
 delete mode 100644 python/pyspark/pandas/data_type_ops/num_ops.py
 delete mode 100644 python/pyspark/pandas/data_type_ops/string_ops.py
 delete mode 100644 python/pyspark/pandas/tests/data_type_ops/__init__.py
 delete mode 100644 
python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py
 delete mode 100644 
python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
 delete mode 100644 python/pyspark/pandas/tests/data_type_ops/test_date_ops.py
 delete mode 100644 
python/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py
 delete mode 100644 python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
 delete mode 100644 python/pyspark/pandas/tests/data_type_ops/test_string_ops.py
 delete mode 100644 python/pyspark/pandas/tests/data_type_ops/testing_utils.py

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d1b24d8 -> 586caae)

2021-05-19 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d1b24d8  [SPARK-35338][PYTHON] Separate arithmetic operations into 
data type based structures
 add 586caae  [SPARK-35438][SQL][DOCS] Minor documentation fix for window 
physical operator

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/sql/execution/window/WindowExec.scala   | 2 +-
 .../scala/org/apache/spark/sql/execution/window/WindowExecBase.scala| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures

2021-05-19 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d1b24d8  [SPARK-35338][PYTHON] Separate arithmetic operations into 
data type based structures
d1b24d8 is described below

commit d1b24d8aba8317c62542a81ca55c12700a07cb80
Author: Xinrong Meng 
AuthorDate: Wed May 19 15:05:32 2021 -0700

[SPARK-35338][PYTHON] Separate arithmetic operations into data type based 
structures

### What changes were proposed in this pull request?

The PR is proposed for **pandas APIs on Spark**, in order to separate 
arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, 
__rpow__,__rmod__`

DataTypeOps and subclasses are introduced.

The existing behaviors of each arithmetic operation should be preserved.

### Why are the changes needed?

Currently, the same arithmetic operation of all data types is defined in 
one function, so it’s difficult to extend the behavior change based on the data 
types.

Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: 
Separate basic operations into data type based 
structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests are introduced under pyspark.pandas.tests.data_type_ops. One test 
file per DataTypeOps class.

Closes #32469 from xinrong-databricks/datatypeop_arith.

Authored-by: Xinrong Meng 
Signed-off-by: Takuya UESHIN 
---
 dev/sparktestsupport/modules.py|   6 +
 python/pyspark/pandas/base.py  | 268 ++-
 python/pyspark/pandas/data_type_ops/__init__.py|  16 +
 python/pyspark/pandas/data_type_ops/base.py| 120 +++
 python/pyspark/pandas/data_type_ops/boolean_ops.py |  28 ++
 .../pandas/data_type_ops/categorical_ops.py|  28 ++
 python/pyspark/pandas/data_type_ops/date_ops.py|  71 
 .../pyspark/pandas/data_type_ops/datetime_ops.py   |  72 
 python/pyspark/pandas/data_type_ops/num_ops.py | 378 +
 python/pyspark/pandas/data_type_ops/string_ops.py  | 104 ++
 .../pyspark/pandas/tests/data_type_ops/__init__.py |  16 +
 .../pandas/tests/data_type_ops/test_boolean_ops.py | 150 
 .../tests/data_type_ops/test_categorical_ops.py| 128 +++
 .../pandas/tests/data_type_ops/test_date_ops.py| 158 +
 .../tests/data_type_ops/test_datetime_ops.py   | 160 +
 .../pandas/tests/data_type_ops/test_num_ops.py | 195 +++
 .../pandas/tests/data_type_ops/test_string_ops.py  | 140 
 .../pandas/tests/data_type_ops/testing_utils.py|  75 
 .../pyspark/pandas/tests/indexes/test_datetime.py  |  10 +-
 python/pyspark/pandas/tests/test_dataframe.py  |   4 +-
 .../pyspark/pandas/tests/test_series_datetime.py   |  10 +-
 python/pyspark/testing/pandasutils.py  |   2 +-
 python/setup.py|   1 +
 23 files changed, 1881 insertions(+), 259 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index ab65ccd..c35618e 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -611,6 +611,12 @@ pyspark_pandas = Module(
 "pyspark.pandas.spark.utils",
 "pyspark.pandas.typedef.typehints",
 # unittests
+"pyspark.pandas.tests.data_type_ops.test_boolean_ops",
+"pyspark.pandas.tests.data_type_ops.test_categorical_ops",
+"pyspark.pandas.tests.data_type_ops.test_date_ops",
+"pyspark.pandas.tests.data_type_ops.test_datetime_ops",
+"pyspark.pandas.tests.data_type_ops.test_num_ops",
+"pyspark.pandas.tests.data_type_ops.test_string_ops",
 "pyspark.pandas.tests.indexes.test_base",
 "pyspark.pandas.tests.indexes.test_category",
 "pyspark.pandas.tests.indexes.test_datetime",
diff --git a/python/pyspark/pandas/base.py b/python/pyspark/pandas/base.py
index 1082052..6cecb73 100644
--- a/python/pyspark/pandas/base.py
+++ b/python/pyspark/pandas/base.py
@@ -19,7 +19,6 @@
 Base and utility classes for pandas-on-Spark objects.
 """
 from abc import ABCMeta, abstractmethod
-import datetime
 from functools import wraps, partial
 from itertools import chain
 from typing import Any, Callable, Optional, Tuple, Union, cast, TYPE_CHECKING
@@ -35,7 +34,6 @@ from pyspark.sql.types import (
 DateType,
 DoubleType,
 FloatType,
-IntegralType,
 LongType,
 NumericType,
 StringType,
@@ -50,11 +48,9 @@ from pyspark.pandas.internal

[spark] branch branch-3.0 updated: [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation

2021-05-19 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new d4b637e  [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation
d4b637e is described below

commit d4b637e13d0893f7fdb71703e56bc7f0dfc93f90
Author: Dongjoon Hyun 
AuthorDate: Wed May 19 09:51:58 2021 -0700

[SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation

### What changes were proposed in this pull request?

This PR is a follow-up of https://github.com/apache/spark/pull/32505 to fix 
`zinc` installation.

### Why are the changes needed?

Currently, branch-3.1/3.0 is broken.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #32591 from dongjoon-hyun/SPARK-35373.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 50edf1246809ffe2b142cab4a9dc4e4e72df3fe7)
Signed-off-by: Dongjoon Hyun 
---
 build/mvn | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/build/mvn b/build/mvn
index 9c9ea4b..d304f86 100755
--- a/build/mvn
+++ b/build/mvn
@@ -150,7 +150,10 @@ install_zinc() {
 local TYPESAFE_MIRROR=${TYPESAFE_MIRROR:-https://downloads.lightbend.com}
 
 install_app \
-  "${TYPESAFE_MIRROR}/zinc/${ZINC_VERSION}/zinc-${ZINC_VERSION}.tgz" \
+  "${TYPESAFE_MIRROR}" \
+  "zinc/${ZINC_VERSION}/zinc-${ZINC_VERSION}.tgz" \
+  "" \
+  "" \
   "zinc-${ZINC_VERSION}.tgz" \
   "${zinc_path}"
 ZINC_BIN="${_DIR}/${zinc_path}"

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation

2021-05-19 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 50edf12  [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation
50edf12 is described below

commit 50edf1246809ffe2b142cab4a9dc4e4e72df3fe7
Author: Dongjoon Hyun 
AuthorDate: Wed May 19 09:51:58 2021 -0700

[SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation

### What changes were proposed in this pull request?

This PR is a follow-up of https://github.com/apache/spark/pull/32505 to fix 
`zinc` installation.

### Why are the changes needed?

Currently, branch-3.1/3.0 is broken.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #32591 from dongjoon-hyun/SPARK-35373.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 build/mvn | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/build/mvn b/build/mvn
index 4e3a8fa..e62abf3 100755
--- a/build/mvn
+++ b/build/mvn
@@ -142,7 +142,10 @@ install_zinc() {
 local TYPESAFE_MIRROR=${TYPESAFE_MIRROR:-https://downloads.lightbend.com}
 
 install_app \
-  "${TYPESAFE_MIRROR}/zinc/${ZINC_VERSION}/zinc-${ZINC_VERSION}.tgz" \
+  "${TYPESAFE_MIRROR}" \
+  "zinc/${ZINC_VERSION}/zinc-${ZINC_VERSION}.tgz" \
+  "" \
+  "" \
   "zinc-${ZINC_VERSION}.tgz" \
   "${zinc_path}"
 ZINC_BIN="${_DIR}/${zinc_path}"

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] HyukjinKwon commented on pull request #343: Make 2.4.8 as EOL release

2021-05-19 Thread GitBox



HyukjinKwon commented on pull request #343:
URL: https://github.com/apache/spark-website/pull/343#issuecomment-844272890


    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn

2021-05-19 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 706f91e  [SPARK-35373][BUILD] Check Maven artifact checksum in 
build/mvn
706f91e is described below

commit 706f91e5e0ae3c6aa156232546f815840f977201
Author: Sean Owen 
AuthorDate: Thu May 13 09:06:57 2021 -0500

[SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn

### What changes were proposed in this pull request?

`./build/mvn` now downloads the .sha512 checksum of Maven artifacts it 
downloads, and checks the checksum after download.

### Why are the changes needed?

This ensures the integrity of the Maven artifact during a user's build, 
which may come from several non-ASF mirrors.

### Does this PR introduce _any_ user-facing change?

Should not affect anything about Spark per se, just the build.

### How was this patch tested?

Manual testing wherein I forced Maven/Scala download, verified checksums 
are downloaded and checked, and verified it fails on error with a corrupted 
checksum.

Closes #32505 from srowen/SPARK-35373.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 
---
 build/mvn | 90 +--
 1 file changed, 65 insertions(+), 25 deletions(-)

diff --git a/build/mvn b/build/mvn
index f33dedc..9c9ea4b 100755
--- a/build/mvn
+++ b/build/mvn
@@ -26,14 +26,23 @@ _COMPILE_JVM_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
 
 # Installs any application tarball given a URL, the expected tarball name,
 # and, optionally, a checkable binary path to determine if the binary has
-# already been installed
-## Arg1 - URL
-## Arg2 - Tarball Name
-## Arg3 - Checkable Binary
+# already been installed. Arguments:
+# 1 - Mirror host
+# 2 - URL path on host
+# 3 - URL query string
+# 4 - checksum suffix
+# 5 - Tarball Name
+# 6 - Checkable Binary
 install_app() {
-  local remote_tarball="$1"
-  local local_tarball="${_DIR}/$2"
-  local binary="${_DIR}/$3"
+  local mirror_host="$1"
+  local url_path="$2"
+  local url_query="$3"
+  local checksum_suffix="$4"
+  local local_tarball="${_DIR}/$5"
+  local binary="${_DIR}/$6"
+  local remote_tarball="${mirror_host}/${url_path}${url_query}"
+  local local_checksum="${local_tarball}.${checksum_suffix}"
+  local 
remote_checksum="https://archive.apache.org/dist/${url_path}.${checksum_suffix};
 
   # setup `curl` and `wget` silent options if we're running on Jenkins
   local curl_opts="-L"
@@ -46,24 +55,46 @@ install_app() {
 wget_opts="--progress=bar:force ${wget_opts}"
   fi
 
-  if [ -z "$3" -o ! -f "$binary" ]; then
+  if [ ! -f "$binary" ]; then
 # check if we already have the tarball
 # check if we have curl installed
 # download application
-[ ! -f "${local_tarball}" ] && [ $(command -v curl) ] && \
-  echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 && \
+if [ ! -f "${local_tarball}" -a $(command -v curl) ]; then
+  echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2
   curl ${curl_opts} "${remote_tarball}" > "${local_tarball}"
+  if [ ! -z "${checksum_suffix}" ]; then
+echo "exec: curl ${curl_opts} ${remote_checksum}" 1>&2
+curl ${curl_opts} "${remote_checksum}" > "${local_checksum}"
+  fi
+fi
 # if the file still doesn't exist, lets try `wget` and cross our fingers
-[ ! -f "${local_tarball}" ] && [ $(command -v wget) ] && \
-  echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 && \
+if [ ! -f "${local_tarball}" -a $(command -v wget) ]; then
+  echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2
   wget ${wget_opts} -O "${local_tarball}" "${remote_tarball}"
+  if [ ! -z "${checksum_suffix}" ]; then
+echo "exec: wget ${wget_opts} ${remote_checksum}" 1>&2
+wget ${wget_opts} -O "${local_checksum}" "${remote_checksum}"
+  fi
+fi
 # if both were unsuccessful, exit
-[ ! -f "${local_tarball}" ] && \
-  echo -n "ERROR: Cannot download $2 with cURL or wget; " && \
-  echo "please install manually and try again." && \
+if [ ! -f "${local_tarball}" ]; then
+  echo -n "ERROR: Cannot download ${remote_tarball} with cURL or wget; 
please install manually and try again."
   exit 2
-cd "${_DIR}" && tar -xzf "$2"
-rm -rf "$local_tarball"
+fi
+# Checksum may not have been specified; don't check if doesn't exist
+if [ -f "${local_checksum}" ]; then
+  echo "  ${local_tarball}" >> ${local_checksum} # two spaces + file are 
important!
+  # Assuming SHA512 here for now
+  echo "Veryfing checksum from ${local_checksum}" 1>&2
+  if ! shasum -a 512 -q -c "${local_checksum}" ; then
+echo "Bad checksum from ${remote_checksum}"
+exit 2
+  fi
+fi
+
+cd "${_DIR}" && tar

[spark] branch branch-3.1 updated: [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn

2021-05-19 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 5d4de77  [SPARK-35373][BUILD] Check Maven artifact checksum in 
build/mvn
5d4de77 is described below

commit 5d4de77a00324d249df31e61a94812c5b93b7b40
Author: Sean Owen 
AuthorDate: Thu May 13 09:06:57 2021 -0500

[SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn

### What changes were proposed in this pull request?

`./build/mvn` now downloads the .sha512 checksum of Maven artifacts it 
downloads, and checks the checksum after download.

### Why are the changes needed?

This ensures the integrity of the Maven artifact during a user's build, 
which may come from several non-ASF mirrors.

### Does this PR introduce _any_ user-facing change?

Should not affect anything about Spark per se, just the build.

### How was this patch tested?

Manual testing wherein I forced Maven/Scala download, verified checksums 
are downloaded and checked, and verified it fails on error with a corrupted 
checksum.

Closes #32505 from srowen/SPARK-35373.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 
---
 build/mvn | 90 +--
 1 file changed, 65 insertions(+), 25 deletions(-)

diff --git a/build/mvn b/build/mvn
index b25df09..4e3a8fa 100755
--- a/build/mvn
+++ b/build/mvn
@@ -26,36 +26,67 @@ _COMPILE_JVM_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
 
 # Installs any application tarball given a URL, the expected tarball name,
 # and, optionally, a checkable binary path to determine if the binary has
-# already been installed
-## Arg1 - URL
-## Arg2 - Tarball Name
-## Arg3 - Checkable Binary
+# already been installed. Arguments:
+# 1 - Mirror host
+# 2 - URL path on host
+# 3 - URL query string
+# 4 - checksum suffix
+# 5 - Tarball Name
+# 6 - Checkable Binary
 install_app() {
-  local remote_tarball="$1"
-  local local_tarball="${_DIR}/$2"
-  local binary="${_DIR}/$3"
+  local mirror_host="$1"
+  local url_path="$2"
+  local url_query="$3"
+  local checksum_suffix="$4"
+  local local_tarball="${_DIR}/$5"
+  local binary="${_DIR}/$6"
+  local remote_tarball="${mirror_host}/${url_path}${url_query}"
+  local local_checksum="${local_tarball}.${checksum_suffix}"
+  local 
remote_checksum="https://archive.apache.org/dist/${url_path}.${checksum_suffix};
 
   local curl_opts="--silent --show-error -L"
   local wget_opts="--no-verbose"
 
-  if [ -z "$3" -o ! -f "$binary" ]; then
+  if [ ! -f "$binary" ]; then
 # check if we already have the tarball
 # check if we have curl installed
 # download application
-[ ! -f "${local_tarball}" ] && [ $(command -v curl) ] && \
-  echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 && \
+if [ ! -f "${local_tarball}" -a $(command -v curl) ]; then
+  echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2
   curl ${curl_opts} "${remote_tarball}" > "${local_tarball}"
+  if [ ! -z "${checksum_suffix}" ]; then
+echo "exec: curl ${curl_opts} ${remote_checksum}" 1>&2
+curl ${curl_opts} "${remote_checksum}" > "${local_checksum}"
+  fi
+fi
 # if the file still doesn't exist, lets try `wget` and cross our fingers
-[ ! -f "${local_tarball}" ] && [ $(command -v wget) ] && \
-  echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 && \
+if [ ! -f "${local_tarball}" -a $(command -v wget) ]; then
+  echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2
   wget ${wget_opts} -O "${local_tarball}" "${remote_tarball}"
+  if [ ! -z "${checksum_suffix}" ]; then
+echo "exec: wget ${wget_opts} ${remote_checksum}" 1>&2
+wget ${wget_opts} -O "${local_checksum}" "${remote_checksum}"
+  fi
+fi
 # if both were unsuccessful, exit
-[ ! -f "${local_tarball}" ] && \
-  echo -n "ERROR: Cannot download $2 with cURL or wget; " && \
-  echo "please install manually and try again." && \
+if [ ! -f "${local_tarball}" ]; then
+  echo -n "ERROR: Cannot download ${remote_tarball} with cURL or wget; 
please install manually and try again."
   exit 2
-cd "${_DIR}" && tar -xzf "$2"
-rm -rf "$local_tarball"
+fi
+# Checksum may not have been specified; don't check if doesn't exist
+if [ -f "${local_checksum}" ]; then
+  echo "  ${local_tarball}" >> ${local_checksum} # two spaces + file are 
important!
+  # Assuming SHA512 here for now
+  echo "Veryfing checksum from ${local_checksum}" 1>&2
+  if ! shasum -a 512 -q -c "${local_checksum}" ; then
+echo "Bad checksum from ${remote_checksum}"
+exit 2
+  fi
+fi
+
+cd "${_DIR}" && tar -xzf "${local_tarball}"
+rm -rf "${local_tarball}"
+rm -f "${local_checksum}"
   fi
 }
 
@@ -71,21

[spark] branch branch-3.0 updated: [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use

2021-05-19 Thread tgraves

This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 1606791  [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for 
looking up cached exchanges for re-use
1606791 is described below

commit 1606791bf54d785eca1ae42fbd98e68a1d58e884
Author: Andy Grove 
AuthorDate: Wed May 19 07:45:26 2021 -0500

[SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up 
cached exchanges for re-use

### What changes were proposed in this pull request?
AQE has an optimization where it attempts to reuse compatible exchanges but 
it does not take into account whether the exchanges are columnar or not, 
resulting in incorrect reuse under some circumstances.

This PR simply changes the key used to lookup cached stages. It now uses 
the canonicalized form of the new query stage (potentially created by a plugin) 
rather than using the canonicalized form of the original exchange.

### Why are the changes needed?
When using the [RAPIDS Accelerator for Apache 
Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query 
stage correctly create a row-based exchange and then Spark replaces it with a 
cached columnar exchange, which is not compatible, and this causes queries to 
fail.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The patch has been tested with the query that highlighted this issue. I 
looked at writing unit tests for this but it would involve implementing a mock 
columnar exchange in the tests so would be quite a bit of work. If anyone has 
ideas on other ways to test this I am happy to hear them.

Closes #32195 from andygrove/SPARK-35093.

Authored-by: Andy Grove 
Signed-off-by: Thomas Graves 
(cherry picked from commit 52e3cf9ff50b4209e29cb06df09b1ef3a18bc83b)
Signed-off-by: Thomas Graves 
---
 .../apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala| 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
index 187827c..e7a8034 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
@@ -349,7 +349,8 @@ case class AdaptiveSparkPlanExec(
   // Check the `stageCache` again for reuse. If a match is found, 
ditch the new stage
   // and reuse the existing stage found in the `stageCache`, 
otherwise update the
   // `stageCache` with the new stage.
-  val queryStage = 
context.stageCache.getOrElseUpdate(e.canonicalized, newStage)
+  val queryStage = context.stageCache.getOrElseUpdate(
+newStage.plan.canonicalized, newStage)
   if (queryStage.ne(newStage)) {
 newStage = reuseQueryStage(queryStage, e)
   }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use

2021-05-19 Thread tgraves

This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 73abceb  [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for 
looking up cached exchanges for re-use
73abceb is described below

commit 73abceb05d64eafeb39866c69a84d0b7f3c1f097
Author: Andy Grove 
AuthorDate: Wed May 19 07:45:26 2021 -0500

[SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up 
cached exchanges for re-use

### What changes were proposed in this pull request?
AQE has an optimization where it attempts to reuse compatible exchanges but 
it does not take into account whether the exchanges are columnar or not, 
resulting in incorrect reuse under some circumstances.

This PR simply changes the key used to lookup cached stages. It now uses 
the canonicalized form of the new query stage (potentially created by a plugin) 
rather than using the canonicalized form of the original exchange.

### Why are the changes needed?
When using the [RAPIDS Accelerator for Apache 
Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query 
stage correctly create a row-based exchange and then Spark replaces it with a 
cached columnar exchange, which is not compatible, and this causes queries to 
fail.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The patch has been tested with the query that highlighted this issue. I 
looked at writing unit tests for this but it would involve implementing a mock 
columnar exchange in the tests so would be quite a bit of work. If anyone has 
ideas on other ways to test this I am happy to hear them.

Closes #32195 from andygrove/SPARK-35093.

Authored-by: Andy Grove 
Signed-off-by: Thomas Graves 
(cherry picked from commit 52e3cf9ff50b4209e29cb06df09b1ef3a18bc83b)
Signed-off-by: Thomas Graves 
---
 .../apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala| 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
index 89d3b53..596c8b3 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
@@ -418,7 +418,8 @@ case class AdaptiveSparkPlanExec(
   // Check the `stageCache` again for reuse. If a match is found, 
ditch the new stage
   // and reuse the existing stage found in the `stageCache`, 
otherwise update the
   // `stageCache` with the new stage.
-  val queryStage = 
context.stageCache.getOrElseUpdate(e.canonicalized, newStage)
+  val queryStage = context.stageCache.getOrElseUpdate(
+newStage.plan.canonicalized, newStage)
   if (queryStage.ne(newStage)) {
 newStage = reuseQueryStage(queryStage, e)
   }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use

2021-05-19 Thread tgraves

This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 52e3cf9  [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for 
looking up cached exchanges for re-use
52e3cf9 is described below

commit 52e3cf9ff50b4209e29cb06df09b1ef3a18bc83b
Author: Andy Grove 
AuthorDate: Wed May 19 07:45:26 2021 -0500

[SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up 
cached exchanges for re-use

### What changes were proposed in this pull request?
AQE has an optimization where it attempts to reuse compatible exchanges but 
it does not take into account whether the exchanges are columnar or not, 
resulting in incorrect reuse under some circumstances.

This PR simply changes the key used to lookup cached stages. It now uses 
the canonicalized form of the new query stage (potentially created by a plugin) 
rather than using the canonicalized form of the original exchange.

### Why are the changes needed?
When using the [RAPIDS Accelerator for Apache 
Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query 
stage correctly create a row-based exchange and then Spark replaces it with a 
cached columnar exchange, which is not compatible, and this causes queries to 
fail.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The patch has been tested with the query that highlighted this issue. I 
looked at writing unit tests for this but it would involve implementing a mock 
columnar exchange in the tests so would be quite a bit of work. If anyone has 
ideas on other ways to test this I am happy to hear them.

Closes #32195 from andygrove/SPARK-35093.

Authored-by: Andy Grove 
Signed-off-by: Thomas Graves 
---
 .../apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala| 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
index 256aacb..766788a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
@@ -435,7 +435,8 @@ case class AdaptiveSparkPlanExec(
   // Check the `stageCache` again for reuse. If a match is found, 
ditch the new stage
   // and reuse the existing stage found in the `stageCache`, 
otherwise update the
   // `stageCache` with the new stage.
-  val queryStage = 
context.stageCache.getOrElseUpdate(e.canonicalized, newStage)
+  val queryStage = context.stageCache.getOrElseUpdate(
+newStage.plan.canonicalized, newStage)
   if (queryStage.ne(newStage)) {
 newStage = reuseQueryStage(queryStage, e)
   }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (9283beb -> 1214213)

2021-05-19 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 9283beb  [SPARK-35418][SQL] Add sentences function to 
functions.{scala,py}
 add 1214213  [SPARK-35362][SQL] Update null count in the column stats for 
UNION operator stats estimation

No new revisions were added by this update.

Summary of changes:
 .../logical/statsEstimation/FilterEstimation.scala |  2 +-
 .../logical/statsEstimation/UnionEstimation.scala  | 97 ++
 .../BasicStatsEstimationSuite.scala|  2 +-
 .../statsEstimation/UnionEstimationSuite.scala | 65 +--
 4 files changed, 122 insertions(+), 44 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (46f7d78 -> 9283beb)

2021-05-19 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 46f7d78  [SPARK-35368][SQL] Update histogram statistics for RANGE 
operator for stats estimation
 add 9283beb  [SPARK-35418][SQL] Add sentences function to 
functions.{scala,py}

No new revisions were added by this update.

Summary of changes:
 python/docs/source/reference/pyspark.sql.rst   |  1 +
 python/pyspark/sql/functions.py| 39 ++
 python/pyspark/sql/functions.pyi   |  5 +++
 .../scala/org/apache/spark/sql/functions.scala | 19 +++
 .../apache/spark/sql/StringFunctionsSuite.scala|  7 
 5 files changed, 71 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (a72d05c -> 46f7d78)

2021-05-19 Thread yamamuro

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a72d05c  [SPARK-35106][CORE][SQL] Avoid failing rename caused by 
destination directory not exist
 add 46f7d78  [SPARK-35368][SQL] Update histogram statistics for RANGE 
operator for stats estimation

No new revisions were added by this update.

Summary of changes:
 .../plans/logical/basicLogicalOperators.scala  | 43 +++-
 .../BasicStatsEstimationSuite.scala| 81 ++
 2 files changed, 108 insertions(+), 16 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist

2021-05-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 0f78d8b  [SPARK-35106][CORE][SQL] Avoid failing rename caused by 
destination directory not exist
0f78d8b is described below

commit 0f78d8bd211e2dcf946e7fa658bce600843bad1b
Author: Yuzhou Sun 
AuthorDate: Wed May 19 15:46:27 2021 +0800

[SPARK-35106][CORE][SQL] Avoid failing rename caused by destination 
directory not exist

### What changes were proposed in this pull request?

1. In HadoopMapReduceCommitProtocol, create parent directory before 
renaming custom partition path staging files
2. In InMemoryCatalog and HiveExternalCatalog, create new partition 
directory before renaming old partition path
3. Check return value of FileSystem#rename, if false, throw exception to 
avoid silent data loss cause by rename failure
4. Change DebugFilesystem#rename behavior to make it match HDFS's behavior 
(return false without rename when dst parent directory not exist)

### Why are the changes needed?

Depends on FileSystem#rename implementation, when destination directory 
does not exist, file system may
1. return false without renaming file nor throwing exception (e.g. HDFS), or
2. create destination directory, rename files, and return true (e.g. 
LocalFileSystem)

In the first case above, renames in HadoopMapReduceCommitProtocol for 
custom partition path will fail silently if the destination partition path does 
not exist. Failed renames can happen when
1. dynamicPartitionOverwrite == true, the custom partition path directories 
are deleted by the job before the rename; or
2. the custom partition path directories do not exist before the job; or
3. something else is wrong when file system handle `rename`

The renames in MemoryCatalog and HiveExternalCatalog for partition renaming 
also have similar issue.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modified DebugFilesystem#rename, and added new unit tests.

Without the fix in src code, five InsertSuite tests and one 
AlterTableRenamePartitionSuite test failed:
InsertSuite.SPARK-20236: dynamic partition overwrite with custom partition 
path (existing test with modified FS)
```
== Results ==
!== Correct Answer - 1 ==   == Spark Answer - 0 ==
struct<>   struct<>
![2,1,1]
```

InsertSuite.SPARK-35106: insert overwrite with custom partition path
```
== Results ==
!== Correct Answer - 1 ==   == Spark Answer - 0 ==
struct<>   struct<>
![2,1,1]
```

InsertSuite.SPARK-35106: dynamic partition overwrite with custom partition 
path
```
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 1 ==
!struct<>   struct
 [1,1,1][1,1,1]
![1,1,2]
```

InsertSuite.SPARK-35106: Throw exception when rename custom partition paths 
returns false
```
Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown
```

InsertSuite.SPARK-35106: Throw exception when rename dynamic partition 
paths returns false
```
Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown
```

AlterTableRenamePartitionSuite.ALTER TABLE .. RENAME PARTITION V1: multi 
part partition (existing test with modified FS)
```
== Results ==
!== Correct Answer - 1 ==   == Spark Answer - 0 ==
 struct<>   struct<>
![3,123,3]
```

Closes #32530 from YuzhouSun/SPARK-35106.

Authored-by: Yuzhou Sun 
Signed-off-by: Wenchen Fan 
(cherry picked from commit a72d05c7e632fbb0d8a6082c3cacdf61f36518b4)
Signed-off-by: Wenchen Fan 
---
 .../io/HadoopMapReduceCommitProtocol.scala |  19 +++-
 .../scala/org/apache/spark/DebugFilesystem.scala   |  14 ++-
 .../sql/catalyst/catalog/InMemoryCatalog.scala |   6 +-
 .../org/apache/spark/sql/sources/InsertSuite.scala | 116 -
 .../spark/sql/hive/HiveExternalCatalog.scala   |   6 +-
 5 files changed, 151 insertions(+), 10 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
index 11ce608..a03a999 100644
--- 
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
+++ 
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
@@ -173,13 +173,18 @@ class HadoopMapReduceCommitProtocol(
 
   val filesToMove = allAbsPathFiles.foldLeft(Map[String, String]())(_ ++ _)

[spark] branch branch-3.1 updated: [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist

2021-05-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 484db68  [SPARK-35106][CORE][SQL] Avoid failing rename caused by 
destination directory not exist
484db68 is described below

commit 484db68d6901a98d4dd82411def2534f12c14f29
Author: Yuzhou Sun 
AuthorDate: Wed May 19 15:46:27 2021 +0800

[SPARK-35106][CORE][SQL] Avoid failing rename caused by destination 
directory not exist

### What changes were proposed in this pull request?

1. In HadoopMapReduceCommitProtocol, create parent directory before 
renaming custom partition path staging files
2. In InMemoryCatalog and HiveExternalCatalog, create new partition 
directory before renaming old partition path
3. Check return value of FileSystem#rename, if false, throw exception to 
avoid silent data loss cause by rename failure
4. Change DebugFilesystem#rename behavior to make it match HDFS's behavior 
(return false without rename when dst parent directory not exist)

### Why are the changes needed?

Depends on FileSystem#rename implementation, when destination directory 
does not exist, file system may
1. return false without renaming file nor throwing exception (e.g. HDFS), or
2. create destination directory, rename files, and return true (e.g. 
LocalFileSystem)

In the first case above, renames in HadoopMapReduceCommitProtocol for 
custom partition path will fail silently if the destination partition path does 
not exist. Failed renames can happen when
1. dynamicPartitionOverwrite == true, the custom partition path directories 
are deleted by the job before the rename; or
2. the custom partition path directories do not exist before the job; or
3. something else is wrong when file system handle `rename`

The renames in MemoryCatalog and HiveExternalCatalog for partition renaming 
also have similar issue.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modified DebugFilesystem#rename, and added new unit tests.

Without the fix in src code, five InsertSuite tests and one 
AlterTableRenamePartitionSuite test failed:
InsertSuite.SPARK-20236: dynamic partition overwrite with custom partition 
path (existing test with modified FS)
```
== Results ==
!== Correct Answer - 1 ==   == Spark Answer - 0 ==
struct<>   struct<>
![2,1,1]
```

InsertSuite.SPARK-35106: insert overwrite with custom partition path
```
== Results ==
!== Correct Answer - 1 ==   == Spark Answer - 0 ==
struct<>   struct<>
![2,1,1]
```

InsertSuite.SPARK-35106: dynamic partition overwrite with custom partition 
path
```
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 1 ==
!struct<>   struct
 [1,1,1][1,1,1]
![1,1,2]
```

InsertSuite.SPARK-35106: Throw exception when rename custom partition paths 
returns false
```
Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown
```

InsertSuite.SPARK-35106: Throw exception when rename dynamic partition 
paths returns false
```
Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown
```

AlterTableRenamePartitionSuite.ALTER TABLE .. RENAME PARTITION V1: multi 
part partition (existing test with modified FS)
```
== Results ==
!== Correct Answer - 1 ==   == Spark Answer - 0 ==
 struct<>   struct<>
![3,123,3]
```

Closes #32530 from YuzhouSun/SPARK-35106.

Authored-by: Yuzhou Sun 
Signed-off-by: Wenchen Fan 
(cherry picked from commit a72d05c7e632fbb0d8a6082c3cacdf61f36518b4)
Signed-off-by: Wenchen Fan 
---
 .../io/HadoopMapReduceCommitProtocol.scala |  19 +++-
 .../scala/org/apache/spark/DebugFilesystem.scala   |  14 ++-
 .../sql/catalyst/catalog/InMemoryCatalog.scala |   6 +-
 .../org/apache/spark/sql/sources/InsertSuite.scala | 116 -
 .../spark/sql/hive/HiveExternalCatalog.scala   |   6 +-
 5 files changed, 151 insertions(+), 10 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
index 30f9a65..c061d61 100644
--- 
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
+++ 
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
@@ -188,13 +188,18 @@ class HadoopMapReduceCommitProtocol(
 
   val filesToMove = allAbsPathFiles.foldLeft(Map[String, String]())(_ ++ _)

[spark] branch master updated (0b3758e -> a72d05c)

2021-05-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0b3758e  [SPARK-35421][SS] Remove redundant ProjectExec from streaming 
queries with V2Relation
 add a72d05c  [SPARK-35106][CORE][SQL] Avoid failing rename caused by 
destination directory not exist

No new revisions were added by this update.

Summary of changes:
 .../io/HadoopMapReduceCommitProtocol.scala |  19 +++-
 .../scala/org/apache/spark/DebugFilesystem.scala   |  14 ++-
 .../sql/catalyst/catalog/InMemoryCatalog.scala |   6 +-
 .../org/apache/spark/sql/sources/InsertSuite.scala | 116 -
 .../spark/sql/hive/HiveExternalCatalog.scala   |   6 +-
 5 files changed, 151 insertions(+), 10 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b1493d8 -> 0b3758e)

2021-05-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b1493d8  [SPARK-35398][SQL] Simplify the way to get classes from 
ClassBodyEvaluator in `CodeGenerator.updateAndGetCompilationStats` method
 add 0b3758e  [SPARK-35421][SS] Remove redundant ProjectExec from streaming 
queries with V2Relation

No new revisions were added by this update.

Summary of changes:
 .../datasources/v2/DataSourceV2Strategy.scala  | 26 --
 1 file changed, 9 insertions(+), 17 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (de59e01 -> 00b63c8)

[spark] branch master updated (a970f85 -> de59e01)

svn commit: r47830 - /release/spark/spark-2.4.7/

[spark] branch master updated (7eaabf4 -> a970f85)

[spark] branch master updated (c064805 -> 7eaabf4)

[spark] branch master updated (d44e6c7 -> c064805)

[spark] branch master updated (586caae -> d44e6c7)

[spark] branch master updated (d1b24d8 -> 586caae)

[spark] branch master updated: [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures

[spark] branch branch-3.0 updated: [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation

[spark] branch branch-3.1 updated: [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation

[GitHub] [spark-website] HyukjinKwon commented on pull request #343: Make 2.4.8 as EOL release

[spark] branch branch-3.0 updated: [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn

[spark] branch branch-3.1 updated: [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn

[spark] branch branch-3.0 updated: [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use

[spark] branch branch-3.1 updated: [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use

[spark] branch master updated: [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use

[spark] branch master updated (9283beb -> 1214213)

[spark] branch master updated (46f7d78 -> 9283beb)

[spark] branch master updated (a72d05c -> 46f7d78)

[spark] branch branch-3.0 updated: [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist

[spark] branch branch-3.1 updated: [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist

[spark] branch master updated (0b3758e -> a72d05c)

[spark] branch master updated (b1493d8 -> 0b3758e)

24 matches

Site Navigation

Mail list logo

Footer information