date:20220317

[spark] branch branch-3.2 updated: [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5

2022-03-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 864b89b  [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5
864b89b is described below

commit 864b89bbea8c2620938b28661fc8bea6271c7048
Author: Hyukjin Kwon 
AuthorDate: Fri Mar 18 14:00:48 2022 +0900

[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5

This PR is a retry of https://github.com/apache/spark/pull/35871 with 
bumping up the version to 0.10.9.5.
It was reverted because of Python 3.10 is broken, and Python 3.10 was not 
officially supported in Py4J.

In Py4J 0.10.9.5, the issue was fixed 
(https://github.com/py4j/py4j/pull/475), and it added Python 3.10 support 
officially with CI set up (https://github.com/py4j/py4j/pull/477).

See https://github.com/apache/spark/pull/35871

See https://github.com/apache/spark/pull/35871

Py4J sets up Python 3.10 CI now, and I manually tested PySpark with Python 
3.10 with this patch:

```bash
./bin/pyspark
```

```
import py4j
py4j.__version__
spark.range(10).show()
```

```
Using Python version 3.10.0 (default, Mar  3 2022 03:57:21)
Spark context Web UI available at http://172.30.5.50:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1647571387534).
SparkSession available as 'spark'.
>>> import py4j
>>> py4j.__version__
'0.10.9.5'
>>> spark.range(10).show()
+---+
| id|
+---+
...
```

Closes #35907 from HyukjinKwon/SPARK-38563-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 97335ea037a9a036c013c86ef62d74ca638f808e)
Signed-off-by: Hyukjin Kwon 
---
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2.7-hive-2.3|   2 +-
 dev/deps/spark-deps-hadoop-3.2-hive-2.3|   2 +-
 docs/job-scheduling.md |   2 +-
 python/docs/Makefile   |   2 +-
 python/docs/make2.bat  |   2 +-
 python/docs/source/getting_started/install.rst |   2 +-
 python/lib/py4j-0.10.9.3-src.zip   | Bin 42021 -> 0 bytes
 python/lib/py4j-0.10.9.5-src.zip   | Bin 0 -> 42404 bytes
 python/pyspark/context.py  |   6 ++--
 python/pyspark/util.py |  32 +++--
 python/setup.py|   2 +-
 sbin/spark-config.sh   |   2 +-
 16 files changed, 19 insertions(+), 43 deletions(-)

diff --git a/bin/pyspark b/bin/pyspark
index 4840589..21a514e 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -50,7 +50,7 @@ export PYSPARK_DRIVER_PYTHON_OPTS
 
 # Add the PySpark classes to the Python path:
 export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
-export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH"
+export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH"
 
 # Load the PySpark shell.py script when ./pyspark is used interactively:
 export OLD_PYTHONSTARTUP="$PYTHONSTARTUP"
diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd
index a19627a..eec02a4 100644
--- a/bin/pyspark2.cmd
+++ b/bin/pyspark2.cmd
@@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
 )
 
 set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
-set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%
+set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.5-src.zip;%PYTHONPATH%
 
 set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
 set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py
diff --git a/core/pom.xml b/core/pom.xml
index 3833794..4926664 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -433,7 +433,7 @@
 
   net.sf.py4j
   py4j
-  0.10.9.3
+  0.10.9.5
 
 
   org.apache.spark
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
index 8daba86..6336171 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
@@ -27,7 +27,7 @@ import org.apache.spark.SparkContext
 import org.apache.spark.api.java.{JavaRDD, JavaSparkContext}
 
 private[spark] object PythonUtils {
-  val PY4J_ZIP_NAME = "py4j-0.10.9.3-src.zip"
+  val PY4J_ZIP_NAME = "py4j-0.10.9.5-src.zip"
 
   /** Get the PYTHONPATH for PySpark, either from SPARK_HOME, if it is set, or 
from our JAR */
   def sparkPythonPath: String = {
diff --git

[spark] branch branch-3.3 updated: [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5

2022-03-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 7bb1d6f  [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5
7bb1d6f is described below

commit 7bb1d6f01148b037acad12de8166cf742cd30ea3
Author: Hyukjin Kwon 
AuthorDate: Fri Mar 18 14:00:48 2022 +0900

[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5

### What changes were proposed in this pull request?

This PR is a retry of https://github.com/apache/spark/pull/35871 with 
bumping up the version to 0.10.9.5.
It was reverted because of Python 3.10 is broken, and Python 3.10 was not 
officially supported in Py4J.

In Py4J 0.10.9.5, the issue was fixed 
(https://github.com/py4j/py4j/pull/475), and it added Python 3.10 support 
officially with CI set up (https://github.com/py4j/py4j/pull/477).

### Why are the changes needed?

See https://github.com/apache/spark/pull/35871

### Does this PR introduce _any_ user-facing change?

See https://github.com/apache/spark/pull/35871

### How was this patch tested?

Py4J sets up Python 3.10 CI now, and I manually tested PySpark with Python 
3.10 with this patch:

```bash
./bin/pyspark
```

```
import py4j
py4j.__version__
spark.range(10).show()
```

```
Using Python version 3.10.0 (default, Mar  3 2022 03:57:21)
Spark context Web UI available at http://172.30.5.50:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1647571387534).
SparkSession available as 'spark'.
>>> import py4j
>>> py4j.__version__
'0.10.9.5'
>>> spark.range(10).show()
+---+
| id|
+---+
...
```

Closes #35907 from HyukjinKwon/SPARK-38563-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 97335ea037a9a036c013c86ef62d74ca638f808e)
Signed-off-by: Hyukjin Kwon 
---
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2-hive-2.3  |   2 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3  |   2 +-
 docs/job-scheduling.md |   2 +-
 python/docs/Makefile   |   2 +-
 python/docs/make2.bat  |   2 +-
 python/docs/source/getting_started/install.rst |   2 +-
 python/lib/py4j-0.10.9.3-src.zip   | Bin 42021 -> 0 bytes
 python/lib/py4j-0.10.9.5-src.zip   | Bin 0 -> 42404 bytes
 python/pyspark/context.py  |   6 ++--
 python/pyspark/util.py |  35 +++--
 python/setup.py|   2 +-
 sbin/spark-config.sh   |   2 +-
 16 files changed, 20 insertions(+), 45 deletions(-)

diff --git a/bin/pyspark b/bin/pyspark
index 4840589..21a514e 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -50,7 +50,7 @@ export PYSPARK_DRIVER_PYTHON_OPTS
 
 # Add the PySpark classes to the Python path:
 export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
-export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH"
+export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH"
 
 # Load the PySpark shell.py script when ./pyspark is used interactively:
 export OLD_PYTHONSTARTUP="$PYTHONSTARTUP"
diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd
index a19627a..eec02a4 100644
--- a/bin/pyspark2.cmd
+++ b/bin/pyspark2.cmd
@@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
 )
 
 set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
-set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%
+set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.5-src.zip;%PYTHONPATH%
 
 set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
 set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py
diff --git a/core/pom.xml b/core/pom.xml
index 9d3b170..a753a59 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -423,7 +423,7 @@
 
   net.sf.py4j
   py4j
-  0.10.9.3
+  0.10.9.5
 
 
   org.apache.spark
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
index 8daba86..6336171 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
@@ -27,7 +27,7 @@ import org.apache.spark.SparkContext
 import org.apache.spark.api.java.{JavaRDD, JavaSparkContext}
 
 private[spark] object PythonUtils {
-  val PY4J_ZIP_NAME = "py4j-0.10.9.3-src.zip"
+  val

[spark] branch master updated (f36a5fb -> 97335ea)

2022-03-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from f36a5fb  [SPARK-38194] Followup: Fix k8s memory overhead passing to 
executor pods
 add 97335ea  [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5

No new revisions were added by this update.

Summary of changes:
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2-hive-2.3  |   2 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3  |   2 +-
 docs/job-scheduling.md |   2 +-
 python/docs/Makefile   |   2 +-
 python/docs/make2.bat  |   2 +-
 python/docs/source/getting_started/install.rst |   2 +-
 python/lib/py4j-0.10.9.3-src.zip   | Bin 42021 -> 0 bytes
 python/lib/py4j-0.10.9.5-src.zip   | Bin 0 -> 42404 bytes
 python/pyspark/context.py  |   6 ++--
 python/pyspark/util.py |  35 +++--
 python/setup.py|   2 +-
 sbin/spark-config.sh   |   2 +-
 16 files changed, 20 insertions(+), 45 deletions(-)
 delete mode 100644 python/lib/py4j-0.10.9.3-src.zip
 create mode 100644 python/lib/py4j-0.10.9.5-src.zip

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"

2022-03-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 8b73b72  Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"
8b73b72 is described below

commit 8b73b72770996ae4a81f092bddb01f1c26346efd
Author: Hyukjin Kwon 
AuthorDate: Fri Mar 18 10:22:55 2022 +0900

Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"

This reverts commit 69903200845b68a0474ecb0a3317dc744490c521.
---
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2.7-hive-2.3|   2 +-
 dev/deps/spark-deps-hadoop-3.2-hive-2.3|   2 +-
 docs/job-scheduling.md |   2 +-
 python/docs/Makefile   |   2 +-
 python/docs/make2.bat  |   2 +-
 python/docs/source/getting_started/install.rst |   2 +-
 python/lib/py4j-0.10.9.3-src.zip   | Bin 0 -> 42021 bytes
 python/lib/py4j-0.10.9.4-src.zip   | Bin 42404 -> 0 bytes
 python/pyspark/context.py  |   6 ++--
 python/pyspark/util.py |  33 +
 python/setup.py|   2 +-
 sbin/spark-config.sh   |   2 +-
 16 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/bin/pyspark b/bin/pyspark
index 1e16c56..4840589 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -50,7 +50,7 @@ export PYSPARK_DRIVER_PYTHON_OPTS
 
 # Add the PySpark classes to the Python path:
 export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
-export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.4-src.zip:$PYTHONPATH"
+export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH"
 
 # Load the PySpark shell.py script when ./pyspark is used interactively:
 export OLD_PYTHONSTARTUP="$PYTHONSTARTUP"
diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd
index f20c320..a19627a 100644
--- a/bin/pyspark2.cmd
+++ b/bin/pyspark2.cmd
@@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
 )
 
 set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
-set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.4-src.zip;%PYTHONPATH%
+set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%
 
 set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
 set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py
diff --git a/core/pom.xml b/core/pom.xml
index 94b3e58..3833794 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -433,7 +433,7 @@
 
   net.sf.py4j
   py4j
-  0.10.9.4
+  0.10.9.3
 
 
   org.apache.spark
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
index a9c35369..8daba86 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
@@ -27,7 +27,7 @@ import org.apache.spark.SparkContext
 import org.apache.spark.api.java.{JavaRDD, JavaSparkContext}
 
 private[spark] object PythonUtils {
-  val PY4J_ZIP_NAME = "py4j-0.10.9.4-src.zip"
+  val PY4J_ZIP_NAME = "py4j-0.10.9.3-src.zip"
 
   /** Get the PYTHONPATH for PySpark, either from SPARK_HOME, if it is set, or 
from our JAR */
   def sparkPythonPath: String = {
diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index 742710e..c2882bd 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -208,7 +208,7 @@ 
parquet-format-structures/1.12.2//parquet-format-structures-1.12.2.jar
 parquet-hadoop/1.12.2//parquet-hadoop-1.12.2.jar
 parquet-jackson/1.12.2//parquet-jackson-1.12.2.jar
 protobuf-java/2.5.0//protobuf-java-2.5.0.jar
-py4j/0.10.9.4//py4j-0.10.9.4.jar
+py4j/0.10.9.3//py4j-0.10.9.3.jar
 pyrolite/4.30//pyrolite-4.30.jar
 rocksdbjni/6.20.3//rocksdbjni-6.20.3.jar
 scala-collection-compat_2.12/2.1.1//scala-collection-compat_2.12-2.1.1.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index ce0dc17..be4c7b8 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -179,7 +179,7 @@ 
parquet-format-structures/1.12.2//parquet-format-structures-1.12.2.jar
 parquet-hadoop/1.12.2//parquet-hadoop-1.12.2.jar
 parquet-jackson/1.12.2//parquet-jackson-1.12.2.jar
 protobuf-java/2.5.0//protobuf-java-2.5.0.jar
-py4j/0.10.9.4//py4j-0.10.9.4.jar
+py4j/0.10.9.3//py4j-0.10.9.3.jar
 pyrolite/4.30//pyrolite-4.30.jar
 rocksdbjni/6.20.3//rocksdbjni-6.20.3.jar

[spark] branch branch-3.3 updated: Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"

2022-03-17 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new ef8773a  Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"
ef8773a is described below

commit ef8773ad5e90738a00408de87d2fe8566dc4acdc
Author: Dongjoon Hyun 
AuthorDate: Thu Mar 17 17:58:40 2022 -0700

Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"

### What changes were proposed in this pull request?

This reverts commit 3bbf346d9ca984faa0c3e67cd1387a13b2bd1e37 from 
branch-3.3 to recover Apache Spark 3.3 on Python 3.10.

### Why are the changes needed?

Py4J 0.10.9.4 has a regression which doesn't support Python 3.10.

### Does this PR introduce _any_ user-facing change?

No. This is not released yet.

### How was this patch tested?

Python UT with Python 3.10.

Closes #35904 from dongjoon-hyun/SPARK-38563-3.3.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2-hive-2.3  |   2 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3  |   2 +-
 docs/job-scheduling.md |   2 +-
 python/docs/Makefile   |   2 +-
 python/docs/make2.bat  |   2 +-
 python/docs/source/getting_started/install.rst |   2 +-
 python/lib/py4j-0.10.9.3-src.zip   | Bin 0 -> 42021 bytes
 python/lib/py4j-0.10.9.4-src.zip   | Bin 42404 -> 0 bytes
 python/pyspark/context.py  |   6 ++--
 python/pyspark/util.py |  35 ++---
 python/setup.py|   2 +-
 sbin/spark-config.sh   |   2 +-
 16 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/bin/pyspark b/bin/pyspark
index 1e16c56..4840589 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -50,7 +50,7 @@ export PYSPARK_DRIVER_PYTHON_OPTS
 
 # Add the PySpark classes to the Python path:
 export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
-export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.4-src.zip:$PYTHONPATH"
+export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH"
 
 # Load the PySpark shell.py script when ./pyspark is used interactively:
 export OLD_PYTHONSTARTUP="$PYTHONSTARTUP"
diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd
index f20c320..a19627a 100644
--- a/bin/pyspark2.cmd
+++ b/bin/pyspark2.cmd
@@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
 )
 
 set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
-set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.4-src.zip;%PYTHONPATH%
+set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%
 
 set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
 set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py
diff --git a/core/pom.xml b/core/pom.xml
index 953c76b..9d3b170 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -423,7 +423,7 @@
 
   net.sf.py4j
   py4j
-  0.10.9.4
+  0.10.9.3
 
 
   org.apache.spark
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
index a9c35369..8daba86 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
@@ -27,7 +27,7 @@ import org.apache.spark.SparkContext
 import org.apache.spark.api.java.{JavaRDD, JavaSparkContext}
 
 private[spark] object PythonUtils {
-  val PY4J_ZIP_NAME = "py4j-0.10.9.4-src.zip"
+  val PY4J_ZIP_NAME = "py4j-0.10.9.3-src.zip"
 
   /** Get the PYTHONPATH for PySpark, either from SPARK_HOME, if it is set, or 
from our JAR */
   def sparkPythonPath: String = {
diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 
b/dev/deps/spark-deps-hadoop-2-hive-2.3
index f2db663..bcbf8b9 100644
--- a/dev/deps/spark-deps-hadoop-2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2-hive-2.3
@@ -233,7 +233,7 @@ parquet-hadoop/1.12.2//parquet-hadoop-1.12.2.jar
 parquet-jackson/1.12.2//parquet-jackson-1.12.2.jar
 pickle/1.2//pickle-1.2.jar
 protobuf-java/2.5.0//protobuf-java-2.5.0.jar
-py4j/0.10.9.4//py4j-0.10.9.4.jar
+py4j/0.10.9.3//py4j-0.10.9.3.jar
 remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar
 rocksdbjni/6.20.3//rocksdbjni-6.20.3.jar
 scala-collection-compat_2.12/2.1.1//scala-collection-compat_2.12-2.1.1.jar
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index c56b4c9..8ca7880 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3

[spark] branch master updated: [SPARK-38194] Followup: Fix k8s memory overhead passing to executor pods

2022-03-17 Thread tgraves

This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f36a5fb  [SPARK-38194] Followup: Fix k8s memory overhead passing to 
executor pods
f36a5fb is described below

commit f36a5fb2b88620c1c490d087b0293c4e58d29979
Author: Adam Binford - Customer Site (Virginia) - CW 121796 

AuthorDate: Thu Mar 17 18:32:29 2022 -0500

[SPARK-38194] Followup: Fix k8s memory overhead passing to executor pods

### What changes were proposed in this pull request?

Follow up to https://github.com/apache/spark/pull/35504 to fix k8s memory 
overhead handling.

### Why are the changes needed?

https://github.com/apache/spark/pull/35504 introduced a bug only caught by 
the K8S integration tests.

### Does this PR introduce _any_ user-facing change?

Fix back to old behavior.

### How was this patch tested?

See if IT passes

Closes #35901 from Kimahriman/k8s-memory-overhead-executors.

Authored-by: Adam Binford - Customer Site (Virginia) - CW 121796 

Signed-off-by: Thomas Graves 
---
 .../k8s/features/BasicDriverFeatureStep.scala  | 30 +++-
 .../k8s/features/BasicDriverFeatureStepSuite.scala | 57 +-
 2 files changed, 63 insertions(+), 24 deletions(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala
index 9715149..413f5bc 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala
@@ -53,28 +53,32 @@ private[spark] class BasicDriverFeatureStep(conf: 
KubernetesDriverConf)
 
   // Memory settings
   private val driverMemoryMiB = conf.get(DRIVER_MEMORY)
-  private val memoryOverheadFactor = if 
(conf.contains(DRIVER_MEMORY_OVERHEAD_FACTOR)) {
-conf.get(DRIVER_MEMORY_OVERHEAD_FACTOR)
-  } else {
-conf.get(MEMORY_OVERHEAD_FACTOR)
-  }
 
-  // The memory overhead factor to use. If the user has not set it, then use a 
different
-  // value for non-JVM apps. This value is propagated to executors.
-  private val overheadFactor =
+  // The default memory overhead factor to use, derived from the deprecated
+  // `spark.kubernetes.memoryOverheadFactor` config or the default overhead 
values.
+  // If the user has not set it, then use a different default for non-JVM 
apps. This value is
+  // propagated to executors and used if the executor overhead factor is not 
set explicitly.
+  private val defaultOverheadFactor =
 if (conf.mainAppResource.isInstanceOf[NonJVMResource]) {
-  if (conf.contains(MEMORY_OVERHEAD_FACTOR) || 
conf.contains(DRIVER_MEMORY_OVERHEAD_FACTOR)) {
-memoryOverheadFactor
+  if (conf.contains(MEMORY_OVERHEAD_FACTOR)) {
+conf.get(MEMORY_OVERHEAD_FACTOR)
   } else {
 NON_JVM_MEMORY_OVERHEAD_FACTOR
   }
 } else {
-  memoryOverheadFactor
+  conf.get(MEMORY_OVERHEAD_FACTOR)
 }
 
+  // Prefer the driver memory overhead factor if set explicitly
+  private val memoryOverheadFactor = if 
(conf.contains(DRIVER_MEMORY_OVERHEAD_FACTOR)) {
+conf.get(DRIVER_MEMORY_OVERHEAD_FACTOR)
+  } else {
+defaultOverheadFactor
+  }
+
   private val memoryOverheadMiB = conf
 .get(DRIVER_MEMORY_OVERHEAD)
-.getOrElse(math.max((overheadFactor * driverMemoryMiB).toInt,
+.getOrElse(math.max((memoryOverheadFactor * driverMemoryMiB).toInt,
   ResourceProfile.MEMORY_OVERHEAD_MIN_MIB))
   private val driverMemoryWithOverheadMiB = driverMemoryMiB + memoryOverheadMiB
 
@@ -169,7 +173,7 @@ private[spark] class BasicDriverFeatureStep(conf: 
KubernetesDriverConf)
   KUBERNETES_DRIVER_POD_NAME.key -> driverPodName,
   "spark.app.id" -> conf.appId,
   KUBERNETES_DRIVER_SUBMIT_CHECK.key -> "true",
-  DRIVER_MEMORY_OVERHEAD_FACTOR.key -> overheadFactor.toString)
+  MEMORY_OVERHEAD_FACTOR.key -> defaultOverheadFactor.toString)
 // try upload local, resolvable files to a hadoop compatible file system
 Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key =>
   val uris = conf.get(key).filter(uri => 
KubernetesUtils.isLocalAndResolvable(uri))
diff --git 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStepSuite.scala
 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStepSuite.scala
index d45f5f9..9a3b06a 100644
---

[spark] branch master updated (2d1d18a -> 54fdb88)

2022-03-17 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 2d1d18a  [SPARK-37425][PYTHON] Inline type hints for 
python/pyspark/mllib/recommendation.py
 add 54fdb88  Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"

No new revisions were added by this update.

Summary of changes:
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2-hive-2.3  |   2 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3  |   2 +-
 docs/job-scheduling.md |   2 +-
 python/docs/Makefile   |   2 +-
 python/docs/make2.bat  |   2 +-
 python/docs/source/getting_started/install.rst |   2 +-
 python/lib/py4j-0.10.9.3-src.zip   | Bin 0 -> 42021 bytes
 python/lib/py4j-0.10.9.4-src.zip   | Bin 42404 -> 0 bytes
 python/pyspark/context.py  |   6 ++--
 python/pyspark/util.py |  35 ++---
 python/setup.py|   2 +-
 sbin/spark-config.sh   |   2 +-
 16 files changed, 45 insertions(+), 20 deletions(-)
 create mode 100644 python/lib/py4j-0.10.9.3-src.zip
 delete mode 100644 python/lib/py4j-0.10.9.4-src.zip

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37425][PYTHON] Inline type hints for python/pyspark/mllib/recommendation.py

2022-03-17 Thread zero323

This is an automated email from the ASF dual-hosted git repository.

zero323 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2d1d18a  [SPARK-37425][PYTHON] Inline type hints for 
python/pyspark/mllib/recommendation.py
2d1d18a is described below

commit 2d1d18ab4268eb5ba2ff22f74a5b3b85988d65b4
Author: dch nguyen 
AuthorDate: Fri Mar 18 00:14:04 2022 +0100

[SPARK-37425][PYTHON] Inline type hints for 
python/pyspark/mllib/recommendation.py

### What changes were proposed in this pull request?
Inline type hints for recommendation in python/pyspark/mllib/

### Why are the changes needed?
We can take advantage of static type checking within the functions by 
inlining the type hints.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #35766 from dchvn/SPARK-37425.

Authored-by: dch nguyen 
Signed-off-by: zero323 
---
 python/pyspark/mllib/recommendation.py  | 70 +++-
 python/pyspark/mllib/recommendation.pyi | 71 -
 2 files changed, 42 insertions(+), 99 deletions(-)

diff --git a/python/pyspark/mllib/recommendation.py 
b/python/pyspark/mllib/recommendation.py
index cb412fb..55eae10 100644
--- a/python/pyspark/mllib/recommendation.py
+++ b/python/pyspark/mllib/recommendation.py
@@ -17,7 +17,7 @@
 
 import array
 import sys
-from collections import namedtuple
+from typing import Any, List, NamedTuple, Optional, Tuple, Type, Union
 
 from pyspark import SparkContext, since
 from pyspark.rdd import RDD
@@ -28,7 +28,7 @@ from pyspark.sql import DataFrame
 __all__ = ["MatrixFactorizationModel", "ALS", "Rating"]
 
 
-class Rating(namedtuple("Rating", ["user", "product", "rating"])):
+class Rating(NamedTuple):
 """
 Represents a (user, product, rating) tuple.
 
@@ -43,12 +43,18 @@ class Rating(namedtuple("Rating", ["user", "product", 
"rating"])):
 (1, 2, 5.0)
 """
 
-def __reduce__(self):
+user: int
+product: int
+rating: float
+
+def __reduce__(self) -> Tuple[Type["Rating"], Tuple[int, int, float]]:
 return Rating, (int(self.user), int(self.product), float(self.rating))
 
 
 @inherit_doc
-class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
+class MatrixFactorizationModel(
+JavaModelWrapper, JavaSaveable, JavaLoader["MatrixFactorizationModel"]
+):
 
 """A matrix factorisation model trained by regularized alternating
 least-squares.
@@ -135,14 +141,14 @@ class MatrixFactorizationModel(JavaModelWrapper, 
JavaSaveable, JavaLoader):
 """
 
 @since("0.9.0")
-def predict(self, user, product):
+def predict(self, user: int, product: int) -> float:
 """
 Predicts rating for the given user and product.
 """
 return self._java_model.predict(int(user), int(product))
 
 @since("0.9.0")
-def predictAll(self, user_product):
+def predictAll(self, user_product: RDD[Tuple[int, int]]) -> RDD[Rating]:
 """
 Returns a list of predicted ratings for input user and product
 pairs.
@@ -154,7 +160,7 @@ class MatrixFactorizationModel(JavaModelWrapper, 
JavaSaveable, JavaLoader):
 return self.call("predict", user_product)
 
 @since("1.2.0")
-def userFeatures(self):
+def userFeatures(self) -> RDD[Tuple[int, array.array]]:
 """
 Returns a paired RDD, where the first element is the user and the
 second is an array of features corresponding to that user.
@@ -162,7 +168,7 @@ class MatrixFactorizationModel(JavaModelWrapper, 
JavaSaveable, JavaLoader):
 return self.call("getUserFeatures").mapValues(lambda v: 
array.array("d", v))
 
 @since("1.2.0")
-def productFeatures(self):
+def productFeatures(self) -> RDD[Tuple[int, array.array]]:
 """
 Returns a paired RDD, where the first element is the product and the
 second is an array of features corresponding to that product.
@@ -170,7 +176,7 @@ class MatrixFactorizationModel(JavaModelWrapper, 
JavaSaveable, JavaLoader):
 return self.call("getProductFeatures").mapValues(lambda v: 
array.array("d", v))
 
 @since("1.4.0")
-def recommendUsers(self, product, num):
+def recommendUsers(self, product: int, num: int) -> List[Rating]:
 """
 Recommends the top "num" number of users for a given product and
 returns a list of Rating objects sorted by the predicted rating in
@@ -179,7 +185,7 @@ class MatrixFactorizationModel(JavaModelWrapper, 
JavaSaveable, JavaLoader):
 return list(self.call("recommendUsers", product, num))
 
 @since("1.4.0")
-def recommendProducts(self, user, num):
+def recommendProducts(self, user: int, num: int) -> List[Rating]:
 """
 Recommends the top "num"

[spark] branch branch-3.3 updated: Revert "[SPARK-38194][YARN][MESOS][K8S] Make memory overhead factor configurable"

2022-03-17 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new cf0afa8  Revert "[SPARK-38194][YARN][MESOS][K8S] Make memory overhead 
factor configurable"
cf0afa8 is described below

commit cf0afa8619544ab6008fcec8a25891c2ff43625a
Author: Thomas Graves 
AuthorDate: Thu Mar 17 12:54:50 2022 -0700

Revert "[SPARK-38194][YARN][MESOS][K8S] Make memory overhead factor 
configurable"

### What changes were proposed in this pull request?

This reverts commit 8405ec352dbed6a3199fc2af3c60fae7186d15b5.

### Why are the changes needed?

The original PR broke K8s integration tests so lets revert in branch-3.3 
for now and fix on master.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI. K8s IT is recovered like the following.

```
[info] KubernetesSuite:
[info] - Run SparkPi with no resources (9 seconds, 832 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (9 seconds, 
715 milliseconds)
[info] - Run SparkPi with a very long application name. (8 seconds, 672 
milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (9 seconds, 614 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (9 seconds, 616 
milliseconds)
[info] - Run SparkPi with an argument. (8 seconds, 633 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment 
variables. (8 seconds, 631 milliseconds)
[info] - All pods have the same service account by default (8 seconds, 625 
milliseconds)
[info] - Run extraJVMOptions check on driver (4 seconds, 639 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 699 
milliseconds)
[info] - Verify logging configuration is picked from the provided 
SPARK_CONF_DIR/log4j2.properties (14 seconds, 31 milliseconds)
[info] - Run SparkPi with env and mount secrets. (17 seconds, 878 
milliseconds)
[info] - Run PySpark on simple pi.py example (9 seconds, 642 milliseconds)
[info] - Run PySpark to test a pyfiles example (11 seconds, 883 
milliseconds)
[info] - Run PySpark with memory customization (9 seconds, 602 milliseconds)
[info] - Run in client mode. (6 seconds, 303 milliseconds)
[info] - Start pod creation from template (8 seconds, 864 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (8 seconds, 665 
milliseconds)
[info] - Test basic decommissioning (41 seconds, 74 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (41 seconds, 318 
milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 
minutes, 40 seconds)
[info] - Test decommissioning timeouts (41 seconds, 892 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 7 seconds)
[info] - Run SparkR on simple dataframe.R example (11 seconds, 643 
milliseconds)
[info] VolcanoSuite:
[info] - Run SparkPi with no resources (9 seconds, 585 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (10 
seconds, 607 milliseconds)
[info] - Run SparkPi with a very long application name. (9 seconds, 636 
milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (10 seconds, 681 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (10 seconds, 628 
milliseconds)
[info] - Run SparkPi with an argument. (9 seconds, 638 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment 
variables. (9 seconds, 626 milliseconds)
[info] - All pods have the same service account by default (10 seconds, 615 
milliseconds)
[info] - Run extraJVMOptions check on driver (4 seconds, 590 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (9 seconds, 660 
milliseconds)
[info] - Verify logging configuration is picked from the provided 
SPARK_CONF_DIR/log4j2.properties (15 seconds, 277 milliseconds)
[info] - Run SparkPi with env and mount secrets. (19 seconds, 300 
milliseconds)
[info] - Run PySpark on simple pi.py example (10 seconds, 641 milliseconds)
[info] - Run PySpark to test a pyfiles example (12 seconds, 656 
milliseconds)
[info] - Run PySpark with memory customization (10 seconds, 599 
milliseconds)
[info] - Run in client mode. (7 seconds, 258 milliseconds)
[info] - Start pod creation from template (10 seconds, 664 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (10 seconds, 891 
milliseconds)
[info] - Test basic decommissioning (42 seconds, 85 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (42 seconds, 384 
milliseconds)
[info] - Test decommissioning with dynamic allocation &

[spark] branch master updated (968bb34 -> cd86df8)

2022-03-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 968bb34  [SPARK-38575][INFRA][FOLLOW-UP] Use GITHUB_REF to get the 
current branch
 add cd86df8  [SPARK-38575][INFRA][FOLLOW-UP] Pin the branch to `master` 
for forked repositories

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_and_test.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-38575][INFRA][FOLLOW-UP] Use GITHUB_REF to get the current branch

2022-03-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 968bb34  [SPARK-38575][INFRA][FOLLOW-UP] Use GITHUB_REF to get the 
current branch
968bb34 is described below

commit 968bb34ef6f824d2c1b7c68ea5190d0f5ef32654
Author: Hyukjin Kwon 
AuthorDate: Thu Mar 17 18:52:26 2022 +0900

[SPARK-38575][INFRA][FOLLOW-UP] Use GITHUB_REF to get the current branch

### What changes were proposed in this pull request?

Current `git --show-current` doesn't work because we're not in a git 
repository when we calculate the changes we need.
Instead, this PR proposes to use `GITHUB_REF` environment variable to get 
the current branch.

### Why are the changes needed?

To fix up the build. See 
https://github.com/apache/spark/runs/5583091595?check_suite_focus=true (branch 
names)

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Should monitor test.

Closes #35892 from HyukjinKwon/SPARK-38575-follow2.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_and_test.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index c589b39..2e9f971 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -96,7 +96,7 @@ jobs:
   echo '::set-output name=hadoop::hadoop3'
 else
   echo '::set-output name=java::8'
-  echo "::set-output name=branch::`git branch --show-current`"
+  echo "::set-output name=branch::${GITHUB_REF#refs/heads/}"
   echo '::set-output name=type::regular'
   echo '::set-output name=envs::{"SPARK_ANSI_SQL_MODE": "${{ 
inputs.ansi_enabled }}"}'
   echo '::set-output name=hadoop::hadoop3'

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-38586][INFRA] Trigger notifying workflow in branch-3.3 and other future branches

2022-03-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 2ee5705  [SPARK-38586][INFRA] Trigger notifying workflow in branch-3.3 
and other future branches
2ee5705 is described below

commit 2ee57055fd917a1829adce06c2504db0a68e3ba2
Author: Hyukjin Kwon 
AuthorDate: Thu Mar 17 18:43:42 2022 +0900

[SPARK-38586][INFRA] Trigger notifying workflow in branch-3.3 and other 
future branches

This PR fixes `Notify test workflow` workflow to be triggered against PRs.

In fact, we don't need to check if the branch is `master` since the event 
triggers the workflow that's found in the commit SHA.

To link builds to the CI status in PRs.

No, dev-only.

Will be checked after it gets merged - it's pretty straightforward.

Closes #35891 from HyukjinKwon/SPARK-38575-follow.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 5c4930ab96360fc8c9ec8f15316e36eb7d516560)
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/notify_test_workflow.yml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/.github/workflows/notify_test_workflow.yml 
b/.github/workflows/notify_test_workflow.yml
index 04e7ab8..eb0da84 100644
--- a/.github/workflows/notify_test_workflow.yml
+++ b/.github/workflows/notify_test_workflow.yml
@@ -34,7 +34,6 @@ jobs:
 steps:
   - name: "Notify test workflow"
 uses: actions/github-script@f05a81df23035049204b043b50c3322045ce7eb3 # 
pin@v3
-if: ${{ github.base_ref == 'master' }}
 with:
   github-token: ${{ secrets.GITHUB_TOKEN }}
   script: |

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-38586][INFRA] Trigger notifying workflow in branch-3.3 and other future branches

2022-03-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5c4930a  [SPARK-38586][INFRA] Trigger notifying workflow in branch-3.3 
and other future branches
5c4930a is described below

commit 5c4930ab96360fc8c9ec8f15316e36eb7d516560
Author: Hyukjin Kwon 
AuthorDate: Thu Mar 17 18:43:42 2022 +0900

[SPARK-38586][INFRA] Trigger notifying workflow in branch-3.3 and other 
future branches

### What changes were proposed in this pull request?

This PR fixes `Notify test workflow` workflow to be triggered against PRs.

In fact, we don't need to check if the branch is `master` since the event 
triggers the workflow that's found in the commit SHA.

### Why are the changes needed?

To link builds to the CI status in PRs.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Will be checked after it gets merged - it's pretty straightforward.

Closes #35891 from HyukjinKwon/SPARK-38575-follow.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/notify_test_workflow.yml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/.github/workflows/notify_test_workflow.yml 
b/.github/workflows/notify_test_workflow.yml
index 04e7ab8..eb0da84 100644
--- a/.github/workflows/notify_test_workflow.yml
+++ b/.github/workflows/notify_test_workflow.yml
@@ -34,7 +34,6 @@ jobs:
 steps:
   - name: "Notify test workflow"
 uses: actions/github-script@f05a81df23035049204b043b50c3322045ce7eb3 # 
pin@v3
-if: ${{ github.base_ref == 'master' }}
 with:
   github-token: ${{ secrets.GITHUB_TOKEN }}
   script: |

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated (ebaee4f -> 1824c69)

2022-03-17 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ebaee4f  [SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should 
use prepareExecutedPlan rather than createSparkPlan to re-plan subquery
 add 1824c69  [SPARK-38566][SQL][3.3] Revert the parser changes for DEFAULT 
column support

No new revisions were added by this update.

Summary of changes:
 docs/sql-ref-ansi-compliance.md|  1 -
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |  1 -
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 | 22 ++--
 .../spark/sql/catalyst/parser/AstBuilder.scala | 61 +-
 .../spark/sql/errors/QueryParsingErrors.scala  |  4 --
 .../spark/sql/catalyst/parser/DDLParserSuite.scala | 53 ---
 .../spark/sql/execution/SparkSqlParser.scala   |  2 +-
 7 files changed, 6 insertions(+), 138 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][INFRA] Add ANTLR generated files to .gitignore

2022-03-17 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 46ccc22  [MINOR][INFRA] Add ANTLR generated files to .gitignore
46ccc22 is described below

commit 46ccc22ee40c780f6ae4a9af4562fb1ad10ccd9f
Author: Yuto Akutsu 
AuthorDate: Thu Mar 17 18:12:13 2022 +0900

[MINOR][INFRA] Add ANTLR generated files to .gitignore

### What changes were proposed in this pull request?

Add git ignore entries for files created by ANTLR.

### Why are the changes needed?

To avoid developers from accidentally adding those files when working on 
parser/lexer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

To make sure those files are ignored by `git status` when they exist.

Closes #35838 from yutoacts/minor_gitignore.

Authored-by: Yuto Akutsu 
Signed-off-by: Kousuke Saruta 
---
 .gitignore | 5 +
 1 file changed, 5 insertions(+)

diff --git a/.gitignore b/.gitignore
index b758781..0e2f59f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -117,3 +117,8 @@ spark-warehouse/
 
 # For Node.js
 node_modules
+
+# For Antlr
+sql/catalyst/gen/
+sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.tokens
+sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/gen/

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should use prepareExecutedPlan rather than createSparkPlan to re-plan subquery

2022-03-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new a27f13e  [SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should 
use prepareExecutedPlan rather than createSparkPlan to re-plan subquery
a27f13e is described below

commit a27f13ea8c14d72558c77bfb75d725d649cd2081
Author: ulysses-you 
AuthorDate: Thu Mar 17 16:58:22 2022 +0800

[SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should use 
prepareExecutedPlan rather than createSparkPlan to re-plan subquery

### What changes were proposed in this pull request?

Use exists adaptive execution context to re-compile subquery.

### Why are the changes needed?

If a subquery which is inferred by dpp contains a nested subquery, AQE can 
not compile it correctly.

the added test will fail before this pr:
```java
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.Aggregate cannot be cast to 
org.apache.spark.sql.execution.SparkPlan
  at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at scala.collection.TraversableLike.map(TraversableLike.scala:286)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
  at 
org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:75)
  at 
org.apache.spark.sql.execution.SparkPlanInfo$.$anonfun$fromSparkPlan$3(SparkPlanInfo.scala:75)
```

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

add test

Closes #35849 from ulysses-you/SPARK-37995.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 3afc4fb08b01597a5677ce706731639c687fd2dd)
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/execution/QueryExecution.scala  | 13 +
 .../adaptive/PlanAdaptiveDynamicPruningFilters.scala | 11 ---
 .../spark/sql/DynamicPartitionPruningSuite.scala | 20 +++-
 3 files changed, 36 insertions(+), 8 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
index 0d3b752..4c5b0cc 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
@@ -482,6 +482,19 @@ object QueryExecution {
 prepareExecutedPlan(spark, sparkPlan)
   }
 
+  /**
+   * Prepare the [[SparkPlan]] for execution using exists adaptive execution 
context.
+   * This method is only called by [[PlanAdaptiveDynamicPruningFilters]].
+   */
+  def prepareExecutedPlan(
+  session: SparkSession,
+  plan: LogicalPlan,
+  context: AdaptiveExecutionContext): SparkPlan = {
+val sparkPlan = createSparkPlan(session, session.sessionState.planner, 
plan.clone())
+val preparationRules = preparations(session, 
Option(InsertAdaptiveSparkPlan(context)), true)
+prepareForExecution(preparationRules, sparkPlan.clone())
+  }
+
   private val currentCteMap = new ThreadLocal[mutable.HashMap[Long, 
CTERelationDef]]()
 
   def cteMap: mutable.HashMap[Long, CTERelationDef] = currentCteMap.get()
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala
index f9c1bbe..aa8f236 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala
@@ -71,13 +71,10 @@ case class PlanAdaptiveDynamicPruningFilters(
   val aggregate = Aggregate(Seq(alias), Seq(alias), buildPlan)
 
   val session = adaptivePlan.context.session
-  val planner = session.sessionState.planner
-  // Here we can't call the QueryExecution.prepareExecutedPlan() 
method to
-  // get the sparkPlan as Non-AQE use case, which will cause the 
physical
-  // plan optimization rules be inserted twice, once in AQE framework 
and
-  // another in prepareExecutedPlan() method.
-  val

[spark] branch branch-3.3 updated: [SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should use prepareExecutedPlan rather than createSparkPlan to re-plan subquery

2022-03-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new ebaee4f  [SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should 
use prepareExecutedPlan rather than createSparkPlan to re-plan subquery
ebaee4f is described below

commit ebaee4fad1a09d5e19d74ced940ef09f2a416f71
Author: ulysses-you 
AuthorDate: Thu Mar 17 16:58:22 2022 +0800

[SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should use 
prepareExecutedPlan rather than createSparkPlan to re-plan subquery

### What changes were proposed in this pull request?

Use exists adaptive execution context to re-compile subquery.

### Why are the changes needed?

If a subquery which is inferred by dpp contains a nested subquery, AQE can 
not compile it correctly.

the added test will fail before this pr:
```java
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.Aggregate cannot be cast to 
org.apache.spark.sql.execution.SparkPlan
  at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at scala.collection.TraversableLike.map(TraversableLike.scala:286)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
  at 
org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:75)
  at 
org.apache.spark.sql.execution.SparkPlanInfo$.$anonfun$fromSparkPlan$3(SparkPlanInfo.scala:75)
```

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

add test

Closes #35849 from ulysses-you/SPARK-37995.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 3afc4fb08b01597a5677ce706731639c687fd2dd)
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/execution/QueryExecution.scala  | 13 +
 .../adaptive/PlanAdaptiveDynamicPruningFilters.scala | 11 ---
 .../spark/sql/DynamicPartitionPruningSuite.scala | 20 +++-
 3 files changed, 36 insertions(+), 8 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
index 1b08994..9bf8de5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
@@ -485,6 +485,19 @@ object QueryExecution {
 prepareExecutedPlan(spark, sparkPlan)
   }
 
+  /**
+   * Prepare the [[SparkPlan]] for execution using exists adaptive execution 
context.
+   * This method is only called by [[PlanAdaptiveDynamicPruningFilters]].
+   */
+  def prepareExecutedPlan(
+  session: SparkSession,
+  plan: LogicalPlan,
+  context: AdaptiveExecutionContext): SparkPlan = {
+val sparkPlan = createSparkPlan(session, session.sessionState.planner, 
plan.clone())
+val preparationRules = preparations(session, 
Option(InsertAdaptiveSparkPlan(context)), true)
+prepareForExecution(preparationRules, sparkPlan.clone())
+  }
+
   private val currentCteMap = new ThreadLocal[mutable.HashMap[Long, 
CTERelationDef]]()
 
   def cteMap: mutable.HashMap[Long, CTERelationDef] = currentCteMap.get()
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala
index 1cc39df..9a780c1 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala
@@ -71,13 +71,10 @@ case class PlanAdaptiveDynamicPruningFilters(
   val aggregate = Aggregate(Seq(alias), Seq(alias), buildPlan)
 
   val session = adaptivePlan.context.session
-  val planner = session.sessionState.planner
-  // Here we can't call the QueryExecution.prepareExecutedPlan() 
method to
-  // get the sparkPlan as Non-AQE use case, which will cause the 
physical
-  // plan optimization rules be inserted twice, once in AQE framework 
and
-  // another in prepareExecutedPlan() method.
-  val

[spark] branch master updated (f0b836b -> 3afc4fb)

2022-03-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from f0b836b  [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with 
distinct, cannot do partial agg push down
 add 3afc4fb  [SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should 
use prepareExecutedPlan rather than createSparkPlan to re-plan subquery

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/execution/QueryExecution.scala  | 13 +
 .../adaptive/PlanAdaptiveDynamicPruningFilters.scala | 11 ---
 .../spark/sql/DynamicPartitionPruningSuite.scala | 20 +++-
 3 files changed, 36 insertions(+), 8 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot do partial agg push down

2022-03-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 801c330  [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with 
distinct, cannot do partial agg push down
801c330 is described below

commit 801c330036ef93789061daeea82f7cbc8b9cdebb
Author: Jiaan Geng 
AuthorDate: Thu Mar 17 16:53:40 2022 +0800

[SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot 
do partial agg push down

### What changes were proposed in this pull request?
Spark could partial push down sum(distinct col), count(distinct col) if 
data source have multiple partitions, and Spark will sum the value again.
So the result may not correctly.

### Why are the changes needed?
Fix the bug push down sum(distinct col), count(distinct col) to data source 
and return incorrect result.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users will see the correct behavior.

### How was this patch tested?
New tests.

Closes #35873 from beliefer/SPARK-38560.

Authored-by: Jiaan Geng 
Signed-off-by: Wenchen Fan 
---
 .../datasources/v2/V2ScanRelationPushDown.scala| 184 +++--
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|  14 +-
 2 files changed, 111 insertions(+), 87 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
index 3ff9176..b4bd027 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
@@ -26,7 +26,7 @@ import org.apache.spark.sql.catalyst.planning.ScanOperation
 import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Filter, 
LeafNode, Limit, LocalLimit, LogicalPlan, Project, Sample, Sort}
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.connector.expressions.SortOrder
-import org.apache.spark.sql.connector.expressions.aggregate.{Aggregation, Avg, 
GeneralAggregateFunc}
+import org.apache.spark.sql.connector.expressions.aggregate.{Aggregation, Avg, 
Count, GeneralAggregateFunc, Sum}
 import org.apache.spark.sql.connector.read.{Scan, ScanBuilder, 
SupportsPushDownAggregates, SupportsPushDownFilters, V1Scan}
 import org.apache.spark.sql.execution.datasources.DataSourceStrategy
 import org.apache.spark.sql.sources
@@ -156,101 +156,106 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 }
   }
 
-  val pushedAggregates = 
finalTranslatedAggregates.filter(r.pushAggregation)
-  if (pushedAggregates.isEmpty) {
+  if (finalTranslatedAggregates.isEmpty) {
 aggNode // return original plan node
-  } else if (!supportPartialAggPushDown(pushedAggregates.get) &&
-!r.supportCompletePushDown(pushedAggregates.get)) {
+  } else if 
(!r.supportCompletePushDown(finalTranslatedAggregates.get) &&
+!supportPartialAggPushDown(finalTranslatedAggregates.get)) {
 aggNode // return original plan node
   } else {
-// No need to do column pruning because only the aggregate 
columns are used as
-// DataSourceV2ScanRelation output columns. All the other 
columns are not
-// included in the output.
-val scan = sHolder.builder.build()
-
-// scalastyle:off
-// use the group by columns and aggregate columns as the 
output columns
-// e.g. TABLE t (c1 INT, c2 INT, c3 INT)
-// SELECT min(c1), max(c1) FROM t GROUP BY c2;
-// Use c2, min(c1), max(c1) as output for 
DataSourceV2ScanRelation
-// We want to have the following logical plan:
-// == Optimized Logical Plan ==
-// Aggregate [c2#10], [min(min(c1)#21) AS min(c1)#17, 
max(max(c1)#22) AS max(c1)#18]
-// +- RelationV2[c2#10, min(c1)#21, max(c1)#22]
-// scalastyle:on
-val newOutput = scan.readSchema().toAttributes
-assert(newOutput.length == groupingExpressions.length + 
finalAggregates.length)
-val groupAttrs = 
normalizedGroupingExpressions.zip(newOutput).map {
-  case (a: Attribute, b: Attribute) => b.withExprId(a.exprId)
-  case (_, b) => b
-}
-val aggOutput = newOutput.drop(groupAttrs.length)
-val output = groupAttrs ++ aggOutput
-
-logInfo(
-

[spark] branch master updated: [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot do partial agg push down

2022-03-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f0b836b  [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with 
distinct, cannot do partial agg push down
f0b836b is described below

commit f0b836b1a6d5926dd09018b67a27461aca5ce739
Author: Jiaan Geng 
AuthorDate: Thu Mar 17 16:53:40 2022 +0800

[SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot 
do partial agg push down

### What changes were proposed in this pull request?
Spark could partial push down sum(distinct col), count(distinct col) if 
data source have multiple partitions, and Spark will sum the value again.
So the result may not correctly.

### Why are the changes needed?
Fix the bug push down sum(distinct col), count(distinct col) to data source 
and return incorrect result.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users will see the correct behavior.

### How was this patch tested?
New tests.

Closes #35873 from beliefer/SPARK-38560.

Authored-by: Jiaan Geng 
Signed-off-by: Wenchen Fan 
---
 .../datasources/v2/V2ScanRelationPushDown.scala| 184 +++--
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|  14 +-
 2 files changed, 111 insertions(+), 87 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
index 3ff9176..b4bd027 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
@@ -26,7 +26,7 @@ import org.apache.spark.sql.catalyst.planning.ScanOperation
 import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Filter, 
LeafNode, Limit, LocalLimit, LogicalPlan, Project, Sample, Sort}
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.connector.expressions.SortOrder
-import org.apache.spark.sql.connector.expressions.aggregate.{Aggregation, Avg, 
GeneralAggregateFunc}
+import org.apache.spark.sql.connector.expressions.aggregate.{Aggregation, Avg, 
Count, GeneralAggregateFunc, Sum}
 import org.apache.spark.sql.connector.read.{Scan, ScanBuilder, 
SupportsPushDownAggregates, SupportsPushDownFilters, V1Scan}
 import org.apache.spark.sql.execution.datasources.DataSourceStrategy
 import org.apache.spark.sql.sources
@@ -156,101 +156,106 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 }
   }
 
-  val pushedAggregates = 
finalTranslatedAggregates.filter(r.pushAggregation)
-  if (pushedAggregates.isEmpty) {
+  if (finalTranslatedAggregates.isEmpty) {
 aggNode // return original plan node
-  } else if (!supportPartialAggPushDown(pushedAggregates.get) &&
-!r.supportCompletePushDown(pushedAggregates.get)) {
+  } else if 
(!r.supportCompletePushDown(finalTranslatedAggregates.get) &&
+!supportPartialAggPushDown(finalTranslatedAggregates.get)) {
 aggNode // return original plan node
   } else {
-// No need to do column pruning because only the aggregate 
columns are used as
-// DataSourceV2ScanRelation output columns. All the other 
columns are not
-// included in the output.
-val scan = sHolder.builder.build()
-
-// scalastyle:off
-// use the group by columns and aggregate columns as the 
output columns
-// e.g. TABLE t (c1 INT, c2 INT, c3 INT)
-// SELECT min(c1), max(c1) FROM t GROUP BY c2;
-// Use c2, min(c1), max(c1) as output for 
DataSourceV2ScanRelation
-// We want to have the following logical plan:
-// == Optimized Logical Plan ==
-// Aggregate [c2#10], [min(min(c1)#21) AS min(c1)#17, 
max(max(c1)#22) AS max(c1)#18]
-// +- RelationV2[c2#10, min(c1)#21, max(c1)#22]
-// scalastyle:on
-val newOutput = scan.readSchema().toAttributes
-assert(newOutput.length == groupingExpressions.length + 
finalAggregates.length)
-val groupAttrs = 
normalizedGroupingExpressions.zip(newOutput).map {
-  case (a: Attribute, b: Attribute) => b.withExprId(a.exprId)
-  case (_, b) => b
-}
-val aggOutput = newOutput.drop(groupAttrs.length)
-val output = groupAttrs ++ aggOutput
-
-logInfo(
-  s"""

[spark] branch branch-3.2 updated: [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5

[spark] branch branch-3.3 updated: [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5

[spark] branch master updated (f36a5fb -> 97335ea)

[spark] branch branch-3.2 updated: Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"

[spark] branch branch-3.3 updated: Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"

[spark] branch master updated: [SPARK-38194] Followup: Fix k8s memory overhead passing to executor pods

[spark] branch master updated (2d1d18a -> 54fdb88)

[spark] branch master updated: [SPARK-37425][PYTHON] Inline type hints for python/pyspark/mllib/recommendation.py

[spark] branch branch-3.3 updated: Revert "[SPARK-38194][YARN][MESOS][K8S] Make memory overhead factor configurable"

[spark] branch master updated (968bb34 -> cd86df8)

[spark] branch master updated: [SPARK-38575][INFRA][FOLLOW-UP] Use GITHUB_REF to get the current branch

[spark] branch branch-3.3 updated: [SPARK-38586][INFRA] Trigger notifying workflow in branch-3.3 and other future branches

[spark] branch master updated: [SPARK-38586][INFRA] Trigger notifying workflow in branch-3.3 and other future branches

[spark] branch branch-3.3 updated (ebaee4f -> 1824c69)

[spark] branch master updated: [MINOR][INFRA] Add ANTLR generated files to .gitignore

[spark] branch branch-3.2 updated: [SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should use prepareExecutedPlan rather than createSparkPlan to re-plan subquery

[spark] branch branch-3.3 updated: [SPARK-37995][SQL] PlanAdaptiveDynamicPruningFilters should use prepareExecutedPlan rather than createSparkPlan to re-plan subquery

[spark] branch master updated (f0b836b -> 3afc4fb)

[spark] branch branch-3.3 updated: [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot do partial agg push down

[spark] branch master updated: [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot do partial agg push down

20 matches

Site Navigation

Mail list logo

Footer information