(spark) branch master updated: [SPARK-46196][PYTHON][DOCS] Add missing `toDegrees/toRadians/atan2/approxCountDistinct` function descriptions

2023-12-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f0ec157fe833 [SPARK-46196][PYTHON][DOCS] Add missing 
`toDegrees/toRadians/atan2/approxCountDistinct` function descriptions
f0ec157fe833 is described below

commit f0ec157fe8339284c1ca301e8af24c144af51e65
Author: Ruifeng Zheng 
AuthorDate: Fri Dec 1 00:19:32 2023 -0800

[SPARK-46196][PYTHON][DOCS] Add missing 
`toDegrees/toRadians/atan2/approxCountDistinct` function descriptions

### What changes were proposed in this pull request?
Add missing function descriptions

### Why are the changes needed?
they are missing in 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html


![image](https://github.com/apache/spark/assets/7322292/dd6cc2f2-0e5a-4a9d-ba91-3ffbb0ebb3a7)

### Does this PR introduce _any_ user-facing change?
yes, doc changes

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44104 from zhengruifeng/py_doc_fix_atan2.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/functions/builtin.py | 12 
 1 file changed, 12 insertions(+)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index d985b9e6138f..5e5c70322ec9 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -2546,6 +2546,9 @@ def tanh(col: "ColumnOrName") -> Column:
 @_try_remote_functions
 def toDegrees(col: "ColumnOrName") -> Column:
 """
+Converts an angle measured in radians to an approximately equivalent angle
+measured in degrees.
+
 .. versionadded:: 1.4.0
 
 .. versionchanged:: 3.4.0
@@ -2561,6 +2564,9 @@ def toDegrees(col: "ColumnOrName") -> Column:
 @_try_remote_functions
 def toRadians(col: "ColumnOrName") -> Column:
 """
+Converts an angle measured in degrees to an approximately equivalent angle
+measured in radians.
+
 .. versionadded:: 1.4.0
 
 .. versionchanged:: 3.4.0
@@ -4025,6 +4031,9 @@ def radians(col: "ColumnOrName") -> Column:
 @_try_remote_functions
 def atan2(col1: Union["ColumnOrName", float], col2: Union["ColumnOrName", 
float]) -> Column:
 """
+Compute the angle in radians between the positive x-axis of a plane
+and the point given by the coordinates
+
 .. versionadded:: 1.4.0
 
 .. versionchanged:: 3.4.0
@@ -4412,6 +4421,9 @@ def percent_rank() -> Column:
 @_try_remote_functions
 def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
 """
+This aggregate function returns a new :class:`~pyspark.sql.Column`, which 
estimates
+the approximate distinct count of elements in a specified column or a 
group of columns.
+
 .. versionadded:: 1.3.0
 
 .. versionchanged:: 3.4.0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46197][INFRA] Upgrade `memory-profiler>=0.61.0` for Python 3.12

2023-12-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2ff9b7241e46 [SPARK-46197][INFRA] Upgrade `memory-profiler>=0.61.0` 
for Python 3.12
2ff9b7241e46 is described below

commit 2ff9b7241e4659c1918b32ceef35aa30206ae077
Author: Ruifeng Zheng 
AuthorDate: Fri Dec 1 01:23:11 2023 -0800

[SPARK-46197][INFRA] Upgrade `memory-profiler>=0.61.0` for Python 3.12

### What changes were proposed in this pull request?
Upgrade memory-profiler>=0.61.0 for Python 3.12

### Why are the changes needed?
`memory-profiler==0.60.0` does not support Python 3.12

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44105 from zhengruifeng/infra_upgrade_memory-profiler.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 dev/infra/Dockerfile | 8 
 dev/requirements.txt | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index 7348c6af1e05..f0ca4a47d698 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -93,7 +93,7 @@ RUN Rscript -e "devtools::install_version('preferably', 
version='0.4', repos='ht
 ENV R_LIBS_SITE 
"/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"
 
 RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.3' scipy coverage 
matplotlib
-RUN python3.9 -m pip install numpy 'pyarrow>=14.0.0' 'six==1.16.0' 
'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.8.1' 
coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn>=1.3.2'
+RUN python3.9 -m pip install numpy 'pyarrow>=14.0.0' 'six==1.16.0' 
'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.8.1' 
coverage matplotlib openpyxl 'memory-profiler>=0.61.0' 'scikit-learn>=1.3.2'
 
 # Add Python deps for Spark Connect.
 RUN python3.9 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 
'protobuf==4.25.1' 'googleapis-common-protos==1.56.4'
@@ -110,7 +110,7 @@ RUN apt-get update && apt-get install -y \
 python3.10 python3.10-distutils \
 && rm -rf /var/lib/apt/lists/*
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
-RUN python3.10 -m pip install numpy 'pyarrow>=14.0.0' 'six==1.16.0' 
'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.8.1' 
coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn>=1.3.2'
+RUN python3.10 -m pip install numpy 'pyarrow>=14.0.0' 'six==1.16.0' 
'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.8.1' 
coverage matplotlib openpyxl 'memory-profiler>=0.61.0' 'scikit-learn>=1.3.2'
 RUN python3.10 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 
'protobuf==4.25.1' 'googleapis-common-protos==1.56.4'
 RUN python3.10 -m pip install 'torch<=2.0.1' torchvision --index-url 
https://download.pytorch.org/whl/cpu
 RUN python3.10 -m pip install torcheval
@@ -122,7 +122,7 @@ RUN apt-get update && apt-get install -y \
 python3.11 python3.11-distutils \
 && rm -rf /var/lib/apt/lists/*
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11
-RUN python3.11 -m pip install numpy 'pyarrow>=14.0.0' 'six==1.16.0' 
'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.8.1' 
coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn>=1.3.2'
+RUN python3.11 -m pip install numpy 'pyarrow>=14.0.0' 'six==1.16.0' 
'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.8.1' 
coverage matplotlib openpyxl 'memory-profiler>=0.61.0' 'scikit-learn>=1.3.2'
 RUN python3.11 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 
'protobuf==4.25.1' 'googleapis-common-protos==1.56.4'
 RUN python3.11 -m pip install 'torch<=2.0.1' torchvision --index-url 
https://download.pytorch.org/whl/cpu
 RUN python3.11 -m pip install torcheval
@@ -134,7 +134,7 @@ RUN apt-get update && apt-get install -y \
 python3.12 python3.12-distutils \
 && rm -rf /var/lib/apt/lists/*
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12
-RUN python3.12 -m pip install numpy 'pyarrow>=14.0.0' 'six==1.16.0' 
'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.8.1' 
coverage matplotlib openpyxl 'scikit-learn>=1.3.2'
+RUN python3.12 -m pip install numpy 'pyarrow>=14.0.0' 'six==1.16.0' 
'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.8.1' 
coverage matplotlib openpyxl 'memory-profiler>=0.61.0' 'scikit-learn>=1.3.2'
 RUN python3.12 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 
'protobuf==4.25.1' 'googleapis-common-protos==1.56.4'
 # TODO(SPARK-46078) Use official one instead of nightly build when it's ready
 RUN python3.12 -m pip install --pre torch --index-url 
https://

(spark) branch master updated: [SPARK-46187][SQL] Align codegen and non-codegen implementation of `StringDecode`

2023-12-01 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e93bff6fc0bc [SPARK-46187][SQL] Align codegen and non-codegen 
implementation of `StringDecode`
e93bff6fc0bc is described below

commit e93bff6fc0bc3a549de02958ffc17b1bca3d50b8
Author: Max Gekk 
AuthorDate: Fri Dec 1 10:42:31 2023 +0100

[SPARK-46187][SQL] Align codegen and non-codegen implementation of 
`StringDecode`

### What changes were proposed in this pull request?
In the PR, I propose to change the implementation of interpretation mode of 
`StringDecode` and apparently of the `decode` function. And make it consistent 
to codegen. Both implementation raise the same error with of the error class 
`INVALID_PARAMETER_VALUE.CHARSET`.

### Why are the changes needed?
To make codegen and non-codegen of the `StringDecode` expression 
consistent. So, users will observe the same behaviour in both modes.

### Does this PR introduce _any_ user-facing change?
Yes, if user code depends on error from `decode()`.

### How was this patch tested?
By running the following test suites:
```
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite -- -z string-functions.sql"
$ build/sbt "core/testOnly *SparkThrowableSuite"
$ build/sbt "test:testOnly *.StringFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44094 from MaxGekk/align-codegen-stringdecode.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 .../catalyst/expressions/stringExpressions.scala   | 18 
 .../analyzer-results/ansi/string-functions.sql.out | 15 ++
 .../analyzer-results/string-functions.sql.out  | 15 ++
 .../sql-tests/inputs/string-functions.sql  |  2 ++
 .../results/ansi/string-functions.sql.out  | 34 ++
 .../sql-tests/results/string-functions.sql.out | 34 ++
 6 files changed, 113 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
index 412422f4da4e..84a5eebd70ec 100755
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
@@ -2648,18 +2648,26 @@ case class StringDecode(bin: Expression, charset: 
Expression)
 
   protected override def nullSafeEval(input1: Any, input2: Any): Any = {
 val fromCharset = input2.asInstanceOf[UTF8String].toString
-UTF8String.fromString(new String(input1.asInstanceOf[Array[Byte]], 
fromCharset))
+try {
+  UTF8String.fromString(new String(input1.asInstanceOf[Array[Byte]], 
fromCharset))
+} catch {
+  case _: UnsupportedEncodingException =>
+throw QueryExecutionErrors.invalidCharsetError(prettyName, fromCharset)
+}
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-nullSafeCodeGen(ctx, ev, (bytes, charset) =>
+nullSafeCodeGen(ctx, ev, (bytes, charset) => {
+  val fromCharset = ctx.freshName("fromCharset")
   s"""
+String $fromCharset = $charset.toString();
 try {
-  ${ev.value} = UTF8String.fromString(new String($bytes, 
$charset.toString()));
+  ${ev.value} = UTF8String.fromString(new String($bytes, 
$fromCharset));
 } catch (java.io.UnsupportedEncodingException e) {
-  org.apache.spark.unsafe.Platform.throwException(e);
+  throw QueryExecutionErrors.invalidCharsetError("$prettyName", 
$fromCharset);
 }
-  """)
+  """
+})
   }
 
   override protected def withNewChildrenInternal(
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/string-functions.sql.out
 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/string-functions.sql.out
index 9d8705e3e862..7ace31d5 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/string-functions.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/string-functions.sql.out
@@ -799,6 +799,21 @@ Project [decode(null, 6, Spark, null, SQL, 4, rocks, null, 
.) AS decode(NULL, 6,
 +- OneRowRelation
 
 
+-- !query
+select decode(X'68656c6c6f', 'Windows-xxx')
+-- !query analysis
+Project [decode(0x68656C6C6F, Windows-xxx) AS decode(X'68656C6C6F', 
Windows-xxx)#x]
++- OneRowRelation
+
+
+-- !query
+select decode(scol, ecol) from values(X'68656c6c6f', 'Windows-xxx') as t(scol, 
ecol)
+-- !query analysis
+Project [decode(scol#x, ecol#x) AS decode(scol, ecol)#x]
++- SubqueryAlias t
+   +- LocalRelation [scol#x, ec

(spark) branch master updated: [SPARK-46193][CORE][TESTS] Add `PersistenceEngineBenchmark`

2023-12-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0e689611f099 [SPARK-46193][CORE][TESTS] Add 
`PersistenceEngineBenchmark`
0e689611f099 is described below

commit 0e689611f09968c3a46689294184de29d097302b
Author: Dongjoon Hyun 
AuthorDate: Fri Dec 1 02:52:29 2023 -0800

[SPARK-46193][CORE][TESTS] Add `PersistenceEngineBenchmark`

### What changes were proposed in this pull request?

This PR aims to provide a new benchmark, `PersistenceEngineBenchmark`.

### Why are the changes needed?

This is beneficial for both the developers and the users by providing a 
consistent measurement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

```
$ build/sbt "core/Test/runMain 
org.apache.spark.deploy.master.PersistenceEngineBenchmark"
...
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.2
[info] Apple M1 Max
[info] 1000 Workers: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 

[info] ZooKeeperPersistenceEngine11179  
11198  20  0.011179348.5   1.0X
[info] FileSystemPersistenceEngine 416
422   6  0.0  415745.2  26.9X
[info] BlackHolePersistenceEngine0  
0   0 22.7  44.1  253597.7X
```

```
$ bin/spark-submit --driver-memory 6g --class 
org.apache.spark.deploy.master.PersistenceEngineBenchmark --jars `find 
~/Library/Caches/Coursier/v1 -name 'curator-test-*.jar'` 
core/target/scala-2.13/spark-core_2.13-4.0.0-SNAPSHOT-tests.jar
...
OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.2
Apple M1 Max
1000 Workers: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


ZooKeeperPersistenceEngine11565  11857  
   373  0.011564757.8   1.0X
FileSystemPersistenceEngine 426426  
 1  0.0  425605.0  27.2X
BlackHolePersistenceEngine0  0  
 0 27.4  36.5  316478.5X
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44102 from dongjoon-hyun/SPARK-46193.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/benchmark.yml|   2 +-
 .../PersistenceEngineBenchmark-jdk21-results.txt   |  13 +++
 .../PersistenceEngineBenchmark-results.txt |  13 +++
 .../deploy/master/PersistenceEngineBenchmark.scala | 114 +
 4 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
index 8e7551fa7738..3cb63404bcac 100644
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@@ -177,7 +177,7 @@ jobs:
 # In benchmark, we use local as master so set driver memory only. Note 
that GitHub Actions has 7 GB memory limit.
 bin/spark-submit \
   --driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \
-  --jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name 
'*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
+  --jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name 
'*avro*-SNAPSHOT.jar' | paste -sd ',' -`,`find ~/.cache/coursier -name 
'curator-test-*.jar'`" \
   "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
   "${{ github.event.inputs.class }}"
 # To keep the directory structure and file permissions, tar them
diff --git a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt 
b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
new file mode 100644
index ..3312d6feff88
--- /dev/null
+++ b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
@@ -0,0 +1,13 @@
+
+PersistenceEngineBenchmark
+
+
+OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1051-azure
+AMD EPYC 7763 64-Core Processor
+1000 Workers: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)

(spark) branch master updated: [SPARK-46204][K8S] Fix `dev-run-integration-tests.sh` to use 17 for `JAVA_VERSION`

2023-12-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d78126978799 [SPARK-46204][K8S] Fix `dev-run-integration-tests.sh` to 
use 17 for `JAVA_VERSION`
d78126978799 is described below

commit d78126978799e2d125c20b7db16855fdd628e874
Author: hannahkamundson 
AuthorDate: Fri Dec 1 12:21:25 2023 -0800

[SPARK-46204][K8S] Fix `dev-run-integration-tests.sh` to use 17 for 
`JAVA_VERSION`

### What changes were proposed in this pull request?
Tie the Kubernetes integration dev test script to Java 17.

### Why are the changes needed?
The Kubernetes dev integration test shell script was tied to Java 8. When 
it was run as is (and not forcing Java version >= 14), the build would fail 
because it couldn't build different features that required Java >= 14.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
I ran `./dev/dev-run-integration-tests.sh` and it worked.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44112 from hannahkamundson/SPARK-46204.

Authored-by: hannahkamundson 
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/integration-tests/dev/dev-run-integration-tests.sh   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh
 
b/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh
index f5f93adeddf6..a0834e1ff23f 100755
--- 
a/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh
+++ 
b/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh
@@ -38,7 +38,7 @@ CONTEXT=
 INCLUDE_TAGS="k8s"
 EXCLUDE_TAGS=
 DEFAULT_EXCLUDE_TAGS="N/A"
-JAVA_VERSION="8"
+JAVA_VERSION="17"
 BUILD_DEPENDENCIES_MVN_FLAG="-am"
 HADOOP_PROFILE="hadoop-3"
 MVN="$TEST_ROOT_DIR/build/mvn"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46200][INFRA] re-org the testing dockerfile

2023-12-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c85d41c558b6 [SPARK-46200][INFRA] re-org the testing dockerfile
c85d41c558b6 is described below

commit c85d41c558b671f815fca980c614a22ca267c28b
Author: Ruifeng Zheng 
AuthorDate: Fri Dec 1 12:26:32 2023 -0800

[SPARK-46200][INFRA] re-org the testing dockerfile

### What changes were proposed in this pull request?
re-org the testing dockerfile:
1, move R package installation before the python part;
2, combine pip install commands to make sure no conflict (except the 
torch-related pkgs, since we can not specify the `--index-url` to a subset of 
pkgs in single pip command):
```
RUN python3.9 -m pip install pkg-a==x
RUN python3.9 -m pip install pkg-b==y
```
pkg-b installation can potentially break `pkg-a==x` by installing another 
version

### Why are the changes needed?
to make sure no conflict

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44107 from zhengruifeng/infra_docker_refactor.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 dev/infra/Dockerfile | 50 +++---
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index f0ca4a47d698..3e449bcb6c82 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -63,16 +63,6 @@ RUN apt-get update && apt-get install -y \
 zlib1g-dev \
 && rm -rf /var/lib/apt/lists/*
 
-RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
-
-RUN add-apt-repository ppa:pypy/ppa
-
-RUN mkdir -p /usr/local/pypy/pypy3.8 && \
-curl -sqL 
https://downloads.python.org/pypy/pypy3.8-v7.3.11-linux64.tar.bz2 | tar xjf - 
-C /usr/local/pypy/pypy3.8 --strip-components=1 && \
-ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
-ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
-
-RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
 
 RUN echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >> 
/etc/apt/sources.list
 RUN gpg --keyserver hkps://keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9
@@ -92,17 +82,28 @@ RUN Rscript -e "devtools::install_version('preferably', 
version='0.4', repos='ht
 # See more in SPARK-39735
 ENV R_LIBS_SITE 
"/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"
 
+
+RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
+
+
+RUN add-apt-repository ppa:pypy/ppa
+RUN mkdir -p /usr/local/pypy/pypy3.8 && \
+curl -sqL 
https://downloads.python.org/pypy/pypy3.8-v7.3.11-linux64.tar.bz2 | tar xjf - 
-C /usr/local/pypy/pypy3.8 --strip-components=1 && \
+ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
+ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
+RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
 RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.3' scipy coverage 
matplotlib
-RUN python3.9 -m pip install numpy 'pyarrow>=14.0.0' 'six==1.16.0' 
'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.8.1' 
coverage matplotlib openpyxl 'memory-profiler>=0.61.0' 'scikit-learn>=1.3.2'
 
-# Add Python deps for Spark Connect.
-RUN python3.9 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 
'protobuf==4.25.1' 'googleapis-common-protos==1.56.4'
 
-# Add torch as a testing dependency for TorchDistributor
+ARG BASIC_PIP_PKGS="numpy pyarrow>=14.0.0 six==1.16.0 pandas<=2.1.3 scipy 
unittest-xml-reporting plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl 
memory-profiler>=0.61.0 scikit-learn>=1.3.2"
+# Python deps for Spark Connect
+ARG CONNECT_PIP_PKGS="grpcio==1.59.3 grpcio-status==1.59.3 protobuf==4.25.1 
googleapis-common-protos==1.56.4"
+
+
+RUN python3.9 -m pip install $BASIC_PIP_PKGS $CONNECT_PIP_PKGS
+# Add torch as a testing dependency for TorchDistributor and 
DeepspeedTorchDistributor
 RUN python3.9 -m pip install 'torch<=2.0.1' torchvision --index-url 
https://download.pytorch.org/whl/cpu
-RUN python3.9 -m pip install torcheval
-# Add Deepspeed as a testing dependency for DeepspeedTorchDistributor
-RUN python3.9 -m pip install deepspeed
+RUN python3.9 -m pip install deepspeed torcheval
 
 # Install Python 3.10 at the last stage to avoid breaking Python 3.9
 RUN add-apt-repository ppa:deadsnakes/ppa
@@ -110,11 +111,9 @@ RUN apt-get update && apt-get install -y \
 python3.10 python3.10-distutils \
 && rm -rf /var/lib/apt/lists/*
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
-RUN python3.10 -m pip install numpy 'pyarrow>=14.0.0' 'six=

(spark) branch master updated: [SPARK-46199][PYTHON][DOCS] Add PyPi link icon to PySpark doc header

2023-12-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8060e7e73170 [SPARK-46199][PYTHON][DOCS] Add PyPi link icon to PySpark 
doc header
8060e7e73170 is described below

commit 8060e7e73170c0122acb2a005f3c54487e226208
Author: panbingkun 
AuthorDate: Fri Dec 1 12:29:29 2023 -0800

[SPARK-46199][PYTHON][DOCS] Add PyPi link icon to PySpark doc header

### What changes were proposed in this pull request?
This PR  proposes to introduce a `PyPi link icon` in the header of the 
PySpark documentation, similar to what is currently implemented in the 
`pydata-sphinx-theme` docs.

https://pydata-sphinx-theme.readthedocs.io/en/v0.13.3/user_guide/styling.html#
https://github.com/apache/spark/assets/15246973/4bb66f51-96e7-45d5-890b-03a4700f1ec7";>

### Why are the changes needed?
 This change aligns with the practices of other open-source projects like 
`pydata-sphinx-theme`, facilitating community engagement and collaboration.

### Does this PR introduce _any_ user-facing change?
No API changes, but a `PyPi link icon` will be added to the top right 
corner of the PySpark docs header. This icon will be linked to the `PySpark 
PyPi` as below:
https://github.com/apache/spark/assets/15246973/4c5a70a0-cb30-4615-827d-97610a71d5e4";>

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44106 from panbingkun/SPARK-46199.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 python/docs/source/conf.py | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/conf.py b/python/docs/source/conf.py
index 0ac06f802c95..640155cee5a3 100644
--- a/python/docs/source/conf.py
+++ b/python/docs/source/conf.py
@@ -204,7 +204,18 @@ html_theme_options = {
 "image_light": "_static/spark-logo-light.png",
 "image_dark": "_static/spark-logo-dark.png",
 },
-"github_url": "https://github.com/apache/spark";,
+"icon_links": [
+{
+"name": "GitHub",
+"url": "https://github.com/apache/spark";,
+"icon": "fa-brands fa-github",
+},
+{
+"name": "PyPI",
+"url": "https://pypi.org/project/pyspark";,
+"icon": "fa-solid fa-box",
+},
+]
 }
 
 # Add any paths that contain custom themes here, relative to this directory.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46201][PYTHON][DOCS] Correct the typing of `schema_of_{csv, json, xml}`

2023-12-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a4678143f6bb [SPARK-46201][PYTHON][DOCS] Correct the typing of 
`schema_of_{csv, json, xml}`
a4678143f6bb is described below

commit a4678143f6bb9fd4356bf12a1db993cdfa22c7bd
Author: Ruifeng Zheng 
AuthorDate: Fri Dec 1 12:28:10 2023 -0800

[SPARK-46201][PYTHON][DOCS] Correct the typing of `schema_of_{csv, json, 
xml}`

### What changes were proposed in this pull request?
Correct the typing of `schema_of_{csv, json, xml}`

### Why are the changes needed?
although `ColumnOrName` is defined as `ColumnOrName = Union[Column, str]`,
we should not use it when the string here is not a column name.

e.g. 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.schema_of_csv.html


![image](https://github.com/apache/spark/assets/7322292/97681f11-a360-4bce-8557-2366aa07a0b5)

in this case, we should follow parameter `schema` in `from_csv`:

![image](https://github.com/apache/spark/assets/7322292/3e05da85-6aba-4d23-b6a6-6a2fb8c9b8fd)

### Does this PR introduce _any_ user-facing change?
yes, doc-change

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44108 from zhengruifeng/py_doc_schema_of.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/functions/builtin.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 5e5c70322ec9..ac237f10c2e7 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -13753,7 +13753,7 @@ def to_json(col: "ColumnOrName", options: 
Optional[Dict[str, str]] = None) -> Co
 
 
 @_try_remote_functions
-def schema_of_json(json: "ColumnOrName", options: Optional[Dict[str, str]] = 
None) -> Column:
+def schema_of_json(json: Union[Column, str], options: Optional[Dict[str, str]] 
= None) -> Column:
 """
 Parses a JSON string and infers its schema in DDL format.
 
@@ -13941,7 +13941,7 @@ def from_xml(
 
 
 @_try_remote_functions
-def schema_of_xml(xml: "ColumnOrName", options: Optional[Dict[str, str]] = 
None) -> Column:
+def schema_of_xml(xml: Union[Column, str], options: Optional[Dict[str, str]] = 
None) -> Column:
 """
 Parses a XML string and infers its schema in DDL format.
 
@@ -14055,7 +14055,7 @@ def to_xml(col: "ColumnOrName", options: 
Optional[Dict[str, str]] = None) -> Col
 
 
 @_try_remote_functions
-def schema_of_csv(csv: "ColumnOrName", options: Optional[Dict[str, str]] = 
None) -> Column:
+def schema_of_csv(csv: Union[Column, str], options: Optional[Dict[str, str]] = 
None) -> Column:
 """
 Parses a CSV string and infers its schema in DDL format.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark-docker) branch master updated: [SPARK-46185] Add official image Dockerfile for Apache Spark 3.4.2

2023-12-01 Thread yikun
This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new ec69b9c  [SPARK-46185] Add official image Dockerfile for Apache Spark 
3.4.2
ec69b9c is described below

commit ec69b9c77bc733ed5937f5068d23f7407eb51ea9
Author: Yikun Jiang 
AuthorDate: Sat Dec 2 10:00:48 2023 +0800

[SPARK-46185] Add official image Dockerfile for Apache Spark 3.4.2

### What changes were proposed in this pull request?
Add Apache Spark 3.4.2 Dockerfiles.

- Add 3.4.2 GPG key
- Add .github/workflows/build_3.4.2.yaml
- `./add-dockerfiles.sh 3.4.2` to generate dockerfiles (and remove master 
changes: 
https://github.com/apache/spark-docker/pull/55/commits/24cbf40abdc252fdcf48303efa33ba7f84adefaf)
- Add version and tag info

### Why are the changes needed?
Apache Spark 3.4.2 released

### Does this PR introduce _any_ user-facing change?
Docker image will be published.

### How was this patch tested?
Add workflow and CI passed

Closes #57 from Yikun/3.4.2.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 .github/workflows/build_3.4.2.yaml |  41 +++
 .github/workflows/publish.yml  |   1 +
 .github/workflows/test.yml |   1 +
 3.4.2/scala2.12-java11-python3-r-ubuntu/Dockerfile |  29 +
 3.4.2/scala2.12-java11-python3-ubuntu/Dockerfile   |  26 +
 3.4.2/scala2.12-java11-r-ubuntu/Dockerfile |  28 +
 3.4.2/scala2.12-java11-ubuntu/Dockerfile   |  79 +
 3.4.2/scala2.12-java11-ubuntu/entrypoint.sh| 126 +
 tools/template.py  |   2 +
 versions.json  |  28 +
 10 files changed, 361 insertions(+)

diff --git a/.github/workflows/build_3.4.2.yaml 
b/.github/workflows/build_3.4.2.yaml
new file mode 100644
index 000..8ae17d1
--- /dev/null
+++ b/.github/workflows/build_3.4.2.yaml
@@ -0,0 +1,41 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+name: "Build and Test (3.4.2)"
+
+on:
+  pull_request:
+branches:
+  - 'master'
+paths:
+  - '3.4.2/**'
+
+jobs:
+  run-build:
+strategy:
+  matrix:
+image-type: ["all", "python", "scala", "r"]
+name: Run
+secrets: inherit
+uses: ./.github/workflows/main.yml
+with:
+  spark: 3.4.2
+  scala: 2.12
+  java: 11
+  image-type: ${{ matrix.image-type }}
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
index 879a9c2..ec0d66c 100644
--- a/.github/workflows/publish.yml
+++ b/.github/workflows/publish.yml
@@ -29,6 +29,7 @@ on:
 type: choice
 options:
 - 3.5.0
+- 3.4.2
 - 3.4.1
 - 3.4.0
 - 3.3.3
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
index 689981a..df79364 100644
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -29,6 +29,7 @@ on:
 type: choice
 options:
 - 3.5.0
+- 3.4.2
 - 3.4.1
 - 3.4.0
 - 3.3.3
diff --git a/3.4.2/scala2.12-java11-python3-r-ubuntu/Dockerfile 
b/3.4.2/scala2.12-java11-python3-r-ubuntu/Dockerfile
new file mode 100644
index 000..7c7e96a
--- /dev/null
+++ b/3.4.2/scala2.12-java11-python3-r-ubuntu/Dockerfile
@@ -0,0 +1,29 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the Licen

(spark) branch master updated: [SPARK-46205][CORE] Improve `PersistenceEngine` performance with `KryoSerializer`

2023-12-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0d40b1aea758 [SPARK-46205][CORE] Improve `PersistenceEngine` 
performance with `KryoSerializer`
0d40b1aea758 is described below

commit 0d40b1aea758b95a4416c8653599af8713a4aa16
Author: Dongjoon Hyun 
AuthorDate: Fri Dec 1 18:29:42 2023 -0800

[SPARK-46205][CORE] Improve `PersistenceEngine` performance with 
`KryoSerializer`

### What changes were proposed in this pull request?

This PR aims to improve `PersistenceEngine` performance with 
`KryoSerializer` via introducing a new configuration, 
`spark.deploy.recoverySerializer`.

### Why are the changes needed?

Allow users to choose a better serializer to get a better performance in 
their environment. Especially, `KryoSerializer` is about **3x faster** than 
`JavaSerializer` with `FileSystemPersistenceEngine`.

```


PersistenceEngineBenchmark



OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Linux 5.15.0-1051-azure
AMD EPYC 7763 64-Core Processor
1000 Workers:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

---
ZooKeeperPersistenceEngine with JavaSerializer1202   
1298 138  0.0 1201614.2   1.0X
ZooKeeperPersistenceEngine with KryoSerializer 951   
1004  48  0.0  950559.0   1.3X
FileSystemPersistenceEngine with JavaSerializer212
217   6  0.0  211623.2   5.7X
FileSystemPersistenceEngine with KryoSerializer 79 
81   2  0.0   79132.5  15.2X
BlackHolePersistenceEngine   0  
0   0 30.9  32.4   37109.8X
```

### Does this PR introduce _any_ user-facing change?

No. The default behavior is the same.

### How was this patch tested?

Pass the CIs with the new added test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44113 from dongjoon-hyun/SPARK-46205.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../PersistenceEngineBenchmark-jdk21-results.txt   | 12 ++-
 .../PersistenceEngineBenchmark-results.txt | 12 ++-
 .../master/FileSystemPersistenceEngine.scala   |  2 +-
 .../org/apache/spark/deploy/master/Master.scala|  7 +--
 .../org/apache/spark/internal/config/Deploy.scala  | 16 +++
 .../apache/spark/deploy/master/MasterSuite.scala   | 22 
 .../deploy/master/PersistenceEngineBenchmark.scala | 24 +++---
 .../deploy/master/PersistenceEngineSuite.scala | 14 -
 8 files changed, 92 insertions(+), 17 deletions(-)

diff --git a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt 
b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
index 3312d6feff88..65dbfd0990d3 100644
--- a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
+++ b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
@@ -4,10 +4,12 @@ PersistenceEngineBenchmark
 
 OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1051-azure
 AMD EPYC 7763 64-Core Processor
-1000 Workers: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
-
-ZooKeeperPersistenceEngine 1183   1266 
129  0.0 1183158.2   1.0X
-FileSystemPersistenceEngine 218222 
  4  0.0  218005.2   5.4X
-BlackHolePersistenceEngine0  0 
  0 29.5  34.0   34846.9X
+1000 Workers:Best Time(ms)   Avg Time(ms)  
 Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+---
+ZooKeeperPersistenceEngine with JavaSerializer1100   1255  
   150  0.0 1099532.9   1.0X
+ZooKeeperPersistenceEngine with KryoSerializer 946967  
20  0.0  946367.3   1.2X
+FileSy