(spark) branch master updated: [SPARK-46216][CORE] Improve `FileSystemPersistenceEngine` to support compressions

2023-12-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3da2e5c6324 [SPARK-46216][CORE] Improve `FileSystemPersistenceEngine` 
to support compressions
3da2e5c6324 is described below

commit 3da2e5c632468ec7cf7001255c1a44197b46ce30
Author: Dongjoon Hyun 
AuthorDate: Sun Dec 3 00:26:16 2023 -0800

[SPARK-46216][CORE] Improve `FileSystemPersistenceEngine` to support 
compressions

### What changes were proposed in this pull request?

This PR aims to improve `FileSystemPersistenceEngine` to support 
compressions via a new configuration, `spark.deploy.recoveryCompressionCodec`.

### Why are the changes needed?

To allow the users to choose a proper compression codec for their 
workloads. For `JavaSerializer` case, `LZ4` compression is **2x** faster than 
the baseline (no compression).
```
OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Linux 5.15.0-1051-azure
AMD EPYC 7763 64-Core Processor
2000 Workers: Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative


ZooKeeperPersistenceEngine with JavaSerializer 2276 
  2360 115  0.0 1137909.6   1.0X
ZooKeeperPersistenceEngine with KryoSerializer 1883 
  1906  34  0.0  941364.2   1.2X
FileSystemPersistenceEngine with JavaSerializer 431 
   436   7  0.0  215436.9   5.3X
FileSystemPersistenceEngine with JavaSerializer (lz4)   209 
   216   9  0.0  104404.1  10.9X
FileSystemPersistenceEngine with JavaSerializer (lzf)   199 
   202   2  0.0   99489.5  11.4X
FileSystemPersistenceEngine with JavaSerializer (snappy)192 
   199   9  0.0   95872.9  11.9X
FileSystemPersistenceEngine with JavaSerializer (zstd)  258 
   264   6  0.0  129249.4   8.8X
FileSystemPersistenceEngine with KryoSerializer 139 
   151  13  0.0   69374.5  16.4X
FileSystemPersistenceEngine with KryoSerializer (lz4)   159 
   165   8  0.0   79588.9  14.3X
FileSystemPersistenceEngine with KryoSerializer (lzf)   180 
   195  18  0.0   89844.0  12.7X
FileSystemPersistenceEngine with KryoSerializer (snappy)164 
   183  18  0.0   82016.0  13.9X
FileSystemPersistenceEngine with KryoSerializer (zstd)  206 
   218  11  0.0  102838.9  11.1X
BlackHolePersistenceEngine0 
 0   0 35.1  28.5   39908.5X
```

### Does this PR introduce _any_ user-facing change?

No, this is a new feature.

### How was this patch tested?

Pass the CIs with the newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44129 from dongjoon-hyun/SPARK-46216.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../PersistenceEngineBenchmark-jdk21-results.txt   | 22 
 .../PersistenceEngineBenchmark-results.txt | 22 
 .../master/FileSystemPersistenceEngine.scala   | 10 --
 .../spark/deploy/master/RecoveryModeFactory.scala  |  6 ++--
 .../org/apache/spark/internal/config/Deploy.scala  |  7 
 .../apache/spark/deploy/master/MasterSuite.scala   | 40 ++
 .../deploy/master/PersistenceEngineBenchmark.scala | 19 --
 .../deploy/master/PersistenceEngineSuite.scala | 13 +++
 8 files changed, 118 insertions(+), 21 deletions(-)

diff --git a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt 
b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
index 65dbfd0990d..38e74ed6b53 100644
--- a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
+++ b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
@@ -4,12 +4,20 @@ PersistenceEngineBenchmark
 
 OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1051-azure
 AMD EPYC 7763 64-Core Processor
-1000 Workers:Best Time(ms)   Avg Time(ms)  
 Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
---

(spark) branch master updated: [SPARK-45093][CONNECT][PYTHON] Properly support error handling and conversion for AddArtifactHandler

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 400573d33af [SPARK-45093][CONNECT][PYTHON] Properly support error 
handling and conversion for AddArtifactHandler
400573d33af is described below

commit 400573d33af39d61748dd0b2960c47161277f85e
Author: Martin Grund 
AuthorDate: Mon Dec 4 08:22:13 2023 +0900

[SPARK-45093][CONNECT][PYTHON] Properly support error handling and 
conversion for AddArtifactHandler

### What changes were proposed in this pull request?
This patch improves the error handling when errors are happening in the 
`AddArtifact` path. In particular, the `AddArtifactHandler` would not properly 
return exceptions but all exceptions would end up yielding `UNKNOWN` errors. 
This patch makes sure we wrap the errors in the add artifact path the same way 
as we're wrapping the errors in the normal query execution path.

In addition, it adds tests and verification for this behavior in Scala and 
in Python.

### Why are the changes needed?
Stability

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT for Scala and Python

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44092 from grundprinzip/SPARK-ADD_ARTIFACT_CRAP.

Authored-by: Martin Grund 
Signed-off-by: Hyukjin Kwon 
---
 .../service/SparkConnectAddArtifactsHandler.scala  | 26 +++---
 .../spark/sql/connect/utils/ErrorUtils.scala   | 25 +++--
 .../connect/service/AddArtifactsHandlerSuite.scala | 20 +++--
 python/pyspark/sql/connect/client/core.py  | 15 -
 .../sql/tests/connect/client/test_artifact.py  | 20 +
 5 files changed, 93 insertions(+), 13 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAddArtifactsHandler.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAddArtifactsHandler.scala
index e664e07dce1..ea3b578be3b 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAddArtifactsHandler.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAddArtifactsHandler.scala
@@ -24,7 +24,6 @@ import scala.collection.mutable
 import scala.util.control.NonFatal
 
 import com.google.common.io.CountingOutputStream
-import io.grpc.StatusRuntimeException
 import io.grpc.stub.StreamObserver
 
 import org.apache.spark.connect.proto
@@ -32,6 +31,7 @@ import org.apache.spark.connect.proto.{AddArtifactsRequest, 
AddArtifactsResponse
 import org.apache.spark.connect.proto.AddArtifactsResponse.ArtifactSummary
 import org.apache.spark.sql.artifact.ArtifactManager
 import org.apache.spark.sql.artifact.util.ArtifactUtils
+import org.apache.spark.sql.connect.utils.ErrorUtils
 import org.apache.spark.util.Utils
 
 /**
@@ -51,7 +51,7 @@ class SparkConnectAddArtifactsHandler(val responseObserver: 
StreamObserver[AddAr
   private var chunkedArtifact: StagedChunkedArtifact = _
   private var holder: SessionHolder = _
 
-  override def onNext(req: AddArtifactsRequest): Unit = {
+  override def onNext(req: AddArtifactsRequest): Unit = try {
 if (this.holder == null) {
   this.holder = SparkConnectService.getOrCreateIsolatedSession(
 req.getUserContext.getUserId,
@@ -78,6 +78,17 @@ class SparkConnectAddArtifactsHandler(val responseObserver: 
StreamObserver[AddAr
 } else {
   throw new UnsupportedOperationException(s"Unsupported data transfer 
request: $req")
 }
+  } catch {
+ErrorUtils.handleError(
+  "addArtifacts.onNext",
+  responseObserver,
+  holder.userId,
+  holder.sessionId,
+  None,
+  false,
+  Some(() => {
+cleanUpStagedArtifacts()
+  }))
   }
 
   override def onError(throwable: Throwable): Unit = {
@@ -128,7 +139,16 @@ class SparkConnectAddArtifactsHandler(val 
responseObserver: StreamObserver[AddAr
   responseObserver.onNext(builder.build())
   responseObserver.onCompleted()
 } catch {
-  case e: StatusRuntimeException => onError(e)
+  ErrorUtils.handleError(
+"addArtifacts.onComplete",
+responseObserver,
+holder.userId,
+holder.sessionId,
+None,
+false,
+Some(() => {
+  cleanUpStagedArtifacts()
+}))
 }
   }
 
diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/utils/ErrorUtils.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/utils/ErrorUtils.scala
index 7cb555ca47e..703b11c0c73 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/utils/Er

(spark) branch master updated: [SPARK-46221][PS][DOCS] Change `to_spark_io` to `spark.to_spark_io` in `quickstart_ps.ipynb`

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 349037567a6 [SPARK-46221][PS][DOCS] Change `to_spark_io` to 
`spark.to_spark_io` in `quickstart_ps.ipynb`
349037567a6 is described below

commit 349037567a67dd5acb71f6e22e6e26fa304ba035
Author: Bjørn Jørgensen 
AuthorDate: Mon Dec 4 08:28:18 2023 +0900

[SPARK-46221][PS][DOCS] Change `to_spark_io` to `spark.to_spark_io` in 
`quickstart_ps.ipynb`

### What changes were proposed in this pull request?
Change `to_spark_io` to `spark.to_spark_io`  in `quickstart_ps.ipynb`

### Why are the changes needed?
`to_spark_io` is change to `spark.to_spark_io` in spark 4.0
To have user guilds that works :)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

![image](https://github.com/apache/spark/assets/47577197/76f7067a-eeb4-4acc-88f1-3af94f5e78c7)

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44135 from bjornjorgensen/spark.to_spark_io.

Authored-by: Bjørn Jørgensen 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/getting_started/quickstart_ps.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/docs/source/getting_started/quickstart_ps.ipynb 
b/python/docs/source/getting_started/quickstart_ps.ipynb
index 2b6b3f8142c..efd2753dcdc 100644
--- a/python/docs/source/getting_started/quickstart_ps.ipynb
+++ b/python/docs/source/getting_started/quickstart_ps.ipynb
@@ -14457,7 +14457,7 @@
 }
],
"source": [
-"psdf.to_spark_io('zoo.orc', format=\"orc\")\n",
+"psdf.spark.to_spark_io('zoo.orc', format=\"orc\")\n",
 "ps.read_spark_io('zoo.orc', format=\"orc\").head(10)"
]
   },


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46217][CORE][TESTS] Include `Driver/App` data in `PersistenceEngineBenchmark`

2023-12-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5031e52f9e0 [SPARK-46217][CORE][TESTS] Include `Driver/App` data in 
`PersistenceEngineBenchmark`
5031e52f9e0 is described below

commit 5031e52f9e032e8e450af9fcd294f5b53e2c4cfd
Author: Dongjoon Hyun 
AuthorDate: Sun Dec 3 15:50:09 2023 -0800

[SPARK-46217][CORE][TESTS] Include `Driver/App` data in 
`PersistenceEngineBenchmark`

### What changes were proposed in this pull request?

This PR aims to include `DirverInfo` and `ApplicationInfo` data in 
`PersistenceEngineBenchmark`.

### Why are the changes needed?

Previously, `PersistenceEngine` recovers three kind of information. 
Previously, `PersistenceEngineBenchmark ` focused on `WorkerInfo` only. This PR 
will add two other informations to be more complete.


https://github.com/apache/spark/blob/3da2e5c632468ec7cf7001255c1a44197b46ce30/core/src/main/scala/org/apache/spark/deploy/master/PersistenceEngine.scala#L56-L78

### Does this PR introduce _any_ user-facing change?

No. This is a test improvement.

### How was this patch tested?

Manual tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44130 from dongjoon-hyun/SPARK-46217.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../PersistenceEngineBenchmark-jdk21-results.txt   | 28 +-
 .../PersistenceEngineBenchmark-results.txt | 28 +-
 .../deploy/master/PersistenceEngineBenchmark.scala | 65 +-
 3 files changed, 80 insertions(+), 41 deletions(-)

diff --git a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt 
b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
index 38e74ed6b53..314fb6958b6 100644
--- a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
+++ b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
@@ -4,20 +4,20 @@ PersistenceEngineBenchmark
 
 OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1051-azure
 AMD EPYC 7763 64-Core Processor
-2000 Workers: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+1000 Workers: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-ZooKeeperPersistenceEngine with JavaSerializer 2254
   2329 119  0.0 1126867.1   1.0X
-ZooKeeperPersistenceEngine with KryoSerializer 1911
   1912   1  0.0  955667.1   1.2X
-FileSystemPersistenceEngine with JavaSerializer 438
448  15  0.0  218868.1   5.1X
-FileSystemPersistenceEngine with JavaSerializer (lz4)   187
195   8  0.0   93337.8  12.1X
-FileSystemPersistenceEngine with JavaSerializer (lzf)   193
216  20  0.0   96678.8  11.7X
-FileSystemPersistenceEngine with JavaSerializer (snappy)175
183  10  0.0   87652.3  12.9X
-FileSystemPersistenceEngine with JavaSerializer (zstd)  243
255  14  0.0  121695.2   9.3X
-FileSystemPersistenceEngine with KryoSerializer 150
160  15  0.0   75089.7  15.0X
-FileSystemPersistenceEngine with KryoSerializer (lz4)   170
177  10  0.0   84996.7  13.3X
-FileSystemPersistenceEngine with KryoSerializer (lzf)   192
203  12  0.0   96019.1  11.7X
-FileSystemPersistenceEngine with KryoSerializer (snappy)184
202  16  0.0   92241.3  12.2X
-FileSystemPersistenceEngine with KryoSerializer (zstd)  232
238   5  0.0  116075.2   9.7X
-BlackHolePersistenceEngine0
  0   0 27.3  36.6   30761.0X
+ZooKeeperPersistenceEngine with JavaSerializer 5402
   5546 233  0.0 5402030.8   1.0X
+ZooKeeperPersistenceEngine with KryoSerializer 4185
   4220  32  0.0 4184623.1   1.3X
+FileSystemPersistenceEngine with JavaSerializer1591
   1634  37  0.0 1590836.4   3.4X
+FileSystemPersistenceEngine with JavaSerializer (lz4) 

(spark) branch master updated: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 393c69a515b [SPARK-40559][PYTHON] Add applyInArrow to groupBy and 
cogroup
393c69a515b is described below

commit 393c69a515b4acb6ea0659c0a7f09ae487801c40
Author: Enrico Minack 
AuthorDate: Mon Dec 4 10:21:26 2023 +0900

[SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

### What changes were proposed in this pull request?
Add `applyInArrow` method to PySpark `groupBy` and `groupBy.cogroup` to 
allow for user functions that work on Arrow. Similar to existing `mapInArrow`.

### Why are the changes needed?
PySpark allows to transform a `DataFrame` via Pandas and Arrow API:
```
df.mapInArrow(map_arrow, schema="...")
df.mapInPandas(map_pandas, schema="...")
```

For `df.groupBy(...)` and `df.groupBy(...).cogroup(...)`, there is only a 
Pandas interface, no Arrow interface:
```
df.groupBy("id").applyInPandas(apply_pandas, schema="...")
```

Providing a pure Arrow interface allows user code to use **any** 
Arrow-based data framework, not only Pandas, e.g. Polars:
```
def apply_polars(df: polars.DataFrame) -> polars.DataFrame:
  return df

def apply_arrow(table: pyarrow.Table) -> pyarrow.Table:
  df = polars.from_arrow(table)
  return apply_polars(df).to_arrow()

df.groupBy("id").applyInArrow(apply_arrow, schema="...")
```

### Does this PR introduce _any_ user-facing change?
This adds method `applyInPandas` to PySpark `groupBy` and `groupBy.cogroup`.

### How was this patch tested?
Tested with unit tests.

Closes #38624 from EnricoMi/branch-pyspark-grouped-apply-in-arrow.

Authored-by: Enrico Minack 
Signed-off-by: Hyukjin Kwon 
---
 .../org/apache/spark/api/python/PythonRunner.scala |   4 +
 dev/sparktestsupport/modules.py|   2 +
 python/pyspark/errors/error_classes.py |  15 ++
 python/pyspark/rdd.py  |   4 +
 python/pyspark/sql/pandas/_typing/__init__.pyi |  11 +
 python/pyspark/sql/pandas/functions.py |  22 ++
 python/pyspark/sql/pandas/group_ops.py | 230 +++-
 python/pyspark/sql/pandas/serializers.py   |  81 +-
 python/pyspark/sql/tests/arrow/__init__.py |  16 ++
 .../sql/tests/arrow/test_arrow_cogrouped_map.py| 300 +
 .../sql/tests/arrow/test_arrow_grouped_map.py  | 291 
 python/pyspark/sql/udf.py  |  40 +++
 python/pyspark/worker.py   | 207 +-
 .../plans/logical/pythonLogicalOperators.scala |  48 
 .../spark/sql/RelationalGroupedDataset.scala   |  82 +-
 .../spark/sql/execution/SparkStrategies.scala  |   6 +
 .../python/FlatMapCoGroupsInArrowExec.scala|  58 
 .../python/FlatMapCoGroupsInPandasExec.scala   |  62 +
 ...xec.scala => FlatMapCoGroupsInPythonExec.scala} |  43 +--
 .../python/FlatMapGroupsInArrowExec.scala  |  64 +
 .../python/FlatMapGroupsInPandasExec.scala |  60 +
 ...sExec.scala => FlatMapGroupsInPythonExec.scala} |  54 ++--
 22 files changed, 1513 insertions(+), 187 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
index e6d5a750ea3..148f80540d9 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
@@ -57,6 +57,8 @@ private[spark] object PythonEvalType {
   val SQL_COGROUPED_MAP_PANDAS_UDF = 206
   val SQL_MAP_ARROW_ITER_UDF = 207
   val SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE = 208
+  val SQL_GROUPED_MAP_ARROW_UDF = 209
+  val SQL_COGROUPED_MAP_ARROW_UDF = 210
 
   val SQL_TABLE_UDF = 300
   val SQL_ARROW_TABLE_UDF = 301
@@ -74,6 +76,8 @@ private[spark] object PythonEvalType {
 case SQL_COGROUPED_MAP_PANDAS_UDF => "SQL_COGROUPED_MAP_PANDAS_UDF"
 case SQL_MAP_ARROW_ITER_UDF => "SQL_MAP_ARROW_ITER_UDF"
 case SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE => 
"SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE"
+case SQL_GROUPED_MAP_ARROW_UDF => "SQL_GROUPED_MAP_ARROW_UDF"
+case SQL_COGROUPED_MAP_ARROW_UDF => "SQL_COGROUPED_MAP_ARROW_UDF"
 case SQL_TABLE_UDF => "SQL_TABLE_UDF"
 case SQL_ARROW_TABLE_UDF => "SQL_ARROW_TABLE_UDF"
   }
diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index feb49062316..718a2509741 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -491,6 +491,8 @@ pyspark_sql = Module(
 "pyspark.sql.pandas.utils",
 "pyspark.sql.observation",
 # unittests
+"pyspark.sql.tests.

(spark) branch master updated: [SPARK-46222][PYTHON][TESTS] Test invalid error class (pyspark.errors.utils)

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6ddd0081989 [SPARK-46222][PYTHON][TESTS] Test invalid error class 
(pyspark.errors.utils)
6ddd0081989 is described below

commit 6ddd0081989eab9e27f84f006f5d7d76359f17bc
Author: Hyukjin Kwon 
AuthorDate: Mon Dec 4 10:31:11 2023 +0900

[SPARK-46222][PYTHON][TESTS] Test invalid error class (pyspark.errors.utils)

### What changes were proposed in this pull request?

This PR adds a test for invalid error classes

### Why are the changes needed?

![Screenshot 2023-12-04 at 8 50 37 
AM](https://github.com/apache/spark/assets/6477701/47e80d92-3f28-4fb3-a2aa-8745f7fb40ec)


https://app.codecov.io/gh/apache/spark/blob/master/python%2Fpyspark%2Ferrors%2Futils.py

This is not being tested.

### Does this PR introduce _any_ user-facing change?

No, test-only

### How was this patch tested?

Manually ran the new unittest.

### Was this patch authored or co-authored using generative AI tooling?

Np.

Closes #44136 from HyukjinKwon/SPARK-46222.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/errors/tests/test_errors.py | 5 +
 1 file changed, 5 insertions(+)

diff --git a/python/pyspark/errors/tests/test_errors.py 
b/python/pyspark/errors/tests/test_errors.py
index 4e743bfb9a0..e191d7eff38 100644
--- a/python/pyspark/errors/tests/test_errors.py
+++ b/python/pyspark/errors/tests/test_errors.py
@@ -19,6 +19,7 @@
 import json
 import unittest
 
+from pyspark.errors import PySparkValueError
 from pyspark.errors.error_classes import ERROR_CLASSES_JSON
 from pyspark.errors.utils import ErrorClassesReader
 
@@ -46,6 +47,10 @@ class ErrorsTest(unittest.TestCase):
 
 json.loads(ERROR_CLASSES_JSON, object_pairs_hook=detect_duplication)
 
+def test_invalid_error_class(self):
+with self.assertRaisesRegex(ValueError, "Cannot find main error 
class"):
+PySparkValueError(error_class="invalid", message_parameters={})
+
 
 if __name__ == "__main__":
 import unittest


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46220][SQL] Restrict charsets in `decode()`

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 52b94a6f6a9 [SPARK-46220][SQL] Restrict charsets in `decode()`
52b94a6f6a9 is described below

commit 52b94a6f6a9335ebfaee1456f78d1943c24694d7
Author: Max Gekk 
AuthorDate: Mon Dec 4 10:42:40 2023 +0900

[SPARK-46220][SQL] Restrict charsets in `decode()`

### What changes were proposed in this pull request?
In the PR, I propose to restrict the supported charsets in the `decode()` 
functions by the list from [the 
doc](https://spark.apache.org/docs/latest/api/sql/#decode):
```
'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'
```
and use the existing SQL config `spark.sql.legacy.javaCharsets` for 
restoring the previous behaviour.

### Why are the changes needed?
Currently the list of supported charsets in `decode()` is not stable and 
fully depends on the used JDK version. So, sometimes user code might not work 
because a devop changed Java version in a Spark cluster.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
By running new checks:
```
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite -- -z string-functions.sql"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44131 from MaxGekk/restrict-charsets-in-stringdecode.

Authored-by: Max Gekk 
Signed-off-by: Hyukjin Kwon 
---
 .../explain-results/function_decode.explain|  2 +-
 docs/sql-migration-guide.md|  2 +-
 .../catalyst/expressions/stringExpressions.scala   | 25 +++-
 .../analyzer-results/ansi/string-functions.sql.out | 42 ++
 .../analyzer-results/string-functions.sql.out  | 42 ++
 .../sql-tests/inputs/string-functions.sql  |  6 ++
 .../results/ansi/string-functions.sql.out  | 66 ++
 .../sql-tests/results/string-functions.sql.out | 66 ++
 8 files changed, 246 insertions(+), 5 deletions(-)

diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_decode.explain
 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_decode.explain
index 3b8e1eea576..165be9b9e12 100644
--- 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_decode.explain
+++ 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_decode.explain
@@ -1,2 +1,2 @@
-Project [decode(cast(g#0 as binary), UTF-8) AS decode(g, UTF-8)#0]
+Project [decode(cast(g#0 as binary), UTF-8, false) AS decode(g, UTF-8)#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index cb4f59323c3..9f9c15521c6 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -29,7 +29,7 @@ license: |
 - Since Spark 4.0, `spark.sql.hive.metastore` drops the support of Hive prior 
to 2.0.0 as they require JDK 8 that Spark does not support anymore. Users 
should migrate to higher versions.
 - Since Spark 4.0, `spark.sql.parquet.compression.codec` drops the support of 
codec name `lz4raw`, please use `lz4_raw` instead.
 - Since Spark 4.0, when overflowing during casting timestamp to byte/short/int 
under non-ansi mode, Spark will return null instead a wrapping value.
-- Since Spark 4.0, the `encode()` function supports only the following 
charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'. 
To restore the previous behavior when the function accepts charsets of the 
current JDK used by Spark, set `spark.sql.legacy.javaCharsets` to `true`.
+- Since Spark 4.0, the `encode()` and `decode()` functions support only the 
following charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 
'UTF-16'. To restore the previous behavior when the function accepts charsets 
of the current JDK used by Spark, set `spark.sql.legacy.javaCharsets` to `true`.
 
 ## Upgrading from Spark SQL 3.4 to 3.5
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
index 84a5eebd70e..7c5d65d2b95 100755
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
@@ -2638,18 +2638,26 @@ case class Decode(params: Seq[Expression], replacement: 
Expression)
   since = "1.5.0",
   group = "string_funcs")
 // scalastyle:on line.size.limit
-case class StringDecode(bin: Expression, charset: Expression)
+

(spark) branch master updated: [SPARK-46048][PYTHON][CONNECT][FOLLOWUP] Correct the string representation

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0f3c58aa9e7d [SPARK-46048][PYTHON][CONNECT][FOLLOWUP] Correct the 
string representation
0f3c58aa9e7d is described below

commit 0f3c58aa9e7dd7966e07a391767bdba294a97c9e
Author: Ruifeng Zheng 
AuthorDate: Mon Dec 4 11:02:28 2023 +0900

[SPARK-46048][PYTHON][CONNECT][FOLLOWUP] Correct the string representation

### What changes were proposed in this pull request?
Correct the string representation

### Why are the changes needed?
to fix a minor issue

### Does this PR introduce _any_ user-facing change?
yes

before:
```
In [1]: spark.range(10).groupingSets([])
Out[1]: GroupedData[grouping expressions: [], value: [id: bigint], type: 
Pivot]
```

after:
```
In [2]: spark.range(10).groupingSets([])
Out[2]: GroupedData[grouping expressions: [], value: [id: bigint], type: 
GroupingSets]
```

### How was this patch tested?
manually test, it is trivial so I don't add a doctest

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44140 from zhengruifeng/connect_grouping_str.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/group.py | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/python/pyspark/sql/connect/group.py 
b/python/pyspark/sql/connect/group.py
index 610ef036bc5e..bb963c910e2f 100644
--- a/python/pyspark/sql/connect/group.py
+++ b/python/pyspark/sql/connect/group.py
@@ -109,6 +109,8 @@ class GroupedData:
 type_str = "RollUp"
 elif self._group_type == "cube":
 type_str = "Cube"
+elif self._group_type == "grouping_sets":
+type_str = "GroupingSets"
 else:
 type_str = "Pivot"
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46223][PS] Test SparkPandasNotImplementedError with cleaning up unused code

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 91a43c6db185 [SPARK-46223][PS] Test SparkPandasNotImplementedError 
with cleaning up unused code
91a43c6db185 is described below

commit 91a43c6db18595f481958c3e1c67c9724a561aca
Author: Hyukjin Kwon 
AuthorDate: Mon Dec 4 11:03:08 2023 +0900

[SPARK-46223][PS] Test SparkPandasNotImplementedError with cleaning up 
unused code

### What changes were proposed in this pull request?

This PR adds a test for `SparkPandasNotImplementedError`, and clean up some 
unused code.

### Why are the changes needed?

![Screenshot 2023-12-04 at 9 07 36 
AM](https://github.com/apache/spark/assets/6477701/87f7dcb5-b448-4841-ab6a-dd92eb07a795)

This is not being tested 
(https://app.codecov.io/gh/apache/spark/blob/master/python%2Fpyspark%2Fpandas%2Fexceptions.py).

### Does this PR introduce _any_ user-facing change?

No, test-only

### How was this patch tested?

Manually ran the new unittest.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44137 from HyukjinKwon/SPARK-46223.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/exceptions.py  | 24 +++-
 python/pyspark/pandas/tests/test_indexing.py |  8 +++-
 2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/python/pyspark/pandas/exceptions.py 
b/python/pyspark/pandas/exceptions.py
index d93f0bf0b68e..8644d0306bf0 100644
--- a/python/pyspark/pandas/exceptions.py
+++ b/python/pyspark/pandas/exceptions.py
@@ -29,28 +29,18 @@ class SparkPandasIndexingError(Exception):
 pass
 
 
-def code_change_hint(pandas_function: Optional[str], spark_target_function: 
Optional[str]) -> str:
-if pandas_function is not None and spark_target_function is not None:
-return "You are trying to use pandas function {}, use spark function 
{}".format(
-pandas_function, spark_target_function
-)
-elif pandas_function is not None and spark_target_function is None:
-return (
-"You are trying to use pandas function {}, checkout the spark "
-"user guide to find a relevant function"
-).format(pandas_function)
-elif pandas_function is None and spark_target_function is not None:
-return "Use spark function {}".format(spark_target_function)
-else:  # both none
-return "Checkout the spark user guide to find a relevant function"
+def code_change_hint(pandas_function: str, spark_target_function: str) -> str:
+return "You are trying to use pandas function {}, use spark function 
{}".format(
+pandas_function, spark_target_function
+)
 
 
 class SparkPandasNotImplementedError(NotImplementedError):
 def __init__(
 self,
-pandas_function: Optional[str] = None,
-spark_target_function: Optional[str] = None,
-description: str = "",
+pandas_function: str,
+spark_target_function: str,
+description: str,
 ):
 self.pandas_source = pandas_function
 self.spark_target = spark_target_function
diff --git a/python/pyspark/pandas/tests/test_indexing.py 
b/python/pyspark/pandas/tests/test_indexing.py
index 590bef16bb22..a4ca03005b33 100644
--- a/python/pyspark/pandas/tests/test_indexing.py
+++ b/python/pyspark/pandas/tests/test_indexing.py
@@ -22,7 +22,7 @@ import numpy as np
 import pandas as pd
 
 from pyspark import pandas as ps
-from pyspark.pandas.exceptions import SparkPandasIndexingError
+from pyspark.pandas.exceptions import SparkPandasIndexingError, 
SparkPandasNotImplementedError
 from pyspark.testing.pandasutils import ComparisonTestBase, compare_both
 
 
@@ -902,6 +902,12 @@ class IndexingTest(ComparisonTestBase):
 self.assert_eq(psdf.iloc[:-1, indexer], pdf.iloc[:-1, indexer])
 # self.assert_eq(psdf.iloc[psdf.index == 2, indexer], 
pdf.iloc[pdf.index == 2, indexer])
 
+self.assertRaisesRegex(
+SparkPandasNotImplementedError,
+".iloc requires numeric slice, conditional boolean",
+lambda: ps.range(10).iloc["a", :],
+)
+
 def test_iloc_multiindex_columns(self):
 arrays = [np.array(["bar", "bar", "baz", "baz"]), np.array(["one", 
"two", "one", "two"])]
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-40559][PYTHON][FOLLOW-UP] Fix linter for `getfullargspec`

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c6ad0b0825d7 [SPARK-40559][PYTHON][FOLLOW-UP] Fix linter for 
`getfullargspec`
c6ad0b0825d7 is described below

commit c6ad0b0825d7174a9b24b13156a0fa7c59056158
Author: Hyukjin Kwon 
AuthorDate: Mon Dec 4 11:15:39 2023 +0900

[SPARK-40559][PYTHON][FOLLOW-UP] Fix linter for `getfullargspec`

### What changes were proposed in this pull request?

This PR proposes to use `inspect.getfullargspec` instead of unimported 
`getfullargspec`. This PR is a followup of 
https://github.com/apache/spark/pull/38624.

### Why are the changes needed?

To recover the CI.

It fails as below:

```
./python/pyspark/worker.py:749:19: F821 undefined name 'getfullargspec'
argspec = getfullargspec(chained_func)  # signature was lost when 
wrapping it
  ^
./python/pyspark/worker.py:757:19: 
F8[21](https://github.com/apache/spark/actions/runs/7080907452/job/19269484124#step:21:22)
 undefined name 'getfullargspec'
argspec = getfullargspec(chained_func)  # signature was lost when 
wrapping it
```

https://github.com/apache/spark/actions/runs/7080907452/job/19269484124

It was caused by the logical conflict w/ 
https://github.com/apache/spark/commit/f5e4e84ce3a7f65407d07cfc3eed2f51837527c1

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested via `linter-python`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44141 from HyukjinKwon/SPARK-40559-followup2.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/worker.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py
index 2534238b43cb..158e3ae62bb7 100644
--- a/python/pyspark/worker.py
+++ b/python/pyspark/worker.py
@@ -746,7 +746,7 @@ def read_single_udf(pickleSer, infile, eval_type, 
runner_conf, udf_index):
 argspec = inspect.getfullargspec(chained_func)  # signature was lost 
when wrapping it
 return args_offsets, wrap_grouped_map_pandas_udf(func, return_type, 
argspec, runner_conf)
 elif eval_type == PythonEvalType.SQL_GROUPED_MAP_ARROW_UDF:
-argspec = getfullargspec(chained_func)  # signature was lost when 
wrapping it
+argspec = inspect.getfullargspec(chained_func)  # signature was lost 
when wrapping it
 return args_offsets, wrap_grouped_map_arrow_udf(func, return_type, 
argspec, runner_conf)
 elif eval_type == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE:
 return args_offsets, wrap_grouped_map_pandas_udf_with_state(func, 
return_type)
@@ -754,7 +754,7 @@ def read_single_udf(pickleSer, infile, eval_type, 
runner_conf, udf_index):
 argspec = inspect.getfullargspec(chained_func)  # signature was lost 
when wrapping it
 return args_offsets, wrap_cogrouped_map_pandas_udf(func, return_type, 
argspec, runner_conf)
 elif eval_type == PythonEvalType.SQL_COGROUPED_MAP_ARROW_UDF:
-argspec = getfullargspec(chained_func)  # signature was lost when 
wrapping it
+argspec = inspect.getfullargspec(chained_func)  # signature was lost 
when wrapping it
 return args_offsets, wrap_cogrouped_map_arrow_udf(func, return_type, 
argspec, runner_conf)
 elif eval_type == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF:
 return wrap_grouped_agg_pandas_udf(func, args_offsets, kwargs_offsets, 
return_type)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46224][PYTHON][MLLIB][TESTS] Test string representation of TestResult (pyspark.mllib.stat.test)

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 108e25086e07 [SPARK-46224][PYTHON][MLLIB][TESTS] Test string 
representation of TestResult (pyspark.mllib.stat.test)
108e25086e07 is described below

commit 108e25086e07e9bcc8290ec946aa421068f4cd59
Author: Hyukjin Kwon 
AuthorDate: Mon Dec 4 11:16:17 2023 +0900

[SPARK-46224][PYTHON][MLLIB][TESTS] Test string representation of 
TestResult (pyspark.mllib.stat.test)

### What changes were proposed in this pull request?

This PR adds a test for `toString` on the model via `TestResult` within 
`pyspark.mllib.stat.test`.

### Why are the changes needed?

![Screenshot 2023-12-04 at 9 19 37 
AM](https://github.com/apache/spark/assets/6477701/8a360172-0913-43ce-a5f0-a68cdefe252c)


https://app.codecov.io/gh/apache/spark/blob/master/python%2Fpyspark%2Fmllib%2Fstat%2Ftest.py#L28

This is not being tested.

### Does this PR introduce _any_ user-facing change?

No, test-only

### How was this patch tested?

Manually ran the new unittest.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44138 from HyukjinKwon/SPARK-46041.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/mllib/tests/test_stat.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/python/pyspark/mllib/tests/test_stat.py 
b/python/pyspark/mllib/tests/test_stat.py
index 4fcd49f11bdc..c851768a1fb5 100644
--- a/python/pyspark/mllib/tests/test_stat.py
+++ b/python/pyspark/mllib/tests/test_stat.py
@@ -65,6 +65,7 @@ class ChiSqTestTests(MLlibTestCase):
 
 observed = Vectors.dense([4, 6, 5])
 pearson = Statistics.chiSqTest(observed)
+self.assertIn("Chi squared test summary", str(pearson))
 
 # Validated against the R command `chisq.test(c(4, 6, 5), p=c(1/3, 
1/3, 1/3))`
 self.assertEqual(pearson.statistic, 0.4)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46227][SQL] Move `withSQLConf` from `SQLHelper` to `SQLConfHelper`

2023-12-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new aee6b1582775 [SPARK-46227][SQL] Move `withSQLConf` from `SQLHelper` to 
`SQLConfHelper`
aee6b1582775 is described below

commit aee6b158277537709a717223b518923431bca0a6
Author: ulysses-you 
AuthorDate: Sun Dec 3 21:23:34 2023 -0800

[SPARK-46227][SQL] Move `withSQLConf` from `SQLHelper` to `SQLConfHelper`

### What changes were proposed in this pull request?

This pr moves method `withSQLConf` from `SQLHelper` in catalyst test module 
to `SQLConfHelper` trait in catalyst module. To make it easy to use such case: 
`val x = withSQLConf {}`, this pr also changes its return type.

### Why are the changes needed?

A part of https://github.com/apache/spark/pull/44013

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Pass CI

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44142 from ulysses-you/withSQLConf.

Authored-by: ulysses-you 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/catalyst/SQLConfHelper.scala  | 29 
 .../spark/sql/catalyst/plans/SQLHelper.scala   | 32 ++
 .../sql/internal/ExecutorSideSQLConfSuite.scala|  2 +-
 .../org/apache/spark/sql/test/SQLTestUtils.scala   |  2 +-
 .../spark/sql/hive/execution/HiveSerDeSuite.scala  |  2 +-
 5 files changed, 34 insertions(+), 33 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SQLConfHelper.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SQLConfHelper.scala
index cee35cdb8d84..f4605b9218f0 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SQLConfHelper.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SQLConfHelper.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.sql.catalyst
 
+import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.internal.SQLConf
 
 /**
@@ -29,4 +30,32 @@ trait SQLConfHelper {
* See [[SQLConf.get]] for more information.
*/
   def conf: SQLConf = SQLConf.get
+
+  /**
+   * Sets all SQL configurations specified in `pairs`, calls `f`, and then 
restores all SQL
+   * configurations.
+   */
+  protected def withSQLConf[T](pairs: (String, String)*)(f: => T): T = {
+val conf = SQLConf.get
+val (keys, values) = pairs.unzip
+val currentValues = keys.map { key =>
+  if (conf.contains(key)) {
+Some(conf.getConfString(key))
+  } else {
+None
+  }
+}
+keys.lazyZip(values).foreach { (k, v) =>
+  if (SQLConf.isStaticConfigKey(k)) {
+throw new AnalysisException(s"Cannot modify the value of a static 
config: $k")
+  }
+  conf.setConfString(k, v)
+}
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => conf.setConfString(key, value)
+case (key, None) => conf.unsetConf(key)
+  }
+}
+  }
 }
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala
index eb844e6f057f..92681613bd83 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala
@@ -23,41 +23,13 @@ import scala.util.control.NonFatal
 
 import org.scalatest.Assertions.fail
 
-import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.SQLConfHelper
 import org.apache.spark.sql.catalyst.util.DateTimeTestUtils
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.getZoneId
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.util.Utils
 
-trait SQLHelper {
-
-  /**
-   * Sets all SQL configurations specified in `pairs`, calls `f`, and then 
restores all SQL
-   * configurations.
-   */
-  protected def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
-val conf = SQLConf.get
-val (keys, values) = pairs.unzip
-val currentValues = keys.map { key =>
-  if (conf.contains(key)) {
-Some(conf.getConfString(key))
-  } else {
-None
-  }
-}
-keys.lazyZip(values).foreach { (k, v) =>
-  if (SQLConf.isStaticConfigKey(k)) {
-throw new AnalysisException(s"Cannot modify the value of a static 
config: $k")
-  }
-  conf.setConfString(k, v)
-}
-try f finally {
-  keys.zip(currentValues).foreach {
-case (key, Some(value)) => conf.setConfString(key, value)
-case (key, None) => conf.unsetConf(key)
-  }
-}
-  }
+trait SQLHelper extends SQLConfHelper {
 
   /**
* Generates a temporary path w

(spark) branch master updated: [SPARK-46218][BUILD] Upgrade commons-cli to 1.6.0

2023-12-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0c029e70706c [SPARK-46218][BUILD] Upgrade commons-cli to 1.6.0
0c029e70706c is described below

commit 0c029e70706c7e1a4c3a7bb763dbbcb4fe1ccd9f
Author: panbingkun 
AuthorDate: Sun Dec 3 21:27:21 2023 -0800

[SPARK-46218][BUILD] Upgrade commons-cli to 1.6.0

### What changes were proposed in this pull request?
The pr aims to upgrade `commons-cli` from `1.5.0` to `1.6.0`.

### Why are the changes needed?
- The last upgrade occurred two years ago, 
https://github.com/apache/spark/pull/34707
- The full release notes: 
https://commons.apache.org/proper/commons-cli/changes-report.html#a1.6.0
- The version mainly focus on fixing bugs:
Fix NPE in CommandLine.resolveOption(String). Fixes 
[CLI-283](https://issues.apache.org/jira/browse/CLI-283).
CommandLine.addOption(Option) should not allow a null Option. Fixes 
[CLI-283](https://issues.apache.org/jira/browse/CLI-283).
CommandLine.addArgs(String) should not allow a null String. Fixes 
[CLI-283](https://issues.apache.org/jira/browse/CLI-283).
NullPointerException thrown by CommandLineParser.parse(). Fixes 
[CLI-317](https://issues.apache.org/jira/browse/CLI-317).
StringIndexOutOfBoundsException thrown by CommandLineParser.parse(). Fixes 
[CLI-313](https://issues.apache.org/jira/browse/CLI-313).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44132 from panbingkun/SPARK-46218.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index f1d675d92b6d..ebfe6acad960 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -35,7 +35,7 @@ breeze_2.13/2.1.0//breeze_2.13-2.1.0.jar
 cats-kernel_2.13/2.8.0//cats-kernel_2.13-2.8.0.jar
 chill-java/0.10.0//chill-java-0.10.0.jar
 chill_2.13/0.10.0//chill_2.13-0.10.0.jar
-commons-cli/1.5.0//commons-cli-1.5.0.jar
+commons-cli/1.6.0//commons-cli-1.6.0.jar
 commons-codec/1.16.0//commons-codec-1.16.0.jar
 commons-collections/3.2.2//commons-collections-3.2.2.jar
 commons-collections4/4.4//commons-collections4-4.4.jar
diff --git a/pom.xml b/pom.xml
index 2a259cfd322b..27ee42f103dd 100644
--- a/pom.xml
+++ b/pom.xml
@@ -220,7 +220,7 @@
 2.70.0
 3.1.0
 1.1.0
-1.5.0
+1.6.0
 1.70
 1.9.0
 4.1.100.Final


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (0c029e70706c -> 712352e37ec5)

2023-12-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0c029e70706c [SPARK-46218][BUILD] Upgrade commons-cli to 1.6.0
 add 712352e37ec5 [SPARK-40559][PYTHON][DOCS][FOLLOW-UP] Fix the docstring 
and document both applyInArrows

No new revisions were added by this update.

Summary of changes:
 python/docs/source/reference/pyspark.sql/grouping.rst |  2 ++
 python/pyspark/sql/pandas/group_ops.py| 15 +++
 2 files changed, 9 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event

2023-12-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6f112f7b1a50 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the 
exact task finished event
6f112f7b1a50 is described below

commit 6f112f7b1a50a2b8a59952c69f67dd5f80ab6633
Author: Xingbo Jiang 
AuthorDate: Sun Dec 3 22:08:20 2023 -0800

[SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task 
finished event

### What changes were proposed in this pull request?

We found a race condition between lastTaskRunningTime and 
lastShuffleMigrationTime that could lead to a decommissioned executor exit 
before all the shuffle blocks have been discovered. The issue could lead to 
immediate task retry right after an executor exit, thus longer query execution 
time.

To fix the issue, we choose to update the lastTaskRunningTime only when a 
task updates its status to finished through the StatusUpdate event. This is 
better than the current approach (which use a thread to check for number of 
running tasks every second), because in this way we clearly know whether the 
shuffle block refresh happened after all tasks finished running or not, thus 
resolved the race condition mentioned above.

### Why are the changes needed?

To fix a race condition that could lead to shuffle data lost, thus longer 
query execution time.

### How was this patch tested?

This is a very subtle race condition that is hard to write a unit test 
using current unit test framework. And we are confident the change is low risk. 
Thus only verify by passing all the existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44090 from jiangxb1987/SPARK-46182.

Authored-by: Xingbo Jiang 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/executor/CoarseGrainedExecutorBackend.scala| 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
index f1a9aa353e76..4bf4929c1339 100644
--- 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
@@ -21,7 +21,7 @@ import java.net.URL
 import java.nio.ByteBuffer
 import java.util.Locale
 import java.util.concurrent.ConcurrentHashMap
-import java.util.concurrent.atomic.AtomicBoolean
+import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong}
 
 import scala.util.{Failure, Success}
 import scala.util.control.NonFatal
@@ -77,6 +77,10 @@ private[spark] class CoarseGrainedExecutorBackend(
 
   private var decommissioned = false
 
+  // Track the last time in ns that at least one task is running. If no task 
is running and all
+  // shuffle/RDD data migration are done, the decommissioned executor should 
exit.
+  private var lastTaskFinishTime = new AtomicLong(System.nanoTime())
+
   override def onStart(): Unit = {
 if (env.conf.get(DECOMMISSION_ENABLED)) {
   val signal = env.conf.get(EXECUTOR_DECOMMISSION_SIGNAL)
@@ -273,6 +277,7 @@ private[spark] class CoarseGrainedExecutorBackend(
 val msg = StatusUpdate(executorId, taskId, state, data, cpus, resources)
 if (TaskState.isFinished(state)) {
   taskResources.remove(taskId)
+  lastTaskFinishTime.set(System.nanoTime())
 }
 driver match {
   case Some(driverRef) => driverRef.send(msg)
@@ -345,7 +350,6 @@ private[spark] class CoarseGrainedExecutorBackend(
 
   val shutdownThread = new Thread("wait-for-blocks-to-migrate") {
 override def run(): Unit = {
-  var lastTaskRunningTime = System.nanoTime()
   val sleep_time = 1000 // 1s
   // This config is internal and only used by unit tests to force an 
executor
   // to hang around for longer when decommissioned.
@@ -362,7 +366,7 @@ private[spark] class CoarseGrainedExecutorBackend(
 val (migrationTime, allBlocksMigrated) = 
env.blockManager.lastMigrationInfo()
 // We can only trust allBlocksMigrated boolean value if there 
were no tasks running
 // since the start of computing it.
-if (allBlocksMigrated && (migrationTime > 
lastTaskRunningTime)) {
+if (allBlocksMigrated && (migrationTime > 
lastTaskFinishTime.get())) {
   logInfo("No running tasks, all blocks migrated, stopping.")
   exitExecutor(0, ExecutorLossMessage.decommissionFinished, 
notifyDriver = true)
 } else {
@@ -374,12 +378,6 @@ private[spark] class CoarseGrainedExecutorBackend(
   }
 } else {
   logInfo(s"Blocked 

(spark) branch branch-3.5 updated: [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event

2023-12-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 273ef5708fc3 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the 
exact task finished event
273ef5708fc3 is described below

commit 273ef5708fc33872cfe3091627617bbac8fdd56f
Author: Xingbo Jiang 
AuthorDate: Sun Dec 3 22:08:20 2023 -0800

[SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task 
finished event

### What changes were proposed in this pull request?

We found a race condition between lastTaskRunningTime and 
lastShuffleMigrationTime that could lead to a decommissioned executor exit 
before all the shuffle blocks have been discovered. The issue could lead to 
immediate task retry right after an executor exit, thus longer query execution 
time.

To fix the issue, we choose to update the lastTaskRunningTime only when a 
task updates its status to finished through the StatusUpdate event. This is 
better than the current approach (which use a thread to check for number of 
running tasks every second), because in this way we clearly know whether the 
shuffle block refresh happened after all tasks finished running or not, thus 
resolved the race condition mentioned above.

### Why are the changes needed?

To fix a race condition that could lead to shuffle data lost, thus longer 
query execution time.

### How was this patch tested?

This is a very subtle race condition that is hard to write a unit test 
using current unit test framework. And we are confident the change is low risk. 
Thus only verify by passing all the existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44090 from jiangxb1987/SPARK-46182.

Authored-by: Xingbo Jiang 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 6f112f7b1a50a2b8a59952c69f67dd5f80ab6633)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/executor/CoarseGrainedExecutorBackend.scala| 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
index c695a9ec2851..537522326fc7 100644
--- 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
@@ -21,7 +21,7 @@ import java.net.URL
 import java.nio.ByteBuffer
 import java.util.Locale
 import java.util.concurrent.ConcurrentHashMap
-import java.util.concurrent.atomic.AtomicBoolean
+import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong}
 
 import scala.util.{Failure, Success}
 import scala.util.control.NonFatal
@@ -80,6 +80,10 @@ private[spark] class CoarseGrainedExecutorBackend(
 
   private var decommissioned = false
 
+  // Track the last time in ns that at least one task is running. If no task 
is running and all
+  // shuffle/RDD data migration are done, the decommissioned executor should 
exit.
+  private var lastTaskFinishTime = new AtomicLong(System.nanoTime())
+
   override def onStart(): Unit = {
 if (env.conf.get(DECOMMISSION_ENABLED)) {
   val signal = env.conf.get(EXECUTOR_DECOMMISSION_SIGNAL)
@@ -269,6 +273,7 @@ private[spark] class CoarseGrainedExecutorBackend(
 val msg = StatusUpdate(executorId, taskId, state, data, cpus, resources)
 if (TaskState.isFinished(state)) {
   taskResources.remove(taskId)
+  lastTaskFinishTime.set(System.nanoTime())
 }
 driver match {
   case Some(driverRef) => driverRef.send(msg)
@@ -341,7 +346,6 @@ private[spark] class CoarseGrainedExecutorBackend(
 
   val shutdownThread = new Thread("wait-for-blocks-to-migrate") {
 override def run(): Unit = {
-  var lastTaskRunningTime = System.nanoTime()
   val sleep_time = 1000 // 1s
   // This config is internal and only used by unit tests to force an 
executor
   // to hang around for longer when decommissioned.
@@ -358,7 +362,7 @@ private[spark] class CoarseGrainedExecutorBackend(
 val (migrationTime, allBlocksMigrated) = 
env.blockManager.lastMigrationInfo()
 // We can only trust allBlocksMigrated boolean value if there 
were no tasks running
 // since the start of computing it.
-if (allBlocksMigrated && (migrationTime > 
lastTaskRunningTime)) {
+if (allBlocksMigrated && (migrationTime > 
lastTaskFinishTime.get())) {
   logInfo("No running tasks, all blocks migrated, stopping.")
   exitExecutor(0, ExecutorLossMessage.decommissionFinished, 
notifyDriver = true)
 } else {
@@ -370,12 +374,6 @@ private[

(spark) branch branch-3.4 updated: [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event

2023-12-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new b8750d5c0b41 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the 
exact task finished event
b8750d5c0b41 is described below

commit b8750d5c0b416137ce802cf73dd92b0fc7ff5467
Author: Xingbo Jiang 
AuthorDate: Sun Dec 3 22:08:20 2023 -0800

[SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task 
finished event

### What changes were proposed in this pull request?

We found a race condition between lastTaskRunningTime and 
lastShuffleMigrationTime that could lead to a decommissioned executor exit 
before all the shuffle blocks have been discovered. The issue could lead to 
immediate task retry right after an executor exit, thus longer query execution 
time.

To fix the issue, we choose to update the lastTaskRunningTime only when a 
task updates its status to finished through the StatusUpdate event. This is 
better than the current approach (which use a thread to check for number of 
running tasks every second), because in this way we clearly know whether the 
shuffle block refresh happened after all tasks finished running or not, thus 
resolved the race condition mentioned above.

### Why are the changes needed?

To fix a race condition that could lead to shuffle data lost, thus longer 
query execution time.

### How was this patch tested?

This is a very subtle race condition that is hard to write a unit test 
using current unit test framework. And we are confident the change is low risk. 
Thus only verify by passing all the existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44090 from jiangxb1987/SPARK-46182.

Authored-by: Xingbo Jiang 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 6f112f7b1a50a2b8a59952c69f67dd5f80ab6633)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/executor/CoarseGrainedExecutorBackend.scala| 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
index c695a9ec2851..537522326fc7 100644
--- 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
@@ -21,7 +21,7 @@ import java.net.URL
 import java.nio.ByteBuffer
 import java.util.Locale
 import java.util.concurrent.ConcurrentHashMap
-import java.util.concurrent.atomic.AtomicBoolean
+import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong}
 
 import scala.util.{Failure, Success}
 import scala.util.control.NonFatal
@@ -80,6 +80,10 @@ private[spark] class CoarseGrainedExecutorBackend(
 
   private var decommissioned = false
 
+  // Track the last time in ns that at least one task is running. If no task 
is running and all
+  // shuffle/RDD data migration are done, the decommissioned executor should 
exit.
+  private var lastTaskFinishTime = new AtomicLong(System.nanoTime())
+
   override def onStart(): Unit = {
 if (env.conf.get(DECOMMISSION_ENABLED)) {
   val signal = env.conf.get(EXECUTOR_DECOMMISSION_SIGNAL)
@@ -269,6 +273,7 @@ private[spark] class CoarseGrainedExecutorBackend(
 val msg = StatusUpdate(executorId, taskId, state, data, cpus, resources)
 if (TaskState.isFinished(state)) {
   taskResources.remove(taskId)
+  lastTaskFinishTime.set(System.nanoTime())
 }
 driver match {
   case Some(driverRef) => driverRef.send(msg)
@@ -341,7 +346,6 @@ private[spark] class CoarseGrainedExecutorBackend(
 
   val shutdownThread = new Thread("wait-for-blocks-to-migrate") {
 override def run(): Unit = {
-  var lastTaskRunningTime = System.nanoTime()
   val sleep_time = 1000 // 1s
   // This config is internal and only used by unit tests to force an 
executor
   // to hang around for longer when decommissioned.
@@ -358,7 +362,7 @@ private[spark] class CoarseGrainedExecutorBackend(
 val (migrationTime, allBlocksMigrated) = 
env.blockManager.lastMigrationInfo()
 // We can only trust allBlocksMigrated boolean value if there 
were no tasks running
 // since the start of computing it.
-if (allBlocksMigrated && (migrationTime > 
lastTaskRunningTime)) {
+if (allBlocksMigrated && (migrationTime > 
lastTaskFinishTime.get())) {
   logInfo("No running tasks, all blocks migrated, stopping.")
   exitExecutor(0, ExecutorLossMessage.decommissionFinished, 
notifyDriver = true)
 } else {
@@ -370,12 +374,6 @@ private[

(spark) branch master updated: [SPARK-46232][PYTHON] Migrate all remaining ValueError into PySpark error framework

2023-12-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b23ae15da019 [SPARK-46232][PYTHON] Migrate all remaining ValueError 
into PySpark error framework
b23ae15da019 is described below

commit b23ae15da019082891d71853682329c2d24c2e9e
Author: Haejoon Lee 
AuthorDate: Sun Dec 3 22:49:30 2023 -0800

[SPARK-46232][PYTHON] Migrate all remaining ValueError into PySpark error 
framework

### What changes were proposed in this pull request?

This PR proposes to migrate all remaining `ValueError`  from 
`pyspark/sql/*` into PySpark error framework, `PySparkValueError` with 
assigning dedicated error classes.

### Why are the changes needed?

To improve the error handling in PySpark.

### Does this PR introduce _any_ user-facing change?

No API changes, but the user-facing error messages will be improved.

### How was this patch tested?

The existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44149 from itholic/migrate_value_error.

Authored-by: Haejoon Lee 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/errors/error_classes.py   | 19 +--
 python/pyspark/sql/pandas/serializers.py |  5 +++--
 python/pyspark/sql/pandas/typehints.py   | 12 +---
 python/pyspark/sql/pandas/types.py   |  7 +--
 python/pyspark/sql/sql_formatter.py  |  7 ---
 5 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/python/pyspark/errors/error_classes.py 
b/python/pyspark/errors/error_classes.py
index c7199ac938be..d0c0d1c115b0 100644
--- a/python/pyspark/errors/error_classes.py
+++ b/python/pyspark/errors/error_classes.py
@@ -287,6 +287,11 @@ ERROR_CLASSES_JSON = """
   "NumPy array input should be of  dimensions."
 ]
   },
+  "INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP" : {
+"message" : [
+  "Invalid number of dataframes in group ."
+]
+  },
   "INVALID_PANDAS_UDF" : {
 "message" : [
   "Invalid function: "
@@ -803,9 +808,9 @@ ERROR_CLASSES_JSON = """
   "Expected  values for ``, got ."
 ]
   },
-  "TYPE_HINT_REQUIRED" : {
+  "TYPE_HINT_SHOULD_BE_SPECIFIED" : {
 "message" : [
-  "A  is required ."
+  "Type hints for  should be specified; however, got ."
 ]
   },
   "UDF_RETURN_TYPE" : {
@@ -888,6 +893,11 @@ ERROR_CLASSES_JSON = """
   "Unknown response: ."
 ]
   },
+  "UNKNOWN_VALUE_FOR" : {
+"message" : [
+  "Unknown value for ``."
+]
+  },
   "UNSUPPORTED_DATA_TYPE" : {
 "message" : [
   "Unsupported DataType ``."
@@ -983,6 +993,11 @@ ERROR_CLASSES_JSON = """
   "Value for `` only supports the 'pearson', got ''."
 ]
   },
+  "VALUE_NOT_PLAIN_COLUMN_REFERENCE" : {
+"message" : [
+  "Value  in  should be a plain column reference such as 
`df.col` or `col('column')`."
+]
+  },
   "VALUE_NOT_POSITIVE" : {
 "message" : [
   "Value for `` must be positive, got ''."
diff --git a/python/pyspark/sql/pandas/serializers.py 
b/python/pyspark/sql/pandas/serializers.py
index 8ffb7407714b..6c5bd826a023 100644
--- a/python/pyspark/sql/pandas/serializers.py
+++ b/python/pyspark/sql/pandas/serializers.py
@@ -707,8 +707,9 @@ class 
CogroupArrowUDFSerializer(ArrowStreamGroupUDFSerializer):
 yield batches1, batches2
 
 elif dataframes_in_group != 0:
-raise ValueError(
-"Invalid number of dataframes in group 
{0}".format(dataframes_in_group)
+raise PySparkValueError(
+error_class="INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP",
+message_parameters={"dataframes_in_group": 
str(dataframes_in_group)},
 )
 
 
diff --git a/python/pyspark/sql/pandas/typehints.py 
b/python/pyspark/sql/pandas/typehints.py
index f0c13e66a63d..37ba02a94d58 100644
--- a/python/pyspark/sql/pandas/typehints.py
+++ b/python/pyspark/sql/pandas/typehints.py
@@ -18,7 +18,7 @@ from inspect import Signature
 from typing import Any, Callable, Dict, Optional, Union, TYPE_CHECKING
 
 from pyspark.sql.pandas.utils import require_minimum_pandas_version
-from pyspark.errors import PySparkNotImplementedError
+from pyspark.errors import PySparkNotImplementedError, PySparkValueError
 
 if TYPE_CHECKING:
 from pyspark.sql.pandas._typing import (
@@ -51,12 +51,18 @@ def infer_eval_type(
 annotations[parameter] for parameter in sig.parameters if parameter in 
annotations
 ]
 if len(parameters_sig) != len(sig.parameters):
-raise ValueError("Type hints for all parameters should be specified; 
however, got %s" % sig)
+raise PySparkValueError(
+error_class="TYPE_HINT_SHOULD_BE_SPECIFIED",
+message_parameters={"target": "

(spark) branch master updated (b23ae15da019 -> fc56bb53910e)

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b23ae15da019 [SPARK-46232][PYTHON] Migrate all remaining ValueError 
into PySpark error framework
 add fc56bb53910e [SPARK-46229][PYTHON][CONNECT] Add applyInArrow to 
groupBy and cogroup in Spark Connect

No new revisions were added by this update.

Summary of changes:
 .../sql/connect/planner/SparkConnectPlanner.scala  | 29 +++--
 dev/sparktestsupport/modules.py|  6 +-
 python/pyspark/sql/connect/_typing.py  | 12 +++-
 python/pyspark/sql/connect/group.py| 53 +++-
 python/pyspark/sql/tests/arrow/__init__.py | 16 -
 .../connect/test_parity_arrow_cogrouped_map.py}| 14 ++---
 .../connect/test_parity_arrow_grouped_map.py}  | 14 +++--
 .../tests/{arrow => }/test_arrow_cogrouped_map.py  | 71 --
 .../tests/{arrow => }/test_arrow_grouped_map.py| 61 ++-
 9 files changed, 176 insertions(+), 100 deletions(-)
 delete mode 100644 python/pyspark/sql/tests/arrow/__init__.py
 copy python/pyspark/{pandas/tests/connect/test_parity_series_datetime.py => 
sql/tests/connect/test_parity_arrow_cogrouped_map.py} (75%)
 copy python/pyspark/{pandas/tests/connect/series/test_parity_series.py => 
sql/tests/connect/test_parity_arrow_grouped_map.py} (75%)
 rename python/pyspark/sql/tests/{arrow => }/test_arrow_cogrouped_map.py (92%)
 rename python/pyspark/sql/tests/{arrow => }/test_arrow_grouped_map.py (96%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46236][PYTHON][DOCS] Using brighter color for docs h3 title for better visibility

2023-12-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4398bb5d3732 [SPARK-46236][PYTHON][DOCS] Using brighter color for docs 
h3 title for better visibility
4398bb5d3732 is described below

commit 4398bb5d37328e2f3594302d98f98803a379a2e9
Author: panbingkun 
AuthorDate: Mon Dec 4 16:26:39 2023 +0900

[SPARK-46236][PYTHON][DOCS] Using brighter color for docs h3 title for 
better visibility

### What changes were proposed in this pull request?
The pr aims to using brighter color for docs `h3 title` for better 
visibility.

### Why are the changes needed?
- Before:
   Dark theme:
   https://github.com/apache/spark/assets/15246973/ee024e67-d86b-4960-968f-329b32d3e370";>

   Light theme:
   https://github.com/apache/spark/assets/15246973/20a562ed-b9b5-44f6-bf52-30967fb7dc3a";>

- After:
   Dark theme:
   https://github.com/apache/spark/assets/15246973/546139b5-a94c-4dec-a4fb-24299b3388ba";>

   Light theme:
   https://github.com/apache/spark/assets/15246973/4d912002-4c9e-46d0-a397-bcca0024f160";>

### Does this PR introduce _any_ user-facing change?
No API changes, but the font color of the `h3 title` has been updated to a 
brighter color, improving contrast and readability.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44152 from panbingkun/SPARK-46236.

Authored-by: panbingkun 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/_static/css/pyspark.css | 4 
 1 file changed, 4 deletions(-)

diff --git a/python/docs/source/_static/css/pyspark.css 
b/python/docs/source/_static/css/pyspark.css
index 2743629ff61c..565eaea29935 100644
--- a/python/docs/source/_static/css/pyspark.css
+++ b/python/docs/source/_static/css/pyspark.css
@@ -27,10 +27,6 @@ h1,h2 {
 color:#17A2B8!important;
 }
 
-h3 {
-color: #55
-}
-
 /* Top menu */
 #navbar-main {
 background: #1B5162!important;


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org