date:20230806

[spark] branch branch-3.5 updated: [SPARK-44644][PYTHON][3.5] Improve error messages for Python UDTFs with pickling errors

2023-08-06 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new b5ab247aa6e [SPARK-44644][PYTHON][3.5] Improve error messages for 
Python UDTFs with pickling errors
b5ab247aa6e is described below

commit b5ab247aa6e45180c2e826da74fcb615f3da3335
Author: allisonwang-db 
AuthorDate: Mon Aug 7 13:03:03 2023 +0800

[SPARK-44644][PYTHON][3.5] Improve error messages for Python UDTFs with 
pickling errors

### What changes were proposed in this pull request?

Cherry-pick 
https://github.com/apache/spark/commit/62415dc59627e1f7b4e3449ae728e93c1fc0b74f

This PR improves the error messages when a Python UDTF failed to pickle.

### Why are the changes needed?

To make the error message more user-friendly

### Does this PR introduce _any_ user-facing change?

Yes, before this PR, when a UDTF fails to pickle, it throws this confusing 
exception:
```
_pickle.PicklingError: Cannot pickle files that are not opened for reading: 
w
```
After this PR, the error is more clear:
`[UDTF_SERIALIZATION_ERROR] Cannot serialize the UDTF 'TestUDTF': Please 
check the stack trace and make sure that the function is serializable.`

And for spark session access inside a UDTF:
`[UDTF_SERIALIZATION_ERROR] it appears that you are attempting to reference 
SparkSession inside a UDTF. SparkSession can only be used on the driver, not in 
code that runs on workers. Please remove the reference and try again.`

### How was this patch tested?

New UTs.

Closes #42349 from allisonwang-db/spark-44644-3.5.

Authored-by: allisonwang-db 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/cloudpickle/cloudpickle_fast.py |  2 +-
 python/pyspark/errors/error_classes.py |  5 +
 python/pyspark/sql/connect/plan.py | 15 +++--
 python/pyspark/sql/tests/test_udtf.py  | 30 +-
 python/pyspark/sql/udtf.py | 25 -
 5 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/cloudpickle/cloudpickle_fast.py 
b/python/pyspark/cloudpickle/cloudpickle_fast.py
index 63aaffa096b..ee1f4b8ee96 100644
--- a/python/pyspark/cloudpickle/cloudpickle_fast.py
+++ b/python/pyspark/cloudpickle/cloudpickle_fast.py
@@ -631,7 +631,7 @@ class CloudPickler(Pickler):
 try:
 return Pickler.dump(self, obj)
 except RuntimeError as e:
-if "recursion" in e.args[0]:
+if len(e.args) > 0 and "recursion" in e.args[0]:
 msg = (
 "Could not pickle object as excessively deep recursion "
 "required."
diff --git a/python/pyspark/errors/error_classes.py 
b/python/pyspark/errors/error_classes.py
index 4ea3e678810..971dc59bbb2 100644
--- a/python/pyspark/errors/error_classes.py
+++ b/python/pyspark/errors/error_classes.py
@@ -743,6 +743,11 @@ ERROR_CLASSES_JSON = """
   "Mismatch in return type for the UDTF ''. Expected a 'StructType', 
but got ''. Please ensure the return type is a correctly formatted 
StructType."
 ]
   },
+  "UDTF_SERIALIZATION_ERROR" : {
+"message" : [
+  "Cannot serialize the UDTF '': "
+]
+  },
   "UNEXPECTED_RESPONSE_FROM_SERVER" : {
 "message" : [
   "Unexpected response from iterator server."
diff --git a/python/pyspark/sql/connect/plan.py 
b/python/pyspark/sql/connect/plan.py
index 3390faa04de..2e918700848 100644
--- a/python/pyspark/sql/connect/plan.py
+++ b/python/pyspark/sql/connect/plan.py
@@ -21,6 +21,7 @@ check_dependencies(__name__)
 from typing import Any, List, Optional, Type, Sequence, Union, cast, 
TYPE_CHECKING, Mapping, Dict
 import functools
 import json
+import pickle
 from threading import Lock
 from inspect import signature, isclass
 
@@ -40,7 +41,7 @@ from pyspark.sql.connect.expressions import (
 LiteralExpression,
 )
 from pyspark.sql.connect.types import pyspark_types_to_proto_types, 
UnparsedDataType
-from pyspark.errors import PySparkTypeError, PySparkNotImplementedError
+from pyspark.errors import PySparkTypeError, PySparkNotImplementedError, 
PySparkRuntimeError
 
 if TYPE_CHECKING:
 from pyspark.sql.connect._typing import ColumnOrName
@@ -2200,7 +2201,17 @@ class PythonUDTF:
 assert self._return_type is not None
 
udtf.return_type.CopyFrom(pyspark_types_to_proto_types(self._return_type))
 udtf.eval_type = self._eval_type
-udtf.command = CloudPickleSerializer().dumps(self._func)
+try:
+udtf.command = CloudPickleSerializer().dumps(self._func)
+except pickle.PicklingError:
+raise PySparkRuntimeError(
+error_class="UDTF_SERIALIZATION_ERROR",
+message_parameters={
+

[spark] branch branch-3.5 updated: [SPARK-44682][PS] Make pandas error class message_parameters strings

2023-08-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 200440e1845 [SPARK-44682][PS] Make pandas error class 
message_parameters strings
200440e1845 is described below

commit 200440e18458e84d178d9eda5a2c54bcedc634ee
Author: Amanda Liu 
AuthorDate: Mon Aug 7 12:01:45 2023 +0900

[SPARK-44682][PS] Make pandas error class message_parameters strings

This PR converts the types for message_parameters for pandas error classes 
to string, to ensure ability to compare error class messages in tests.

The change ensures the ability to compare error class messages in tests.

No, the PR does not affect the user-facing view of the error messages.

Updated `python/pyspark/pandas/tests/test_utils.py` and existing tests

Closes #42348 from asl3/string-pandas-error-types.

Authored-by: Amanda Liu 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit df8e52d84d1eabf48f68d09491f66a0835f41693)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/tests/test_utils.py | 16 -
 python/pyspark/testing/pandasutils.py | 56 +++
 2 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/python/pyspark/pandas/tests/test_utils.py 
b/python/pyspark/pandas/tests/test_utils.py
index 3d658446f27..0bb03dd8749 100644
--- a/python/pyspark/pandas/tests/test_utils.py
+++ b/python/pyspark/pandas/tests/test_utils.py
@@ -208,10 +208,10 @@ class UtilsTestsMixin:
 exception=pe.exception,
 error_class="DIFFERENT_PANDAS_SERIES",
 message_parameters={
-"left": series1,
-"left_dtype": series1.dtype,
-"right": series2,
-"right_dtype": series2.dtype,
+"left": series1.to_string(),
+"left_dtype": str(series1.dtype),
+"right": series2.to_string(),
+"right_dtype": str(series2.dtype),
 },
 )
 
@@ -227,9 +227,9 @@ class UtilsTestsMixin:
 error_class="DIFFERENT_PANDAS_INDEX",
 message_parameters={
 "left": index1,
-"left_dtype": index1.dtype,
+"left_dtype": str(index1.dtype),
 "right": index2,
-"right_dtype": index2.dtype,
+"right_dtype": str(index2.dtype),
 },
 )
 
@@ -247,9 +247,9 @@ class UtilsTestsMixin:
 error_class="DIFFERENT_PANDAS_MULTIINDEX",
 message_parameters={
 "left": multiindex1,
-"left_dtype": multiindex1.dtype,
+"left_dtype": str(multiindex1.dtype),
 "right": multiindex2,
-"right_dtype": multiindex2.dtype,
+"right_dtype": str(multiindex1.dtype),
 },
 )
 
diff --git a/python/pyspark/testing/pandasutils.py 
b/python/pyspark/testing/pandasutils.py
index 58999253521..39196873482 100644
--- a/python/pyspark/testing/pandasutils.py
+++ b/python/pyspark/testing/pandasutils.py
@@ -124,10 +124,10 @@ def _assert_pandas_equal(
 raise PySparkAssertionError(
 error_class="DIFFERENT_PANDAS_SERIES",
 message_parameters={
-"left": left,
-"left_dtype": left.dtype,
-"right": right,
-"right_dtype": right.dtype,
+"left": left.to_string(),
+"left_dtype": str(left.dtype),
+"right": right.to_string(),
+"right_dtype": str(right.dtype),
 },
 )
 elif isinstance(left, pd.Index) and isinstance(right, pd.Index):
@@ -143,9 +143,9 @@ def _assert_pandas_equal(
 error_class="DIFFERENT_PANDAS_INDEX",
 message_parameters={
 "left": left,
-"left_dtype": left.dtype,
+"left_dtype": str(left.dtype),
 "right": right,
-"right_dtype": right.dtype,
+"right_dtype": str(right.dtype),
 },
 )
 else:
@@ -228,10 +228,10 @@ def _assert_pandas_almost_equal(
 raise PySparkAssertionError(
 error_class="DIFFERENT_PANDAS_SERIES",
 message_parameters={
-"left": left,
-"left_dtype": left.dtype,
-"right": right,
-"right_dtype": right.dtype,
+"left": left.to_string(),
+"left_dtype": str(left.dtype),
+"right": right.to_string(),
+"right_dtype": str(right.dtype),
 },
 )
 for lnull, rnull

[spark] branch master updated: [SPARK-44682][PS] Make pandas error class message_parameters strings

2023-08-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new df8e52d84d1 [SPARK-44682][PS] Make pandas error class 
message_parameters strings
df8e52d84d1 is described below

commit df8e52d84d1eabf48f68d09491f66a0835f41693
Author: Amanda Liu 
AuthorDate: Mon Aug 7 12:01:45 2023 +0900

[SPARK-44682][PS] Make pandas error class message_parameters strings

### What changes were proposed in this pull request?
This PR converts the types for message_parameters for pandas error classes 
to string, to ensure ability to compare error class messages in tests.

### Why are the changes needed?
The change ensures the ability to compare error class messages in tests.

### Does this PR introduce _any_ user-facing change?
No, the PR does not affect the user-facing view of the error messages.

### How was this patch tested?
Updated `python/pyspark/pandas/tests/test_utils.py` and existing tests

Closes #42348 from asl3/string-pandas-error-types.

Authored-by: Amanda Liu 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/tests/test_utils.py | 16 -
 python/pyspark/testing/pandasutils.py | 56 +++
 2 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/python/pyspark/pandas/tests/test_utils.py 
b/python/pyspark/pandas/tests/test_utils.py
index 3d658446f27..0bb03dd8749 100644
--- a/python/pyspark/pandas/tests/test_utils.py
+++ b/python/pyspark/pandas/tests/test_utils.py
@@ -208,10 +208,10 @@ class UtilsTestsMixin:
 exception=pe.exception,
 error_class="DIFFERENT_PANDAS_SERIES",
 message_parameters={
-"left": series1,
-"left_dtype": series1.dtype,
-"right": series2,
-"right_dtype": series2.dtype,
+"left": series1.to_string(),
+"left_dtype": str(series1.dtype),
+"right": series2.to_string(),
+"right_dtype": str(series2.dtype),
 },
 )
 
@@ -227,9 +227,9 @@ class UtilsTestsMixin:
 error_class="DIFFERENT_PANDAS_INDEX",
 message_parameters={
 "left": index1,
-"left_dtype": index1.dtype,
+"left_dtype": str(index1.dtype),
 "right": index2,
-"right_dtype": index2.dtype,
+"right_dtype": str(index2.dtype),
 },
 )
 
@@ -247,9 +247,9 @@ class UtilsTestsMixin:
 error_class="DIFFERENT_PANDAS_MULTIINDEX",
 message_parameters={
 "left": multiindex1,
-"left_dtype": multiindex1.dtype,
+"left_dtype": str(multiindex1.dtype),
 "right": multiindex2,
-"right_dtype": multiindex2.dtype,
+"right_dtype": str(multiindex1.dtype),
 },
 )
 
diff --git a/python/pyspark/testing/pandasutils.py 
b/python/pyspark/testing/pandasutils.py
index 58999253521..39196873482 100644
--- a/python/pyspark/testing/pandasutils.py
+++ b/python/pyspark/testing/pandasutils.py
@@ -124,10 +124,10 @@ def _assert_pandas_equal(
 raise PySparkAssertionError(
 error_class="DIFFERENT_PANDAS_SERIES",
 message_parameters={
-"left": left,
-"left_dtype": left.dtype,
-"right": right,
-"right_dtype": right.dtype,
+"left": left.to_string(),
+"left_dtype": str(left.dtype),
+"right": right.to_string(),
+"right_dtype": str(right.dtype),
 },
 )
 elif isinstance(left, pd.Index) and isinstance(right, pd.Index):
@@ -143,9 +143,9 @@ def _assert_pandas_equal(
 error_class="DIFFERENT_PANDAS_INDEX",
 message_parameters={
 "left": left,
-"left_dtype": left.dtype,
+"left_dtype": str(left.dtype),
 "right": right,
-"right_dtype": right.dtype,
+"right_dtype": str(right.dtype),
 },
 )
 else:
@@ -228,10 +228,10 @@ def _assert_pandas_almost_equal(
 raise PySparkAssertionError(
 error_class="DIFFERENT_PANDAS_SERIES",
 message_parameters={
-"left": left,
-"left_dtype": left.dtype,
-"right": right,
-"right_dtype": right.dtype,
+"left": left.to_string(),
+"left_dtype": str(left.dtype),
+"right": right.to_string(),
+"right_dtype":

[spark] branch master updated: [SPARK-44264][PYTHON][ML][TESTS][FOLLOWUP] Adding Deepspeed To The Test Dockerfile

2023-08-06 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7515061ec23 [SPARK-44264][PYTHON][ML][TESTS][FOLLOWUP] Adding 
Deepspeed To The Test Dockerfile
7515061ec23 is described below

commit 7515061ec237b0393e2fcc064a309fae29502dff
Author: Mathew Jacob 
AuthorDate: Mon Aug 7 10:57:02 2023 +0800

[SPARK-44264][PYTHON][ML][TESTS][FOLLOWUP] Adding Deepspeed To The Test 
Dockerfile

### What changes were proposed in this pull request?
Added tests to the Dockerfile for tests in OSS Spark CI.

### Why are the changes needed?
They'll skip the deepspeed tests otherwise.

### Does this PR introduce _any_ user-facing change?
Nope, testing infra.

### How was this patch tested?
Running the tests on machine.

Closes #42347 from mathewjacob1002/testing_infra.

Authored-by: Mathew Jacob 
Signed-off-by: Ruifeng Zheng 
---
 dev/infra/Dockerfile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index 9d7b29e25b4..b69e682f239 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -73,3 +73,5 @@ RUN python3.9 -m pip install grpcio protobuf 
googleapis-common-protos grpcio-sta
 # Add torch as a testing dependency for TorchDistributor
 RUN python3.9 -m pip install torch torchvision --index-url 
https://download.pytorch.org/whl/cpu
 RUN python3.9 -m pip install torcheval
+# Add Deepspeed as a testing dependency for DeepspeedTorchDistributor
+RUN python3.9 -m pip install deepspeed


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44693][BUILD] Rename the `object Catalyst` in SparkBuild to `object SqlApi`

2023-08-06 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 816ee7e2716 [SPARK-44693][BUILD] Rename the `object Catalyst` in 
SparkBuild to `object SqlApi`
816ee7e2716 is described below

commit 816ee7e2716f2652e7cebe85fe949b6d9b44ffab
Author: yangjie01 
AuthorDate: Mon Aug 7 10:27:53 2023 +0800

[SPARK-44693][BUILD] Rename the `object Catalyst` in SparkBuild to `object 
SqlApi`

### What changes were proposed in this pull request?
This PR renames the Setting object used by the `SqlApi` module in 
`SparkBuild/scala` from `object Catalyst` to `object SqlApi`.

### Why are the changes needed?
The `SqlApi` module should use a more appropriate Setting object name.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions

Closes #42361 from LuciferYang/rename-catalyst-2-sqlapi.

Authored-by: yangjie01 
Signed-off-by: yangjie01 
(cherry picked from commit a98b11274d95f7c9f6e550ef6394e803bc0c17ca)
Signed-off-by: yangjie01 
---
 project/SparkBuild.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 79007626026..bd65d3c4bd4 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -449,7 +449,7 @@ object SparkBuild extends PomBuild {
   enable(Unidoc.settings)(spark)
 
   /* Sql-api ANTLR generation settings */
-  enable(Catalyst.settings)(sqlApi)
+  enable(SqlApi.settings)(sqlApi)
 
   /* Spark SQL Core console settings */
   enable(SQL.settings)(sql)
@@ -1169,7 +1169,7 @@ object OldDeps {
   )
 }
 
-object Catalyst {
+object SqlApi {
   import com.simplytyped.Antlr4Plugin
   import com.simplytyped.Antlr4Plugin.autoImport._
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (656bf36363c -> a98b11274d9)

2023-08-06 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 656bf36363c [SPARK-44690][SPARK-44376][BUILD] Downgrade Scala to 2.13.8
 add a98b11274d9 [SPARK-44693][BUILD] Rename the `object Catalyst` in 
SparkBuild to `object SqlApi`

No new revisions were added by this update.

Summary of changes:
 project/SparkBuild.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44690][SPARK-44376][BUILD] Downgrade Scala to 2.13.8

2023-08-06 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 1cfa2624c93 [SPARK-44690][SPARK-44376][BUILD] Downgrade Scala to 2.13.8
1cfa2624c93 is described below

commit 1cfa2624c933088deef4281b207631c806230440
Author: yangjie01 
AuthorDate: Mon Aug 7 10:13:59 2023 +0800

[SPARK-44690][SPARK-44376][BUILD] Downgrade Scala to 2.13.8

### What changes were proposed in this pull request?
The aim of this PR is to downgrade the Scala 2.13 dependency to 2.13.8 to 
ensure that Spark can be build with `-target:jvm-1.8`, and tested with Java 
11/17.

### Why are the changes needed?
As reported in SPARK-44376, there are issues when maven build and test 
using Java 11/17 with `-target:jvm-1.8`:

- run `build/mvn clean install -Pscala-2.13` with Java 17

```
[INFO] --- scala-maven-plugin:4.8.0:compile (scala-compile-first)  
spark-core_2.13 ---
[INFO] Compiler bridge file: 
/Users/yangjie01/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.8.0-bin_2.13.11__61.0-1.8.0_20221110T195421.jar
[INFO] compiling 602 Scala sources and 77 Java sources to 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/target/scala-2.13/classes ...
[WARNING] [Warn] : [deprecation   | origin= | version=] -target is 
deprecated: Use -release instead to compile against the correct platform API.
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71:
 not found: value sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
 not found: object sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
 not found: object sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206:
 not found: type DirectBuffer
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210:
 not found: type Unsafe
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212:
 not found: type Unsafe
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213:
 not found: type DirectBuffer
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216:
 not found: type DirectBuffer
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236:
 not found: type DirectBuffer
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
 Unused import
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
 Unused import
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452:
 not found: value sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
 not found: object sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
 not found: type SignalHandler
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
 not found: type Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83:
 not found: type Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
 not found: type SignalHandler
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
 not found: value Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114:
 not found: type Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116:
 not found: value Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128:
 not found: value Signal
[ERROR] [Error]

[spark] branch master updated: [SPARK-44690][SPARK-44376][BUILD] Downgrade Scala to 2.13.8

2023-08-06 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 656bf36363c [SPARK-44690][SPARK-44376][BUILD] Downgrade Scala to 2.13.8
656bf36363c is described below

commit 656bf36363c466b60d0045234ccaaa654ed8
Author: yangjie01 
AuthorDate: Mon Aug 7 10:13:59 2023 +0800

[SPARK-44690][SPARK-44376][BUILD] Downgrade Scala to 2.13.8

### What changes were proposed in this pull request?
The aim of this PR is to downgrade the Scala 2.13 dependency to 2.13.8 to 
ensure that Spark can be build with `-target:jvm-1.8`, and tested with Java 
11/17.

### Why are the changes needed?
As reported in SPARK-44376, there are issues when maven build and test 
using Java 11/17 with `-target:jvm-1.8`:

- run `build/mvn clean install -Pscala-2.13` with Java 17

```
[INFO] --- scala-maven-plugin:4.8.0:compile (scala-compile-first)  
spark-core_2.13 ---
[INFO] Compiler bridge file: 
/Users/yangjie01/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.8.0-bin_2.13.11__61.0-1.8.0_20221110T195421.jar
[INFO] compiling 602 Scala sources and 77 Java sources to 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/target/scala-2.13/classes ...
[WARNING] [Warn] : [deprecation   | origin= | version=] -target is 
deprecated: Use -release instead to compile against the correct platform API.
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71:
 not found: value sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
 not found: object sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
 not found: object sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206:
 not found: type DirectBuffer
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210:
 not found: type Unsafe
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212:
 not found: type Unsafe
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213:
 not found: type DirectBuffer
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216:
 not found: type DirectBuffer
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236:
 not found: type DirectBuffer
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
 Unused import
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
 Unused import
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452:
 not found: value sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
 not found: object sun
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
 not found: type SignalHandler
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
 not found: type Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83:
 not found: type Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
 not found: type SignalHandler
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
 not found: value Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114:
 not found: type Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116:
 not found: value Signal
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128:
 not found: value Signal
[ERROR] [Error]

[spark] branch branch-3.5 updated: [SPARK-41636][SQL] Make sure `selectFilters` returns predicates in deterministic order

2023-08-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new c07eb181609 [SPARK-41636][SQL] Make sure `selectFilters` returns 
predicates in deterministic order
c07eb181609 is described below

commit c07eb18160952fab18a6d118ce2bb56e18d56741
Author: Jia Fan 
AuthorDate: Mon Aug 7 09:07:48 2023 +0900

[SPARK-41636][SQL] Make sure `selectFilters` returns predicates in 
deterministic order

### What changes were proposed in this pull request?
Method `DataSourceStrategy#selectFilters`, which is used to determine 
"pushdown-able" filters, does not preserve the order of the input 
Seq[Expression] nor does it return the same order across the same plans. This 
is resulting in CodeGenerator cache misses even when the exact same LogicalPlan 
is executed.

This PR to make sure `selectFilters` returns predicates in deterministic 
order.

### Why are the changes needed?
Make sure `selectFilters` returns predicates in deterministic order, to 
reduce the probability of codegen cache misses.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

Closes #42265 from Hisoka-X/SPARK-41636_selectfilters_order.

Authored-by: Jia Fan 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 9462dcd0e996dd940d4970dc75482f7d088ac2ae)
Signed-off-by: Hyukjin Kwon 
---
 .../sql/execution/datasources/DataSourceStrategy.scala |  6 --
 .../execution/datasources/DataSourceStrategySuite.scala| 14 ++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index 5e6e0ad0392..94c2d2ffaca 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources
 
 import java.util.Locale
 
+import scala.collection.immutable.ListMap
 import scala.collection.mutable
 
 import org.apache.hadoop.fs.Path
@@ -670,9 +671,10 @@ object DataSourceStrategy
 // A map from original Catalyst expressions to corresponding translated 
data source filters.
 // If a predicate is not in this map, it means it cannot be pushed down.
 val supportNestedPredicatePushdown = 
DataSourceUtils.supportNestedPredicatePushdown(relation)
-val translatedMap: Map[Expression, Filter] = predicates.flatMap { p =>
+// SPARK-41636: we keep the order of the predicates to avoid CodeGenerator 
cache misses
+val translatedMap: Map[Expression, Filter] = ListMap(predicates.flatMap { 
p =>
   translateFilter(p, supportNestedPredicatePushdown).map(f => p -> f)
-}.toMap
+}: _*)
 
 val pushedFilters: Seq[Filter] = translatedMap.values.toSeq
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
index a35fb5f6271..2b9ec97bace 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
@@ -324,4 +324,18 @@ class DataSourceStrategySuite extends PlanTest with 
SharedSparkSession {
   DataSourceStrategy.translateFilter(catalystFilter, true)
 }
   }
+
+  test("SPARK-41636: selectFilters returns predicates in deterministic order") 
{
+
+val predicates = Seq(EqualTo($"id", 1), EqualTo($"id", 2),
+  EqualTo($"id", 3), EqualTo($"id", 4), EqualTo($"id", 5), EqualTo($"id", 
6))
+
+val (unhandledPredicates, pushedFilters, handledFilters) =
+  DataSourceStrategy.selectFilters(FakeRelation(), predicates)
+assert(unhandledPredicates.equals(predicates))
+assert(pushedFilters.zipWithIndex.forall { case (f, i) =>
+  f.equals(sources.EqualTo("id", i + 1))
+})
+assert(handledFilters.isEmpty)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (a640373fff3 -> 9462dcd0e99)

2023-08-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a640373fff3 [SPARK-44629][PYTHON][DOCS] Publish PySpark Test 
Guidelines webpage
 add 9462dcd0e99 [SPARK-41636][SQL] Make sure `selectFilters` returns 
predicates in deterministic order

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/DataSourceStrategy.scala |  6 --
 .../execution/datasources/DataSourceStrategySuite.scala| 14 ++
 2 files changed, 18 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44629][PYTHON][DOCS] Publish PySpark Test Guidelines webpage

2023-08-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 36e7fe2e49c [SPARK-44629][PYTHON][DOCS] Publish PySpark Test 
Guidelines webpage
36e7fe2e49c is described below

commit 36e7fe2e49cbde10b0faedcf701a04543a5c156f
Author: Amanda Liu 
AuthorDate: Mon Aug 7 08:54:52 2023 +0900

[SPARK-44629][PYTHON][DOCS] Publish PySpark Test Guidelines webpage

### What changes were proposed in this pull request?
This PR adds a webpage to the Spark docs website, 
https://spark.apache.org/docs, to outline PySpark testing best practices.

### Why are the changes needed?
The changes are needed to provide PySpark end users with a guideline for 
how to use PySpark utils (introduced in SPARK-44629) to test PySpark code.

### Does this PR introduce _any_ user-facing change?
Yes, the PR publishes a webpage on the Spark website.

### How was this patch tested?
Existing tests

Closes #42284 from asl3/testing-guidelines.

Authored-by: Amanda Liu 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit a640373fff38f5c594e4e5c30587bcfe823dee1d)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/getting_started/index.rst   |   1 +
 .../source/getting_started/testing_pyspark.ipynb   | 485 +
 2 files changed, 486 insertions(+)

diff --git a/python/docs/source/getting_started/index.rst 
b/python/docs/source/getting_started/index.rst
index 3c1c7d80863..5f6d306651b 100644
--- a/python/docs/source/getting_started/index.rst
+++ b/python/docs/source/getting_started/index.rst
@@ -40,3 +40,4 @@ The list below is the contents of this quickstart page:
quickstart_df
quickstart_connect
quickstart_ps
+   testing_pyspark
diff --git a/python/docs/source/getting_started/testing_pyspark.ipynb 
b/python/docs/source/getting_started/testing_pyspark.ipynb
new file mode 100644
index 000..268ace04376
--- /dev/null
+++ b/python/docs/source/getting_started/testing_pyspark.ipynb
@@ -0,0 +1,485 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4ee2125b-f889-47e6-9c3d-8bd63a253683",
+   "metadata": {},
+   "source": [
+"# Testing PySpark\n",
+"\n",
+"This guide is a reference for writing robust tests for PySpark code.\n",
+"\n",
+"To view the docs for PySpark test utils, see here. To see the code for 
PySpark built-in test utils, check out the Spark repository here. To see the 
JIRA board tickets for the PySpark test framework, see here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e8ee4b6-9544-45e1-8a91-e71ed8ef8b9d",
+   "metadata": {},
+   "source": [
+"## Build a PySpark Application\n",
+"Here is an example for how to start a PySpark application. Feel free to 
skip to the next section, “Testing your PySpark Application,” if you already 
have an application you’re ready to test.\n",
+"\n",
+"First, start your Spark Session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "9af4a35b-17e8-4e45-816b-34c14c5902f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+"from pyspark.sql import SparkSession \n",
+"from pyspark.sql.functions import col \n",
+"\n",
+"# Create a SparkSession \n",
+"spark = SparkSession.builder.appName(\"Testing PySpark 
Example\").getOrCreate() "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a4c6efe-91f5-4e18-b4b2-b0401c2368e4",
+   "metadata": {},
+   "source": [
+"Next, create a DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "3b483dd8-3a76-41c6-9206-301d7ef314d6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+"sample_data = [{\"name\": \"JohnD.\", \"age\": 30}, \n",
+"  {\"name\": \"Alice   G.\", \"age\": 25}, \n",
+"  {\"name\": \"Bob  T.\", \"age\": 35}, \n",
+"  {\"name\": \"Eve   A.\", \"age\": 28}] \n",
+"\n",
+"df = spark.createDataFrame(sample_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0f44333-0e08-470b-9fa2-38f59e3dbd63",
+   "metadata": {},
+   "source": [
+"Now, let’s define and apply a transformation function to our DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a6c0b766-af5f-4e1d-acf8-887d7cf0b0b2",
+   "metadata": {},
+   "outputs": [
+{
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+  "+---++\n",
+  "|age|name|\n",
+  "+---++\n",
+  "| 30| John D.|\n",
+  "| 25|Alice G.|\n",
+  "| 35|  Bob T.|\n",
+  "| 28|  Eve A.|\n",
+  "+---++\n",
+  "\n"
+ ]
+}
+   ],
+   "source": [
+"from pyspark.sql.functions import col, regexp_replace\n",
+"\n",
+"# Remove additional spaces in name\n",
+"def

[spark] branch master updated: [SPARK-44629][PYTHON][DOCS] Publish PySpark Test Guidelines webpage

2023-08-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a640373fff3 [SPARK-44629][PYTHON][DOCS] Publish PySpark Test 
Guidelines webpage
a640373fff3 is described below

commit a640373fff38f5c594e4e5c30587bcfe823dee1d
Author: Amanda Liu 
AuthorDate: Mon Aug 7 08:54:52 2023 +0900

[SPARK-44629][PYTHON][DOCS] Publish PySpark Test Guidelines webpage

### What changes were proposed in this pull request?
This PR adds a webpage to the Spark docs website, 
https://spark.apache.org/docs, to outline PySpark testing best practices.

### Why are the changes needed?
The changes are needed to provide PySpark end users with a guideline for 
how to use PySpark utils (introduced in SPARK-44629) to test PySpark code.

### Does this PR introduce _any_ user-facing change?
Yes, the PR publishes a webpage on the Spark website.

### How was this patch tested?
Existing tests

Closes #42284 from asl3/testing-guidelines.

Authored-by: Amanda Liu 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/getting_started/index.rst   |   1 +
 .../source/getting_started/testing_pyspark.ipynb   | 485 +
 2 files changed, 486 insertions(+)

diff --git a/python/docs/source/getting_started/index.rst 
b/python/docs/source/getting_started/index.rst
index 3c1c7d80863..5f6d306651b 100644
--- a/python/docs/source/getting_started/index.rst
+++ b/python/docs/source/getting_started/index.rst
@@ -40,3 +40,4 @@ The list below is the contents of this quickstart page:
quickstart_df
quickstart_connect
quickstart_ps
+   testing_pyspark
diff --git a/python/docs/source/getting_started/testing_pyspark.ipynb 
b/python/docs/source/getting_started/testing_pyspark.ipynb
new file mode 100644
index 000..268ace04376
--- /dev/null
+++ b/python/docs/source/getting_started/testing_pyspark.ipynb
@@ -0,0 +1,485 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4ee2125b-f889-47e6-9c3d-8bd63a253683",
+   "metadata": {},
+   "source": [
+"# Testing PySpark\n",
+"\n",
+"This guide is a reference for writing robust tests for PySpark code.\n",
+"\n",
+"To view the docs for PySpark test utils, see here. To see the code for 
PySpark built-in test utils, check out the Spark repository here. To see the 
JIRA board tickets for the PySpark test framework, see here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e8ee4b6-9544-45e1-8a91-e71ed8ef8b9d",
+   "metadata": {},
+   "source": [
+"## Build a PySpark Application\n",
+"Here is an example for how to start a PySpark application. Feel free to 
skip to the next section, “Testing your PySpark Application,” if you already 
have an application you’re ready to test.\n",
+"\n",
+"First, start your Spark Session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "9af4a35b-17e8-4e45-816b-34c14c5902f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+"from pyspark.sql import SparkSession \n",
+"from pyspark.sql.functions import col \n",
+"\n",
+"# Create a SparkSession \n",
+"spark = SparkSession.builder.appName(\"Testing PySpark 
Example\").getOrCreate() "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a4c6efe-91f5-4e18-b4b2-b0401c2368e4",
+   "metadata": {},
+   "source": [
+"Next, create a DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "3b483dd8-3a76-41c6-9206-301d7ef314d6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+"sample_data = [{\"name\": \"JohnD.\", \"age\": 30}, \n",
+"  {\"name\": \"Alice   G.\", \"age\": 25}, \n",
+"  {\"name\": \"Bob  T.\", \"age\": 35}, \n",
+"  {\"name\": \"Eve   A.\", \"age\": 28}] \n",
+"\n",
+"df = spark.createDataFrame(sample_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0f44333-0e08-470b-9fa2-38f59e3dbd63",
+   "metadata": {},
+   "source": [
+"Now, let’s define and apply a transformation function to our DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a6c0b766-af5f-4e1d-acf8-887d7cf0b0b2",
+   "metadata": {},
+   "outputs": [
+{
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+  "+---++\n",
+  "|age|name|\n",
+  "+---++\n",
+  "| 30| John D.|\n",
+  "| 25|Alice G.|\n",
+  "| 35|  Bob T.|\n",
+  "| 28|  Eve A.|\n",
+  "+---++\n",
+  "\n"
+ ]
+}
+   ],
+   "source": [
+"from pyspark.sql.functions import col, regexp_replace\n",
+"\n",
+"# Remove additional spaces in name\n",
+"def remove_extra_spaces(df, column_name):\n",
+"# Remove extra spaces from the specified column\n",
+"df_transformed =

[spark] branch branch-3.5 updated: [SPARK-44634][SQL] Encoders.bean does no longer support nested beans with type arguments

2023-08-06 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e3b031276e4 [SPARK-44634][SQL] Encoders.bean does no longer support 
nested beans with type arguments
e3b031276e4 is described below

commit e3b031276e4ab626f7db7b8d95f01a598e25a6b1
Author: Giambattista Bloisi 
AuthorDate: Sun Aug 6 21:47:57 2023 +0200

[SPARK-44634][SQL] Encoders.bean does no longer support nested beans with 
type arguments

### What changes were proposed in this pull request?
This PR fixes a regression introduced in Spark 3.4.x  where  Encoders.bean 
is no longer able to process nested beans having type arguments. For example:

```
class A {
   T value;
   // value getter and setter
}

class B {
   A stringHolder;
   // stringHolder getter and setter
}

Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: 
[ENCODER_NOT_FOUND]..."
```

### Why are the changes needed?
JavaTypeInference.encoderFor main match does not manage ParameterizedType 
and TypeVariable cases. I think this is a regression introduced after getting 
rid of usage of guava TypeToken: [SPARK-42093 SQL Move JavaTypeInference to 
AgnosticEncoders](https://github.com/apache/spark/commit/18672003513d5a4aa610b6b94dbbc15c33185d3#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b)
 hvanhovell cloud-fan

In this PR I'm leveraging commons lang3 TypeUtils functionalities to solve 
ParameterizedType type arguments for classes

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests have been extended to check correct encoding of a nested 
bean having type arguments.

Closes #42327 from gbloisi-openaire/spark-44634.

Authored-by: Giambattista Bloisi 
Signed-off-by: Herman van Hovell 
(cherry picked from commit d6998979427b6ad3a0f16d6966b3927d40440a60)
Signed-off-by: Herman van Hovell 
---
 .../spark/sql/catalyst/JavaTypeInference.scala | 84 +-
 .../spark/sql/catalyst/JavaBeanWithGenerics.java   | 41 +++
 .../sql/catalyst/JavaTypeInferenceSuite.scala  |  4 ++
 3 files changed, 64 insertions(+), 65 deletions(-)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
index f352d28a7b5..3d536b735db 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
@@ -18,12 +18,14 @@ package org.apache.spark.sql.catalyst
 
 import java.beans.{Introspector, PropertyDescriptor}
 import java.lang.reflect.{ParameterizedType, Type, TypeVariable}
-import java.util.{ArrayDeque, List => JList, Map => JMap}
+import java.util.{List => JList, Map => JMap}
 import javax.annotation.Nonnull
 
-import scala.annotation.tailrec
+import scala.collection.JavaConverters._
 import scala.reflect.ClassTag
 
+import org.apache.commons.lang3.reflect.{TypeUtils => JavaTypeUtils}
+
 import org.apache.spark.sql.catalyst.encoders.AgnosticEncoder
 import org.apache.spark.sql.catalyst.encoders.AgnosticEncoders.{ArrayEncoder, 
BinaryEncoder, BoxedBooleanEncoder, BoxedByteEncoder, BoxedDoubleEncoder, 
BoxedFloatEncoder, BoxedIntEncoder, BoxedLongEncoder, BoxedShortEncoder, 
DayTimeIntervalEncoder, DEFAULT_JAVA_DECIMAL_ENCODER, EncoderField, 
IterableEncoder, JavaBeanEncoder, JavaBigIntEncoder, JavaEnumEncoder, 
LocalDateTimeEncoder, MapEncoder, PrimitiveBooleanEncoder, 
PrimitiveByteEncoder, PrimitiveDoubleEncoder, PrimitiveFloatEncoder, P [...]
 import org.apache.spark.sql.errors.ExecutionErrors
@@ -57,7 +59,8 @@ object JavaTypeInference {
 encoderFor(beanType, Set.empty).asInstanceOf[AgnosticEncoder[T]]
   }
 
-  private def encoderFor(t: Type, seenTypeSet: Set[Class[_]]): 
AgnosticEncoder[_] = t match {
+  private def encoderFor(t: Type, seenTypeSet: Set[Class[_]],
+typeVariables: Map[TypeVariable[_], Type] = Map.empty): AgnosticEncoder[_] 
= t match {
 
 case c: Class[_] if c == java.lang.Boolean.TYPE => PrimitiveBooleanEncoder
 case c: Class[_] if c == java.lang.Byte.TYPE => PrimitiveByteEncoder
@@ -101,18 +104,24 @@ object JavaTypeInference {
   UDTEncoder(udt, udt.getClass)
 
 case c: Class[_] if c.isArray =>
-  val elementEncoder = encoderFor(c.getComponentType, seenTypeSet)
+  val elementEncoder = encoderFor(c.getComponentType, seenTypeSet, 
typeVariables)
   ArrayEncoder(elementEncoder, elementEncoder.nullable)
 
-case ImplementsList(c, Array(elementCls)) =>
-  val element = encoderFor(elementCls, seenTypeSet)
+case c: Class[_] if classOf[JList[_]].isAssignableFrom(c) =>
+

[spark] branch master updated (74ae1e3434c -> d6998979427)

2023-08-06 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 74ae1e3434c [SPARK-42500][SQL] ConstantPropagation support more case
 add d6998979427 [SPARK-44634][SQL] Encoders.bean does no longer support 
nested beans with type arguments

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/JavaTypeInference.scala | 84 +-
 .../spark/sql/catalyst/JavaBeanWithGenerics.java}  | 29 
 .../sql/catalyst/JavaTypeInferenceSuite.scala  |  4 ++
 3 files changed, 39 insertions(+), 78 deletions(-)
 copy 
sql/{hive/src/test/java/org/apache/spark/sql/hive/execution/UDFListString.java 
=> 
catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java} 
(67%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (41a2a7daeee -> 74ae1e3434c)

2023-08-06 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 41a2a7daeee [SPARK-44650][CORE] `spark.executor.defaultJavaOptions` 
Check illegal java options
 add 74ae1e3434c [SPARK-42500][SQL] ConstantPropagation support more case

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/optimizer/expressions.scala | 37 ++
 .../optimizer/ConstantPropagationSuite.scala   | 32 +--
 2 files changed, 47 insertions(+), 22 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-44650][CORE] `spark.executor.defaultJavaOptions` Check illegal java options

2023-08-06 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 41a2a7daeee [SPARK-44650][CORE] `spark.executor.defaultJavaOptions` 
Check illegal java options
41a2a7daeee is described below

commit 41a2a7daeee0a25d39f30364a694becf54ab37e7
Author: sychen 
AuthorDate: Sun Aug 6 08:24:40 2023 -0500

[SPARK-44650][CORE] `spark.executor.defaultJavaOptions` Check illegal java 
options

### What changes were proposed in this pull request?

### Why are the changes needed?
Command
```bash
 ./bin/spark-shell --conf spark.executor.extraJavaOptions='-Dspark.foo=bar'
```
Error
```
spark.executor.extraJavaOptions is not allowed to set Spark options (was 
'-Dspark.foo=bar'). Set them directly on a SparkConf or in a properties file 
when using ./bin/spark-submit.
```

Command
```bash
./bin/spark-shell --conf spark.executor.defaultJavaOptions='-Dspark.foo=bar'
```
Start up normally.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
local test & add UT

```
./bin/spark-shell --conf spark.executor.defaultJavaOptions='-Dspark.foo=bar'
```

```
spark.executor.defaultJavaOptions is not allowed to set Spark options (was 
'-Dspark.foo=bar'). Set them directly on a SparkConf or in a properties file 
when using ./bin/spark-submit.
```

Closes #42313 from cxzl25/SPARK-44650.

Authored-by: sychen 
Signed-off-by: Sean Owen 
---
 .../main/scala/org/apache/spark/SparkConf.scala| 25 +++---
 .../scala/org/apache/spark/SparkConfSuite.scala| 14 
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala 
b/core/src/main/scala/org/apache/spark/SparkConf.scala
index 813a14acd19..8c054d24b10 100644
--- a/core/src/main/scala/org/apache/spark/SparkConf.scala
+++ b/core/src/main/scala/org/apache/spark/SparkConf.scala
@@ -503,8 +503,6 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable 
with Logging with Seria
   logWarning(msg)
 }
 
-val executorOptsKey = EXECUTOR_JAVA_OPTIONS.key
-
 // Used by Yarn in 1.1 and before
 sys.props.get("spark.driver.libraryPath").foreach { value =>
   val warning =
@@ -518,16 +516,19 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable 
with Logging with Seria
 }
 
 // Validate spark.executor.extraJavaOptions
-getOption(executorOptsKey).foreach { javaOpts =>
-  if (javaOpts.contains("-Dspark")) {
-val msg = s"$executorOptsKey is not allowed to set Spark options (was 
'$javaOpts'). " +
-  "Set them directly on a SparkConf or in a properties file when using 
./bin/spark-submit."
-throw new Exception(msg)
-  }
-  if (javaOpts.contains("-Xmx")) {
-val msg = s"$executorOptsKey is not allowed to specify max heap memory 
settings " +
-  s"(was '$javaOpts'). Use spark.executor.memory instead."
-throw new Exception(msg)
+Seq(EXECUTOR_JAVA_OPTIONS.key, 
"spark.executor.defaultJavaOptions").foreach { executorOptsKey =>
+  getOption(executorOptsKey).foreach { javaOpts =>
+if (javaOpts.contains("-Dspark")) {
+  val msg = s"$executorOptsKey is not allowed to set Spark options 
(was '$javaOpts'). " +
+"Set them directly on a SparkConf or in a properties file " +
+"when using ./bin/spark-submit."
+  throw new Exception(msg)
+}
+if (javaOpts.contains("-Xmx")) {
+  val msg = s"$executorOptsKey is not allowed to specify max heap 
memory settings " +
+s"(was '$javaOpts'). Use spark.executor.memory instead."
+  throw new Exception(msg)
+}
   }
 }
 
diff --git a/core/src/test/scala/org/apache/spark/SparkConfSuite.scala 
b/core/src/test/scala/org/apache/spark/SparkConfSuite.scala
index 74fd7816221..75e22e1418b 100644
--- a/core/src/test/scala/org/apache/spark/SparkConfSuite.scala
+++ b/core/src/test/scala/org/apache/spark/SparkConfSuite.scala
@@ -498,6 +498,20 @@ class SparkConfSuite extends SparkFunSuite with 
LocalSparkContext with ResetSyst
 }
 }
   }
+
+  test("SPARK-44650: spark.executor.defaultJavaOptions Check illegal java 
options") {
+val conf = new SparkConf()
+conf.validateSettings()
+conf.set(EXECUTOR_JAVA_OPTIONS.key, "-Dspark.foo=bar")
+intercept[Exception] {
+  conf.validateSettings()
+}
+conf.remove(EXECUTOR_JAVA_OPTIONS.key)
+conf.set("spark.executor.defaultJavaOptions", "-Dspark.foo=bar")
+intercept[Exception] {
+  conf.validateSettings()
+}
+  }
 }
 
 class Class1 {}


-
To unsubscribe, e-mail:

[spark] branch master updated: [SPARK-44688][INFRA] Add a file existence check before executing `free_disk_space` and `free_disk_space_container`

2023-08-06 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d264ee37f31 [SPARK-44688][INFRA] Add a file existence check before 
executing `free_disk_space` and `free_disk_space_container`
d264ee37f31 is described below

commit d264ee37f316b32bf37fff770b72b5841b84f7cc
Author: yangjie01 
AuthorDate: Sun Aug 6 17:41:11 2023 +0800

[SPARK-44688][INFRA] Add a file existence check before executing 
`free_disk_space` and `free_disk_space_container`

### What changes were proposed in this pull request?
This pr add a file existence check before executing `dev/free_disk_space` 
and `dev/free_disk_space_container`

### Why are the changes needed?
We added `free_disk_space` and `free_disk_space_container` to clean up the 
disk, but because the daily tests of other branches and the master branch share 
the yml file, we should check if the file exists before execution, otherwise it 
will affect the daily tests of other branches.

- branch-3.5: https://github.com/apache/spark/actions/runs/5761479443
- branch-3.4: https://github.com/apache/spark/actions/runs/5760423900
- branch-3.3: https://github.com/apache/spark/actions/runs/5759384052

https://github.com/apache/spark/assets/1475305/6e46b34b-645a-4da5-b9c3-8a89bfacabcb;>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Action

Closes #42359 from LuciferYang/test-free_disk_space-exist.

Authored-by: yangjie01 
Signed-off-by: yangjie01 
---
 .github/workflows/build_and_test.yml | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 04585481a9c..cd68c0904d9 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -241,7 +241,10 @@ jobs:
 restore-keys: |
   ${{ matrix.java }}-${{ matrix.hadoop }}-coursier-
 - name: Free up disk space
-  run: ./dev/free_disk_space
+  run: |
+if [ -f ./dev/free_disk_space ]; then
+  ./dev/free_disk_space
+fi
 - name: Install Java ${{ matrix.java }}
   uses: actions/setup-java@v3
   with:
@@ -419,7 +422,9 @@ jobs:
   # uninstall libraries dedicated for ML testing
   python3.9 -m pip uninstall -y torch torchvision torcheval torchtnt 
tensorboard mlflow
 fi
-./dev/free_disk_space_container
+if [ -f ./dev/free_disk_space_container ]; then
+  ./dev/free_disk_space_container
+fi
 - name: Install Java ${{ matrix.java }}
   uses: actions/setup-java@v3
   with:
@@ -519,7 +524,10 @@ jobs:
 restore-keys: |
   sparkr-coursier-
 - name: Free up disk space
-  run: ./dev/free_disk_space_container
+  run: |
+if [ -f ./dev/free_disk_space_container ]; then
+  ./dev/free_disk_space_container
+fi
 - name: Install Java ${{ inputs.java }}
   uses: actions/setup-java@v3
   with:
@@ -629,7 +637,10 @@ jobs:
 restore-keys: |
   docs-maven-
 - name: Free up disk space
-  run: ./dev/free_disk_space_container
+  run: |
+if [ -f ./dev/free_disk_space_container ]; then
+  ./dev/free_disk_space_container
+fi
 - name: Install Java 8
   uses: actions/setup-java@v3
   with:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44644][PYTHON][3.5] Improve error messages for Python UDTFs with pickling errors

[spark] branch branch-3.5 updated: [SPARK-44682][PS] Make pandas error class message_parameters strings

[spark] branch master updated: [SPARK-44682][PS] Make pandas error class message_parameters strings

[spark] branch master updated: [SPARK-44264][PYTHON][ML][TESTS][FOLLOWUP] Adding Deepspeed To The Test Dockerfile

[spark] branch branch-3.5 updated: [SPARK-44693][BUILD] Rename the `object Catalyst` in SparkBuild to `object SqlApi`

[spark] branch master updated (656bf36363c -> a98b11274d9)

[spark] branch branch-3.5 updated: [SPARK-44690][SPARK-44376][BUILD] Downgrade Scala to 2.13.8

[spark] branch master updated: [SPARK-44690][SPARK-44376][BUILD] Downgrade Scala to 2.13.8

[spark] branch branch-3.5 updated: [SPARK-41636][SQL] Make sure `selectFilters` returns predicates in deterministic order

[spark] branch master updated (a640373fff3 -> 9462dcd0e99)

[spark] branch branch-3.5 updated: [SPARK-44629][PYTHON][DOCS] Publish PySpark Test Guidelines webpage

[spark] branch master updated: [SPARK-44629][PYTHON][DOCS] Publish PySpark Test Guidelines webpage

[spark] branch branch-3.5 updated: [SPARK-44634][SQL] Encoders.bean does no longer support nested beans with type arguments

[spark] branch master updated (74ae1e3434c -> d6998979427)

[spark] branch master updated (41a2a7daeee -> 74ae1e3434c)

[spark] branch master updated: [SPARK-44650][CORE] `spark.executor.defaultJavaOptions` Check illegal java options

[spark] branch master updated: [SPARK-44688][INFRA] Add a file existence check before executing `free_disk_space` and `free_disk_space_container`

17 matches

Site Navigation

Mail list logo

Footer information