date:20230726

[spark] branch branch-3.4 updated: [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 4f9d6d077ff [SPARK-44557][INFRA] Clean up untracked/ignored files 
before running pip packaging test in GitHub Actions
4f9d6d077ff is described below

commit 4f9d6d077ff1d3f359774d21d32833cb80a193b7
Author: Hyukjin Kwon 
AuthorDate: Thu Jul 27 14:08:09 2023 +0900

[SPARK-44557][INFRA] Clean up untracked/ignored files before running pip 
packaging test in GitHub Actions

### What changes were proposed in this pull request?

This PR proposes to remove untracked/ignored files before running pip 
packaging test in GitHub Actions.

### Why are the changes needed?

In order to fix the flakiness in the test such as:

```
...
creating dist
Creating tar archive
error: [Errno 28] No space left on device
Cleaning up temporary directory - /tmp/tmp.CvSzgB7Kyy
```

See also 
https://github.com/apache/spark/actions/runs/5665869112/job/15351515539.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

GitHub Actions build in this PR.

Closes #42159 from HyukjinKwon/debug-ci-failure.

Lead-authored-by: Hyukjin Kwon 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 77aab6edb7c2b7be54cb19d40c0eeeb2c699f0f7)
Signed-off-by: Hyukjin Kwon 
---
 dev/run-pip-tests | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/dev/run-pip-tests b/dev/run-pip-tests
index 5da231495a2..773611d9d92 100755
--- a/dev/run-pip-tests
+++ b/dev/run-pip-tests
@@ -25,6 +25,13 @@ shopt -s nullglob
 FWDIR="$(cd "$(dirname "$0")"/..; pwd)"
 cd "$FWDIR"
 
+# Clean ignored/untracked files that do not need
+# for pip packaging test. Machines in GitHub Action do not have
+# enough space, see also SPARK-44557.
+if [[ ! -z "${GITHUB_ACTIONS}" ]]; then
+  git clean -d -f -x -e assembly
+fi
+
 echo "Constructing virtual env for testing"
 VIRTUALENV_BASE=$(mktemp -d)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 7539d1992ca [SPARK-44557][INFRA] Clean up untracked/ignored files 
before running pip packaging test in GitHub Actions
7539d1992ca is described below

commit 7539d1992ca2c01988c440c3a0706a23b9112e73
Author: Hyukjin Kwon 
AuthorDate: Thu Jul 27 14:08:09 2023 +0900

[SPARK-44557][INFRA] Clean up untracked/ignored files before running pip 
packaging test in GitHub Actions

### What changes were proposed in this pull request?

This PR proposes to remove untracked/ignored files before running pip 
packaging test in GitHub Actions.

### Why are the changes needed?

In order to fix the flakiness in the test such as:

```
...
creating dist
Creating tar archive
error: [Errno 28] No space left on device
Cleaning up temporary directory - /tmp/tmp.CvSzgB7Kyy
```

See also 
https://github.com/apache/spark/actions/runs/5665869112/job/15351515539.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

GitHub Actions build in this PR.

Closes #42159 from HyukjinKwon/debug-ci-failure.

Lead-authored-by: Hyukjin Kwon 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 77aab6edb7c2b7be54cb19d40c0eeeb2c699f0f7)
Signed-off-by: Hyukjin Kwon 
---
 dev/run-pip-tests | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/dev/run-pip-tests b/dev/run-pip-tests
index 5da231495a2..773611d9d92 100755
--- a/dev/run-pip-tests
+++ b/dev/run-pip-tests
@@ -25,6 +25,13 @@ shopt -s nullglob
 FWDIR="$(cd "$(dirname "$0")"/..; pwd)"
 cd "$FWDIR"
 
+# Clean ignored/untracked files that do not need
+# for pip packaging test. Machines in GitHub Action do not have
+# enough space, see also SPARK-44557.
+if [[ ! -z "${GITHUB_ACTIONS}" ]]; then
+  git clean -d -f -x -e assembly
+fi
+
 echo "Constructing virtual env for testing"
 VIRTUALENV_BASE=$(mktemp -d)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 77aab6edb7c [SPARK-44557][INFRA] Clean up untracked/ignored files 
before running pip packaging test in GitHub Actions
77aab6edb7c is described below

commit 77aab6edb7c2b7be54cb19d40c0eeeb2c699f0f7
Author: Hyukjin Kwon 
AuthorDate: Thu Jul 27 14:08:09 2023 +0900

[SPARK-44557][INFRA] Clean up untracked/ignored files before running pip 
packaging test in GitHub Actions

### What changes were proposed in this pull request?

This PR proposes to remove untracked/ignored files before running pip 
packaging test in GitHub Actions.

### Why are the changes needed?

In order to fix the flakiness in the test such as:

```
...
creating dist
Creating tar archive
error: [Errno 28] No space left on device
Cleaning up temporary directory - /tmp/tmp.CvSzgB7Kyy
```

See also 
https://github.com/apache/spark/actions/runs/5665869112/job/15351515539.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

GitHub Actions build in this PR.

Closes #42159 from HyukjinKwon/debug-ci-failure.

Lead-authored-by: Hyukjin Kwon 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 dev/run-pip-tests | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/dev/run-pip-tests b/dev/run-pip-tests
index 5da231495a2..773611d9d92 100755
--- a/dev/run-pip-tests
+++ b/dev/run-pip-tests
@@ -25,6 +25,13 @@ shopt -s nullglob
 FWDIR="$(cd "$(dirname "$0")"/..; pwd)"
 cd "$FWDIR"
 
+# Clean ignored/untracked files that do not need
+# for pip packaging test. Machines in GitHub Action do not have
+# enough space, see also SPARK-44557.
+if [[ ! -z "${GITHUB_ACTIONS}" ]]; then
+  git clean -d -f -x -e assembly
+fi
+
 echo "Constructing virtual env for testing"
 VIRTUALENV_BASE=$(mktemp -d)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-44533][PYTHON] Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze

2023-07-26 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8647b243dee [SPARK-44533][PYTHON] Add support for accumulator, 
broadcast, and Spark files in Python UDTF's analyze
8647b243dee is described below

commit 8647b243deed8f2c3279ed17fe196006b6c923af
Author: Takuya UESHIN 
AuthorDate: Wed Jul 26 21:03:08 2023 -0700

[SPARK-44533][PYTHON] Add support for accumulator, broadcast, and Spark 
files in Python UDTF's analyze

### What changes were proposed in this pull request?

Adds support for `accumulator`, `broadcast` in vanilla PySpark, and Spark 
files in both vanilla PySpark and Spark Connect Python client, in Python UDTF's 
analyze.

For example, in vanilla PySpark:

```py
>>> colname = sc.broadcast("col1")
>>> test_accum = sc.accumulator(0)

>>> udtf
... class TestUDTF:
... staticmethod
... def analyze(a: AnalyzeArgument) -> AnalyzeResult:
... test_accum.add(1)
... return AnalyzeResult(StructType().add(colname.value, 
a.data_type))
... def eval(self, a):
... test_accum.add(1)
... yield a,
...
>>> df = TestUDTF(lit(10))
>>> df.printSchema()
root
 |-- col1: integer (nullable = true)

>>> df.show()
++
|col1|
++
|  10|
++

>>> test_accum.value
2
```

or

```py
>>> pyfile_path = "my_pyfile.py"
>>> with open(pyfile_path, "w") as f:
... f.write("my_func = lambda: 'col1'")
...
24
>>> sc.addPyFile(pyfile_path)
>>> # or spark.addArtifacts(pyfile_path, pyfile=True)
>>>
>>> udtf
... class TestUDTF:
... staticmethod
... def analyze(a: AnalyzeArgument) -> AnalyzeResult:
... import my_pyfile
... return AnalyzeResult(StructType().add(my_pyfile.my_func(), 
a.data_type))
... def eval(self, a):
... yield a,
...
>>> df = TestUDTF(lit(10))
>>> df.printSchema()
root
 |-- col1: integer (nullable = true)

>>> df.show()
++
|col1|
++
|  10|
++
```

### Why are the changes needed?

To support missing features: `accumulator`, `broadcast`, and Spark files in 
Python UDTF's analyze.

### Does this PR introduce _any_ user-facing change?

Yes, accumulator, broadcast in vanilla PySpark, and Spark files in both 
vanilla PySpark and Spark Connect Python client will be available.

### How was this patch tested?

Added related tests.

Closes #42135 from ueshin/issues/SPARK-44533/analyze.

Authored-by: Takuya UESHIN 
Signed-off-by: Takuya UESHIN 
---
 .../org/apache/spark/api/python/PythonRDD.scala|   4 +-
 .../org/apache/spark/api/python/PythonRunner.scala |  83 +---
 .../spark/api/python/PythonWorkerUtils.scala   | 152 ++
 .../pyspark/sql/tests/connect/test_parity_udtf.py  |  19 ++
 python/pyspark/sql/tests/test_udtf.py  | 224 -
 python/pyspark/sql/worker/analyze_udtf.py  |  18 +-
 python/pyspark/worker.py   |  91 ++---
 python/pyspark/worker_util.py  | 132 
 .../execution/python/BatchEvalPythonUDTFExec.scala |   7 +-
 .../python/UserDefinedPythonFunction.scala |  23 ++-
 10 files changed, 584 insertions(+), 169 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
index 95fbc145d83..91fd92d4422 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
@@ -487,9 +487,7 @@ private[spark] object PythonRDD extends Logging {
   }
 
   def writeUTF(str: String, dataOut: DataOutputStream): Unit = {
-val bytes = str.getBytes(StandardCharsets.UTF_8)
-dataOut.writeInt(bytes.length)
-dataOut.write(bytes)
+PythonWorkerUtils.writeUTF(str, dataOut)
   }
 
   /**
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
index 5d719b33a30..0173de75ff2 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
@@ -309,8 +309,9 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
 val dataOut = new DataOutputStream(stream)
 // Partition index
 dataOut.writeInt(partitionIndex)
-// Python version of driver
-PythonRDD.writeUTF(pythonVer, dataOut)
+
+PythonWorkerUtils.writePythonVersion(pythonVer, dataOut)
+
 // Init a ServerSocket to accept

[spark] branch master updated: [SPARK-43611][SQL][PS][CONNCECT] Make `ExtractWindowExpressions` retain the `PLAN_ID_TAG`

2023-07-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 185a0a5a239 [SPARK-43611][SQL][PS][CONNCECT] Make 
`ExtractWindowExpressions` retain the `PLAN_ID_TAG`
185a0a5a239 is described below

commit 185a0a5a23958676e4236eaf9e4d78cdfd2dd2d7
Author: Ruifeng Zheng 
AuthorDate: Thu Jul 27 11:00:18 2023 +0800

[SPARK-43611][SQL][PS][CONNCECT] Make `ExtractWindowExpressions` retain the 
`PLAN_ID_TAG`

### What changes were proposed in this pull request?
Make rule `ExtractWindowExpressions` retain the `PLAN_ID_TAG `

### Why are the changes needed?
In https://github.com/apache/spark/pull/39925, we introduced a new 
mechanism to resolve expression with specified plan.

However, sometimes the plan ID might be discarded by some analyzer rules, 
and then some expressions can not be correctly resolved, this issue is the main 
blocker of PS on Connect.

### Does this PR introduce _any_ user-facing change?
yes, a lot of Pandas APIs enabled

### How was this patch tested?
Enable UTs

Closes #42086 from zhengruifeng/ps_connect_analyze_window.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../computation/test_parity_missing_data.py|  30 --
 .../tests/connect/series/test_parity_compute.py|  16 ---
 .../tests/connect/series/test_parity_cumulative.py |  25 +
 .../tests/connect/series/test_parity_index.py  |   7 +-
 .../connect/series/test_parity_missing_data.py |  35 +-
 .../tests/connect/series/test_parity_stat.py   |  11 +-
 .../pandas/tests/connect/test_parity_ewm.py|  12 +--
 .../pandas/tests/connect/test_parity_expanding.py  | 120 +
 .../test_parity_ops_on_diff_frames_groupby.py  |  48 +
 ..._parity_ops_on_diff_frames_groupby_expanding.py |  42 +---
 .../pandas/tests/connect/test_parity_rolling.py| 120 +
 .../spark/sql/catalyst/analysis/Analyzer.scala |  12 ++-
 12 files changed, 21 insertions(+), 457 deletions(-)

diff --git 
a/python/pyspark/pandas/tests/connect/computation/test_parity_missing_data.py 
b/python/pyspark/pandas/tests/connect/computation/test_parity_missing_data.py
index a88c8692eca..d2ff09e5e8a 100644
--- 
a/python/pyspark/pandas/tests/connect/computation/test_parity_missing_data.py
+++ 
b/python/pyspark/pandas/tests/connect/computation/test_parity_missing_data.py
@@ -29,36 +29,6 @@ class FrameParityMissingDataTests(
 def psdf(self):
 return ps.from_pandas(self.pdf)
 
-@unittest.skip(
-"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark 
Connect client."
-)
-def test_backfill(self):
-super().test_backfill()
-
-@unittest.skip(
-"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark 
Connect client."
-)
-def test_bfill(self):
-super().test_bfill()
-
-@unittest.skip(
-"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark 
Connect client."
-)
-def test_ffill(self):
-super().test_ffill()
-
-@unittest.skip(
-"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark 
Connect client."
-)
-def test_fillna(self):
-return super().test_fillna()
-
-@unittest.skip(
-"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark 
Connect client."
-)
-def test_pad(self):
-super().test_pad()
-
 
 if __name__ == "__main__":
 from pyspark.pandas.tests.connect.computation.test_parity_missing_data 
import *  # noqa: F401
diff --git a/python/pyspark/pandas/tests/connect/series/test_parity_compute.py 
b/python/pyspark/pandas/tests/connect/series/test_parity_compute.py
index 00e35b27e8f..f757d19ca69 100644
--- a/python/pyspark/pandas/tests/connect/series/test_parity_compute.py
+++ b/python/pyspark/pandas/tests/connect/series/test_parity_compute.py
@@ -22,22 +22,6 @@ from pyspark.testing.pandasutils import 
PandasOnSparkTestUtils
 
 
 class SeriesParityComputeTests(SeriesComputeMixin, PandasOnSparkTestUtils, 
ReusedConnectTestCase):
-@unittest.skip(
-"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark 
Connect client."
-)
-def test_diff(self):
-super().test_diff()
-
-@unittest.skip("TODO(SPARK-43620): Support `Column` for 
SparkConnectColumn.__getitem__.")
-def test_factorize(self):
-super().test_factorize()
-
-@unittest.skip(
-"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark 
Connect client."
-)
-def test_shift(self):
-super().test_shift()
-
 @unittest.skip(
 "TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark 
Connect client."
 )
diff --git

[spark] branch branch-3.4 updated: [SPARK-44479][CONNECT][PYTHON] Fix protobuf conversion from an empty struct type

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 135bb497eda [SPARK-44479][CONNECT][PYTHON] Fix protobuf conversion 
from an empty struct type
135bb497eda is described below

commit 135bb497edad8d132323257547c86bd405ecba8e
Author: Takuya UESHIN 
AuthorDate: Thu Jul 27 10:10:44 2023 +0900

[SPARK-44479][CONNECT][PYTHON] Fix protobuf conversion from an empty struct 
type

### What changes were proposed in this pull request?

This is a partial backport of #42161.

Fixes protobuf conversion from an empty struct type.

### Why are the changes needed?

The empty struct type was not properly converted to the protobuf message.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #42179 from ueshin/issues/SPARK-44479/3.4/empty_schema.

Authored-by: Takuya UESHIN 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/types.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/connect/types.py 
b/python/pyspark/sql/connect/types.py
index dfb0fb5303f..be8d62c805f 100644
--- a/python/pyspark/sql/connect/types.py
+++ b/python/pyspark/sql/connect/types.py
@@ -155,6 +155,7 @@ def pyspark_types_to_proto_types(data_type: DataType) -> 
pb2.DataType:
 ret.day_time_interval.start_field = data_type.startField
 ret.day_time_interval.end_field = data_type.endField
 elif isinstance(data_type, StructType):
+struct = pb2.DataType.Struct()
 for field in data_type.fields:
 struct_field = pb2.DataType.StructField()
 struct_field.name = field.name
@@ -162,7 +163,8 @@ def pyspark_types_to_proto_types(data_type: DataType) -> 
pb2.DataType:
 struct_field.nullable = field.nullable
 if field.metadata is not None and len(field.metadata) > 0:
 struct_field.metadata = json.dumps(field.metadata)
-ret.struct.fields.append(struct_field)
+struct.fields.append(struct_field)
+ret.struct.CopyFrom(struct)
 elif isinstance(data_type, MapType):
 
ret.map.key_type.CopyFrom(pyspark_types_to_proto_types(data_type.keyType))
 
ret.map.value_type.CopyFrom(pyspark_types_to_proto_types(data_type.valueType))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44479][PYTHON][3.5] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame

2023-07-26 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 803b2854a9e [SPARK-44479][PYTHON][3.5] Fix 
ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame
803b2854a9e is described below

commit 803b2854a9e82aee4e5691c4a9a697856b963377
Author: Takuya UESHIN 
AuthorDate: Wed Jul 26 17:54:38 2023 -0700

[SPARK-44479][PYTHON][3.5] Fix ArrowStreamPandasUDFSerializer to accept 
no-column pandas DataFrame

### What changes were proposed in this pull request?

Fixes `ArrowStreamPandasUDFSerializer` to accept no-column pandas DataFrame.

```py
>>> def _scalar_f(id):
...   return pd.DataFrame(index=id)
...
>>> scalar_f = pandas_udf(_scalar_f, returnType=StructType())
>>> df = spark.range(3).withColumn("f", scalar_f(col("id")))
>>> df.printSchema()
root
 |-- id: long (nullable = false)
 |-- f: struct (nullable = true)

>>> df.show()
+---+---+
| id|  f|
+---+---+
|  0| {}|
|  1| {}|
|  2| {}|
+---+---+
```

### Why are the changes needed?

The above query fails with the following error:

```py
>>> df.show()
org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
...
ValueError: not enough values to unpack (expected 2, got 0)
```

### Does this PR introduce _any_ user-facing change?

Yes, Pandas UDF will accept no-column pandas DataFrame.

### How was this patch tested?

Added related tests.

Closes #42176 from ueshin/issues/SPARK-44479/3.5/empty_schema.

Authored-by: Takuya UESHIN 
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/sql/connect/types.py|  4 ++-
 python/pyspark/sql/pandas/serializers.py   | 31 --
 .../sql/tests/pandas/test_pandas_udf_scalar.py | 23 +++-
 python/pyspark/sql/tests/test_udtf.py  | 11 ++--
 4 files changed, 45 insertions(+), 24 deletions(-)

diff --git a/python/pyspark/sql/connect/types.py 
b/python/pyspark/sql/connect/types.py
index 2a21cdf0675..0db2833d2c1 100644
--- a/python/pyspark/sql/connect/types.py
+++ b/python/pyspark/sql/connect/types.py
@@ -170,6 +170,7 @@ def pyspark_types_to_proto_types(data_type: DataType) -> 
pb2.DataType:
 ret.year_month_interval.start_field = data_type.startField
 ret.year_month_interval.end_field = data_type.endField
 elif isinstance(data_type, StructType):
+struct = pb2.DataType.Struct()
 for field in data_type.fields:
 struct_field = pb2.DataType.StructField()
 struct_field.name = field.name
@@ -177,7 +178,8 @@ def pyspark_types_to_proto_types(data_type: DataType) -> 
pb2.DataType:
 struct_field.nullable = field.nullable
 if field.metadata is not None and len(field.metadata) > 0:
 struct_field.metadata = json.dumps(field.metadata)
-ret.struct.fields.append(struct_field)
+struct.fields.append(struct_field)
+ret.struct.CopyFrom(struct)
 elif isinstance(data_type, MapType):
 
ret.map.key_type.CopyFrom(pyspark_types_to_proto_types(data_type.keyType))
 
ret.map.value_type.CopyFrom(pyspark_types_to_proto_types(data_type.valueType))
diff --git a/python/pyspark/sql/pandas/serializers.py 
b/python/pyspark/sql/pandas/serializers.py
index f22a73cbbef..1d326928e23 100644
--- a/python/pyspark/sql/pandas/serializers.py
+++ b/python/pyspark/sql/pandas/serializers.py
@@ -385,37 +385,28 @@ class 
ArrowStreamPandasUDFSerializer(ArrowStreamPandasSerializer):
 """
 import pyarrow as pa
 
-# Input partition and result pandas.DataFrame empty, make empty Arrays 
with struct
-if len(df) == 0 and len(df.columns) == 0:
-arrs_names = [
-(pa.array([], type=field.type), field.name) for field in 
arrow_struct_type
-]
+if len(df.columns) == 0:
+return pa.array([{}] * len(df), arrow_struct_type)
 # Assign result columns by schema name if user labeled with strings
-elif self._assign_cols_by_name and any(isinstance(name, str) for name 
in df.columns):
-arrs_names = [
-(
-self._create_array(df[field.name], field.type, 
arrow_cast=self._arrow_cast),
-field.name,
-)
+if self._assign_cols_by_name and any(isinstance(name, str) for name in 
df.columns):
+struct_arrs = [
+self._create_array(df[field.name], field.type, 
arrow_cast=self._arrow_cast)
 for field in arrow_struct_type
 ]
 # Assign result columns by position
 else:
-arrs_names = [
+

[spark] branch master updated: [SPARK-44479][PYTHON] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame

2023-07-26 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 02e36dd0f07 [SPARK-44479][PYTHON] Fix ArrowStreamPandasUDFSerializer 
to accept no-column pandas DataFrame
02e36dd0f07 is described below

commit 02e36dd0f077d11a75c6e083489dc1a51c870a0d
Author: Takuya UESHIN 
AuthorDate: Wed Jul 26 17:53:46 2023 -0700

[SPARK-44479][PYTHON] Fix ArrowStreamPandasUDFSerializer to accept 
no-column pandas DataFrame

### What changes were proposed in this pull request?

Fixes `ArrowStreamPandasUDFSerializer` to accept no-column pandas DataFrame.

```py
>>> def _scalar_f(id):
...   return pd.DataFrame(index=id)
...
>>> scalar_f = pandas_udf(_scalar_f, returnType=StructType())
>>> df = spark.range(3).withColumn("f", scalar_f(col("id")))
>>> df.printSchema()
root
 |-- id: long (nullable = false)
 |-- f: struct (nullable = true)

>>> df.show()
+---+---+
| id|  f|
+---+---+
|  0| {}|
|  1| {}|
|  2| {}|
+---+---+
```

### Why are the changes needed?

The above query fails with the following error:

```py
>>> df.show()
org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
...
ValueError: not enough values to unpack (expected 2, got 0)
```

### Does this PR introduce _any_ user-facing change?

Yes, Pandas UDF will accept no-column pandas DataFrame.

### How was this patch tested?

Added related tests.

Closes #42161 from ueshin/issues/SPARK-44479/empty_schema.

Authored-by: Takuya UESHIN 
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/sql/connect/types.py|  4 ++-
 python/pyspark/sql/pandas/serializers.py   | 31 --
 .../sql/tests/pandas/test_pandas_udf_scalar.py | 23 +++-
 python/pyspark/sql/tests/test_udtf.py  | 11 ++--
 4 files changed, 45 insertions(+), 24 deletions(-)

diff --git a/python/pyspark/sql/connect/types.py 
b/python/pyspark/sql/connect/types.py
index 2a21cdf0675..0db2833d2c1 100644
--- a/python/pyspark/sql/connect/types.py
+++ b/python/pyspark/sql/connect/types.py
@@ -170,6 +170,7 @@ def pyspark_types_to_proto_types(data_type: DataType) -> 
pb2.DataType:
 ret.year_month_interval.start_field = data_type.startField
 ret.year_month_interval.end_field = data_type.endField
 elif isinstance(data_type, StructType):
+struct = pb2.DataType.Struct()
 for field in data_type.fields:
 struct_field = pb2.DataType.StructField()
 struct_field.name = field.name
@@ -177,7 +178,8 @@ def pyspark_types_to_proto_types(data_type: DataType) -> 
pb2.DataType:
 struct_field.nullable = field.nullable
 if field.metadata is not None and len(field.metadata) > 0:
 struct_field.metadata = json.dumps(field.metadata)
-ret.struct.fields.append(struct_field)
+struct.fields.append(struct_field)
+ret.struct.CopyFrom(struct)
 elif isinstance(data_type, MapType):
 
ret.map.key_type.CopyFrom(pyspark_types_to_proto_types(data_type.keyType))
 
ret.map.value_type.CopyFrom(pyspark_types_to_proto_types(data_type.valueType))
diff --git a/python/pyspark/sql/pandas/serializers.py 
b/python/pyspark/sql/pandas/serializers.py
index 90a24197f64..15de00782c6 100644
--- a/python/pyspark/sql/pandas/serializers.py
+++ b/python/pyspark/sql/pandas/serializers.py
@@ -385,37 +385,28 @@ class 
ArrowStreamPandasUDFSerializer(ArrowStreamPandasSerializer):
 """
 import pyarrow as pa
 
-# Input partition and result pandas.DataFrame empty, make empty Arrays 
with struct
-if len(df) == 0 and len(df.columns) == 0:
-arrs_names = [
-(pa.array([], type=field.type), field.name) for field in 
arrow_struct_type
-]
+if len(df.columns) == 0:
+return pa.array([{}] * len(df), arrow_struct_type)
 # Assign result columns by schema name if user labeled with strings
-elif self._assign_cols_by_name and any(isinstance(name, str) for name 
in df.columns):
-arrs_names = [
-(
-self._create_array(df[field.name], field.type, 
arrow_cast=self._arrow_cast),
-field.name,
-)
+if self._assign_cols_by_name and any(isinstance(name, str) for name in 
df.columns):
+struct_arrs = [
+self._create_array(df[field.name], field.type, 
arrow_cast=self._arrow_cast)
 for field in arrow_struct_type
 ]
 # Assign result columns by position
 else:
-arrs_names = [
+struct_arrs = [

[spark] branch branch-3.4 updated: [SPARK-44553][BUILD][3.4] Ignoring `connect-check-protos` logic in GA testing

2023-07-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new dff13979dc2 [SPARK-44553][BUILD][3.4] Ignoring `connect-check-protos` 
logic in GA testing
dff13979dc2 is described below

commit dff13979dc2a7abd9222b0567914b554d6b8baf4
Author: panbingkun 
AuthorDate: Thu Jul 27 08:30:41 2023 +0800

[SPARK-44553][BUILD][3.4] Ignoring `connect-check-protos` logic in GA 
testing

### What changes were proposed in this pull request?
The pr aims to ignoring `connect-check-protos` logic in GA testing for 
branch-3.4.

### Why are the changes needed?
Make GA happy.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #42166 from panbingkun/branch-3.4_SPARK-44553.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml | 9 -
 1 file changed, 9 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 06f94ea0b25..4f9978b0414 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -589,15 +589,6 @@ jobs:
 python3.9 -m pip install 'pandas-stubs==1.2.0.53' ipython 
'grpcio==1.48.1' 'grpc-stubs==1.24.11' 'googleapis-common-protos-stubs==2.2.0'
 - name: Python linter
   run: PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
-- name: Install dependencies for Python code generation check
-  run: |
-# See more in "Installation" 
https://docs.buf.build/installation#tarball
-curl -LO 
https://github.com/bufbuild/buf/releases/download/v1.15.1/buf-Linux-x86_64.tar.gz
-mkdir -p $HOME/buf
-tar -xvzf buf-Linux-x86_64.tar.gz -C $HOME/buf --strip-components 1
-python3.9 -m pip install 'protobuf==3.19.5' 'mypy-protobuf==3.3.0'
-- name: Python code generation check
-  run: if test -f ./dev/connect-check-protos.py; then 
PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 
./dev/connect-check-protos.py; fi
 - name: Install JavaScript linter dependencies
   run: |
 apt update


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-44544][INFRA][3.4] Deduplicate `run_python_packaging_tests`

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 96296aac4f4 [SPARK-44544][INFRA][3.4] Deduplicate 
`run_python_packaging_tests`
96296aac4f4 is described below

commit 96296aac4f4f168a7f30ff1ccb33c3b52b433ba4
Author: Ruifeng Zheng 
AuthorDate: Thu Jul 27 09:22:36 2023 +0900

[SPARK-44544][INFRA][3.4] Deduplicate `run_python_packaging_tests`

### What changes were proposed in this pull request?
cherry-pick https://github.com/apache/spark/pull/42146 to 3.4

### Why are the changes needed?
can not cherry-pick clearly, so make this PR

### Does this PR introduce _any_ user-facing change?
no, infra-only

### How was this patch tested?
updated CI

Closes #42172 from zhengruifeng/cp_fix.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_and_test.yml | 16 ++--
 dev/run-tests.py |  2 +-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 657fec27d52..06f94ea0b25 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -192,6 +192,7 @@ jobs:
   HIVE_PROFILE: ${{ matrix.hive }}
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
+  SKIP_PACKAGING: true
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v3
@@ -328,6 +329,8 @@ jobs:
 java:
   - ${{ inputs.java }}
 modules:
+  - >-
+pyspark-errors
   - >-
 pyspark-sql, pyspark-mllib, pyspark-resource
   - >-
@@ -337,7 +340,7 @@ jobs:
   - >-
 pyspark-pandas-slow
   - >-
-pyspark-connect, pyspark-errors
+pyspark-connect
 env:
   MODULES_TO_TEST: ${{ matrix.modules }}
   HADOOP_PROFILE: ${{ inputs.hadoop }}
@@ -346,6 +349,7 @@ jobs:
   SPARK_LOCAL_IP: localhost
   SKIP_UNIDOC: true
   SKIP_MIMA: true
+  SKIP_PACKAGING: true
   METASPACE_SIZE: 1g
 steps:
 - name: Checkout Spark repository
@@ -394,14 +398,20 @@ jobs:
 python3.9 -m pip list
 pypy3 -m pip list
 - name: Install Conda for pip packaging test
+  if: ${{ matrix.modules == 'pyspark-errors' }}
   run: |
 curl -s 
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh > 
miniconda.sh
 bash miniconda.sh -b -p $HOME/miniconda
 # Run the tests.
 - name: Run tests
   env: ${{ fromJSON(inputs.envs) }}
+  shell: 'script -q -e -c "bash {0}"'
   run: |
-export PATH=$PATH:$HOME/miniconda/bin
+if [[ "$MODULES_TO_TEST" == "pyspark-errors" ]]; then
+  export PATH=$PATH:$HOME/miniconda/bin
+  export SKIP_PACKAGING=false
+  echo "Python Packaging Tests Enabled!"
+fi
 ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST"
 - name: Upload coverage to Codecov
   if: fromJSON(inputs.envs).PYSPARK_CODECOV == 'true'
@@ -437,6 +447,7 @@ jobs:
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
   SKIP_MIMA: true
+  SKIP_PACKAGING: true
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v3
@@ -850,6 +861,7 @@ jobs:
   SPARK_LOCAL_IP: localhost
   ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-xe:21.3.0
   SKIP_MIMA: true
+  SKIP_PACKAGING: true
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v3
diff --git a/dev/run-tests.py b/dev/run-tests.py
index 92768c96905..dab3dcf7fe6 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -396,7 +396,7 @@ def run_python_tests(test_modules, parallelism, 
with_coverage=False):
 
 
 def run_python_packaging_tests():
-if not os.environ.get("SPARK_JENKINS"):
+if not os.environ.get("SPARK_JENKINS") and 
os.environ.get("SKIP_PACKAGING", "false") != "true":
 set_title_and_block("Running PySpark packaging tests", 
"BLOCK_PYSPARK_PIP_TESTS")
 command = [os.path.join(SPARK_HOME, "dev", "run-pip-tests")]
 run_cmd(command)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass

2023-07-26 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 17fc3632f23 [SPARK-44457][CONNECT][TESTS] Add 
`truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily 
test GA task pass
17fc3632f23 is described below

commit 17fc3632f2344101f8318457e3f9d5f133913997
Author: yangjie01 
AuthorDate: Wed Jul 26 19:17:40 2023 -0500

[SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make 
`ArrowEncoderSuite` in Java 17 daily test GA task pass

### What changes were proposed in this pull request?
Similar to SPARK-42770 | https://github.com/apache/spark/pull/40395, this 
pr call `truncatedTo(ChronoUnit.MICROS)` on `Instant.now()` and 
`LocalDateTime.now()` to ensure microsecond accuracy is used in any environment.

### Why are the changes needed?
Make Java 17 daily test GA task run successfully.

The Java 17 daily test GA task failed now: 
https://github.com/apache/spark/actions/runs/5570003581/jobs/10173767006

```
[info] - nullable fields *** FAILED *** (169 milliseconds)
[info]   NullableData(null, JANUARY, E1, null, 1.00, 
2.00, null, 4, PT0S, null, 2023-07-16, 2023-07-16, null, 
2023-07-16T23:01:54.059339Z, 2023-07-16T23:01:54.059359) did not equal 
NullableData(null, JANUARY, E1, null, 1.00, 
2.00, null, 4, PT0S, null, 2023-07-16, 2023-07-16, null, 
2023-07-16T23:01:54.059339538Z, 2023-07-16T23:01:54.059359638) 
(ArrowEncoderSuite.scala:194)
[info]   Analysis:
[info]   NullableData(instant: 2023-07-16T23:01:54.059339Z -> 
2023-07-16T23:01:54.059339538Z, localDateTime: 2023-07-16T23:01:54.059359 -> 
2023-07-16T23:01:54.059359638)
[info]   org.scalatest.exceptions.TestFailedException:
...
[info] - lenient field serialization - timestamp/instant *** FAILED *** (26 
milliseconds)
[info]   2023-07-16T23:01:55.112838Z did not equal 
2023-07-16T23:01:55.112838568Z (ArrowEncoderSuite.scala:194)
[info]   org.scalatest.exceptions.TestFailedException:
...

```

### Does this PR introduce _any_ user-facing change?
No, just for test

### How was this patch tested?
- Pass GitHub Action
- Git Hub Action test with Java 17 passed: 
https://github.com/LuciferYang/spark/actions/runs/5647253889/job/15297009685

https://github.com/apache/spark/assets/1475305/27a4350a-9475-45e3-b39f-b0b1e8f14e92;>

Closes #42039 from LuciferYang/ArrowEncoderSuite-Java17.

Authored-by: yangjie01 
Signed-off-by: Sean Owen 
(cherry picked from commit da359259b138864a52ea98a4e19c55e593a5a8fa)
Signed-off-by: Sean Owen 
---
 .../spark/sql/connect/client/arrow/ArrowEncoderSuite.scala| 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
index 3f8ac1cb8d1..5c035a613fe 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
@@ -18,6 +18,7 @@ package org.apache.spark.sql.connect.client.arrow
 
 import java.math.BigInteger
 import java.time.{Duration, Period, ZoneOffset}
+import java.time.temporal.ChronoUnit
 import java.util
 import java.util.{Collections, Objects}
 
@@ -361,8 +362,10 @@ class ArrowEncoderSuite extends ConnectFunSuite with 
BeforeAndAfterAll {
 
   test("nullable fields") {
 val encoder = ScalaReflection.encoderFor[NullableData]
-val instant = java.time.Instant.now()
-val now = java.time.LocalDateTime.now()
+// SPARK-44457: Similar to SPARK-42770, calling 
`truncatedTo(ChronoUnit.MICROS)`
+// on `Instant.now()` and `LocalDateTime.now()` to ensure microsecond 
accuracy is used.
+val instant = java.time.Instant.now().truncatedTo(ChronoUnit.MICROS)
+val now = java.time.LocalDateTime.now().truncatedTo(ChronoUnit.MICROS)
 val today = java.time.LocalDate.now()
 roundTripAndCheckIdentical(encoder) { () =>
   val maybeNull = MaybeNull(3)
@@ -602,7 +605,9 @@ class ArrowEncoderSuite extends ConnectFunSuite with 
BeforeAndAfterAll {
   }
 
   test("lenient field serialization - timestamp/instant") {
-val base = java.time.Instant.now()
+// SPARK-44457: Similar to SPARK-42770, calling 
`truncatedTo(ChronoUnit.MICROS)`
+// on `Instant.now()` to ensure microsecond accuracy is used.
+val base = java.time.Instant.now().truncatedTo(ChronoUnit.MICROS)
 val instants = () => Iterator.tabulate(10)(i => base.plusSeconds(i * i

[spark] branch master updated: [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass

2023-07-26 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new da359259b13 [SPARK-44457][CONNECT][TESTS] Add 
`truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily 
test GA task pass
da359259b13 is described below

commit da359259b138864a52ea98a4e19c55e593a5a8fa
Author: yangjie01 
AuthorDate: Wed Jul 26 19:17:40 2023 -0500

[SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make 
`ArrowEncoderSuite` in Java 17 daily test GA task pass

### What changes were proposed in this pull request?
Similar to SPARK-42770 | https://github.com/apache/spark/pull/40395, this 
pr call `truncatedTo(ChronoUnit.MICROS)` on `Instant.now()` and 
`LocalDateTime.now()` to ensure microsecond accuracy is used in any environment.

### Why are the changes needed?
Make Java 17 daily test GA task run successfully.

The Java 17 daily test GA task failed now: 
https://github.com/apache/spark/actions/runs/5570003581/jobs/10173767006

```
[info] - nullable fields *** FAILED *** (169 milliseconds)
[info]   NullableData(null, JANUARY, E1, null, 1.00, 
2.00, null, 4, PT0S, null, 2023-07-16, 2023-07-16, null, 
2023-07-16T23:01:54.059339Z, 2023-07-16T23:01:54.059359) did not equal 
NullableData(null, JANUARY, E1, null, 1.00, 
2.00, null, 4, PT0S, null, 2023-07-16, 2023-07-16, null, 
2023-07-16T23:01:54.059339538Z, 2023-07-16T23:01:54.059359638) 
(ArrowEncoderSuite.scala:194)
[info]   Analysis:
[info]   NullableData(instant: 2023-07-16T23:01:54.059339Z -> 
2023-07-16T23:01:54.059339538Z, localDateTime: 2023-07-16T23:01:54.059359 -> 
2023-07-16T23:01:54.059359638)
[info]   org.scalatest.exceptions.TestFailedException:
...
[info] - lenient field serialization - timestamp/instant *** FAILED *** (26 
milliseconds)
[info]   2023-07-16T23:01:55.112838Z did not equal 
2023-07-16T23:01:55.112838568Z (ArrowEncoderSuite.scala:194)
[info]   org.scalatest.exceptions.TestFailedException:
...

```

### Does this PR introduce _any_ user-facing change?
No, just for test

### How was this patch tested?
- Pass GitHub Action
- Git Hub Action test with Java 17 passed: 
https://github.com/LuciferYang/spark/actions/runs/5647253889/job/15297009685

https://github.com/apache/spark/assets/1475305/27a4350a-9475-45e3-b39f-b0b1e8f14e92;>

Closes #42039 from LuciferYang/ArrowEncoderSuite-Java17.

Authored-by: yangjie01 
Signed-off-by: Sean Owen 
---
 .../spark/sql/connect/client/arrow/ArrowEncoderSuite.scala| 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
index 3f8ac1cb8d1..5c035a613fe 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
@@ -18,6 +18,7 @@ package org.apache.spark.sql.connect.client.arrow
 
 import java.math.BigInteger
 import java.time.{Duration, Period, ZoneOffset}
+import java.time.temporal.ChronoUnit
 import java.util
 import java.util.{Collections, Objects}
 
@@ -361,8 +362,10 @@ class ArrowEncoderSuite extends ConnectFunSuite with 
BeforeAndAfterAll {
 
   test("nullable fields") {
 val encoder = ScalaReflection.encoderFor[NullableData]
-val instant = java.time.Instant.now()
-val now = java.time.LocalDateTime.now()
+// SPARK-44457: Similar to SPARK-42770, calling 
`truncatedTo(ChronoUnit.MICROS)`
+// on `Instant.now()` and `LocalDateTime.now()` to ensure microsecond 
accuracy is used.
+val instant = java.time.Instant.now().truncatedTo(ChronoUnit.MICROS)
+val now = java.time.LocalDateTime.now().truncatedTo(ChronoUnit.MICROS)
 val today = java.time.LocalDate.now()
 roundTripAndCheckIdentical(encoder) { () =>
   val maybeNull = MaybeNull(3)
@@ -602,7 +605,9 @@ class ArrowEncoderSuite extends ConnectFunSuite with 
BeforeAndAfterAll {
   }
 
   test("lenient field serialization - timestamp/instant") {
-val base = java.time.Instant.now()
+// SPARK-44457: Similar to SPARK-42770, calling 
`truncatedTo(ChronoUnit.MICROS)`
+// on `Instant.now()` to ensure microsecond accuracy is used.
+val base = java.time.Instant.now().truncatedTo(ChronoUnit.MICROS)
 val instants = () => Iterator.tabulate(10)(i => base.plusSeconds(i * i * 
60))
 val timestamps = () => instants().map(java.sql.Timestamp.from)
 val combo = () => instants()

[spark] branch master updated: [SPARK-44522][BUILD] Upgrade `scala-xml` to 2.2.0

2023-07-26 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 43b753a3530 [SPARK-44522][BUILD] Upgrade `scala-xml` to 2.2.0
43b753a3530 is described below

commit 43b753a3530bcfdad415765e1348136d70d8125d
Author: yangjie01 
AuthorDate: Wed Jul 26 19:11:00 2023 -0500

[SPARK-44522][BUILD] Upgrade `scala-xml` to 2.2.0

### What changes were proposed in this pull request?
This pr aims to upgrade `scala-xml` from 2.1.0 to 2.2.0.

### Why are the changes needed?
The new version bring some bug fix like:
- https://github.com/scala/scala-xml/pull/651
- https://github.com/scala/scala-xml/pull/677

The full release notes as follows:
- https://github.com/scala/scala-xml/releases/tag/v2.2.0

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions
- Checked Scala 2.13, all Scala test passed: 
https://github.com/LuciferYang/spark/runs/15278359785

Closes #42119 from LuciferYang/scala-xml-220.

Authored-by: yangjie01 
Signed-off-by: Sean Owen 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 168b0b34787..3b54ef43f6a 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -229,7 +229,7 @@ scala-compiler/2.12.18//scala-compiler-2.12.18.jar
 scala-library/2.12.18//scala-library-2.12.18.jar
 scala-parser-combinators_2.12/2.3.0//scala-parser-combinators_2.12-2.3.0.jar
 scala-reflect/2.12.18//scala-reflect-2.12.18.jar
-scala-xml_2.12/2.1.0//scala-xml_2.12-2.1.0.jar
+scala-xml_2.12/2.2.0//scala-xml_2.12-2.2.0.jar
 shims/0.9.45//shims-0.9.45.jar
 slf4j-api/2.0.7//slf4j-api-2.0.7.jar
 snakeyaml-engine/2.6//snakeyaml-engine-2.6.jar
diff --git a/pom.xml b/pom.xml
index 5711dba04b9..2e9d1d2d8f3 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1089,7 +1089,7 @@
   
 org.scala-lang.modules
 scala-xml_${scala.binary.version}
-2.1.0
+2.2.0
   
   
 org.scala-lang


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 43ac3db1e27 [SPARK-44528][CONNECT] Support proper usage of hasattr() 
for Connect dataframe
43ac3db1e27 is described below

commit 43ac3db1e27a4169183a90b54b6a873f0d26a7ba
Author: Martin Grund 
AuthorDate: Thu Jul 27 08:53:45 2023 +0900

[SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect 
dataframe

### What changes were proposed in this pull request?
Currently Connect does not allow the proper usage of Python's `hasattr()` 
to identify if an attribute is defined or not. This patch fixes that bug (it's 
working in regular PySpark).

### Why are the changes needed?
Bugfix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #42132 from grundprinzip/SPARK-44528.

Lead-authored-by: Martin Grund 
Co-authored-by: Martin Grund 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 91e97f92fe76f9718cd16af0c761d5530bdb37ee)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/dataframe.py|  8 +++
 .../sql/tests/connect/test_connect_basic.py| 17 +++--
 python/pyspark/testing/connectutils.py | 28 ++
 3 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 6429645f0e0..12e424b5ef1 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1584,8 +1584,16 @@ class DataFrame:
 error_class="NOT_IMPLEMENTED",
 message_parameters={"feature": f"{name}()"},
 )
+
+if name not in self.columns:
+raise AttributeError(
+"'%s' object has no attribute '%s'" % 
(self.__class__.__name__, name)
+)
+
 return self[name]
 
+__getattr__.__doc__ = PySparkDataFrame.__getattr__.__doc__
+
 @overload
 def __getitem__(self, item: Union[int, str]) -> Column:
 ...
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 5259ea6b5f5..065f1585a9f 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -157,6 +157,19 @@ class SparkConnectSQLTestCase(ReusedConnectTestCase, 
SQLTestUtils, PandasOnSpark
 
 
 class SparkConnectBasicTests(SparkConnectSQLTestCase):
+def test_df_getattr_behavior(self):
+cdf = self.connect.range(10)
+sdf = self.spark.range(10)
+
+sdf._simple_extension = 10
+cdf._simple_extension = 10
+
+self.assertEqual(sdf._simple_extension, cdf._simple_extension)
+self.assertEqual(type(sdf._simple_extension), 
type(cdf._simple_extension))
+
+self.assertTrue(hasattr(cdf, "_simple_extension"))
+self.assertFalse(hasattr(cdf, "_simple_extension_does_not_exsit"))
+
 def test_df_get_item(self):
 # SPARK-41779: test __getitem__
 
@@ -1296,8 +1309,8 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 sdf.drop("a", "x").toPandas(),
 )
 self.assert_eq(
-cdf.drop(cdf.a, cdf.x).toPandas(),
-sdf.drop("a", "x").toPandas(),
+cdf.drop(cdf.a, "x").toPandas(),
+sdf.drop(sdf.a, "x").toPandas(),
 )
 
 def test_subquery_alias(self) -> None:
diff --git a/python/pyspark/testing/connectutils.py 
b/python/pyspark/testing/connectutils.py
index 1b3ac10fce8..b6145d0a006 100644
--- a/python/pyspark/testing/connectutils.py
+++ b/python/pyspark/testing/connectutils.py
@@ -16,6 +16,7 @@
 #
 import shutil
 import tempfile
+import types
 import typing
 import os
 import functools
@@ -67,7 +68,7 @@ should_test_connect: str = typing.cast(str, 
connect_requirement_message is None)
 
 if should_test_connect:
 from pyspark.sql.connect.dataframe import DataFrame
-from pyspark.sql.connect.plan import Read, Range, SQL
+from pyspark.sql.connect.plan import Read, Range, SQL, LogicalPlan
 from pyspark.sql.connect.session import SparkSession
 
 
@@ -88,16 +89,33 @@ class MockRemoteSession:
 return functools.partial(self.hooks[item])
 
 
+class MockDF(DataFrame):
+"""Helper class that must only be used for the mock plan tests."""
+
+def __init__(self, session: SparkSession, plan: LogicalPlan):
+super().__init__(session)
+self._plan = plan
+
+def __getattr__(self, name):
+"""All attributes are resolved to columns, because none really exist 
in the
+mocked DataFrame."""
+return self[name]
+
+
 @unittest.skipIf(not should_test_connect,

[spark] branch master updated: [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 91e97f92fe7 [SPARK-44528][CONNECT] Support proper usage of hasattr() 
for Connect dataframe
91e97f92fe7 is described below

commit 91e97f92fe76f9718cd16af0c761d5530bdb37ee
Author: Martin Grund 
AuthorDate: Thu Jul 27 08:53:45 2023 +0900

[SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect 
dataframe

### What changes were proposed in this pull request?
Currently Connect does not allow the proper usage of Python's `hasattr()` 
to identify if an attribute is defined or not. This patch fixes that bug (it's 
working in regular PySpark).

### Why are the changes needed?
Bugfix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #42132 from grundprinzip/SPARK-44528.

Lead-authored-by: Martin Grund 
Co-authored-by: Martin Grund 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/dataframe.py|  8 +++
 .../sql/tests/connect/test_connect_basic.py| 17 +++--
 python/pyspark/testing/connectutils.py | 28 ++
 3 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 6429645f0e0..12e424b5ef1 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1584,8 +1584,16 @@ class DataFrame:
 error_class="NOT_IMPLEMENTED",
 message_parameters={"feature": f"{name}()"},
 )
+
+if name not in self.columns:
+raise AttributeError(
+"'%s' object has no attribute '%s'" % 
(self.__class__.__name__, name)
+)
+
 return self[name]
 
+__getattr__.__doc__ = PySparkDataFrame.__getattr__.__doc__
+
 @overload
 def __getitem__(self, item: Union[int, str]) -> Column:
 ...
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 5259ea6b5f5..065f1585a9f 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -157,6 +157,19 @@ class SparkConnectSQLTestCase(ReusedConnectTestCase, 
SQLTestUtils, PandasOnSpark
 
 
 class SparkConnectBasicTests(SparkConnectSQLTestCase):
+def test_df_getattr_behavior(self):
+cdf = self.connect.range(10)
+sdf = self.spark.range(10)
+
+sdf._simple_extension = 10
+cdf._simple_extension = 10
+
+self.assertEqual(sdf._simple_extension, cdf._simple_extension)
+self.assertEqual(type(sdf._simple_extension), 
type(cdf._simple_extension))
+
+self.assertTrue(hasattr(cdf, "_simple_extension"))
+self.assertFalse(hasattr(cdf, "_simple_extension_does_not_exsit"))
+
 def test_df_get_item(self):
 # SPARK-41779: test __getitem__
 
@@ -1296,8 +1309,8 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 sdf.drop("a", "x").toPandas(),
 )
 self.assert_eq(
-cdf.drop(cdf.a, cdf.x).toPandas(),
-sdf.drop("a", "x").toPandas(),
+cdf.drop(cdf.a, "x").toPandas(),
+sdf.drop(sdf.a, "x").toPandas(),
 )
 
 def test_subquery_alias(self) -> None:
diff --git a/python/pyspark/testing/connectutils.py 
b/python/pyspark/testing/connectutils.py
index 1b3ac10fce8..b6145d0a006 100644
--- a/python/pyspark/testing/connectutils.py
+++ b/python/pyspark/testing/connectutils.py
@@ -16,6 +16,7 @@
 #
 import shutil
 import tempfile
+import types
 import typing
 import os
 import functools
@@ -67,7 +68,7 @@ should_test_connect: str = typing.cast(str, 
connect_requirement_message is None)
 
 if should_test_connect:
 from pyspark.sql.connect.dataframe import DataFrame
-from pyspark.sql.connect.plan import Read, Range, SQL
+from pyspark.sql.connect.plan import Read, Range, SQL, LogicalPlan
 from pyspark.sql.connect.session import SparkSession
 
 
@@ -88,16 +89,33 @@ class MockRemoteSession:
 return functools.partial(self.hooks[item])
 
 
+class MockDF(DataFrame):
+"""Helper class that must only be used for the mock plan tests."""
+
+def __init__(self, session: SparkSession, plan: LogicalPlan):
+super().__init__(session)
+self._plan = plan
+
+def __getattr__(self, name):
+"""All attributes are resolved to columns, because none really exist 
in the
+mocked DataFrame."""
+return self[name]
+
+
 @unittest.skipIf(not should_test_connect, connect_requirement_message)
 class PlanOnlyTestFixture(unittest.TestCase, PySparkErrorTestUtils):
 @classmethod
 def

[spark] branch master updated: [SPARK-44537][BUILD] Upgrade kubernetes-client to 6.8.0

2023-07-26 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6b6216c01ef [SPARK-44537][BUILD] Upgrade kubernetes-client to 6.8.0
6b6216c01ef is described below

commit 6b6216c01ef333aef43c4b78831078576d7834fb
Author: panbingkun 
AuthorDate: Wed Jul 26 09:33:43 2023 -0700

[SPARK-44537][BUILD] Upgrade kubernetes-client to 6.8.0

### What changes were proposed in this pull request?
The pr aims to upgrade kubernetes-client from 6.7.2 to 6.8.0.

### Why are the changes needed?
- The newest version brings some bug fixed & improvment, eg:
Fix https://github.com/fabric8io/kubernetes-client/issues/5221: Empty kube 
config file causes NPE
Fix https://github.com/fabric8io/kubernetes-client/issues/5281: Ensure the 
KubernetesCrudDispatcher's backing map is accessed w/lock
Fix https://github.com/fabric8io/kubernetes-client/issues/5298: Prevent 
requests needing authentication from causing a 403 response
Fix https://github.com/fabric8io/kubernetes-client/issues/5233: Generalized 
SchemaSwap to allow for cycle expansion
Fix https://github.com/fabric8io/kubernetes-client/issues/5262: all 
built-in collections will omit empty in their serialized form.

- The full release notes:
https://github.com/fabric8io/kubernetes-client/releases/

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #42142 from panbingkun/SPARK-44537.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 50 +--
 pom.xml   |  2 +-
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 251c8174fcc..168b0b34787 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -141,31 +141,31 @@ jsr305/3.0.0//jsr305-3.0.0.jar
 jta/1.1//jta-1.1.jar
 jul-to-slf4j/2.0.7//jul-to-slf4j-2.0.7.jar
 kryo-shaded/4.0.2//kryo-shaded-4.0.2.jar
-kubernetes-client-api/6.7.2//kubernetes-client-api-6.7.2.jar
-kubernetes-client/6.7.2//kubernetes-client-6.7.2.jar
-kubernetes-httpclient-okhttp/6.7.2//kubernetes-httpclient-okhttp-6.7.2.jar
-kubernetes-model-admissionregistration/6.7.2//kubernetes-model-admissionregistration-6.7.2.jar
-kubernetes-model-apiextensions/6.7.2//kubernetes-model-apiextensions-6.7.2.jar
-kubernetes-model-apps/6.7.2//kubernetes-model-apps-6.7.2.jar
-kubernetes-model-autoscaling/6.7.2//kubernetes-model-autoscaling-6.7.2.jar
-kubernetes-model-batch/6.7.2//kubernetes-model-batch-6.7.2.jar
-kubernetes-model-certificates/6.7.2//kubernetes-model-certificates-6.7.2.jar
-kubernetes-model-common/6.7.2//kubernetes-model-common-6.7.2.jar
-kubernetes-model-coordination/6.7.2//kubernetes-model-coordination-6.7.2.jar
-kubernetes-model-core/6.7.2//kubernetes-model-core-6.7.2.jar
-kubernetes-model-discovery/6.7.2//kubernetes-model-discovery-6.7.2.jar
-kubernetes-model-events/6.7.2//kubernetes-model-events-6.7.2.jar
-kubernetes-model-extensions/6.7.2//kubernetes-model-extensions-6.7.2.jar
-kubernetes-model-flowcontrol/6.7.2//kubernetes-model-flowcontrol-6.7.2.jar
-kubernetes-model-gatewayapi/6.7.2//kubernetes-model-gatewayapi-6.7.2.jar
-kubernetes-model-metrics/6.7.2//kubernetes-model-metrics-6.7.2.jar
-kubernetes-model-networking/6.7.2//kubernetes-model-networking-6.7.2.jar
-kubernetes-model-node/6.7.2//kubernetes-model-node-6.7.2.jar
-kubernetes-model-policy/6.7.2//kubernetes-model-policy-6.7.2.jar
-kubernetes-model-rbac/6.7.2//kubernetes-model-rbac-6.7.2.jar
-kubernetes-model-resource/6.7.2//kubernetes-model-resource-6.7.2.jar
-kubernetes-model-scheduling/6.7.2//kubernetes-model-scheduling-6.7.2.jar
-kubernetes-model-storageclass/6.7.2//kubernetes-model-storageclass-6.7.2.jar
+kubernetes-client-api/6.8.0//kubernetes-client-api-6.8.0.jar
+kubernetes-client/6.8.0//kubernetes-client-6.8.0.jar
+kubernetes-httpclient-okhttp/6.8.0//kubernetes-httpclient-okhttp-6.8.0.jar
+kubernetes-model-admissionregistration/6.8.0//kubernetes-model-admissionregistration-6.8.0.jar
+kubernetes-model-apiextensions/6.8.0//kubernetes-model-apiextensions-6.8.0.jar
+kubernetes-model-apps/6.8.0//kubernetes-model-apps-6.8.0.jar
+kubernetes-model-autoscaling/6.8.0//kubernetes-model-autoscaling-6.8.0.jar
+kubernetes-model-batch/6.8.0//kubernetes-model-batch-6.8.0.jar
+kubernetes-model-certificates/6.8.0//kubernetes-model-certificates-6.8.0.jar
+kubernetes-model-common/6.8.0//kubernetes-model-common-6.8.0.jar
+kubernetes-model-coordination/6.8.0//kubernetes-model-coordination-6.8.0.jar
+kubernetes-model-core/6.8.0//kubernetes-model-core-6.8.0.jar
+kubernetes-model-discovery/6.8.0//kubernetes-model-discovery-6.8.0.jar

[spark] branch branch-3.5 updated: [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type

2023-07-26 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 5cb7d2b81af [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and 
`BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data 
type
5cb7d2b81af is described below

commit 5cb7d2b81af5e8c2714cccfb818f9ec9d6c54da3
Author: Bruce Robbins 
AuthorDate: Wed Jul 26 19:31:27 2023 +0800

[SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use 
`DataTypeMismatch` to indicate unexpected input data type

### What changes were proposed in this pull request?

Change `BitmapCount` and `BitmapOrAgg` to use `DataTypeMismatch` rather 
than `TypeCheckResult.TypeCheckFailure` to indicate incorrect input types.

### Why are the changes needed?

It appears `TypeCheckResult.TypeCheckFailure` has been deprecated: No 
expressions except for the recently added `BitmapCount` and `BitmapOrAgg` are 
using it.

### Does this PR introduce _any_ user-facing change?

This PR changes an error message for two expressions that are not yet in 
any released version of Spark.

Before PR:
```
spark-sql (default)> select bitmap_count(12);
[DATATYPE_MISMATCH.TYPE_CHECK_FAILURE_WITH_HINT] Cannot resolve 
"bitmap_count(12)" due to data type mismatch: Bitmap must be a BinaryType.; 
line 1 pos 7;
'Project [unresolvedalias(bitmap_count(12), None)]
+- OneRowRelation

spark-sql (default)> select bitmap_or_agg(12);
[DATATYPE_MISMATCH.TYPE_CHECK_FAILURE_WITH_HINT] Cannot resolve 
"bitmap_or_agg(12)" due to data type mismatch: Bitmap must be a BinaryType.; 
line 1 pos 7;
'Aggregate [unresolvedalias(bitmap_or_agg(12, 0, 0), None)]
+- OneRowRelation
```
After PR:
```
spark-sql (default)> select bitmap_count(12);
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "bitmap_count(12)" 
due to data type mismatch: Parameter 0 requires the "BINARY" type, however "12" 
has the type "INT".; line 1 pos 7;
'Project [unresolvedalias(bitmap_count(12), None)]
+- OneRowRelation

spark-sql (default)> select bitmap_or_agg(12);
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve 
"bitmap_or_agg(12)" due to data type mismatch: Parameter 0 requires the 
"BINARY" type, however "12" has the type "INT".; line 1 pos 7;
'Aggregate [unresolvedalias(bitmap_or_agg(12, 0, 0), None)]
+- OneRowRelation
```
### How was this patch tested?

New unit tests.

Closes #42139 from bersprockets/bitmap_type_check.

Authored-by: Bruce Robbins 
Signed-off-by: Wenchen Fan 
(cherry picked from commit d0d4aab437843ce5adf5900d2d6088e79323f8d5)
Signed-off-by: Wenchen Fan 
---
 .../catalyst/expressions/bitmapExpressions.scala   | 26 +++--
 .../spark/sql/BitmapExpressionsQuerySuite.scala| 44 ++
 2 files changed, 66 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala
index 2adfddb9383..5c7ef5cde5b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala
@@ -19,10 +19,12 @@ package org.apache.spark.sql.catalyst.expressions
 
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{DataTypeMismatch, 
TypeCheckSuccess}
 import org.apache.spark.sql.catalyst.expressions.aggregate.ImperativeAggregate
 import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
 import org.apache.spark.sql.catalyst.trees.UnaryLike
 import org.apache.spark.sql.catalyst.types.DataTypeUtils
+import org.apache.spark.sql.catalyst.util.TypeUtils._
 import org.apache.spark.sql.errors.QueryExecutionErrors
 import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType, 
LongType, StructType}
 
@@ -111,9 +113,17 @@ case class BitmapCount(child: Expression)
 
   override def checkInputDataTypes(): TypeCheckResult = {
 if (child.dataType != BinaryType) {
-  TypeCheckResult.TypeCheckFailure("Bitmap must be a BinaryType")
+  DataTypeMismatch(
+errorSubClass = "UNEXPECTED_INPUT_TYPE",
+messageParameters = Map(
+  "paramIndex" -> "0",
+  "requiredType" -> toSQLType(BinaryType),
+  "inputSql" -> toSQLExpr(child),
+  "inputType" -> toSQLType(child.dataType)
+)
+  )
 } else {
-  TypeCheckResult.TypeCheckSuccess
+  TypeCheckSuccess
 }
   }
 
@@ -248,9 +258,17 @@ case class

[spark] branch master updated: [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type

2023-07-26 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d0d4aab4378 [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and 
`BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data 
type
d0d4aab4378 is described below

commit d0d4aab437843ce5adf5900d2d6088e79323f8d5
Author: Bruce Robbins 
AuthorDate: Wed Jul 26 19:31:27 2023 +0800

[SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use 
`DataTypeMismatch` to indicate unexpected input data type

### What changes were proposed in this pull request?

Change `BitmapCount` and `BitmapOrAgg` to use `DataTypeMismatch` rather 
than `TypeCheckResult.TypeCheckFailure` to indicate incorrect input types.

### Why are the changes needed?

It appears `TypeCheckResult.TypeCheckFailure` has been deprecated: No 
expressions except for the recently added `BitmapCount` and `BitmapOrAgg` are 
using it.

### Does this PR introduce _any_ user-facing change?

This PR changes an error message for two expressions that are not yet in 
any released version of Spark.

Before PR:
```
spark-sql (default)> select bitmap_count(12);
[DATATYPE_MISMATCH.TYPE_CHECK_FAILURE_WITH_HINT] Cannot resolve 
"bitmap_count(12)" due to data type mismatch: Bitmap must be a BinaryType.; 
line 1 pos 7;
'Project [unresolvedalias(bitmap_count(12), None)]
+- OneRowRelation

spark-sql (default)> select bitmap_or_agg(12);
[DATATYPE_MISMATCH.TYPE_CHECK_FAILURE_WITH_HINT] Cannot resolve 
"bitmap_or_agg(12)" due to data type mismatch: Bitmap must be a BinaryType.; 
line 1 pos 7;
'Aggregate [unresolvedalias(bitmap_or_agg(12, 0, 0), None)]
+- OneRowRelation
```
After PR:
```
spark-sql (default)> select bitmap_count(12);
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "bitmap_count(12)" 
due to data type mismatch: Parameter 0 requires the "BINARY" type, however "12" 
has the type "INT".; line 1 pos 7;
'Project [unresolvedalias(bitmap_count(12), None)]
+- OneRowRelation

spark-sql (default)> select bitmap_or_agg(12);
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve 
"bitmap_or_agg(12)" due to data type mismatch: Parameter 0 requires the 
"BINARY" type, however "12" has the type "INT".; line 1 pos 7;
'Aggregate [unresolvedalias(bitmap_or_agg(12, 0, 0), None)]
+- OneRowRelation
```
### How was this patch tested?

New unit tests.

Closes #42139 from bersprockets/bitmap_type_check.

Authored-by: Bruce Robbins 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/expressions/bitmapExpressions.scala   | 26 +++--
 .../spark/sql/BitmapExpressionsQuerySuite.scala| 44 ++
 2 files changed, 66 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala
index 2adfddb9383..5c7ef5cde5b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala
@@ -19,10 +19,12 @@ package org.apache.spark.sql.catalyst.expressions
 
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{DataTypeMismatch, 
TypeCheckSuccess}
 import org.apache.spark.sql.catalyst.expressions.aggregate.ImperativeAggregate
 import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
 import org.apache.spark.sql.catalyst.trees.UnaryLike
 import org.apache.spark.sql.catalyst.types.DataTypeUtils
+import org.apache.spark.sql.catalyst.util.TypeUtils._
 import org.apache.spark.sql.errors.QueryExecutionErrors
 import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType, 
LongType, StructType}
 
@@ -111,9 +113,17 @@ case class BitmapCount(child: Expression)
 
   override def checkInputDataTypes(): TypeCheckResult = {
 if (child.dataType != BinaryType) {
-  TypeCheckResult.TypeCheckFailure("Bitmap must be a BinaryType")
+  DataTypeMismatch(
+errorSubClass = "UNEXPECTED_INPUT_TYPE",
+messageParameters = Map(
+  "paramIndex" -> "0",
+  "requiredType" -> toSQLType(BinaryType),
+  "inputSql" -> toSQLExpr(child),
+  "inputType" -> toSQLType(child.dataType)
+)
+  )
 } else {
-  TypeCheckResult.TypeCheckSuccess
+  TypeCheckSuccess
 }
   }
 
@@ -248,9 +258,17 @@ case class BitmapOrAgg(child: Expression,
 
   override def checkInputDataTypes(): TypeCheckResult = {
 if (child.dataType !=

[GitHub] [spark-website] peter-toth commented on pull request #469: Add Peter Toth to committers

2023-07-26 Thread via GitHub



peter-toth commented on PR #469:
URL: https://github.com/apache/spark-website/pull/469#issuecomment-1651582097

   Thanks @zhengruifeng!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api

2023-07-26 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e0c8f14ce53 [SPARK-44531][CONNECT][SQL] Move encoder inference to 
sql/api
e0c8f14ce53 is described below

commit e0c8f14ce53080e2863c076b7912239bee35003e
Author: Herman van Hovell 
AuthorDate: Wed Jul 26 07:15:27 2023 -0400

[SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api

### What changes were proposed in this pull request?
This PR move encoder inference 
(ScalaReflection/RowEncoder/JavaTypeInference) into sql/api.

### Why are the changes needed?
We want to use encoder inference in the spark connect scala client. The 
client's dependency to catalyst is going away, so we need to move this.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #42134 from hvanhovell/SPARK-44531.

Authored-by: Herman van Hovell 
Signed-off-by: Herman van Hovell 
(cherry picked from commit 071feabbd4325504332679dfa620bc5ee4359370)
Signed-off-by: Herman van Hovell 
---
 .../spark/ml/source/image/ImageFileFormat.scala|  4 +-
 .../spark/ml/source/libsvm/LibSVMRelation.scala|  4 +-
 project/MimaExcludes.scala |  3 +
 sql/api/pom.xml|  4 ++
 .../java/org/apache/spark/sql/types/DataTypes.java |  0
 .../apache/spark/sql/types/SQLUserDefinedType.java |  0
 .../scala/org/apache/spark/sql/SqlApiConf.scala|  2 +
 .../spark/sql/catalyst/JavaTypeInference.scala |  9 ++-
 .../spark/sql/catalyst/ScalaReflection.scala   | 13 ++--
 .../apache/spark/sql/catalyst/WalkedTypePath.scala |  0
 .../spark/sql/catalyst/encoders/RowEncoder.scala   | 19 ++
 .../spark/sql/errors/DataTypeErrorsBase.scala  |  8 ++-
 .../apache/spark/sql/errors/EncoderErrors.scala| 74 ++
 sql/catalyst/pom.xml   |  5 --
 .../sql/catalyst/encoders/ExpressionEncoder.scala  |  8 ++-
 .../spark/sql/catalyst/plans/logical/object.scala  |  8 +--
 .../spark/sql/errors/QueryExecutionErrors.scala| 60 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|  2 +-
 .../spark/sql/CalendarIntervalBenchmark.scala  |  4 +-
 .../scala/org/apache/spark/sql/HashBenchmark.scala |  4 +-
 .../spark/sql/UnsafeProjectionBenchmark.scala  |  4 +-
 .../sql/catalyst/encoders/RowEncoderSuite.scala| 50 +++
 .../expressions/HashExpressionsSuite.scala |  4 +-
 .../expressions/ObjectExpressionsSuite.scala   |  2 +-
 .../optimizer/ObjectSerializerPruningSuite.scala   |  4 +-
 .../catalyst/util/ArrayDataIndexedSeqSuite.scala   |  4 +-
 .../spark/sql/catalyst/util/UnsafeArraySuite.scala |  6 +-
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  6 +-
 .../scala/org/apache/spark/sql/SparkSession.scala  |  2 +-
 .../spark/sql/execution/SparkStrategies.scala  |  4 +-
 .../execution/datasources/DataSourceStrategy.scala |  4 +-
 .../sql/execution/datasources/jdbc/JdbcUtils.scala |  4 +-
 .../execution/datasources/v2/V2CommandExec.scala   |  4 +-
 .../FlatMapGroupsInPandasWithStateExec.scala   |  4 +-
 .../execution/streaming/MicroBatchExecution.scala  |  4 +-
 .../sql/execution/streaming/sources/memory.scala   |  4 +-
 .../spark/sql/DataFrameSessionWindowingSuite.scala |  4 +-
 .../org/apache/spark/sql/DataFrameSuite.scala  |  8 +--
 .../spark/sql/DataFrameTimeWindowingSuite.scala|  4 +-
 .../spark/sql/DatasetOptimizationSuite.scala   |  4 +-
 .../scala/org/apache/spark/sql/DatasetSuite.scala  |  8 +--
 .../spark/sql/execution/GroupedIteratorSuite.scala |  8 +--
 .../binaryfile/BinaryFileFormatSuite.scala |  4 +-
 .../streaming/sources/ForeachBatchSinkSuite.scala  |  4 +-
 .../apache/spark/sql/streaming/StreamTest.scala|  4 +-
 45 files changed, 205 insertions(+), 184 deletions(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala 
b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala
index 206ce6f0675..bf6e6b8eec0 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala
@@ -25,7 +25,7 @@ import org.apache.hadoop.mapreduce.Job
 import org.apache.spark.ml.image.ImageSchema
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
 import org.apache.spark.sql.catalyst.expressions.UnsafeRow
 import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriterFactory, PartitionedFile}
 import org.apache.spark.sql.sources.{DataSourceRegister,

[spark] branch master updated: [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api

2023-07-26 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 071feabbd43 [SPARK-44531][CONNECT][SQL] Move encoder inference to 
sql/api
071feabbd43 is described below

commit 071feabbd4325504332679dfa620bc5ee4359370
Author: Herman van Hovell 
AuthorDate: Wed Jul 26 07:15:27 2023 -0400

[SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api

### What changes were proposed in this pull request?
This PR move encoder inference 
(ScalaReflection/RowEncoder/JavaTypeInference) into sql/api.

### Why are the changes needed?
We want to use encoder inference in the spark connect scala client. The 
client's dependency to catalyst is going away, so we need to move this.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #42134 from hvanhovell/SPARK-44531.

Authored-by: Herman van Hovell 
Signed-off-by: Herman van Hovell 
---
 .../spark/ml/source/image/ImageFileFormat.scala|  4 +-
 .../spark/ml/source/libsvm/LibSVMRelation.scala|  4 +-
 project/MimaExcludes.scala |  3 +
 sql/api/pom.xml|  4 ++
 .../java/org/apache/spark/sql/types/DataTypes.java |  0
 .../apache/spark/sql/types/SQLUserDefinedType.java |  0
 .../scala/org/apache/spark/sql/SqlApiConf.scala|  2 +
 .../spark/sql/catalyst/JavaTypeInference.scala |  9 ++-
 .../spark/sql/catalyst/ScalaReflection.scala   | 13 ++--
 .../apache/spark/sql/catalyst/WalkedTypePath.scala |  0
 .../spark/sql/catalyst/encoders/RowEncoder.scala   | 19 ++
 .../spark/sql/errors/DataTypeErrorsBase.scala  |  8 ++-
 .../apache/spark/sql/errors/EncoderErrors.scala| 74 ++
 sql/catalyst/pom.xml   |  5 --
 .../sql/catalyst/encoders/ExpressionEncoder.scala  |  8 ++-
 .../spark/sql/catalyst/plans/logical/object.scala  |  8 +--
 .../spark/sql/errors/QueryExecutionErrors.scala| 60 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|  2 +-
 .../spark/sql/CalendarIntervalBenchmark.scala  |  4 +-
 .../scala/org/apache/spark/sql/HashBenchmark.scala |  4 +-
 .../spark/sql/UnsafeProjectionBenchmark.scala  |  4 +-
 .../sql/catalyst/encoders/RowEncoderSuite.scala| 50 +++
 .../expressions/HashExpressionsSuite.scala |  4 +-
 .../expressions/ObjectExpressionsSuite.scala   |  2 +-
 .../optimizer/ObjectSerializerPruningSuite.scala   |  4 +-
 .../catalyst/util/ArrayDataIndexedSeqSuite.scala   |  4 +-
 .../spark/sql/catalyst/util/UnsafeArraySuite.scala |  6 +-
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  6 +-
 .../scala/org/apache/spark/sql/SparkSession.scala  |  2 +-
 .../spark/sql/execution/SparkStrategies.scala  |  4 +-
 .../execution/datasources/DataSourceStrategy.scala |  4 +-
 .../sql/execution/datasources/jdbc/JdbcUtils.scala |  4 +-
 .../execution/datasources/v2/V2CommandExec.scala   |  4 +-
 .../FlatMapGroupsInPandasWithStateExec.scala   |  4 +-
 .../execution/streaming/MicroBatchExecution.scala  |  4 +-
 .../sql/execution/streaming/sources/memory.scala   |  4 +-
 .../spark/sql/DataFrameSessionWindowingSuite.scala |  4 +-
 .../org/apache/spark/sql/DataFrameSuite.scala  |  8 +--
 .../spark/sql/DataFrameTimeWindowingSuite.scala|  4 +-
 .../spark/sql/DatasetOptimizationSuite.scala   |  4 +-
 .../scala/org/apache/spark/sql/DatasetSuite.scala  |  8 +--
 .../spark/sql/execution/GroupedIteratorSuite.scala |  8 +--
 .../binaryfile/BinaryFileFormatSuite.scala |  4 +-
 .../streaming/sources/ForeachBatchSinkSuite.scala  |  4 +-
 .../apache/spark/sql/streaming/StreamTest.scala|  4 +-
 45 files changed, 205 insertions(+), 184 deletions(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala 
b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala
index 206ce6f0675..bf6e6b8eec0 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala
@@ -25,7 +25,7 @@ import org.apache.hadoop.mapreduce.Job
 import org.apache.spark.ml.image.ImageSchema
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
 import org.apache.spark.sql.catalyst.expressions.UnsafeRow
 import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriterFactory, PartitionedFile}
 import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
@@ -90,7 +90,7 @@ private[image] class ImageFileFormat extends FileFormat with 
DataSourceRegister
 if

[spark] branch branch-3.5 updated: [SPARK-44525][SQL] Improve error message when Invoke method is not found

2023-07-26 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new ad370115609 [SPARK-44525][SQL] Improve error message when Invoke 
method is not found
ad370115609 is described below

commit ad370115609cf302abb486d31c5460468979bd33
Author: Cheng Pan 
AuthorDate: Wed Jul 26 16:34:08 2023 +0800

[SPARK-44525][SQL] Improve error message when Invoke method is not found

### What changes were proposed in this pull request?

This PR aims to improve the error message when `Invoke`'s `method` is not 
found.

### Why are the changes needed?

Currently, the error message is not clear when `Invoke`'s `method` is not 
found.

There is one error message I have encountered, which is not much helpful to 
find the root cause.
```
org.apache.spark.SparkException: [INTERNAL_ERROR] A method named "invoke" 
is not declared in any enclosing class nor any supertype
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:77)
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:81)
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.methodNotDeclaredError(QueryExecutionErrors.scala:452)
at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.findMethod(objects.scala:173)
at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.findMethod$(objects.scala:170)
at 
org.apache.spark.sql.catalyst.expressions.objects.Invoke.findMethod(objects.scala:363)
at 
org.apache.spark.sql.catalyst.expressions.objects.Invoke.method$lzycompute(objects.scala:391)
at 
org.apache.spark.sql.catalyst.expressions.objects.Invoke.method(objects.scala:389)
at 
org.apache.spark.sql.catalyst.expressions.objects.Invoke.eval(objects.scala:401)
at 
org.apache.spark.sql.catalyst.expressions.HashExpression.eval(hash.scala:292)
at 
org.apache.spark.sql.catalyst.expressions.Pmod.eval(arithmetic.scala:1054)
```

### Does this PR introduce _any_ user-facing change?

Yes, Spark returns a more clear error message.

### How was this patch tested?

Add UT.

Closes #42128 from pan3793/SPARK-44525.

Authored-by: Cheng Pan 
Signed-off-by: yangjie01 
(cherry picked from commit a84e2b1eee3bb868f140bffeba4e19b1a56fa3fb)
Signed-off-by: yangjie01 
---
 .../sql/catalyst/expressions/objects/objects.scala |  2 +-
 .../spark/sql/errors/QueryExecutionErrors.scala|  9 ++
 .../expressions/ObjectExpressionsSuite.scala   | 34 +-
 3 files changed, 43 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
index fec60aef1bf..32bcdaf8609 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
@@ -185,7 +185,7 @@ trait InvokeLike extends Expression with NonSQLExpression 
with ImplicitCastInput
   final def findMethod(cls: Class[_], functionName: String, argClasses: 
Seq[Class[_]]): Method = {
 val method = MethodUtils.getMatchingAccessibleMethod(cls, functionName, 
argClasses: _*)
 if (method == null) {
-  throw QueryExecutionErrors.methodNotDeclaredError(functionName)
+  throw QueryExecutionErrors.methodNotFoundError(cls, functionName, 
argClasses)
 } else {
   method
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
index 2e65d672698..2d0e29b1032 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
@@ -464,6 +464,15 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase {
   s"""A method named "$name" is not declared in any enclosing class nor 
any supertype""")
   }
 
+  def methodNotFoundError(
+  cls: Class[_],
+  functionName: String,
+  argClasses: Seq[Class[_]]): Throwable = {
+SparkException.internalError(
+  s"Couldn't find method $functionName with arguments " +
+s"${argClasses.mkString("(", ", ", ")")} on $cls.")
+  }
+
   def constructorNotFoundError(cls: String): SparkRuntimeException = {
 new SparkRuntimeException(
   errorClass = "_LEGACY_ERROR_TEMP_2020",
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala

[spark] branch master updated: [SPARK-44525][SQL] Improve error message when Invoke method is not found

2023-07-26 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a84e2b1eee3 [SPARK-44525][SQL] Improve error message when Invoke 
method is not found
a84e2b1eee3 is described below

commit a84e2b1eee3bb868f140bffeba4e19b1a56fa3fb
Author: Cheng Pan 
AuthorDate: Wed Jul 26 16:34:08 2023 +0800

[SPARK-44525][SQL] Improve error message when Invoke method is not found

### What changes were proposed in this pull request?

This PR aims to improve the error message when `Invoke`'s `method` is not 
found.

### Why are the changes needed?

Currently, the error message is not clear when `Invoke`'s `method` is not 
found.

There is one error message I have encountered, which is not much helpful to 
find the root cause.
```
org.apache.spark.SparkException: [INTERNAL_ERROR] A method named "invoke" 
is not declared in any enclosing class nor any supertype
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:77)
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:81)
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.methodNotDeclaredError(QueryExecutionErrors.scala:452)
at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.findMethod(objects.scala:173)
at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.findMethod$(objects.scala:170)
at 
org.apache.spark.sql.catalyst.expressions.objects.Invoke.findMethod(objects.scala:363)
at 
org.apache.spark.sql.catalyst.expressions.objects.Invoke.method$lzycompute(objects.scala:391)
at 
org.apache.spark.sql.catalyst.expressions.objects.Invoke.method(objects.scala:389)
at 
org.apache.spark.sql.catalyst.expressions.objects.Invoke.eval(objects.scala:401)
at 
org.apache.spark.sql.catalyst.expressions.HashExpression.eval(hash.scala:292)
at 
org.apache.spark.sql.catalyst.expressions.Pmod.eval(arithmetic.scala:1054)
```

### Does this PR introduce _any_ user-facing change?

Yes, Spark returns a more clear error message.

### How was this patch tested?

Add UT.

Closes #42128 from pan3793/SPARK-44525.

Authored-by: Cheng Pan 
Signed-off-by: yangjie01 
---
 .../sql/catalyst/expressions/objects/objects.scala |  2 +-
 .../spark/sql/errors/QueryExecutionErrors.scala|  9 ++
 .../expressions/ObjectExpressionsSuite.scala   | 34 +-
 3 files changed, 43 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
index fec60aef1bf..32bcdaf8609 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
@@ -185,7 +185,7 @@ trait InvokeLike extends Expression with NonSQLExpression 
with ImplicitCastInput
   final def findMethod(cls: Class[_], functionName: String, argClasses: 
Seq[Class[_]]): Method = {
 val method = MethodUtils.getMatchingAccessibleMethod(cls, functionName, 
argClasses: _*)
 if (method == null) {
-  throw QueryExecutionErrors.methodNotDeclaredError(functionName)
+  throw QueryExecutionErrors.methodNotFoundError(cls, functionName, 
argClasses)
 } else {
   method
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
index 3fb14cd079f..7ddb6ee982e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
@@ -464,6 +464,15 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase {
   s"""A method named "$name" is not declared in any enclosing class nor 
any supertype""")
   }
 
+  def methodNotFoundError(
+  cls: Class[_],
+  functionName: String,
+  argClasses: Seq[Class[_]]): Throwable = {
+SparkException.internalError(
+  s"Couldn't find method $functionName with arguments " +
+s"${argClasses.mkString("(", ", ", ")")} on $cls.")
+  }
+
   def constructorNotFoundError(cls: String): SparkRuntimeException = {
 new SparkRuntimeException(
   errorClass = "_LEGACY_ERROR_TEMP_2020",
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala
index 63edba80ec8..73da5f4d3af

[spark] branch master updated: [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests`

2023-07-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 748eaff4e21 [SPARK-44544][INFRA] Deduplicate 
`run_python_packaging_tests`
748eaff4e21 is described below

commit 748eaff4e2177466dd746f6fbb82de8544bc7168
Author: Ruifeng Zheng 
AuthorDate: Wed Jul 26 15:52:38 2023 +0800

[SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests`

### What changes were proposed in this pull request?
it seems that `run_python_packaging_tests` requires some disk space and 
cause some pyspark modules fail, this PR is to make 
`run_python_packaging_tests` only enabled within `pyspark-errors` (which is the 
smallest pyspark test module)


![image](https://github.com/apache/spark/assets/7322292/2d37c141-15b8-4d9f-bfbd-4dd7782ab62e)

### Why are the changes needed?

1, it seems it is the `run_python_packaging_tests` that cause the `No space 
left` error;
2, the `run_python_packaging_tests` is tested in all `pyspark-*` test 
modules, should be deduplicated;

### Does this PR introduce _any_ user-facing change?
no, infra-only

### How was this patch tested?
updated CI

Closes #42146 from zhengruifeng/infra_skip_py_packing_tests.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml | 16 ++--
 dev/run-tests.py |  2 +-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 7107af66129..02b3814a018 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -205,6 +205,7 @@ jobs:
   HIVE_PROFILE: ${{ matrix.hive }}
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
+  SKIP_PACKAGING: true
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v3
@@ -344,6 +345,8 @@ jobs:
 java:
   - ${{ inputs.java }}
 modules:
+  - >-
+pyspark-errors
   - >-
 pyspark-sql, pyspark-mllib, pyspark-resource, pyspark-testing
   - >-
@@ -353,7 +356,7 @@ jobs:
   - >-
 pyspark-pandas-slow
   - >-
-pyspark-connect, pyspark-errors
+pyspark-connect
   - >-
 pyspark-pandas-connect
   - >-
@@ -366,6 +369,7 @@ jobs:
   SPARK_LOCAL_IP: localhost
   SKIP_UNIDOC: true
   SKIP_MIMA: true
+  SKIP_PACKAGING: true
   METASPACE_SIZE: 1g
 steps:
 - name: Checkout Spark repository
@@ -414,14 +418,20 @@ jobs:
 python3.9 -m pip list
 pypy3 -m pip list
 - name: Install Conda for pip packaging test
+  if: ${{ matrix.modules == 'pyspark-errors' }}
   run: |
 curl -s 
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh > 
miniconda.sh
 bash miniconda.sh -b -p $HOME/miniconda
 # Run the tests.
 - name: Run tests
   env: ${{ fromJSON(inputs.envs) }}
+  shell: 'script -q -e -c "bash {0}"'
   run: |
-export PATH=$PATH:$HOME/miniconda/bin
+if [[ "$MODULES_TO_TEST" == "pyspark-errors" ]]; then
+  export PATH=$PATH:$HOME/miniconda/bin
+  export SKIP_PACKAGING=false
+  echo "Python Packaging Tests Enabled!"
+fi
 ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST"
 - name: Upload coverage to Codecov
   if: fromJSON(inputs.envs).PYSPARK_CODECOV == 'true'
@@ -457,6 +467,7 @@ jobs:
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
   SKIP_MIMA: true
+  SKIP_PACKAGING: true
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v3
@@ -911,6 +922,7 @@ jobs:
   SPARK_LOCAL_IP: localhost
   ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-xe:21.3.0
   SKIP_MIMA: true
+  SKIP_PACKAGING: true
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v3
diff --git a/dev/run-tests.py b/dev/run-tests.py
index c0c281b549e..9bf3095edb7 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -395,7 +395,7 @@ def run_python_tests(test_modules, parallelism, 
with_coverage=False):
 
 
 def run_python_packaging_tests():
-if not os.environ.get("SPARK_JENKINS"):
+if not os.environ.get("SPARK_JENKINS") and 
os.environ.get("SKIP_PACKAGING", "false") != "true":
 set_title_and_block("Running PySpark packaging tests", 
"BLOCK_PYSPARK_PIP_TESTS")
 command = [os.path.join(SPARK_HOME, "dev", "run-pip-tests")]
 run_cmd(command)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail:

[spark] branch branch-3.5 updated: [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests`

2023-07-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a8539688186 [SPARK-44544][INFRA] Deduplicate 
`run_python_packaging_tests`
a8539688186 is described below

commit a8539688186be40c81c39050e70a49a9ef01519f
Author: Ruifeng Zheng 
AuthorDate: Wed Jul 26 15:52:38 2023 +0800

[SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests`

### What changes were proposed in this pull request?
it seems that `run_python_packaging_tests` requires some disk space and 
cause some pyspark modules fail, this PR is to make 
`run_python_packaging_tests` only enabled within `pyspark-errors` (which is the 
smallest pyspark test module)


![image](https://github.com/apache/spark/assets/7322292/2d37c141-15b8-4d9f-bfbd-4dd7782ab62e)

### Why are the changes needed?

1, it seems it is the `run_python_packaging_tests` that cause the `No space 
left` error;
2, the `run_python_packaging_tests` is tested in all `pyspark-*` test 
modules, should be deduplicated;

### Does this PR introduce _any_ user-facing change?
no, infra-only

### How was this patch tested?
updated CI

Closes #42146 from zhengruifeng/infra_skip_py_packing_tests.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
(cherry picked from commit 748eaff4e2177466dd746f6fbb82de8544bc7168)
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml | 16 ++--
 dev/run-tests.py |  2 +-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 54fe9f38ddd..1fcca7e4c39 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -204,6 +204,7 @@ jobs:
   HIVE_PROFILE: ${{ matrix.hive }}
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
+  SKIP_PACKAGING: true
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v3
@@ -343,6 +344,8 @@ jobs:
 java:
   - ${{ inputs.java }}
 modules:
+  - >-
+pyspark-errors
   - >-
 pyspark-sql, pyspark-mllib, pyspark-resource, pyspark-testing
   - >-
@@ -352,7 +355,7 @@ jobs:
   - >-
 pyspark-pandas-slow
   - >-
-pyspark-connect, pyspark-errors
+pyspark-connect
   - >-
 pyspark-pandas-connect
   - >-
@@ -365,6 +368,7 @@ jobs:
   SPARK_LOCAL_IP: localhost
   SKIP_UNIDOC: true
   SKIP_MIMA: true
+  SKIP_PACKAGING: true
   METASPACE_SIZE: 1g
 steps:
 - name: Checkout Spark repository
@@ -413,14 +417,20 @@ jobs:
 python3.9 -m pip list
 pypy3 -m pip list
 - name: Install Conda for pip packaging test
+  if: ${{ matrix.modules == 'pyspark-errors' }}
   run: |
 curl -s 
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh > 
miniconda.sh
 bash miniconda.sh -b -p $HOME/miniconda
 # Run the tests.
 - name: Run tests
   env: ${{ fromJSON(inputs.envs) }}
+  shell: 'script -q -e -c "bash {0}"'
   run: |
-export PATH=$PATH:$HOME/miniconda/bin
+if [[ "$MODULES_TO_TEST" == "pyspark-errors" ]]; then
+  export PATH=$PATH:$HOME/miniconda/bin
+  export SKIP_PACKAGING=false
+  echo "Python Packaging Tests Enabled!"
+fi
 ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST"
 - name: Upload coverage to Codecov
   if: fromJSON(inputs.envs).PYSPARK_CODECOV == 'true'
@@ -456,6 +466,7 @@ jobs:
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
   SKIP_MIMA: true
+  SKIP_PACKAGING: true
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v3
@@ -900,6 +911,7 @@ jobs:
   SPARK_LOCAL_IP: localhost
   ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-xe:21.3.0
   SKIP_MIMA: true
+  SKIP_PACKAGING: true
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v3
diff --git a/dev/run-tests.py b/dev/run-tests.py
index c0c281b549e..9bf3095edb7 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -395,7 +395,7 @@ def run_python_tests(test_modules, parallelism, 
with_coverage=False):
 
 
 def run_python_packaging_tests():
-if not os.environ.get("SPARK_JENKINS"):
+if not os.environ.get("SPARK_JENKINS") and 
os.environ.get("SKIP_PACKAGING", "false") != "true":
 set_title_and_block("Running PySpark packaging tests", 
"BLOCK_PYSPARK_PIP_TESTS")
 command = [os.path.join(SPARK_HOME, "dev", "run-pip-tests")]
 run_cmd(command)

[spark] branch branch-3.5 updated: [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new d4b03ec53db [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed 
Distributor
d4b03ec53db is described below

commit d4b03ec53db98d237e00aa9e097ef69faa19b4b1
Author: Mathew Jacob 
AuthorDate: Wed Jul 26 16:22:45 2023 +0900

[SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor

### What changes were proposed in this pull request?
Added examples to the docstring of using DeepspeedTorchDistributor

### Why are the changes needed?
More concrete examples, allowing for a better understanding of feature.

### Does this PR introduce _any_ user-facing change?
Yes, docs changes.

### How was this patch tested?
make html

Closes #42087 from mathewjacob1002/docs_deepspeed.

Lead-authored-by: Mathew Jacob 
Co-authored-by: Mathew Jacob 
<134338709+mathewjacob1...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit ac8fe83af2178d76a9e3df9fedf008ef26d8d044)
Signed-off-by: Hyukjin Kwon 
---
 .../pyspark/ml/deepspeed/deepspeed_distributor.py  | 50 --
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/python/pyspark/ml/deepspeed/deepspeed_distributor.py 
b/python/pyspark/ml/deepspeed/deepspeed_distributor.py
index d6ae98de5e3..7c2b8c43526 100644
--- a/python/pyspark/ml/deepspeed/deepspeed_distributor.py
+++ b/python/pyspark/ml/deepspeed/deepspeed_distributor.py
@@ -35,11 +35,11 @@ class DeepspeedTorchDistributor(TorchDistributor):
 
 def __init__(
 self,
-num_gpus: int = 1,
+numGpus: int = 1,
 nnodes: int = 1,
-local_mode: bool = True,
-use_gpu: bool = True,
-deepspeed_config: Optional[Union[str, Dict[str, Any]]] = None,
+localMode: bool = True,
+useGpu: bool = True,
+deepspeedConfig: Optional[Union[str, Dict[str, Any]]] = None,
 ):
 """
 This class is used to run deepspeed training workloads with spark 
clusters.
@@ -49,25 +49,51 @@ class DeepspeedTorchDistributor(TorchDistributor):
 
 Parameters
 --
-num_gpus: int
+numGpus: int
 The number of GPUs to use per node (analagous to num_gpus in 
deepspeed command).
 nnodes: int
 The number of nodes that should be used for the run.
-local_mode: bool
+localMode: bool
 Whether or not to run the training in a distributed fashion or 
just locally.
-use_gpu: bool
+useGpu: bool
 Boolean flag to determine whether to utilize gpus.
-deepspeed_config: Union[Dict[str,Any], str] or None:
+deepspeedConfig: Union[Dict[str,Any], str] or None:
 The configuration file to be used for launching the deepspeed 
application.
 If it's a dictionary containing the parameters, then we will 
create the file.
 If None, deepspeed will fall back to default parameters.
+
+Examples
+
+Run Deepspeed training function on a single node
+
+>>> def train(learning_rate):
+... import deepspeed
+... # rest of training function
+... return model
+>>> distributor = DeepspeedTorchDistributor(
+... numGpus=4,
+... nnodes=1,
+... useGpu=True,
+... localMode=True,
+... deepspeedConfig="path/to/config.json")
+>>> output = distributor.run(train, 0.01)
+
+Run Deepspeed training function on multiple nodes
+
+>>> distributor = DeepspeedTorchDistributor(
+... numGpus=4,
+... nnodes=3,
+... useGpu=True,
+... localMode=False,
+... deepspeedConfig="path/to/config.json")
+>>> output = distributor.run(train, 0.01)
 """
-num_processes = num_gpus * nnodes
-self.deepspeed_config = deepspeed_config
+num_processes = numGpus * nnodes
+self.deepspeed_config = deepspeedConfig
 super().__init__(
 num_processes,
-local_mode,
-use_gpu,
+localMode,
+useGpu,
 _ssl_conf=DeepspeedTorchDistributor._DEEPSPEED_SSL_CONF,
 )
 self.cleanup_deepspeed_conf = False


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ac8fe83af21 [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed 
Distributor
ac8fe83af21 is described below

commit ac8fe83af2178d76a9e3df9fedf008ef26d8d044
Author: Mathew Jacob 
AuthorDate: Wed Jul 26 16:22:45 2023 +0900

[SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor

### What changes were proposed in this pull request?
Added examples to the docstring of using DeepspeedTorchDistributor

### Why are the changes needed?
More concrete examples, allowing for a better understanding of feature.

### Does this PR introduce _any_ user-facing change?
Yes, docs changes.

### How was this patch tested?
make html

Closes #42087 from mathewjacob1002/docs_deepspeed.

Lead-authored-by: Mathew Jacob 
Co-authored-by: Mathew Jacob 
<134338709+mathewjacob1...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 .../pyspark/ml/deepspeed/deepspeed_distributor.py  | 50 --
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/python/pyspark/ml/deepspeed/deepspeed_distributor.py 
b/python/pyspark/ml/deepspeed/deepspeed_distributor.py
index d6ae98de5e3..7c2b8c43526 100644
--- a/python/pyspark/ml/deepspeed/deepspeed_distributor.py
+++ b/python/pyspark/ml/deepspeed/deepspeed_distributor.py
@@ -35,11 +35,11 @@ class DeepspeedTorchDistributor(TorchDistributor):
 
 def __init__(
 self,
-num_gpus: int = 1,
+numGpus: int = 1,
 nnodes: int = 1,
-local_mode: bool = True,
-use_gpu: bool = True,
-deepspeed_config: Optional[Union[str, Dict[str, Any]]] = None,
+localMode: bool = True,
+useGpu: bool = True,
+deepspeedConfig: Optional[Union[str, Dict[str, Any]]] = None,
 ):
 """
 This class is used to run deepspeed training workloads with spark 
clusters.
@@ -49,25 +49,51 @@ class DeepspeedTorchDistributor(TorchDistributor):
 
 Parameters
 --
-num_gpus: int
+numGpus: int
 The number of GPUs to use per node (analagous to num_gpus in 
deepspeed command).
 nnodes: int
 The number of nodes that should be used for the run.
-local_mode: bool
+localMode: bool
 Whether or not to run the training in a distributed fashion or 
just locally.
-use_gpu: bool
+useGpu: bool
 Boolean flag to determine whether to utilize gpus.
-deepspeed_config: Union[Dict[str,Any], str] or None:
+deepspeedConfig: Union[Dict[str,Any], str] or None:
 The configuration file to be used for launching the deepspeed 
application.
 If it's a dictionary containing the parameters, then we will 
create the file.
 If None, deepspeed will fall back to default parameters.
+
+Examples
+
+Run Deepspeed training function on a single node
+
+>>> def train(learning_rate):
+... import deepspeed
+... # rest of training function
+... return model
+>>> distributor = DeepspeedTorchDistributor(
+... numGpus=4,
+... nnodes=1,
+... useGpu=True,
+... localMode=True,
+... deepspeedConfig="path/to/config.json")
+>>> output = distributor.run(train, 0.01)
+
+Run Deepspeed training function on multiple nodes
+
+>>> distributor = DeepspeedTorchDistributor(
+... numGpus=4,
+... nnodes=3,
+... useGpu=True,
+... localMode=False,
+... deepspeedConfig="path/to/config.json")
+>>> output = distributor.run(train, 0.01)
 """
-num_processes = num_gpus * nnodes
-self.deepspeed_config = deepspeed_config
+num_processes = numGpus * nnodes
+self.deepspeed_config = deepspeedConfig
 super().__init__(
 num_processes,
-local_mode,
-use_gpu,
+localMode,
+useGpu,
 _ssl_conf=DeepspeedTorchDistributor._DEEPSPEED_SSL_CONF,
 )
 self.cleanup_deepspeed_conf = False


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 11f9d00caed [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py
11f9d00caed is described below

commit 11f9d00caedc3cb1dfc94cf5cfbdfd7aa7c93576
Author: Mathew Jacob 
AuthorDate: Wed Jul 26 16:21:01 2023 +0900

[SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py

### What changes were proposed in this pull request?
Added pyspark.ml.deepspeed to the setup.py

### Why are the changes needed?
It allows the deepspeed distributor to be pip installed when you pip 
install pyspark.

### Does this PR introduce _any_ user-facing change?
No

Closes #42123 from mathewjacob1002/add_deepspeed_setup_py.

Authored-by: Mathew Jacob 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 89c3472c5feb7b9625f55f82d08429dcb55a6fee)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/ml/deepspeed/__init__.py | 16 
 python/setup.py |  1 +
 2 files changed, 17 insertions(+)

diff --git a/python/pyspark/ml/deepspeed/__init__.py 
b/python/pyspark/ml/deepspeed/__init__.py
new file mode 100644
index 000..cce3acad34a
--- /dev/null
+++ b/python/pyspark/ml/deepspeed/__init__.py
@@ -0,0 +1,16 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
diff --git a/python/setup.py b/python/setup.py
index f190930b2a7..3774273c421 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -241,6 +241,7 @@ try:
 "pyspark.ml.linalg",
 "pyspark.ml.param",
 "pyspark.ml.torch",
+"pyspark.ml.deepspeed",
 "pyspark.sql",
 "pyspark.sql.avro",
 "pyspark.sql.connect",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py

2023-07-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 89c3472c5fe [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py
89c3472c5fe is described below

commit 89c3472c5feb7b9625f55f82d08429dcb55a6fee
Author: Mathew Jacob 
AuthorDate: Wed Jul 26 16:21:01 2023 +0900

[SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py

### What changes were proposed in this pull request?
Added pyspark.ml.deepspeed to the setup.py

### Why are the changes needed?
It allows the deepspeed distributor to be pip installed when you pip 
install pyspark.

### Does this PR introduce _any_ user-facing change?
No

Closes #42123 from mathewjacob1002/add_deepspeed_setup_py.

Authored-by: Mathew Jacob 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/ml/deepspeed/__init__.py | 16 
 python/setup.py |  1 +
 2 files changed, 17 insertions(+)

diff --git a/python/pyspark/ml/deepspeed/__init__.py 
b/python/pyspark/ml/deepspeed/__init__.py
new file mode 100644
index 000..cce3acad34a
--- /dev/null
+++ b/python/pyspark/ml/deepspeed/__init__.py
@@ -0,0 +1,16 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
diff --git a/python/setup.py b/python/setup.py
index 93ca373c586..fa938b0b4ef 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -241,6 +241,7 @@ try:
 "pyspark.ml.linalg",
 "pyspark.ml.param",
 "pyspark.ml.torch",
+"pyspark.ml.deepspeed",
 "pyspark.sql",
 "pyspark.sql.avro",
 "pyspark.sql.connect",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] zhengruifeng commented on pull request #469: Add Peter Toth to committers

2023-07-26 Thread via GitHub



zhengruifeng commented on PR #469:
URL: https://github.com/apache/spark-website/pull/469#issuecomment-1651112455

   Late LGTM, Congrats!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] peter-toth commented on pull request #469: Add Peter Toth to committers

2023-07-26 Thread via GitHub



peter-toth commented on PR #469:
URL: https://github.com/apache/spark-website/pull/469#issuecomment-1651044687

   Thank you @dongjoon-hyun!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions

[spark] branch branch-3.5 updated: [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions

[spark] branch master updated: [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions

[spark] branch master updated: [SPARK-44533][PYTHON] Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze

[spark] branch master updated: [SPARK-43611][SQL][PS][CONNCECT] Make `ExtractWindowExpressions` retain the `PLAN_ID_TAG`

[spark] branch branch-3.4 updated: [SPARK-44479][CONNECT][PYTHON] Fix protobuf conversion from an empty struct type

[spark] branch branch-3.5 updated: [SPARK-44479][PYTHON][3.5] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame

[spark] branch master updated: [SPARK-44479][PYTHON] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame

[spark] branch branch-3.4 updated: [SPARK-44553][BUILD][3.4] Ignoring `connect-check-protos` logic in GA testing

[spark] branch branch-3.4 updated: [SPARK-44544][INFRA][3.4] Deduplicate `run_python_packaging_tests`

[spark] branch branch-3.5 updated: [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass

[spark] branch master updated: [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass

[spark] branch master updated: [SPARK-44522][BUILD] Upgrade `scala-xml` to 2.2.0

[spark] branch branch-3.5 updated: [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe

[spark] branch master updated: [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe

[spark] branch master updated: [SPARK-44537][BUILD] Upgrade kubernetes-client to 6.8.0

[spark] branch branch-3.5 updated: [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type

[spark] branch master updated: [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type

[GitHub] [spark-website] peter-toth commented on pull request #469: Add Peter Toth to committers

[spark] branch branch-3.5 updated: [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api

[spark] branch master updated: [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api

[spark] branch branch-3.5 updated: [SPARK-44525][SQL] Improve error message when Invoke method is not found

[spark] branch master updated: [SPARK-44525][SQL] Improve error message when Invoke method is not found

[spark] branch master updated: [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests`

[spark] branch branch-3.5 updated: [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests`

[spark] branch branch-3.5 updated: [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor

[spark] branch master updated: [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor

[spark] branch branch-3.5 updated: [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py

[spark] branch master updated: [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py

[GitHub] [spark-website] zhengruifeng commented on pull request #469: Add Peter Toth to committers

[GitHub] [spark-website] peter-toth commented on pull request #469: Add Peter Toth to committers

31 matches

Site Navigation

Mail list logo

Footer information