[spark] branch branch-3.4 updated: [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 4f9d6d077ff [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions 4f9d6d077ff is described below commit 4f9d6d077ff1d3f359774d21d32833cb80a193b7 Author: Hyukjin Kwon AuthorDate: Thu Jul 27 14:08:09 2023 +0900 [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to remove untracked/ignored files before running pip packaging test in GitHub Actions. ### Why are the changes needed? In order to fix the flakiness in the test such as: ``` ... creating dist Creating tar archive error: [Errno 28] No space left on device Cleaning up temporary directory - /tmp/tmp.CvSzgB7Kyy ``` See also https://github.com/apache/spark/actions/runs/5665869112/job/15351515539. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? GitHub Actions build in this PR. Closes #42159 from HyukjinKwon/debug-ci-failure. Lead-authored-by: Hyukjin Kwon Co-authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 77aab6edb7c2b7be54cb19d40c0eeeb2c699f0f7) Signed-off-by: Hyukjin Kwon --- dev/run-pip-tests | 7 +++ 1 file changed, 7 insertions(+) diff --git a/dev/run-pip-tests b/dev/run-pip-tests index 5da231495a2..773611d9d92 100755 --- a/dev/run-pip-tests +++ b/dev/run-pip-tests @@ -25,6 +25,13 @@ shopt -s nullglob FWDIR="$(cd "$(dirname "$0")"/..; pwd)" cd "$FWDIR" +# Clean ignored/untracked files that do not need +# for pip packaging test. Machines in GitHub Action do not have +# enough space, see also SPARK-44557. +if [[ ! -z "${GITHUB_ACTIONS}" ]]; then + git clean -d -f -x -e assembly +fi + echo "Constructing virtual env for testing" VIRTUALENV_BASE=$(mktemp -d) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 7539d1992ca [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions 7539d1992ca is described below commit 7539d1992ca2c01988c440c3a0706a23b9112e73 Author: Hyukjin Kwon AuthorDate: Thu Jul 27 14:08:09 2023 +0900 [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to remove untracked/ignored files before running pip packaging test in GitHub Actions. ### Why are the changes needed? In order to fix the flakiness in the test such as: ``` ... creating dist Creating tar archive error: [Errno 28] No space left on device Cleaning up temporary directory - /tmp/tmp.CvSzgB7Kyy ``` See also https://github.com/apache/spark/actions/runs/5665869112/job/15351515539. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? GitHub Actions build in this PR. Closes #42159 from HyukjinKwon/debug-ci-failure. Lead-authored-by: Hyukjin Kwon Co-authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 77aab6edb7c2b7be54cb19d40c0eeeb2c699f0f7) Signed-off-by: Hyukjin Kwon --- dev/run-pip-tests | 7 +++ 1 file changed, 7 insertions(+) diff --git a/dev/run-pip-tests b/dev/run-pip-tests index 5da231495a2..773611d9d92 100755 --- a/dev/run-pip-tests +++ b/dev/run-pip-tests @@ -25,6 +25,13 @@ shopt -s nullglob FWDIR="$(cd "$(dirname "$0")"/..; pwd)" cd "$FWDIR" +# Clean ignored/untracked files that do not need +# for pip packaging test. Machines in GitHub Action do not have +# enough space, see also SPARK-44557. +if [[ ! -z "${GITHUB_ACTIONS}" ]]; then + git clean -d -f -x -e assembly +fi + echo "Constructing virtual env for testing" VIRTUALENV_BASE=$(mktemp -d) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 77aab6edb7c [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions 77aab6edb7c is described below commit 77aab6edb7c2b7be54cb19d40c0eeeb2c699f0f7 Author: Hyukjin Kwon AuthorDate: Thu Jul 27 14:08:09 2023 +0900 [SPARK-44557][INFRA] Clean up untracked/ignored files before running pip packaging test in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to remove untracked/ignored files before running pip packaging test in GitHub Actions. ### Why are the changes needed? In order to fix the flakiness in the test such as: ``` ... creating dist Creating tar archive error: [Errno 28] No space left on device Cleaning up temporary directory - /tmp/tmp.CvSzgB7Kyy ``` See also https://github.com/apache/spark/actions/runs/5665869112/job/15351515539. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? GitHub Actions build in this PR. Closes #42159 from HyukjinKwon/debug-ci-failure. Lead-authored-by: Hyukjin Kwon Co-authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- dev/run-pip-tests | 7 +++ 1 file changed, 7 insertions(+) diff --git a/dev/run-pip-tests b/dev/run-pip-tests index 5da231495a2..773611d9d92 100755 --- a/dev/run-pip-tests +++ b/dev/run-pip-tests @@ -25,6 +25,13 @@ shopt -s nullglob FWDIR="$(cd "$(dirname "$0")"/..; pwd)" cd "$FWDIR" +# Clean ignored/untracked files that do not need +# for pip packaging test. Machines in GitHub Action do not have +# enough space, see also SPARK-44557. +if [[ ! -z "${GITHUB_ACTIONS}" ]]; then + git clean -d -f -x -e assembly +fi + echo "Constructing virtual env for testing" VIRTUALENV_BASE=$(mktemp -d) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44533][PYTHON] Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8647b243dee [SPARK-44533][PYTHON] Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze 8647b243dee is described below commit 8647b243deed8f2c3279ed17fe196006b6c923af Author: Takuya UESHIN AuthorDate: Wed Jul 26 21:03:08 2023 -0700 [SPARK-44533][PYTHON] Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze ### What changes were proposed in this pull request? Adds support for `accumulator`, `broadcast` in vanilla PySpark, and Spark files in both vanilla PySpark and Spark Connect Python client, in Python UDTF's analyze. For example, in vanilla PySpark: ```py >>> colname = sc.broadcast("col1") >>> test_accum = sc.accumulator(0) >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(a: AnalyzeArgument) -> AnalyzeResult: ... test_accum.add(1) ... return AnalyzeResult(StructType().add(colname.value, a.data_type)) ... def eval(self, a): ... test_accum.add(1) ... yield a, ... >>> df = TestUDTF(lit(10)) >>> df.printSchema() root |-- col1: integer (nullable = true) >>> df.show() ++ |col1| ++ | 10| ++ >>> test_accum.value 2 ``` or ```py >>> pyfile_path = "my_pyfile.py" >>> with open(pyfile_path, "w") as f: ... f.write("my_func = lambda: 'col1'") ... 24 >>> sc.addPyFile(pyfile_path) >>> # or spark.addArtifacts(pyfile_path, pyfile=True) >>> >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(a: AnalyzeArgument) -> AnalyzeResult: ... import my_pyfile ... return AnalyzeResult(StructType().add(my_pyfile.my_func(), a.data_type)) ... def eval(self, a): ... yield a, ... >>> df = TestUDTF(lit(10)) >>> df.printSchema() root |-- col1: integer (nullable = true) >>> df.show() ++ |col1| ++ | 10| ++ ``` ### Why are the changes needed? To support missing features: `accumulator`, `broadcast`, and Spark files in Python UDTF's analyze. ### Does this PR introduce _any_ user-facing change? Yes, accumulator, broadcast in vanilla PySpark, and Spark files in both vanilla PySpark and Spark Connect Python client will be available. ### How was this patch tested? Added related tests. Closes #42135 from ueshin/issues/SPARK-44533/analyze. Authored-by: Takuya UESHIN Signed-off-by: Takuya UESHIN --- .../org/apache/spark/api/python/PythonRDD.scala| 4 +- .../org/apache/spark/api/python/PythonRunner.scala | 83 +--- .../spark/api/python/PythonWorkerUtils.scala | 152 ++ .../pyspark/sql/tests/connect/test_parity_udtf.py | 19 ++ python/pyspark/sql/tests/test_udtf.py | 224 - python/pyspark/sql/worker/analyze_udtf.py | 18 +- python/pyspark/worker.py | 91 ++--- python/pyspark/worker_util.py | 132 .../execution/python/BatchEvalPythonUDTFExec.scala | 7 +- .../python/UserDefinedPythonFunction.scala | 23 ++- 10 files changed, 584 insertions(+), 169 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala index 95fbc145d83..91fd92d4422 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala @@ -487,9 +487,7 @@ private[spark] object PythonRDD extends Logging { } def writeUTF(str: String, dataOut: DataOutputStream): Unit = { -val bytes = str.getBytes(StandardCharsets.UTF_8) -dataOut.writeInt(bytes.length) -dataOut.write(bytes) +PythonWorkerUtils.writeUTF(str, dataOut) } /** diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala index 5d719b33a30..0173de75ff2 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala @@ -309,8 +309,9 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( val dataOut = new DataOutputStream(stream) // Partition index dataOut.writeInt(partitionIndex) -// Python version of driver -PythonRDD.writeUTF(pythonVer, dataOut) + +PythonWorkerUtils.writePythonVersion(pythonVer, dataOut) + // Init a ServerSocket to accept
[spark] branch master updated: [SPARK-43611][SQL][PS][CONNCECT] Make `ExtractWindowExpressions` retain the `PLAN_ID_TAG`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 185a0a5a239 [SPARK-43611][SQL][PS][CONNCECT] Make `ExtractWindowExpressions` retain the `PLAN_ID_TAG` 185a0a5a239 is described below commit 185a0a5a23958676e4236eaf9e4d78cdfd2dd2d7 Author: Ruifeng Zheng AuthorDate: Thu Jul 27 11:00:18 2023 +0800 [SPARK-43611][SQL][PS][CONNCECT] Make `ExtractWindowExpressions` retain the `PLAN_ID_TAG` ### What changes were proposed in this pull request? Make rule `ExtractWindowExpressions` retain the `PLAN_ID_TAG ` ### Why are the changes needed? In https://github.com/apache/spark/pull/39925, we introduced a new mechanism to resolve expression with specified plan. However, sometimes the plan ID might be discarded by some analyzer rules, and then some expressions can not be correctly resolved, this issue is the main blocker of PS on Connect. ### Does this PR introduce _any_ user-facing change? yes, a lot of Pandas APIs enabled ### How was this patch tested? Enable UTs Closes #42086 from zhengruifeng/ps_connect_analyze_window. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .../computation/test_parity_missing_data.py| 30 -- .../tests/connect/series/test_parity_compute.py| 16 --- .../tests/connect/series/test_parity_cumulative.py | 25 + .../tests/connect/series/test_parity_index.py | 7 +- .../connect/series/test_parity_missing_data.py | 35 +- .../tests/connect/series/test_parity_stat.py | 11 +- .../pandas/tests/connect/test_parity_ewm.py| 12 +-- .../pandas/tests/connect/test_parity_expanding.py | 120 + .../test_parity_ops_on_diff_frames_groupby.py | 48 + ..._parity_ops_on_diff_frames_groupby_expanding.py | 42 +--- .../pandas/tests/connect/test_parity_rolling.py| 120 + .../spark/sql/catalyst/analysis/Analyzer.scala | 12 ++- 12 files changed, 21 insertions(+), 457 deletions(-) diff --git a/python/pyspark/pandas/tests/connect/computation/test_parity_missing_data.py b/python/pyspark/pandas/tests/connect/computation/test_parity_missing_data.py index a88c8692eca..d2ff09e5e8a 100644 --- a/python/pyspark/pandas/tests/connect/computation/test_parity_missing_data.py +++ b/python/pyspark/pandas/tests/connect/computation/test_parity_missing_data.py @@ -29,36 +29,6 @@ class FrameParityMissingDataTests( def psdf(self): return ps.from_pandas(self.pdf) -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_backfill(self): -super().test_backfill() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_bfill(self): -super().test_bfill() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_ffill(self): -super().test_ffill() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_fillna(self): -return super().test_fillna() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_pad(self): -super().test_pad() - if __name__ == "__main__": from pyspark.pandas.tests.connect.computation.test_parity_missing_data import * # noqa: F401 diff --git a/python/pyspark/pandas/tests/connect/series/test_parity_compute.py b/python/pyspark/pandas/tests/connect/series/test_parity_compute.py index 00e35b27e8f..f757d19ca69 100644 --- a/python/pyspark/pandas/tests/connect/series/test_parity_compute.py +++ b/python/pyspark/pandas/tests/connect/series/test_parity_compute.py @@ -22,22 +22,6 @@ from pyspark.testing.pandasutils import PandasOnSparkTestUtils class SeriesParityComputeTests(SeriesComputeMixin, PandasOnSparkTestUtils, ReusedConnectTestCase): -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_diff(self): -super().test_diff() - -@unittest.skip("TODO(SPARK-43620): Support `Column` for SparkConnectColumn.__getitem__.") -def test_factorize(self): -super().test_factorize() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_shift(self): -super().test_shift() - @unittest.skip( "TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." ) diff --git
[spark] branch branch-3.4 updated: [SPARK-44479][CONNECT][PYTHON] Fix protobuf conversion from an empty struct type
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 135bb497eda [SPARK-44479][CONNECT][PYTHON] Fix protobuf conversion from an empty struct type 135bb497eda is described below commit 135bb497edad8d132323257547c86bd405ecba8e Author: Takuya UESHIN AuthorDate: Thu Jul 27 10:10:44 2023 +0900 [SPARK-44479][CONNECT][PYTHON] Fix protobuf conversion from an empty struct type ### What changes were proposed in this pull request? This is a partial backport of #42161. Fixes protobuf conversion from an empty struct type. ### Why are the changes needed? The empty struct type was not properly converted to the protobuf message. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #42179 from ueshin/issues/SPARK-44479/3.4/empty_schema. Authored-by: Takuya UESHIN Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/types.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/connect/types.py b/python/pyspark/sql/connect/types.py index dfb0fb5303f..be8d62c805f 100644 --- a/python/pyspark/sql/connect/types.py +++ b/python/pyspark/sql/connect/types.py @@ -155,6 +155,7 @@ def pyspark_types_to_proto_types(data_type: DataType) -> pb2.DataType: ret.day_time_interval.start_field = data_type.startField ret.day_time_interval.end_field = data_type.endField elif isinstance(data_type, StructType): +struct = pb2.DataType.Struct() for field in data_type.fields: struct_field = pb2.DataType.StructField() struct_field.name = field.name @@ -162,7 +163,8 @@ def pyspark_types_to_proto_types(data_type: DataType) -> pb2.DataType: struct_field.nullable = field.nullable if field.metadata is not None and len(field.metadata) > 0: struct_field.metadata = json.dumps(field.metadata) -ret.struct.fields.append(struct_field) +struct.fields.append(struct_field) +ret.struct.CopyFrom(struct) elif isinstance(data_type, MapType): ret.map.key_type.CopyFrom(pyspark_types_to_proto_types(data_type.keyType)) ret.map.value_type.CopyFrom(pyspark_types_to_proto_types(data_type.valueType)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44479][PYTHON][3.5] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 803b2854a9e [SPARK-44479][PYTHON][3.5] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame 803b2854a9e is described below commit 803b2854a9e82aee4e5691c4a9a697856b963377 Author: Takuya UESHIN AuthorDate: Wed Jul 26 17:54:38 2023 -0700 [SPARK-44479][PYTHON][3.5] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame ### What changes were proposed in this pull request? Fixes `ArrowStreamPandasUDFSerializer` to accept no-column pandas DataFrame. ```py >>> def _scalar_f(id): ... return pd.DataFrame(index=id) ... >>> scalar_f = pandas_udf(_scalar_f, returnType=StructType()) >>> df = spark.range(3).withColumn("f", scalar_f(col("id"))) >>> df.printSchema() root |-- id: long (nullable = false) |-- f: struct (nullable = true) >>> df.show() +---+---+ | id| f| +---+---+ | 0| {}| | 1| {}| | 2| {}| +---+---+ ``` ### Why are the changes needed? The above query fails with the following error: ```py >>> df.show() org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... ValueError: not enough values to unpack (expected 2, got 0) ``` ### Does this PR introduce _any_ user-facing change? Yes, Pandas UDF will accept no-column pandas DataFrame. ### How was this patch tested? Added related tests. Closes #42176 from ueshin/issues/SPARK-44479/3.5/empty_schema. Authored-by: Takuya UESHIN Signed-off-by: Takuya UESHIN --- python/pyspark/sql/connect/types.py| 4 ++- python/pyspark/sql/pandas/serializers.py | 31 -- .../sql/tests/pandas/test_pandas_udf_scalar.py | 23 +++- python/pyspark/sql/tests/test_udtf.py | 11 ++-- 4 files changed, 45 insertions(+), 24 deletions(-) diff --git a/python/pyspark/sql/connect/types.py b/python/pyspark/sql/connect/types.py index 2a21cdf0675..0db2833d2c1 100644 --- a/python/pyspark/sql/connect/types.py +++ b/python/pyspark/sql/connect/types.py @@ -170,6 +170,7 @@ def pyspark_types_to_proto_types(data_type: DataType) -> pb2.DataType: ret.year_month_interval.start_field = data_type.startField ret.year_month_interval.end_field = data_type.endField elif isinstance(data_type, StructType): +struct = pb2.DataType.Struct() for field in data_type.fields: struct_field = pb2.DataType.StructField() struct_field.name = field.name @@ -177,7 +178,8 @@ def pyspark_types_to_proto_types(data_type: DataType) -> pb2.DataType: struct_field.nullable = field.nullable if field.metadata is not None and len(field.metadata) > 0: struct_field.metadata = json.dumps(field.metadata) -ret.struct.fields.append(struct_field) +struct.fields.append(struct_field) +ret.struct.CopyFrom(struct) elif isinstance(data_type, MapType): ret.map.key_type.CopyFrom(pyspark_types_to_proto_types(data_type.keyType)) ret.map.value_type.CopyFrom(pyspark_types_to_proto_types(data_type.valueType)) diff --git a/python/pyspark/sql/pandas/serializers.py b/python/pyspark/sql/pandas/serializers.py index f22a73cbbef..1d326928e23 100644 --- a/python/pyspark/sql/pandas/serializers.py +++ b/python/pyspark/sql/pandas/serializers.py @@ -385,37 +385,28 @@ class ArrowStreamPandasUDFSerializer(ArrowStreamPandasSerializer): """ import pyarrow as pa -# Input partition and result pandas.DataFrame empty, make empty Arrays with struct -if len(df) == 0 and len(df.columns) == 0: -arrs_names = [ -(pa.array([], type=field.type), field.name) for field in arrow_struct_type -] +if len(df.columns) == 0: +return pa.array([{}] * len(df), arrow_struct_type) # Assign result columns by schema name if user labeled with strings -elif self._assign_cols_by_name and any(isinstance(name, str) for name in df.columns): -arrs_names = [ -( -self._create_array(df[field.name], field.type, arrow_cast=self._arrow_cast), -field.name, -) +if self._assign_cols_by_name and any(isinstance(name, str) for name in df.columns): +struct_arrs = [ +self._create_array(df[field.name], field.type, arrow_cast=self._arrow_cast) for field in arrow_struct_type ] # Assign result columns by position else: -arrs_names = [ +
[spark] branch master updated: [SPARK-44479][PYTHON] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 02e36dd0f07 [SPARK-44479][PYTHON] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame 02e36dd0f07 is described below commit 02e36dd0f077d11a75c6e083489dc1a51c870a0d Author: Takuya UESHIN AuthorDate: Wed Jul 26 17:53:46 2023 -0700 [SPARK-44479][PYTHON] Fix ArrowStreamPandasUDFSerializer to accept no-column pandas DataFrame ### What changes were proposed in this pull request? Fixes `ArrowStreamPandasUDFSerializer` to accept no-column pandas DataFrame. ```py >>> def _scalar_f(id): ... return pd.DataFrame(index=id) ... >>> scalar_f = pandas_udf(_scalar_f, returnType=StructType()) >>> df = spark.range(3).withColumn("f", scalar_f(col("id"))) >>> df.printSchema() root |-- id: long (nullable = false) |-- f: struct (nullable = true) >>> df.show() +---+---+ | id| f| +---+---+ | 0| {}| | 1| {}| | 2| {}| +---+---+ ``` ### Why are the changes needed? The above query fails with the following error: ```py >>> df.show() org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... ValueError: not enough values to unpack (expected 2, got 0) ``` ### Does this PR introduce _any_ user-facing change? Yes, Pandas UDF will accept no-column pandas DataFrame. ### How was this patch tested? Added related tests. Closes #42161 from ueshin/issues/SPARK-44479/empty_schema. Authored-by: Takuya UESHIN Signed-off-by: Takuya UESHIN --- python/pyspark/sql/connect/types.py| 4 ++- python/pyspark/sql/pandas/serializers.py | 31 -- .../sql/tests/pandas/test_pandas_udf_scalar.py | 23 +++- python/pyspark/sql/tests/test_udtf.py | 11 ++-- 4 files changed, 45 insertions(+), 24 deletions(-) diff --git a/python/pyspark/sql/connect/types.py b/python/pyspark/sql/connect/types.py index 2a21cdf0675..0db2833d2c1 100644 --- a/python/pyspark/sql/connect/types.py +++ b/python/pyspark/sql/connect/types.py @@ -170,6 +170,7 @@ def pyspark_types_to_proto_types(data_type: DataType) -> pb2.DataType: ret.year_month_interval.start_field = data_type.startField ret.year_month_interval.end_field = data_type.endField elif isinstance(data_type, StructType): +struct = pb2.DataType.Struct() for field in data_type.fields: struct_field = pb2.DataType.StructField() struct_field.name = field.name @@ -177,7 +178,8 @@ def pyspark_types_to_proto_types(data_type: DataType) -> pb2.DataType: struct_field.nullable = field.nullable if field.metadata is not None and len(field.metadata) > 0: struct_field.metadata = json.dumps(field.metadata) -ret.struct.fields.append(struct_field) +struct.fields.append(struct_field) +ret.struct.CopyFrom(struct) elif isinstance(data_type, MapType): ret.map.key_type.CopyFrom(pyspark_types_to_proto_types(data_type.keyType)) ret.map.value_type.CopyFrom(pyspark_types_to_proto_types(data_type.valueType)) diff --git a/python/pyspark/sql/pandas/serializers.py b/python/pyspark/sql/pandas/serializers.py index 90a24197f64..15de00782c6 100644 --- a/python/pyspark/sql/pandas/serializers.py +++ b/python/pyspark/sql/pandas/serializers.py @@ -385,37 +385,28 @@ class ArrowStreamPandasUDFSerializer(ArrowStreamPandasSerializer): """ import pyarrow as pa -# Input partition and result pandas.DataFrame empty, make empty Arrays with struct -if len(df) == 0 and len(df.columns) == 0: -arrs_names = [ -(pa.array([], type=field.type), field.name) for field in arrow_struct_type -] +if len(df.columns) == 0: +return pa.array([{}] * len(df), arrow_struct_type) # Assign result columns by schema name if user labeled with strings -elif self._assign_cols_by_name and any(isinstance(name, str) for name in df.columns): -arrs_names = [ -( -self._create_array(df[field.name], field.type, arrow_cast=self._arrow_cast), -field.name, -) +if self._assign_cols_by_name and any(isinstance(name, str) for name in df.columns): +struct_arrs = [ +self._create_array(df[field.name], field.type, arrow_cast=self._arrow_cast) for field in arrow_struct_type ] # Assign result columns by position else: -arrs_names = [ +struct_arrs = [
[spark] branch branch-3.4 updated: [SPARK-44553][BUILD][3.4] Ignoring `connect-check-protos` logic in GA testing
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new dff13979dc2 [SPARK-44553][BUILD][3.4] Ignoring `connect-check-protos` logic in GA testing dff13979dc2 is described below commit dff13979dc2a7abd9222b0567914b554d6b8baf4 Author: panbingkun AuthorDate: Thu Jul 27 08:30:41 2023 +0800 [SPARK-44553][BUILD][3.4] Ignoring `connect-check-protos` logic in GA testing ### What changes were proposed in this pull request? The pr aims to ignoring `connect-check-protos` logic in GA testing for branch-3.4. ### Why are the changes needed? Make GA happy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #42166 from panbingkun/branch-3.4_SPARK-44553. Authored-by: panbingkun Signed-off-by: Ruifeng Zheng --- .github/workflows/build_and_test.yml | 9 - 1 file changed, 9 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 06f94ea0b25..4f9978b0414 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -589,15 +589,6 @@ jobs: python3.9 -m pip install 'pandas-stubs==1.2.0.53' ipython 'grpcio==1.48.1' 'grpc-stubs==1.24.11' 'googleapis-common-protos-stubs==2.2.0' - name: Python linter run: PYTHON_EXECUTABLE=python3.9 ./dev/lint-python -- name: Install dependencies for Python code generation check - run: | -# See more in "Installation" https://docs.buf.build/installation#tarball -curl -LO https://github.com/bufbuild/buf/releases/download/v1.15.1/buf-Linux-x86_64.tar.gz -mkdir -p $HOME/buf -tar -xvzf buf-Linux-x86_64.tar.gz -C $HOME/buf --strip-components 1 -python3.9 -m pip install 'protobuf==3.19.5' 'mypy-protobuf==3.3.0' -- name: Python code generation check - run: if test -f ./dev/connect-check-protos.py; then PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/connect-check-protos.py; fi - name: Install JavaScript linter dependencies run: | apt update - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated: [SPARK-44544][INFRA][3.4] Deduplicate `run_python_packaging_tests`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 96296aac4f4 [SPARK-44544][INFRA][3.4] Deduplicate `run_python_packaging_tests` 96296aac4f4 is described below commit 96296aac4f4f168a7f30ff1ccb33c3b52b433ba4 Author: Ruifeng Zheng AuthorDate: Thu Jul 27 09:22:36 2023 +0900 [SPARK-44544][INFRA][3.4] Deduplicate `run_python_packaging_tests` ### What changes were proposed in this pull request? cherry-pick https://github.com/apache/spark/pull/42146 to 3.4 ### Why are the changes needed? can not cherry-pick clearly, so make this PR ### Does this PR introduce _any_ user-facing change? no, infra-only ### How was this patch tested? updated CI Closes #42172 from zhengruifeng/cp_fix. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- .github/workflows/build_and_test.yml | 16 ++-- dev/run-tests.py | 2 +- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 657fec27d52..06f94ea0b25 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -192,6 +192,7 @@ jobs: HIVE_PROFILE: ${{ matrix.hive }} GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost + SKIP_PACKAGING: true steps: - name: Checkout Spark repository uses: actions/checkout@v3 @@ -328,6 +329,8 @@ jobs: java: - ${{ inputs.java }} modules: + - >- +pyspark-errors - >- pyspark-sql, pyspark-mllib, pyspark-resource - >- @@ -337,7 +340,7 @@ jobs: - >- pyspark-pandas-slow - >- -pyspark-connect, pyspark-errors +pyspark-connect env: MODULES_TO_TEST: ${{ matrix.modules }} HADOOP_PROFILE: ${{ inputs.hadoop }} @@ -346,6 +349,7 @@ jobs: SPARK_LOCAL_IP: localhost SKIP_UNIDOC: true SKIP_MIMA: true + SKIP_PACKAGING: true METASPACE_SIZE: 1g steps: - name: Checkout Spark repository @@ -394,14 +398,20 @@ jobs: python3.9 -m pip list pypy3 -m pip list - name: Install Conda for pip packaging test + if: ${{ matrix.modules == 'pyspark-errors' }} run: | curl -s https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh > miniconda.sh bash miniconda.sh -b -p $HOME/miniconda # Run the tests. - name: Run tests env: ${{ fromJSON(inputs.envs) }} + shell: 'script -q -e -c "bash {0}"' run: | -export PATH=$PATH:$HOME/miniconda/bin +if [[ "$MODULES_TO_TEST" == "pyspark-errors" ]]; then + export PATH=$PATH:$HOME/miniconda/bin + export SKIP_PACKAGING=false + echo "Python Packaging Tests Enabled!" +fi ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST" - name: Upload coverage to Codecov if: fromJSON(inputs.envs).PYSPARK_CODECOV == 'true' @@ -437,6 +447,7 @@ jobs: GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost SKIP_MIMA: true + SKIP_PACKAGING: true steps: - name: Checkout Spark repository uses: actions/checkout@v3 @@ -850,6 +861,7 @@ jobs: SPARK_LOCAL_IP: localhost ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-xe:21.3.0 SKIP_MIMA: true + SKIP_PACKAGING: true steps: - name: Checkout Spark repository uses: actions/checkout@v3 diff --git a/dev/run-tests.py b/dev/run-tests.py index 92768c96905..dab3dcf7fe6 100755 --- a/dev/run-tests.py +++ b/dev/run-tests.py @@ -396,7 +396,7 @@ def run_python_tests(test_modules, parallelism, with_coverage=False): def run_python_packaging_tests(): -if not os.environ.get("SPARK_JENKINS"): +if not os.environ.get("SPARK_JENKINS") and os.environ.get("SKIP_PACKAGING", "false") != "true": set_title_and_block("Running PySpark packaging tests", "BLOCK_PYSPARK_PIP_TESTS") command = [os.path.join(SPARK_HOME, "dev", "run-pip-tests")] run_cmd(command) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 17fc3632f23 [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass 17fc3632f23 is described below commit 17fc3632f2344101f8318457e3f9d5f133913997 Author: yangjie01 AuthorDate: Wed Jul 26 19:17:40 2023 -0500 [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass ### What changes were proposed in this pull request? Similar to SPARK-42770 | https://github.com/apache/spark/pull/40395, this pr call `truncatedTo(ChronoUnit.MICROS)` on `Instant.now()` and `LocalDateTime.now()` to ensure microsecond accuracy is used in any environment. ### Why are the changes needed? Make Java 17 daily test GA task run successfully. The Java 17 daily test GA task failed now: https://github.com/apache/spark/actions/runs/5570003581/jobs/10173767006 ``` [info] - nullable fields *** FAILED *** (169 milliseconds) [info] NullableData(null, JANUARY, E1, null, 1.00, 2.00, null, 4, PT0S, null, 2023-07-16, 2023-07-16, null, 2023-07-16T23:01:54.059339Z, 2023-07-16T23:01:54.059359) did not equal NullableData(null, JANUARY, E1, null, 1.00, 2.00, null, 4, PT0S, null, 2023-07-16, 2023-07-16, null, 2023-07-16T23:01:54.059339538Z, 2023-07-16T23:01:54.059359638) (ArrowEncoderSuite.scala:194) [info] Analysis: [info] NullableData(instant: 2023-07-16T23:01:54.059339Z -> 2023-07-16T23:01:54.059339538Z, localDateTime: 2023-07-16T23:01:54.059359 -> 2023-07-16T23:01:54.059359638) [info] org.scalatest.exceptions.TestFailedException: ... [info] - lenient field serialization - timestamp/instant *** FAILED *** (26 milliseconds) [info] 2023-07-16T23:01:55.112838Z did not equal 2023-07-16T23:01:55.112838568Z (ArrowEncoderSuite.scala:194) [info] org.scalatest.exceptions.TestFailedException: ... ``` ### Does this PR introduce _any_ user-facing change? No, just for test ### How was this patch tested? - Pass GitHub Action - Git Hub Action test with Java 17 passed: https://github.com/LuciferYang/spark/actions/runs/5647253889/job/15297009685 https://github.com/apache/spark/assets/1475305/27a4350a-9475-45e3-b39f-b0b1e8f14e92;> Closes #42039 from LuciferYang/ArrowEncoderSuite-Java17. Authored-by: yangjie01 Signed-off-by: Sean Owen (cherry picked from commit da359259b138864a52ea98a4e19c55e593a5a8fa) Signed-off-by: Sean Owen --- .../spark/sql/connect/client/arrow/ArrowEncoderSuite.scala| 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala index 3f8ac1cb8d1..5c035a613fe 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala @@ -18,6 +18,7 @@ package org.apache.spark.sql.connect.client.arrow import java.math.BigInteger import java.time.{Duration, Period, ZoneOffset} +import java.time.temporal.ChronoUnit import java.util import java.util.{Collections, Objects} @@ -361,8 +362,10 @@ class ArrowEncoderSuite extends ConnectFunSuite with BeforeAndAfterAll { test("nullable fields") { val encoder = ScalaReflection.encoderFor[NullableData] -val instant = java.time.Instant.now() -val now = java.time.LocalDateTime.now() +// SPARK-44457: Similar to SPARK-42770, calling `truncatedTo(ChronoUnit.MICROS)` +// on `Instant.now()` and `LocalDateTime.now()` to ensure microsecond accuracy is used. +val instant = java.time.Instant.now().truncatedTo(ChronoUnit.MICROS) +val now = java.time.LocalDateTime.now().truncatedTo(ChronoUnit.MICROS) val today = java.time.LocalDate.now() roundTripAndCheckIdentical(encoder) { () => val maybeNull = MaybeNull(3) @@ -602,7 +605,9 @@ class ArrowEncoderSuite extends ConnectFunSuite with BeforeAndAfterAll { } test("lenient field serialization - timestamp/instant") { -val base = java.time.Instant.now() +// SPARK-44457: Similar to SPARK-42770, calling `truncatedTo(ChronoUnit.MICROS)` +// on `Instant.now()` to ensure microsecond accuracy is used. +val base = java.time.Instant.now().truncatedTo(ChronoUnit.MICROS) val instants = () => Iterator.tabulate(10)(i => base.plusSeconds(i * i
[spark] branch master updated: [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new da359259b13 [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass da359259b13 is described below commit da359259b138864a52ea98a4e19c55e593a5a8fa Author: yangjie01 AuthorDate: Wed Jul 26 19:17:40 2023 -0500 [SPARK-44457][CONNECT][TESTS] Add `truncatedTo(ChronoUnit.MICROS)` to make `ArrowEncoderSuite` in Java 17 daily test GA task pass ### What changes were proposed in this pull request? Similar to SPARK-42770 | https://github.com/apache/spark/pull/40395, this pr call `truncatedTo(ChronoUnit.MICROS)` on `Instant.now()` and `LocalDateTime.now()` to ensure microsecond accuracy is used in any environment. ### Why are the changes needed? Make Java 17 daily test GA task run successfully. The Java 17 daily test GA task failed now: https://github.com/apache/spark/actions/runs/5570003581/jobs/10173767006 ``` [info] - nullable fields *** FAILED *** (169 milliseconds) [info] NullableData(null, JANUARY, E1, null, 1.00, 2.00, null, 4, PT0S, null, 2023-07-16, 2023-07-16, null, 2023-07-16T23:01:54.059339Z, 2023-07-16T23:01:54.059359) did not equal NullableData(null, JANUARY, E1, null, 1.00, 2.00, null, 4, PT0S, null, 2023-07-16, 2023-07-16, null, 2023-07-16T23:01:54.059339538Z, 2023-07-16T23:01:54.059359638) (ArrowEncoderSuite.scala:194) [info] Analysis: [info] NullableData(instant: 2023-07-16T23:01:54.059339Z -> 2023-07-16T23:01:54.059339538Z, localDateTime: 2023-07-16T23:01:54.059359 -> 2023-07-16T23:01:54.059359638) [info] org.scalatest.exceptions.TestFailedException: ... [info] - lenient field serialization - timestamp/instant *** FAILED *** (26 milliseconds) [info] 2023-07-16T23:01:55.112838Z did not equal 2023-07-16T23:01:55.112838568Z (ArrowEncoderSuite.scala:194) [info] org.scalatest.exceptions.TestFailedException: ... ``` ### Does this PR introduce _any_ user-facing change? No, just for test ### How was this patch tested? - Pass GitHub Action - Git Hub Action test with Java 17 passed: https://github.com/LuciferYang/spark/actions/runs/5647253889/job/15297009685 https://github.com/apache/spark/assets/1475305/27a4350a-9475-45e3-b39f-b0b1e8f14e92;> Closes #42039 from LuciferYang/ArrowEncoderSuite-Java17. Authored-by: yangjie01 Signed-off-by: Sean Owen --- .../spark/sql/connect/client/arrow/ArrowEncoderSuite.scala| 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala index 3f8ac1cb8d1..5c035a613fe 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala @@ -18,6 +18,7 @@ package org.apache.spark.sql.connect.client.arrow import java.math.BigInteger import java.time.{Duration, Period, ZoneOffset} +import java.time.temporal.ChronoUnit import java.util import java.util.{Collections, Objects} @@ -361,8 +362,10 @@ class ArrowEncoderSuite extends ConnectFunSuite with BeforeAndAfterAll { test("nullable fields") { val encoder = ScalaReflection.encoderFor[NullableData] -val instant = java.time.Instant.now() -val now = java.time.LocalDateTime.now() +// SPARK-44457: Similar to SPARK-42770, calling `truncatedTo(ChronoUnit.MICROS)` +// on `Instant.now()` and `LocalDateTime.now()` to ensure microsecond accuracy is used. +val instant = java.time.Instant.now().truncatedTo(ChronoUnit.MICROS) +val now = java.time.LocalDateTime.now().truncatedTo(ChronoUnit.MICROS) val today = java.time.LocalDate.now() roundTripAndCheckIdentical(encoder) { () => val maybeNull = MaybeNull(3) @@ -602,7 +605,9 @@ class ArrowEncoderSuite extends ConnectFunSuite with BeforeAndAfterAll { } test("lenient field serialization - timestamp/instant") { -val base = java.time.Instant.now() +// SPARK-44457: Similar to SPARK-42770, calling `truncatedTo(ChronoUnit.MICROS)` +// on `Instant.now()` to ensure microsecond accuracy is used. +val base = java.time.Instant.now().truncatedTo(ChronoUnit.MICROS) val instants = () => Iterator.tabulate(10)(i => base.plusSeconds(i * i * 60)) val timestamps = () => instants().map(java.sql.Timestamp.from) val combo = () => instants()
[spark] branch master updated: [SPARK-44522][BUILD] Upgrade `scala-xml` to 2.2.0
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 43b753a3530 [SPARK-44522][BUILD] Upgrade `scala-xml` to 2.2.0 43b753a3530 is described below commit 43b753a3530bcfdad415765e1348136d70d8125d Author: yangjie01 AuthorDate: Wed Jul 26 19:11:00 2023 -0500 [SPARK-44522][BUILD] Upgrade `scala-xml` to 2.2.0 ### What changes were proposed in this pull request? This pr aims to upgrade `scala-xml` from 2.1.0 to 2.2.0. ### Why are the changes needed? The new version bring some bug fix like: - https://github.com/scala/scala-xml/pull/651 - https://github.com/scala/scala-xml/pull/677 The full release notes as follows: - https://github.com/scala/scala-xml/releases/tag/v2.2.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Checked Scala 2.13, all Scala test passed: https://github.com/LuciferYang/spark/runs/15278359785 Closes #42119 from LuciferYang/scala-xml-220. Authored-by: yangjie01 Signed-off-by: Sean Owen --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 168b0b34787..3b54ef43f6a 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -229,7 +229,7 @@ scala-compiler/2.12.18//scala-compiler-2.12.18.jar scala-library/2.12.18//scala-library-2.12.18.jar scala-parser-combinators_2.12/2.3.0//scala-parser-combinators_2.12-2.3.0.jar scala-reflect/2.12.18//scala-reflect-2.12.18.jar -scala-xml_2.12/2.1.0//scala-xml_2.12-2.1.0.jar +scala-xml_2.12/2.2.0//scala-xml_2.12-2.2.0.jar shims/0.9.45//shims-0.9.45.jar slf4j-api/2.0.7//slf4j-api-2.0.7.jar snakeyaml-engine/2.6//snakeyaml-engine-2.6.jar diff --git a/pom.xml b/pom.xml index 5711dba04b9..2e9d1d2d8f3 100644 --- a/pom.xml +++ b/pom.xml @@ -1089,7 +1089,7 @@ org.scala-lang.modules scala-xml_${scala.binary.version} -2.1.0 +2.2.0 org.scala-lang - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 43ac3db1e27 [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe 43ac3db1e27 is described below commit 43ac3db1e27a4169183a90b54b6a873f0d26a7ba Author: Martin Grund AuthorDate: Thu Jul 27 08:53:45 2023 +0900 [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe ### What changes were proposed in this pull request? Currently Connect does not allow the proper usage of Python's `hasattr()` to identify if an attribute is defined or not. This patch fixes that bug (it's working in regular PySpark). ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #42132 from grundprinzip/SPARK-44528. Lead-authored-by: Martin Grund Co-authored-by: Martin Grund Signed-off-by: Hyukjin Kwon (cherry picked from commit 91e97f92fe76f9718cd16af0c761d5530bdb37ee) Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/dataframe.py| 8 +++ .../sql/tests/connect/test_connect_basic.py| 17 +++-- python/pyspark/testing/connectutils.py | 28 ++ 3 files changed, 46 insertions(+), 7 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 6429645f0e0..12e424b5ef1 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -1584,8 +1584,16 @@ class DataFrame: error_class="NOT_IMPLEMENTED", message_parameters={"feature": f"{name}()"}, ) + +if name not in self.columns: +raise AttributeError( +"'%s' object has no attribute '%s'" % (self.__class__.__name__, name) +) + return self[name] +__getattr__.__doc__ = PySparkDataFrame.__getattr__.__doc__ + @overload def __getitem__(self, item: Union[int, str]) -> Column: ... diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py b/python/pyspark/sql/tests/connect/test_connect_basic.py index 5259ea6b5f5..065f1585a9f 100644 --- a/python/pyspark/sql/tests/connect/test_connect_basic.py +++ b/python/pyspark/sql/tests/connect/test_connect_basic.py @@ -157,6 +157,19 @@ class SparkConnectSQLTestCase(ReusedConnectTestCase, SQLTestUtils, PandasOnSpark class SparkConnectBasicTests(SparkConnectSQLTestCase): +def test_df_getattr_behavior(self): +cdf = self.connect.range(10) +sdf = self.spark.range(10) + +sdf._simple_extension = 10 +cdf._simple_extension = 10 + +self.assertEqual(sdf._simple_extension, cdf._simple_extension) +self.assertEqual(type(sdf._simple_extension), type(cdf._simple_extension)) + +self.assertTrue(hasattr(cdf, "_simple_extension")) +self.assertFalse(hasattr(cdf, "_simple_extension_does_not_exsit")) + def test_df_get_item(self): # SPARK-41779: test __getitem__ @@ -1296,8 +1309,8 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase): sdf.drop("a", "x").toPandas(), ) self.assert_eq( -cdf.drop(cdf.a, cdf.x).toPandas(), -sdf.drop("a", "x").toPandas(), +cdf.drop(cdf.a, "x").toPandas(), +sdf.drop(sdf.a, "x").toPandas(), ) def test_subquery_alias(self) -> None: diff --git a/python/pyspark/testing/connectutils.py b/python/pyspark/testing/connectutils.py index 1b3ac10fce8..b6145d0a006 100644 --- a/python/pyspark/testing/connectutils.py +++ b/python/pyspark/testing/connectutils.py @@ -16,6 +16,7 @@ # import shutil import tempfile +import types import typing import os import functools @@ -67,7 +68,7 @@ should_test_connect: str = typing.cast(str, connect_requirement_message is None) if should_test_connect: from pyspark.sql.connect.dataframe import DataFrame -from pyspark.sql.connect.plan import Read, Range, SQL +from pyspark.sql.connect.plan import Read, Range, SQL, LogicalPlan from pyspark.sql.connect.session import SparkSession @@ -88,16 +89,33 @@ class MockRemoteSession: return functools.partial(self.hooks[item]) +class MockDF(DataFrame): +"""Helper class that must only be used for the mock plan tests.""" + +def __init__(self, session: SparkSession, plan: LogicalPlan): +super().__init__(session) +self._plan = plan + +def __getattr__(self, name): +"""All attributes are resolved to columns, because none really exist in the +mocked DataFrame.""" +return self[name] + + @unittest.skipIf(not should_test_connect,
[spark] branch master updated: [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 91e97f92fe7 [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe 91e97f92fe7 is described below commit 91e97f92fe76f9718cd16af0c761d5530bdb37ee Author: Martin Grund AuthorDate: Thu Jul 27 08:53:45 2023 +0900 [SPARK-44528][CONNECT] Support proper usage of hasattr() for Connect dataframe ### What changes were proposed in this pull request? Currently Connect does not allow the proper usage of Python's `hasattr()` to identify if an attribute is defined or not. This patch fixes that bug (it's working in regular PySpark). ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #42132 from grundprinzip/SPARK-44528. Lead-authored-by: Martin Grund Co-authored-by: Martin Grund Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/dataframe.py| 8 +++ .../sql/tests/connect/test_connect_basic.py| 17 +++-- python/pyspark/testing/connectutils.py | 28 ++ 3 files changed, 46 insertions(+), 7 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 6429645f0e0..12e424b5ef1 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -1584,8 +1584,16 @@ class DataFrame: error_class="NOT_IMPLEMENTED", message_parameters={"feature": f"{name}()"}, ) + +if name not in self.columns: +raise AttributeError( +"'%s' object has no attribute '%s'" % (self.__class__.__name__, name) +) + return self[name] +__getattr__.__doc__ = PySparkDataFrame.__getattr__.__doc__ + @overload def __getitem__(self, item: Union[int, str]) -> Column: ... diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py b/python/pyspark/sql/tests/connect/test_connect_basic.py index 5259ea6b5f5..065f1585a9f 100644 --- a/python/pyspark/sql/tests/connect/test_connect_basic.py +++ b/python/pyspark/sql/tests/connect/test_connect_basic.py @@ -157,6 +157,19 @@ class SparkConnectSQLTestCase(ReusedConnectTestCase, SQLTestUtils, PandasOnSpark class SparkConnectBasicTests(SparkConnectSQLTestCase): +def test_df_getattr_behavior(self): +cdf = self.connect.range(10) +sdf = self.spark.range(10) + +sdf._simple_extension = 10 +cdf._simple_extension = 10 + +self.assertEqual(sdf._simple_extension, cdf._simple_extension) +self.assertEqual(type(sdf._simple_extension), type(cdf._simple_extension)) + +self.assertTrue(hasattr(cdf, "_simple_extension")) +self.assertFalse(hasattr(cdf, "_simple_extension_does_not_exsit")) + def test_df_get_item(self): # SPARK-41779: test __getitem__ @@ -1296,8 +1309,8 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase): sdf.drop("a", "x").toPandas(), ) self.assert_eq( -cdf.drop(cdf.a, cdf.x).toPandas(), -sdf.drop("a", "x").toPandas(), +cdf.drop(cdf.a, "x").toPandas(), +sdf.drop(sdf.a, "x").toPandas(), ) def test_subquery_alias(self) -> None: diff --git a/python/pyspark/testing/connectutils.py b/python/pyspark/testing/connectutils.py index 1b3ac10fce8..b6145d0a006 100644 --- a/python/pyspark/testing/connectutils.py +++ b/python/pyspark/testing/connectutils.py @@ -16,6 +16,7 @@ # import shutil import tempfile +import types import typing import os import functools @@ -67,7 +68,7 @@ should_test_connect: str = typing.cast(str, connect_requirement_message is None) if should_test_connect: from pyspark.sql.connect.dataframe import DataFrame -from pyspark.sql.connect.plan import Read, Range, SQL +from pyspark.sql.connect.plan import Read, Range, SQL, LogicalPlan from pyspark.sql.connect.session import SparkSession @@ -88,16 +89,33 @@ class MockRemoteSession: return functools.partial(self.hooks[item]) +class MockDF(DataFrame): +"""Helper class that must only be used for the mock plan tests.""" + +def __init__(self, session: SparkSession, plan: LogicalPlan): +super().__init__(session) +self._plan = plan + +def __getattr__(self, name): +"""All attributes are resolved to columns, because none really exist in the +mocked DataFrame.""" +return self[name] + + @unittest.skipIf(not should_test_connect, connect_requirement_message) class PlanOnlyTestFixture(unittest.TestCase, PySparkErrorTestUtils): @classmethod def
[spark] branch master updated: [SPARK-44537][BUILD] Upgrade kubernetes-client to 6.8.0
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6b6216c01ef [SPARK-44537][BUILD] Upgrade kubernetes-client to 6.8.0 6b6216c01ef is described below commit 6b6216c01ef333aef43c4b78831078576d7834fb Author: panbingkun AuthorDate: Wed Jul 26 09:33:43 2023 -0700 [SPARK-44537][BUILD] Upgrade kubernetes-client to 6.8.0 ### What changes were proposed in this pull request? The pr aims to upgrade kubernetes-client from 6.7.2 to 6.8.0. ### Why are the changes needed? - The newest version brings some bug fixed & improvment, eg: Fix https://github.com/fabric8io/kubernetes-client/issues/5221: Empty kube config file causes NPE Fix https://github.com/fabric8io/kubernetes-client/issues/5281: Ensure the KubernetesCrudDispatcher's backing map is accessed w/lock Fix https://github.com/fabric8io/kubernetes-client/issues/5298: Prevent requests needing authentication from causing a 403 response Fix https://github.com/fabric8io/kubernetes-client/issues/5233: Generalized SchemaSwap to allow for cycle expansion Fix https://github.com/fabric8io/kubernetes-client/issues/5262: all built-in collections will omit empty in their serialized form. - The full release notes: https://github.com/fabric8io/kubernetes-client/releases/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #42142 from panbingkun/SPARK-44537. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 50 +-- pom.xml | 2 +- 2 files changed, 26 insertions(+), 26 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 251c8174fcc..168b0b34787 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -141,31 +141,31 @@ jsr305/3.0.0//jsr305-3.0.0.jar jta/1.1//jta-1.1.jar jul-to-slf4j/2.0.7//jul-to-slf4j-2.0.7.jar kryo-shaded/4.0.2//kryo-shaded-4.0.2.jar -kubernetes-client-api/6.7.2//kubernetes-client-api-6.7.2.jar -kubernetes-client/6.7.2//kubernetes-client-6.7.2.jar -kubernetes-httpclient-okhttp/6.7.2//kubernetes-httpclient-okhttp-6.7.2.jar -kubernetes-model-admissionregistration/6.7.2//kubernetes-model-admissionregistration-6.7.2.jar -kubernetes-model-apiextensions/6.7.2//kubernetes-model-apiextensions-6.7.2.jar -kubernetes-model-apps/6.7.2//kubernetes-model-apps-6.7.2.jar -kubernetes-model-autoscaling/6.7.2//kubernetes-model-autoscaling-6.7.2.jar -kubernetes-model-batch/6.7.2//kubernetes-model-batch-6.7.2.jar -kubernetes-model-certificates/6.7.2//kubernetes-model-certificates-6.7.2.jar -kubernetes-model-common/6.7.2//kubernetes-model-common-6.7.2.jar -kubernetes-model-coordination/6.7.2//kubernetes-model-coordination-6.7.2.jar -kubernetes-model-core/6.7.2//kubernetes-model-core-6.7.2.jar -kubernetes-model-discovery/6.7.2//kubernetes-model-discovery-6.7.2.jar -kubernetes-model-events/6.7.2//kubernetes-model-events-6.7.2.jar -kubernetes-model-extensions/6.7.2//kubernetes-model-extensions-6.7.2.jar -kubernetes-model-flowcontrol/6.7.2//kubernetes-model-flowcontrol-6.7.2.jar -kubernetes-model-gatewayapi/6.7.2//kubernetes-model-gatewayapi-6.7.2.jar -kubernetes-model-metrics/6.7.2//kubernetes-model-metrics-6.7.2.jar -kubernetes-model-networking/6.7.2//kubernetes-model-networking-6.7.2.jar -kubernetes-model-node/6.7.2//kubernetes-model-node-6.7.2.jar -kubernetes-model-policy/6.7.2//kubernetes-model-policy-6.7.2.jar -kubernetes-model-rbac/6.7.2//kubernetes-model-rbac-6.7.2.jar -kubernetes-model-resource/6.7.2//kubernetes-model-resource-6.7.2.jar -kubernetes-model-scheduling/6.7.2//kubernetes-model-scheduling-6.7.2.jar -kubernetes-model-storageclass/6.7.2//kubernetes-model-storageclass-6.7.2.jar +kubernetes-client-api/6.8.0//kubernetes-client-api-6.8.0.jar +kubernetes-client/6.8.0//kubernetes-client-6.8.0.jar +kubernetes-httpclient-okhttp/6.8.0//kubernetes-httpclient-okhttp-6.8.0.jar +kubernetes-model-admissionregistration/6.8.0//kubernetes-model-admissionregistration-6.8.0.jar +kubernetes-model-apiextensions/6.8.0//kubernetes-model-apiextensions-6.8.0.jar +kubernetes-model-apps/6.8.0//kubernetes-model-apps-6.8.0.jar +kubernetes-model-autoscaling/6.8.0//kubernetes-model-autoscaling-6.8.0.jar +kubernetes-model-batch/6.8.0//kubernetes-model-batch-6.8.0.jar +kubernetes-model-certificates/6.8.0//kubernetes-model-certificates-6.8.0.jar +kubernetes-model-common/6.8.0//kubernetes-model-common-6.8.0.jar +kubernetes-model-coordination/6.8.0//kubernetes-model-coordination-6.8.0.jar +kubernetes-model-core/6.8.0//kubernetes-model-core-6.8.0.jar +kubernetes-model-discovery/6.8.0//kubernetes-model-discovery-6.8.0.jar
[spark] branch branch-3.5 updated: [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 5cb7d2b81af [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type 5cb7d2b81af is described below commit 5cb7d2b81af5e8c2714cccfb818f9ec9d6c54da3 Author: Bruce Robbins AuthorDate: Wed Jul 26 19:31:27 2023 +0800 [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type ### What changes were proposed in this pull request? Change `BitmapCount` and `BitmapOrAgg` to use `DataTypeMismatch` rather than `TypeCheckResult.TypeCheckFailure` to indicate incorrect input types. ### Why are the changes needed? It appears `TypeCheckResult.TypeCheckFailure` has been deprecated: No expressions except for the recently added `BitmapCount` and `BitmapOrAgg` are using it. ### Does this PR introduce _any_ user-facing change? This PR changes an error message for two expressions that are not yet in any released version of Spark. Before PR: ``` spark-sql (default)> select bitmap_count(12); [DATATYPE_MISMATCH.TYPE_CHECK_FAILURE_WITH_HINT] Cannot resolve "bitmap_count(12)" due to data type mismatch: Bitmap must be a BinaryType.; line 1 pos 7; 'Project [unresolvedalias(bitmap_count(12), None)] +- OneRowRelation spark-sql (default)> select bitmap_or_agg(12); [DATATYPE_MISMATCH.TYPE_CHECK_FAILURE_WITH_HINT] Cannot resolve "bitmap_or_agg(12)" due to data type mismatch: Bitmap must be a BinaryType.; line 1 pos 7; 'Aggregate [unresolvedalias(bitmap_or_agg(12, 0, 0), None)] +- OneRowRelation ``` After PR: ``` spark-sql (default)> select bitmap_count(12); [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "bitmap_count(12)" due to data type mismatch: Parameter 0 requires the "BINARY" type, however "12" has the type "INT".; line 1 pos 7; 'Project [unresolvedalias(bitmap_count(12), None)] +- OneRowRelation spark-sql (default)> select bitmap_or_agg(12); [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "bitmap_or_agg(12)" due to data type mismatch: Parameter 0 requires the "BINARY" type, however "12" has the type "INT".; line 1 pos 7; 'Aggregate [unresolvedalias(bitmap_or_agg(12, 0, 0), None)] +- OneRowRelation ``` ### How was this patch tested? New unit tests. Closes #42139 from bersprockets/bitmap_type_check. Authored-by: Bruce Robbins Signed-off-by: Wenchen Fan (cherry picked from commit d0d4aab437843ce5adf5900d2d6088e79323f8d5) Signed-off-by: Wenchen Fan --- .../catalyst/expressions/bitmapExpressions.scala | 26 +++-- .../spark/sql/BitmapExpressionsQuerySuite.scala| 44 ++ 2 files changed, 66 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala index 2adfddb9383..5c7ef5cde5b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala @@ -19,10 +19,12 @@ package org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{DataTypeMismatch, TypeCheckSuccess} import org.apache.spark.sql.catalyst.expressions.aggregate.ImperativeAggregate import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke import org.apache.spark.sql.catalyst.trees.UnaryLike import org.apache.spark.sql.catalyst.types.DataTypeUtils +import org.apache.spark.sql.catalyst.util.TypeUtils._ import org.apache.spark.sql.errors.QueryExecutionErrors import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType, LongType, StructType} @@ -111,9 +113,17 @@ case class BitmapCount(child: Expression) override def checkInputDataTypes(): TypeCheckResult = { if (child.dataType != BinaryType) { - TypeCheckResult.TypeCheckFailure("Bitmap must be a BinaryType") + DataTypeMismatch( +errorSubClass = "UNEXPECTED_INPUT_TYPE", +messageParameters = Map( + "paramIndex" -> "0", + "requiredType" -> toSQLType(BinaryType), + "inputSql" -> toSQLExpr(child), + "inputType" -> toSQLType(child.dataType) +) + ) } else { - TypeCheckResult.TypeCheckSuccess + TypeCheckSuccess } } @@ -248,9 +258,17 @@ case class
[spark] branch master updated: [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d0d4aab4378 [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type d0d4aab4378 is described below commit d0d4aab437843ce5adf5900d2d6088e79323f8d5 Author: Bruce Robbins AuthorDate: Wed Jul 26 19:31:27 2023 +0800 [SPARK-44154][SQL][FOLLOWUP] `BitmapCount` and `BitmapOrAgg` should use `DataTypeMismatch` to indicate unexpected input data type ### What changes were proposed in this pull request? Change `BitmapCount` and `BitmapOrAgg` to use `DataTypeMismatch` rather than `TypeCheckResult.TypeCheckFailure` to indicate incorrect input types. ### Why are the changes needed? It appears `TypeCheckResult.TypeCheckFailure` has been deprecated: No expressions except for the recently added `BitmapCount` and `BitmapOrAgg` are using it. ### Does this PR introduce _any_ user-facing change? This PR changes an error message for two expressions that are not yet in any released version of Spark. Before PR: ``` spark-sql (default)> select bitmap_count(12); [DATATYPE_MISMATCH.TYPE_CHECK_FAILURE_WITH_HINT] Cannot resolve "bitmap_count(12)" due to data type mismatch: Bitmap must be a BinaryType.; line 1 pos 7; 'Project [unresolvedalias(bitmap_count(12), None)] +- OneRowRelation spark-sql (default)> select bitmap_or_agg(12); [DATATYPE_MISMATCH.TYPE_CHECK_FAILURE_WITH_HINT] Cannot resolve "bitmap_or_agg(12)" due to data type mismatch: Bitmap must be a BinaryType.; line 1 pos 7; 'Aggregate [unresolvedalias(bitmap_or_agg(12, 0, 0), None)] +- OneRowRelation ``` After PR: ``` spark-sql (default)> select bitmap_count(12); [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "bitmap_count(12)" due to data type mismatch: Parameter 0 requires the "BINARY" type, however "12" has the type "INT".; line 1 pos 7; 'Project [unresolvedalias(bitmap_count(12), None)] +- OneRowRelation spark-sql (default)> select bitmap_or_agg(12); [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "bitmap_or_agg(12)" due to data type mismatch: Parameter 0 requires the "BINARY" type, however "12" has the type "INT".; line 1 pos 7; 'Aggregate [unresolvedalias(bitmap_or_agg(12, 0, 0), None)] +- OneRowRelation ``` ### How was this patch tested? New unit tests. Closes #42139 from bersprockets/bitmap_type_check. Authored-by: Bruce Robbins Signed-off-by: Wenchen Fan --- .../catalyst/expressions/bitmapExpressions.scala | 26 +++-- .../spark/sql/BitmapExpressionsQuerySuite.scala| 44 ++ 2 files changed, 66 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala index 2adfddb9383..5c7ef5cde5b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala @@ -19,10 +19,12 @@ package org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{DataTypeMismatch, TypeCheckSuccess} import org.apache.spark.sql.catalyst.expressions.aggregate.ImperativeAggregate import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke import org.apache.spark.sql.catalyst.trees.UnaryLike import org.apache.spark.sql.catalyst.types.DataTypeUtils +import org.apache.spark.sql.catalyst.util.TypeUtils._ import org.apache.spark.sql.errors.QueryExecutionErrors import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType, LongType, StructType} @@ -111,9 +113,17 @@ case class BitmapCount(child: Expression) override def checkInputDataTypes(): TypeCheckResult = { if (child.dataType != BinaryType) { - TypeCheckResult.TypeCheckFailure("Bitmap must be a BinaryType") + DataTypeMismatch( +errorSubClass = "UNEXPECTED_INPUT_TYPE", +messageParameters = Map( + "paramIndex" -> "0", + "requiredType" -> toSQLType(BinaryType), + "inputSql" -> toSQLExpr(child), + "inputType" -> toSQLType(child.dataType) +) + ) } else { - TypeCheckResult.TypeCheckSuccess + TypeCheckSuccess } } @@ -248,9 +258,17 @@ case class BitmapOrAgg(child: Expression, override def checkInputDataTypes(): TypeCheckResult = { if (child.dataType !=
[GitHub] [spark-website] peter-toth commented on pull request #469: Add Peter Toth to committers
peter-toth commented on PR #469: URL: https://github.com/apache/spark-website/pull/469#issuecomment-1651582097 Thanks @zhengruifeng! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new e0c8f14ce53 [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api e0c8f14ce53 is described below commit e0c8f14ce53080e2863c076b7912239bee35003e Author: Herman van Hovell AuthorDate: Wed Jul 26 07:15:27 2023 -0400 [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api ### What changes were proposed in this pull request? This PR move encoder inference (ScalaReflection/RowEncoder/JavaTypeInference) into sql/api. ### Why are the changes needed? We want to use encoder inference in the spark connect scala client. The client's dependency to catalyst is going away, so we need to move this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #42134 from hvanhovell/SPARK-44531. Authored-by: Herman van Hovell Signed-off-by: Herman van Hovell (cherry picked from commit 071feabbd4325504332679dfa620bc5ee4359370) Signed-off-by: Herman van Hovell --- .../spark/ml/source/image/ImageFileFormat.scala| 4 +- .../spark/ml/source/libsvm/LibSVMRelation.scala| 4 +- project/MimaExcludes.scala | 3 + sql/api/pom.xml| 4 ++ .../java/org/apache/spark/sql/types/DataTypes.java | 0 .../apache/spark/sql/types/SQLUserDefinedType.java | 0 .../scala/org/apache/spark/sql/SqlApiConf.scala| 2 + .../spark/sql/catalyst/JavaTypeInference.scala | 9 ++- .../spark/sql/catalyst/ScalaReflection.scala | 13 ++-- .../apache/spark/sql/catalyst/WalkedTypePath.scala | 0 .../spark/sql/catalyst/encoders/RowEncoder.scala | 19 ++ .../spark/sql/errors/DataTypeErrorsBase.scala | 8 ++- .../apache/spark/sql/errors/EncoderErrors.scala| 74 ++ sql/catalyst/pom.xml | 5 -- .../sql/catalyst/encoders/ExpressionEncoder.scala | 8 ++- .../spark/sql/catalyst/plans/logical/object.scala | 8 +-- .../spark/sql/errors/QueryExecutionErrors.scala| 60 +- .../org/apache/spark/sql/internal/SQLConf.scala| 2 +- .../spark/sql/CalendarIntervalBenchmark.scala | 4 +- .../scala/org/apache/spark/sql/HashBenchmark.scala | 4 +- .../spark/sql/UnsafeProjectionBenchmark.scala | 4 +- .../sql/catalyst/encoders/RowEncoderSuite.scala| 50 +++ .../expressions/HashExpressionsSuite.scala | 4 +- .../expressions/ObjectExpressionsSuite.scala | 2 +- .../optimizer/ObjectSerializerPruningSuite.scala | 4 +- .../catalyst/util/ArrayDataIndexedSeqSuite.scala | 4 +- .../spark/sql/catalyst/util/UnsafeArraySuite.scala | 6 +- .../main/scala/org/apache/spark/sql/Dataset.scala | 6 +- .../scala/org/apache/spark/sql/SparkSession.scala | 2 +- .../spark/sql/execution/SparkStrategies.scala | 4 +- .../execution/datasources/DataSourceStrategy.scala | 4 +- .../sql/execution/datasources/jdbc/JdbcUtils.scala | 4 +- .../execution/datasources/v2/V2CommandExec.scala | 4 +- .../FlatMapGroupsInPandasWithStateExec.scala | 4 +- .../execution/streaming/MicroBatchExecution.scala | 4 +- .../sql/execution/streaming/sources/memory.scala | 4 +- .../spark/sql/DataFrameSessionWindowingSuite.scala | 4 +- .../org/apache/spark/sql/DataFrameSuite.scala | 8 +-- .../spark/sql/DataFrameTimeWindowingSuite.scala| 4 +- .../spark/sql/DatasetOptimizationSuite.scala | 4 +- .../scala/org/apache/spark/sql/DatasetSuite.scala | 8 +-- .../spark/sql/execution/GroupedIteratorSuite.scala | 8 +-- .../binaryfile/BinaryFileFormatSuite.scala | 4 +- .../streaming/sources/ForeachBatchSinkSuite.scala | 4 +- .../apache/spark/sql/streaming/StreamTest.scala| 4 +- 45 files changed, 205 insertions(+), 184 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala index 206ce6f0675..bf6e6b8eec0 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala @@ -25,7 +25,7 @@ import org.apache.hadoop.mapreduce.Job import org.apache.spark.ml.image.ImageSchema import org.apache.spark.sql.SparkSession import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.encoders.RowEncoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder import org.apache.spark.sql.catalyst.expressions.UnsafeRow import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriterFactory, PartitionedFile} import org.apache.spark.sql.sources.{DataSourceRegister,
[spark] branch master updated: [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 071feabbd43 [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api 071feabbd43 is described below commit 071feabbd4325504332679dfa620bc5ee4359370 Author: Herman van Hovell AuthorDate: Wed Jul 26 07:15:27 2023 -0400 [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api ### What changes were proposed in this pull request? This PR move encoder inference (ScalaReflection/RowEncoder/JavaTypeInference) into sql/api. ### Why are the changes needed? We want to use encoder inference in the spark connect scala client. The client's dependency to catalyst is going away, so we need to move this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #42134 from hvanhovell/SPARK-44531. Authored-by: Herman van Hovell Signed-off-by: Herman van Hovell --- .../spark/ml/source/image/ImageFileFormat.scala| 4 +- .../spark/ml/source/libsvm/LibSVMRelation.scala| 4 +- project/MimaExcludes.scala | 3 + sql/api/pom.xml| 4 ++ .../java/org/apache/spark/sql/types/DataTypes.java | 0 .../apache/spark/sql/types/SQLUserDefinedType.java | 0 .../scala/org/apache/spark/sql/SqlApiConf.scala| 2 + .../spark/sql/catalyst/JavaTypeInference.scala | 9 ++- .../spark/sql/catalyst/ScalaReflection.scala | 13 ++-- .../apache/spark/sql/catalyst/WalkedTypePath.scala | 0 .../spark/sql/catalyst/encoders/RowEncoder.scala | 19 ++ .../spark/sql/errors/DataTypeErrorsBase.scala | 8 ++- .../apache/spark/sql/errors/EncoderErrors.scala| 74 ++ sql/catalyst/pom.xml | 5 -- .../sql/catalyst/encoders/ExpressionEncoder.scala | 8 ++- .../spark/sql/catalyst/plans/logical/object.scala | 8 +-- .../spark/sql/errors/QueryExecutionErrors.scala| 60 +- .../org/apache/spark/sql/internal/SQLConf.scala| 2 +- .../spark/sql/CalendarIntervalBenchmark.scala | 4 +- .../scala/org/apache/spark/sql/HashBenchmark.scala | 4 +- .../spark/sql/UnsafeProjectionBenchmark.scala | 4 +- .../sql/catalyst/encoders/RowEncoderSuite.scala| 50 +++ .../expressions/HashExpressionsSuite.scala | 4 +- .../expressions/ObjectExpressionsSuite.scala | 2 +- .../optimizer/ObjectSerializerPruningSuite.scala | 4 +- .../catalyst/util/ArrayDataIndexedSeqSuite.scala | 4 +- .../spark/sql/catalyst/util/UnsafeArraySuite.scala | 6 +- .../main/scala/org/apache/spark/sql/Dataset.scala | 6 +- .../scala/org/apache/spark/sql/SparkSession.scala | 2 +- .../spark/sql/execution/SparkStrategies.scala | 4 +- .../execution/datasources/DataSourceStrategy.scala | 4 +- .../sql/execution/datasources/jdbc/JdbcUtils.scala | 4 +- .../execution/datasources/v2/V2CommandExec.scala | 4 +- .../FlatMapGroupsInPandasWithStateExec.scala | 4 +- .../execution/streaming/MicroBatchExecution.scala | 4 +- .../sql/execution/streaming/sources/memory.scala | 4 +- .../spark/sql/DataFrameSessionWindowingSuite.scala | 4 +- .../org/apache/spark/sql/DataFrameSuite.scala | 8 +-- .../spark/sql/DataFrameTimeWindowingSuite.scala| 4 +- .../spark/sql/DatasetOptimizationSuite.scala | 4 +- .../scala/org/apache/spark/sql/DatasetSuite.scala | 8 +-- .../spark/sql/execution/GroupedIteratorSuite.scala | 8 +-- .../binaryfile/BinaryFileFormatSuite.scala | 4 +- .../streaming/sources/ForeachBatchSinkSuite.scala | 4 +- .../apache/spark/sql/streaming/StreamTest.scala| 4 +- 45 files changed, 205 insertions(+), 184 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala index 206ce6f0675..bf6e6b8eec0 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala @@ -25,7 +25,7 @@ import org.apache.hadoop.mapreduce.Job import org.apache.spark.ml.image.ImageSchema import org.apache.spark.sql.SparkSession import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.encoders.RowEncoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder import org.apache.spark.sql.catalyst.expressions.UnsafeRow import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriterFactory, PartitionedFile} import org.apache.spark.sql.sources.{DataSourceRegister, Filter} @@ -90,7 +90,7 @@ private[image] class ImageFileFormat extends FileFormat with DataSourceRegister if
[spark] branch branch-3.5 updated: [SPARK-44525][SQL] Improve error message when Invoke method is not found
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new ad370115609 [SPARK-44525][SQL] Improve error message when Invoke method is not found ad370115609 is described below commit ad370115609cf302abb486d31c5460468979bd33 Author: Cheng Pan AuthorDate: Wed Jul 26 16:34:08 2023 +0800 [SPARK-44525][SQL] Improve error message when Invoke method is not found ### What changes were proposed in this pull request? This PR aims to improve the error message when `Invoke`'s `method` is not found. ### Why are the changes needed? Currently, the error message is not clear when `Invoke`'s `method` is not found. There is one error message I have encountered, which is not much helpful to find the root cause. ``` org.apache.spark.SparkException: [INTERNAL_ERROR] A method named "invoke" is not declared in any enclosing class nor any supertype at org.apache.spark.SparkException$.internalError(SparkException.scala:77) at org.apache.spark.SparkException$.internalError(SparkException.scala:81) at org.apache.spark.sql.errors.QueryExecutionErrors$.methodNotDeclaredError(QueryExecutionErrors.scala:452) at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.findMethod(objects.scala:173) at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.findMethod$(objects.scala:170) at org.apache.spark.sql.catalyst.expressions.objects.Invoke.findMethod(objects.scala:363) at org.apache.spark.sql.catalyst.expressions.objects.Invoke.method$lzycompute(objects.scala:391) at org.apache.spark.sql.catalyst.expressions.objects.Invoke.method(objects.scala:389) at org.apache.spark.sql.catalyst.expressions.objects.Invoke.eval(objects.scala:401) at org.apache.spark.sql.catalyst.expressions.HashExpression.eval(hash.scala:292) at org.apache.spark.sql.catalyst.expressions.Pmod.eval(arithmetic.scala:1054) ``` ### Does this PR introduce _any_ user-facing change? Yes, Spark returns a more clear error message. ### How was this patch tested? Add UT. Closes #42128 from pan3793/SPARK-44525. Authored-by: Cheng Pan Signed-off-by: yangjie01 (cherry picked from commit a84e2b1eee3bb868f140bffeba4e19b1a56fa3fb) Signed-off-by: yangjie01 --- .../sql/catalyst/expressions/objects/objects.scala | 2 +- .../spark/sql/errors/QueryExecutionErrors.scala| 9 ++ .../expressions/ObjectExpressionsSuite.scala | 34 +- 3 files changed, 43 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala index fec60aef1bf..32bcdaf8609 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala @@ -185,7 +185,7 @@ trait InvokeLike extends Expression with NonSQLExpression with ImplicitCastInput final def findMethod(cls: Class[_], functionName: String, argClasses: Seq[Class[_]]): Method = { val method = MethodUtils.getMatchingAccessibleMethod(cls, functionName, argClasses: _*) if (method == null) { - throw QueryExecutionErrors.methodNotDeclaredError(functionName) + throw QueryExecutionErrors.methodNotFoundError(cls, functionName, argClasses) } else { method } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala index 2e65d672698..2d0e29b1032 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala @@ -464,6 +464,15 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase { s"""A method named "$name" is not declared in any enclosing class nor any supertype""") } + def methodNotFoundError( + cls: Class[_], + functionName: String, + argClasses: Seq[Class[_]]): Throwable = { +SparkException.internalError( + s"Couldn't find method $functionName with arguments " + +s"${argClasses.mkString("(", ", ", ")")} on $cls.") + } + def constructorNotFoundError(cls: String): SparkRuntimeException = { new SparkRuntimeException( errorClass = "_LEGACY_ERROR_TEMP_2020", diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala
[spark] branch master updated: [SPARK-44525][SQL] Improve error message when Invoke method is not found
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a84e2b1eee3 [SPARK-44525][SQL] Improve error message when Invoke method is not found a84e2b1eee3 is described below commit a84e2b1eee3bb868f140bffeba4e19b1a56fa3fb Author: Cheng Pan AuthorDate: Wed Jul 26 16:34:08 2023 +0800 [SPARK-44525][SQL] Improve error message when Invoke method is not found ### What changes were proposed in this pull request? This PR aims to improve the error message when `Invoke`'s `method` is not found. ### Why are the changes needed? Currently, the error message is not clear when `Invoke`'s `method` is not found. There is one error message I have encountered, which is not much helpful to find the root cause. ``` org.apache.spark.SparkException: [INTERNAL_ERROR] A method named "invoke" is not declared in any enclosing class nor any supertype at org.apache.spark.SparkException$.internalError(SparkException.scala:77) at org.apache.spark.SparkException$.internalError(SparkException.scala:81) at org.apache.spark.sql.errors.QueryExecutionErrors$.methodNotDeclaredError(QueryExecutionErrors.scala:452) at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.findMethod(objects.scala:173) at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike.findMethod$(objects.scala:170) at org.apache.spark.sql.catalyst.expressions.objects.Invoke.findMethod(objects.scala:363) at org.apache.spark.sql.catalyst.expressions.objects.Invoke.method$lzycompute(objects.scala:391) at org.apache.spark.sql.catalyst.expressions.objects.Invoke.method(objects.scala:389) at org.apache.spark.sql.catalyst.expressions.objects.Invoke.eval(objects.scala:401) at org.apache.spark.sql.catalyst.expressions.HashExpression.eval(hash.scala:292) at org.apache.spark.sql.catalyst.expressions.Pmod.eval(arithmetic.scala:1054) ``` ### Does this PR introduce _any_ user-facing change? Yes, Spark returns a more clear error message. ### How was this patch tested? Add UT. Closes #42128 from pan3793/SPARK-44525. Authored-by: Cheng Pan Signed-off-by: yangjie01 --- .../sql/catalyst/expressions/objects/objects.scala | 2 +- .../spark/sql/errors/QueryExecutionErrors.scala| 9 ++ .../expressions/ObjectExpressionsSuite.scala | 34 +- 3 files changed, 43 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala index fec60aef1bf..32bcdaf8609 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala @@ -185,7 +185,7 @@ trait InvokeLike extends Expression with NonSQLExpression with ImplicitCastInput final def findMethod(cls: Class[_], functionName: String, argClasses: Seq[Class[_]]): Method = { val method = MethodUtils.getMatchingAccessibleMethod(cls, functionName, argClasses: _*) if (method == null) { - throw QueryExecutionErrors.methodNotDeclaredError(functionName) + throw QueryExecutionErrors.methodNotFoundError(cls, functionName, argClasses) } else { method } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala index 3fb14cd079f..7ddb6ee982e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala @@ -464,6 +464,15 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase { s"""A method named "$name" is not declared in any enclosing class nor any supertype""") } + def methodNotFoundError( + cls: Class[_], + functionName: String, + argClasses: Seq[Class[_]]): Throwable = { +SparkException.internalError( + s"Couldn't find method $functionName with arguments " + +s"${argClasses.mkString("(", ", ", ")")} on $cls.") + } + def constructorNotFoundError(cls: String): SparkRuntimeException = { new SparkRuntimeException( errorClass = "_LEGACY_ERROR_TEMP_2020", diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala index 63edba80ec8..73da5f4d3af
[spark] branch master updated: [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 748eaff4e21 [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests` 748eaff4e21 is described below commit 748eaff4e2177466dd746f6fbb82de8544bc7168 Author: Ruifeng Zheng AuthorDate: Wed Jul 26 15:52:38 2023 +0800 [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests` ### What changes were proposed in this pull request? it seems that `run_python_packaging_tests` requires some disk space and cause some pyspark modules fail, this PR is to make `run_python_packaging_tests` only enabled within `pyspark-errors` (which is the smallest pyspark test module) ![image](https://github.com/apache/spark/assets/7322292/2d37c141-15b8-4d9f-bfbd-4dd7782ab62e) ### Why are the changes needed? 1, it seems it is the `run_python_packaging_tests` that cause the `No space left` error; 2, the `run_python_packaging_tests` is tested in all `pyspark-*` test modules, should be deduplicated; ### Does this PR introduce _any_ user-facing change? no, infra-only ### How was this patch tested? updated CI Closes #42146 from zhengruifeng/infra_skip_py_packing_tests. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .github/workflows/build_and_test.yml | 16 ++-- dev/run-tests.py | 2 +- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 7107af66129..02b3814a018 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -205,6 +205,7 @@ jobs: HIVE_PROFILE: ${{ matrix.hive }} GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost + SKIP_PACKAGING: true steps: - name: Checkout Spark repository uses: actions/checkout@v3 @@ -344,6 +345,8 @@ jobs: java: - ${{ inputs.java }} modules: + - >- +pyspark-errors - >- pyspark-sql, pyspark-mllib, pyspark-resource, pyspark-testing - >- @@ -353,7 +356,7 @@ jobs: - >- pyspark-pandas-slow - >- -pyspark-connect, pyspark-errors +pyspark-connect - >- pyspark-pandas-connect - >- @@ -366,6 +369,7 @@ jobs: SPARK_LOCAL_IP: localhost SKIP_UNIDOC: true SKIP_MIMA: true + SKIP_PACKAGING: true METASPACE_SIZE: 1g steps: - name: Checkout Spark repository @@ -414,14 +418,20 @@ jobs: python3.9 -m pip list pypy3 -m pip list - name: Install Conda for pip packaging test + if: ${{ matrix.modules == 'pyspark-errors' }} run: | curl -s https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh > miniconda.sh bash miniconda.sh -b -p $HOME/miniconda # Run the tests. - name: Run tests env: ${{ fromJSON(inputs.envs) }} + shell: 'script -q -e -c "bash {0}"' run: | -export PATH=$PATH:$HOME/miniconda/bin +if [[ "$MODULES_TO_TEST" == "pyspark-errors" ]]; then + export PATH=$PATH:$HOME/miniconda/bin + export SKIP_PACKAGING=false + echo "Python Packaging Tests Enabled!" +fi ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST" - name: Upload coverage to Codecov if: fromJSON(inputs.envs).PYSPARK_CODECOV == 'true' @@ -457,6 +467,7 @@ jobs: GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost SKIP_MIMA: true + SKIP_PACKAGING: true steps: - name: Checkout Spark repository uses: actions/checkout@v3 @@ -911,6 +922,7 @@ jobs: SPARK_LOCAL_IP: localhost ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-xe:21.3.0 SKIP_MIMA: true + SKIP_PACKAGING: true steps: - name: Checkout Spark repository uses: actions/checkout@v3 diff --git a/dev/run-tests.py b/dev/run-tests.py index c0c281b549e..9bf3095edb7 100755 --- a/dev/run-tests.py +++ b/dev/run-tests.py @@ -395,7 +395,7 @@ def run_python_tests(test_modules, parallelism, with_coverage=False): def run_python_packaging_tests(): -if not os.environ.get("SPARK_JENKINS"): +if not os.environ.get("SPARK_JENKINS") and os.environ.get("SKIP_PACKAGING", "false") != "true": set_title_and_block("Running PySpark packaging tests", "BLOCK_PYSPARK_PIP_TESTS") command = [os.path.join(SPARK_HOME, "dev", "run-pip-tests")] run_cmd(command) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail:
[spark] branch branch-3.5 updated: [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new a8539688186 [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests` a8539688186 is described below commit a8539688186be40c81c39050e70a49a9ef01519f Author: Ruifeng Zheng AuthorDate: Wed Jul 26 15:52:38 2023 +0800 [SPARK-44544][INFRA] Deduplicate `run_python_packaging_tests` ### What changes were proposed in this pull request? it seems that `run_python_packaging_tests` requires some disk space and cause some pyspark modules fail, this PR is to make `run_python_packaging_tests` only enabled within `pyspark-errors` (which is the smallest pyspark test module) ![image](https://github.com/apache/spark/assets/7322292/2d37c141-15b8-4d9f-bfbd-4dd7782ab62e) ### Why are the changes needed? 1, it seems it is the `run_python_packaging_tests` that cause the `No space left` error; 2, the `run_python_packaging_tests` is tested in all `pyspark-*` test modules, should be deduplicated; ### Does this PR introduce _any_ user-facing change? no, infra-only ### How was this patch tested? updated CI Closes #42146 from zhengruifeng/infra_skip_py_packing_tests. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng (cherry picked from commit 748eaff4e2177466dd746f6fbb82de8544bc7168) Signed-off-by: Ruifeng Zheng --- .github/workflows/build_and_test.yml | 16 ++-- dev/run-tests.py | 2 +- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 54fe9f38ddd..1fcca7e4c39 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -204,6 +204,7 @@ jobs: HIVE_PROFILE: ${{ matrix.hive }} GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost + SKIP_PACKAGING: true steps: - name: Checkout Spark repository uses: actions/checkout@v3 @@ -343,6 +344,8 @@ jobs: java: - ${{ inputs.java }} modules: + - >- +pyspark-errors - >- pyspark-sql, pyspark-mllib, pyspark-resource, pyspark-testing - >- @@ -352,7 +355,7 @@ jobs: - >- pyspark-pandas-slow - >- -pyspark-connect, pyspark-errors +pyspark-connect - >- pyspark-pandas-connect - >- @@ -365,6 +368,7 @@ jobs: SPARK_LOCAL_IP: localhost SKIP_UNIDOC: true SKIP_MIMA: true + SKIP_PACKAGING: true METASPACE_SIZE: 1g steps: - name: Checkout Spark repository @@ -413,14 +417,20 @@ jobs: python3.9 -m pip list pypy3 -m pip list - name: Install Conda for pip packaging test + if: ${{ matrix.modules == 'pyspark-errors' }} run: | curl -s https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh > miniconda.sh bash miniconda.sh -b -p $HOME/miniconda # Run the tests. - name: Run tests env: ${{ fromJSON(inputs.envs) }} + shell: 'script -q -e -c "bash {0}"' run: | -export PATH=$PATH:$HOME/miniconda/bin +if [[ "$MODULES_TO_TEST" == "pyspark-errors" ]]; then + export PATH=$PATH:$HOME/miniconda/bin + export SKIP_PACKAGING=false + echo "Python Packaging Tests Enabled!" +fi ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST" - name: Upload coverage to Codecov if: fromJSON(inputs.envs).PYSPARK_CODECOV == 'true' @@ -456,6 +466,7 @@ jobs: GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost SKIP_MIMA: true + SKIP_PACKAGING: true steps: - name: Checkout Spark repository uses: actions/checkout@v3 @@ -900,6 +911,7 @@ jobs: SPARK_LOCAL_IP: localhost ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-xe:21.3.0 SKIP_MIMA: true + SKIP_PACKAGING: true steps: - name: Checkout Spark repository uses: actions/checkout@v3 diff --git a/dev/run-tests.py b/dev/run-tests.py index c0c281b549e..9bf3095edb7 100755 --- a/dev/run-tests.py +++ b/dev/run-tests.py @@ -395,7 +395,7 @@ def run_python_tests(test_modules, parallelism, with_coverage=False): def run_python_packaging_tests(): -if not os.environ.get("SPARK_JENKINS"): +if not os.environ.get("SPARK_JENKINS") and os.environ.get("SKIP_PACKAGING", "false") != "true": set_title_and_block("Running PySpark packaging tests", "BLOCK_PYSPARK_PIP_TESTS") command = [os.path.join(SPARK_HOME, "dev", "run-pip-tests")] run_cmd(command)
[spark] branch branch-3.5 updated: [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new d4b03ec53db [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor d4b03ec53db is described below commit d4b03ec53db98d237e00aa9e097ef69faa19b4b1 Author: Mathew Jacob AuthorDate: Wed Jul 26 16:22:45 2023 +0900 [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor ### What changes were proposed in this pull request? Added examples to the docstring of using DeepspeedTorchDistributor ### Why are the changes needed? More concrete examples, allowing for a better understanding of feature. ### Does this PR introduce _any_ user-facing change? Yes, docs changes. ### How was this patch tested? make html Closes #42087 from mathewjacob1002/docs_deepspeed. Lead-authored-by: Mathew Jacob Co-authored-by: Mathew Jacob <134338709+mathewjacob1...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon (cherry picked from commit ac8fe83af2178d76a9e3df9fedf008ef26d8d044) Signed-off-by: Hyukjin Kwon --- .../pyspark/ml/deepspeed/deepspeed_distributor.py | 50 -- 1 file changed, 38 insertions(+), 12 deletions(-) diff --git a/python/pyspark/ml/deepspeed/deepspeed_distributor.py b/python/pyspark/ml/deepspeed/deepspeed_distributor.py index d6ae98de5e3..7c2b8c43526 100644 --- a/python/pyspark/ml/deepspeed/deepspeed_distributor.py +++ b/python/pyspark/ml/deepspeed/deepspeed_distributor.py @@ -35,11 +35,11 @@ class DeepspeedTorchDistributor(TorchDistributor): def __init__( self, -num_gpus: int = 1, +numGpus: int = 1, nnodes: int = 1, -local_mode: bool = True, -use_gpu: bool = True, -deepspeed_config: Optional[Union[str, Dict[str, Any]]] = None, +localMode: bool = True, +useGpu: bool = True, +deepspeedConfig: Optional[Union[str, Dict[str, Any]]] = None, ): """ This class is used to run deepspeed training workloads with spark clusters. @@ -49,25 +49,51 @@ class DeepspeedTorchDistributor(TorchDistributor): Parameters -- -num_gpus: int +numGpus: int The number of GPUs to use per node (analagous to num_gpus in deepspeed command). nnodes: int The number of nodes that should be used for the run. -local_mode: bool +localMode: bool Whether or not to run the training in a distributed fashion or just locally. -use_gpu: bool +useGpu: bool Boolean flag to determine whether to utilize gpus. -deepspeed_config: Union[Dict[str,Any], str] or None: +deepspeedConfig: Union[Dict[str,Any], str] or None: The configuration file to be used for launching the deepspeed application. If it's a dictionary containing the parameters, then we will create the file. If None, deepspeed will fall back to default parameters. + +Examples + +Run Deepspeed training function on a single node + +>>> def train(learning_rate): +... import deepspeed +... # rest of training function +... return model +>>> distributor = DeepspeedTorchDistributor( +... numGpus=4, +... nnodes=1, +... useGpu=True, +... localMode=True, +... deepspeedConfig="path/to/config.json") +>>> output = distributor.run(train, 0.01) + +Run Deepspeed training function on multiple nodes + +>>> distributor = DeepspeedTorchDistributor( +... numGpus=4, +... nnodes=3, +... useGpu=True, +... localMode=False, +... deepspeedConfig="path/to/config.json") +>>> output = distributor.run(train, 0.01) """ -num_processes = num_gpus * nnodes -self.deepspeed_config = deepspeed_config +num_processes = numGpus * nnodes +self.deepspeed_config = deepspeedConfig super().__init__( num_processes, -local_mode, -use_gpu, +localMode, +useGpu, _ssl_conf=DeepspeedTorchDistributor._DEEPSPEED_SSL_CONF, ) self.cleanup_deepspeed_conf = False - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ac8fe83af21 [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor ac8fe83af21 is described below commit ac8fe83af2178d76a9e3df9fedf008ef26d8d044 Author: Mathew Jacob AuthorDate: Wed Jul 26 16:22:45 2023 +0900 [SPARK-44264][PYTHON][DOCS] Added Example to Deepspeed Distributor ### What changes were proposed in this pull request? Added examples to the docstring of using DeepspeedTorchDistributor ### Why are the changes needed? More concrete examples, allowing for a better understanding of feature. ### Does this PR introduce _any_ user-facing change? Yes, docs changes. ### How was this patch tested? make html Closes #42087 from mathewjacob1002/docs_deepspeed. Lead-authored-by: Mathew Jacob Co-authored-by: Mathew Jacob <134338709+mathewjacob1...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- .../pyspark/ml/deepspeed/deepspeed_distributor.py | 50 -- 1 file changed, 38 insertions(+), 12 deletions(-) diff --git a/python/pyspark/ml/deepspeed/deepspeed_distributor.py b/python/pyspark/ml/deepspeed/deepspeed_distributor.py index d6ae98de5e3..7c2b8c43526 100644 --- a/python/pyspark/ml/deepspeed/deepspeed_distributor.py +++ b/python/pyspark/ml/deepspeed/deepspeed_distributor.py @@ -35,11 +35,11 @@ class DeepspeedTorchDistributor(TorchDistributor): def __init__( self, -num_gpus: int = 1, +numGpus: int = 1, nnodes: int = 1, -local_mode: bool = True, -use_gpu: bool = True, -deepspeed_config: Optional[Union[str, Dict[str, Any]]] = None, +localMode: bool = True, +useGpu: bool = True, +deepspeedConfig: Optional[Union[str, Dict[str, Any]]] = None, ): """ This class is used to run deepspeed training workloads with spark clusters. @@ -49,25 +49,51 @@ class DeepspeedTorchDistributor(TorchDistributor): Parameters -- -num_gpus: int +numGpus: int The number of GPUs to use per node (analagous to num_gpus in deepspeed command). nnodes: int The number of nodes that should be used for the run. -local_mode: bool +localMode: bool Whether or not to run the training in a distributed fashion or just locally. -use_gpu: bool +useGpu: bool Boolean flag to determine whether to utilize gpus. -deepspeed_config: Union[Dict[str,Any], str] or None: +deepspeedConfig: Union[Dict[str,Any], str] or None: The configuration file to be used for launching the deepspeed application. If it's a dictionary containing the parameters, then we will create the file. If None, deepspeed will fall back to default parameters. + +Examples + +Run Deepspeed training function on a single node + +>>> def train(learning_rate): +... import deepspeed +... # rest of training function +... return model +>>> distributor = DeepspeedTorchDistributor( +... numGpus=4, +... nnodes=1, +... useGpu=True, +... localMode=True, +... deepspeedConfig="path/to/config.json") +>>> output = distributor.run(train, 0.01) + +Run Deepspeed training function on multiple nodes + +>>> distributor = DeepspeedTorchDistributor( +... numGpus=4, +... nnodes=3, +... useGpu=True, +... localMode=False, +... deepspeedConfig="path/to/config.json") +>>> output = distributor.run(train, 0.01) """ -num_processes = num_gpus * nnodes -self.deepspeed_config = deepspeed_config +num_processes = numGpus * nnodes +self.deepspeed_config = deepspeedConfig super().__init__( num_processes, -local_mode, -use_gpu, +localMode, +useGpu, _ssl_conf=DeepspeedTorchDistributor._DEEPSPEED_SSL_CONF, ) self.cleanup_deepspeed_conf = False - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 11f9d00caed [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py 11f9d00caed is described below commit 11f9d00caedc3cb1dfc94cf5cfbdfd7aa7c93576 Author: Mathew Jacob AuthorDate: Wed Jul 26 16:21:01 2023 +0900 [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py ### What changes were proposed in this pull request? Added pyspark.ml.deepspeed to the setup.py ### Why are the changes needed? It allows the deepspeed distributor to be pip installed when you pip install pyspark. ### Does this PR introduce _any_ user-facing change? No Closes #42123 from mathewjacob1002/add_deepspeed_setup_py. Authored-by: Mathew Jacob Signed-off-by: Hyukjin Kwon (cherry picked from commit 89c3472c5feb7b9625f55f82d08429dcb55a6fee) Signed-off-by: Hyukjin Kwon --- python/pyspark/ml/deepspeed/__init__.py | 16 python/setup.py | 1 + 2 files changed, 17 insertions(+) diff --git a/python/pyspark/ml/deepspeed/__init__.py b/python/pyspark/ml/deepspeed/__init__.py new file mode 100644 index 000..cce3acad34a --- /dev/null +++ b/python/pyspark/ml/deepspeed/__init__.py @@ -0,0 +1,16 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# diff --git a/python/setup.py b/python/setup.py index f190930b2a7..3774273c421 100755 --- a/python/setup.py +++ b/python/setup.py @@ -241,6 +241,7 @@ try: "pyspark.ml.linalg", "pyspark.ml.param", "pyspark.ml.torch", +"pyspark.ml.deepspeed", "pyspark.sql", "pyspark.sql.avro", "pyspark.sql.connect", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 89c3472c5fe [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py 89c3472c5fe is described below commit 89c3472c5feb7b9625f55f82d08429dcb55a6fee Author: Mathew Jacob AuthorDate: Wed Jul 26 16:21:01 2023 +0900 [SPARK-44264][PYTHON] Added Deepspeed Folder to setup.py ### What changes were proposed in this pull request? Added pyspark.ml.deepspeed to the setup.py ### Why are the changes needed? It allows the deepspeed distributor to be pip installed when you pip install pyspark. ### Does this PR introduce _any_ user-facing change? No Closes #42123 from mathewjacob1002/add_deepspeed_setup_py. Authored-by: Mathew Jacob Signed-off-by: Hyukjin Kwon --- python/pyspark/ml/deepspeed/__init__.py | 16 python/setup.py | 1 + 2 files changed, 17 insertions(+) diff --git a/python/pyspark/ml/deepspeed/__init__.py b/python/pyspark/ml/deepspeed/__init__.py new file mode 100644 index 000..cce3acad34a --- /dev/null +++ b/python/pyspark/ml/deepspeed/__init__.py @@ -0,0 +1,16 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# diff --git a/python/setup.py b/python/setup.py index 93ca373c586..fa938b0b4ef 100755 --- a/python/setup.py +++ b/python/setup.py @@ -241,6 +241,7 @@ try: "pyspark.ml.linalg", "pyspark.ml.param", "pyspark.ml.torch", +"pyspark.ml.deepspeed", "pyspark.sql", "pyspark.sql.avro", "pyspark.sql.connect", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] zhengruifeng commented on pull request #469: Add Peter Toth to committers
zhengruifeng commented on PR #469: URL: https://github.com/apache/spark-website/pull/469#issuecomment-1651112455 Late LGTM, Congrats! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] peter-toth commented on pull request #469: Add Peter Toth to committers
peter-toth commented on PR #469: URL: https://github.com/apache/spark-website/pull/469#issuecomment-1651044687 Thank you @dongjoon-hyun! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org