date:20230417

[spark] branch master updated: [SPARK-42657][CONNECT][FOLLOWUP] Correct the API version in scaladoc

2023-04-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fc75dab0876 [SPARK-42657][CONNECT][FOLLOWUP] Correct the API version 
in scaladoc
fc75dab0876 is described below

commit fc75dab087696bdc9001a10bd053d52bee8f0ef4
Author: Cheng Pan 
AuthorDate: Tue Apr 18 15:34:12 2023 +0900

[SPARK-42657][CONNECT][FOLLOWUP] Correct the API version in scaladoc

### What changes were proposed in this pull request?

SPARK-42657 is for Spark 3.5.0.

### Why are the changes needed?

Fix the wrong API version

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Passing GA.

Closes #40832 from pan3793/SPARK-42657-followup.

Authored-by: Cheng Pan 
Signed-off-by: Hyukjin Kwon 
---
 .../client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
index e285db39e80..d68988cd435 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
@@ -498,7 +498,7 @@ class SparkSession private[sql] (
   /**
* Register a [[ClassFinder]] for dynamically generated classes.
*
-   * @since 3.4.0
+   * @since 3.5.0
*/
   @Experimental
   def registerClassFinder(finder: ClassFinder): Unit = 
client.registerClassFinder(finder)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43111][PS][CONNECT][PYTHON] Merge nested `if` statements into single `if` statements

2023-04-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 462d4565cd2 [SPARK-43111][PS][CONNECT][PYTHON] Merge nested `if` 
statements into single `if` statements
462d4565cd2 is described below

commit 462d4565cd2782fe805c9871eeab2d969c79369f
Author: Bjørn Jørgensen 
AuthorDate: Tue Apr 18 13:13:10 2023 +0900

[SPARK-43111][PS][CONNECT][PYTHON] Merge nested `if` statements into single 
`if` statements

### What changes were proposed in this pull request?
This PR aims to simplify the code by merging nested `if` statements into 
single `if` statements using the `and` operator.

There are 7 of these according to 
[Sonarcloud](https://sonarcloud.io/project/issues?languages=py&resolved=false&rules=python%3AS1066&id=spark-python&open=AYQdnXXBRrJbVxW9ZDpw).
 And this PR fix them all.

### Why are the changes needed?
The changes do not affect the functionality of the code, but they improve 
readability and maintainability.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #40759 from bjornjorgensen/Merge-if-with-the-enclosing-one.

Lead-authored-by: Bjørn Jørgensen 
Co-authored-by: bjornjorgensen 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/accumulators.py |  5 ++---
 python/pyspark/pandas/frame.py | 17 ++---
 python/pyspark/pandas/groupby.py   | 17 -
 python/pyspark/pandas/indexes/base.py  |  9 -
 python/pyspark/pandas/namespace.py |  5 ++---
 python/pyspark/sql/connect/streaming/readwriter.py | 11 +--
 6 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/python/pyspark/accumulators.py b/python/pyspark/accumulators.py
index fe775a37ed8..ce4bb561814 100644
--- a/python/pyspark/accumulators.py
+++ b/python/pyspark/accumulators.py
@@ -249,9 +249,8 @@ class 
_UpdateRequestHandler(SocketServer.StreamRequestHandler):
 while not self.server.server_shutdown:  # type: 
ignore[attr-defined]
 # Poll every 1 second for new data -- don't block in case of 
shutdown.
 r, _, _ = select.select([self.rfile], [], [], 1)
-if self.rfile in r:
-if func():
-break
+if self.rfile in r and func():
+break
 
 def accum_updates() -> bool:
 num_updates = read_int(self.rfile)
diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index 8bddcb6bae8..d1c10223432 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -8915,15 +8915,19 @@ defaultdict(, {'col..., 'col...})]
 if len(index_scols) != other._internal.index_level:
 raise ValueError("Both DataFrames have to have the same number 
of index levels")
 
-if verify_integrity and len(index_scols) > 0:
-if (
+if (
+verify_integrity
+and len(index_scols) > 0
+and (
 self._internal.spark_frame.select(index_scols)
 .intersect(
 
other._internal.spark_frame.select(other._internal.index_spark_columns)
 )
 .count()
-) > 0:
-raise ValueError("Indices have overlapping values")
+)
+> 0
+):
+raise ValueError("Indices have overlapping values")
 
 # Lazy import to avoid circular dependency issues
 from pyspark.pandas.namespace import concat
@@ -11581,9 +11585,8 @@ defaultdict(, {'col..., 'col...})]
 
 index_columns = psdf._internal.index_spark_column_names
 num_indices = len(index_columns)
-if level:
-if level < 0 or level >= num_indices:
-raise ValueError("level should be an integer between [0, 
%s)" % num_indices)
+if level is not None and (level < 0 or level >= num_indices):
+raise ValueError("level should be an integer between [0, %s)" 
% num_indices)
 
 @pandas_udf(returnType=index_mapper_ret_stype)  # type: 
ignore[call-overload]
 def index_mapper_udf(s: pd.Series) -> pd.Series:
diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index 01687c3fd16..01bc72cd809 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -3550,15 +3550,14 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
 if isinstance(self, SeriesGroupBy):
 raise TypeError("Only numeric aggregation column is accepted.")

[spark] branch branch-3.4 updated: [SPARK-43113][SQL] Evaluate stream-side variables when generating code for a bound condition

2023-04-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 55e152acceb [SPARK-43113][SQL] Evaluate stream-side variables when 
generating code for a bound condition
55e152acceb is described below

commit 55e152accebb1100e18a0d51d44bb6552953139c
Author: Bruce Robbins 
AuthorDate: Tue Apr 18 13:09:41 2023 +0900

[SPARK-43113][SQL] Evaluate stream-side variables when generating code for 
a bound condition

### What changes were proposed in this pull request?

In `JoinCodegenSupport#getJoinCondition`, evaluate any referenced 
stream-side variables before using them in the generated code.

This patch doesn't evaluate the passed stream-side variables directly, but 
instead evaluates a copy (`streamVars2`). This is because 
`SortMergeJoin#codegenFullOuter` will want to evaluate the stream-side vars 
within a different scope than the condition check, so we mustn't delete the 
initialization code from the original `ExprCode` instances.

### Why are the changes needed?

When a bound condition of a full outer join references the same stream-side 
column more than once, wholestage codegen generates bad code.

For example, the following query fails with a compilation error:

```
create or replace temp view v1 as
select * from values
(1, 1),
(2, 2),
(3, 1)
as v1(key, value);

create or replace temp view v2 as
select * from values
(1, 22, 22),
(3, -1, -1),
(7, null, null)
as v2(a, b, c);

select *
from v1
full outer join v2
on key = a
and value > b
and value > c;
```
The error is:
```
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
277, Column 9: Redefinition of local variable "smj_isNull_7"
```
The same error occurs with code generated from ShuffleHashJoinExec:
```
select /*+ SHUFFLE_HASH(v2) */ *
from v1
full outer join v2
on key = a
and value > b
and value > c;
```
In this case, the error is:
```
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
174, Column 5: Redefinition of local variable "shj_value_1"
```
Neither `SortMergeJoin#codegenFullOuter` nor 
`ShuffledHashJoinExec#doProduce` evaluate the stream-side variables before 
calling `consumeFullOuterJoinRow#getJoinCondition`. As a result, 
`getJoinCondition` generates definition/initialization code for each referenced 
stream-side variable at the point of use. If a stream-side variable is used 
more than once in the bound condition, the definition/initialization code is 
generated more than once, resulting in the "Redefinition of local varia [...]

In the end, the query succeeds, since Spark disables wholestage codegen and 
tries again.

(In the case other join-type/strategy pairs, either the implementations 
don't call `JoinCodegenSupport#getJoinCondition`, or the stream-side variables 
are pre-evaluated before the call is made, so no error happens in those cases).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests.

Closes #40766 from bersprockets/full_join_codegen_issue.

Authored-by: Bruce Robbins 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 119ec5b2ea86b73afaeabcb1d52136029326cac7)
Signed-off-by: Hyukjin Kwon 
---
 .../sql/execution/joins/JoinCodegenSupport.scala   |  8 +++--
 .../scala/org/apache/spark/sql/JoinSuite.scala | 35 ++
 2 files changed, 40 insertions(+), 3 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala
index 75f0a359a79..ae91615da0f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala
@@ -42,13 +42,15 @@ trait JoinCodegenSupport extends CodegenSupport with 
BaseJoinExec {
   buildRow: Option[String] = None): (String, String, Seq[ExprCode]) = {
 val buildSideRow = buildRow.getOrElse(ctx.freshName("buildRow"))
 val buildVars = genOneSideJoinVars(ctx, buildSideRow, buildPlan, 
setDefaultValue = false)
+val streamVars2 = streamVars.map(_.copy())
 val checkCondition = if (condition.isDefined) {
   val expr = condition.get
-  // evaluate the variables from build side that used by condition
-  val eval = evaluateRequiredVariables(buildPlan.output, buildVars, 
expr.references)
+  // evaluate the variables that are used by the condition
+  val eval = evaluateRequiredVariables(streamPlan.output ++

[spark] branch master updated (dc84e529ba9 -> 119ec5b2ea8)

2023-04-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from dc84e529ba9 [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable 
TorchDistributorLocalUnitTestsOnConnect and 
TorchDistributorLocalUnitTestsIIOnConnect
 add 119ec5b2ea8 [SPARK-43113][SQL] Evaluate stream-side variables when 
generating code for a bound condition

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/joins/JoinCodegenSupport.scala   |  8 +++--
 .../scala/org/apache/spark/sql/JoinSuite.scala | 35 ++
 2 files changed, 40 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

2023-04-17 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dc84e529ba9 [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable 
TorchDistributorLocalUnitTestsOnConnect and 
TorchDistributorLocalUnitTestsIIOnConnect
dc84e529ba9 is described below

commit dc84e529ba96ba8afc24216e5fc28d95ce8ce290
Author: Ruifeng Zheng 
AuthorDate: Tue Apr 18 11:22:58 2023 +0800

[SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable 
TorchDistributorLocalUnitTestsOnConnect and 
TorchDistributorLocalUnitTestsIIOnConnect

### What changes were proposed in this pull request?
`TorchDistributorLocalUnitTestsOnConnect` and 
`TorchDistributorLocalUnitTestsIIOnConnect` were not stable and occasionally 
got stuck. However, I can not reproduce the issue locally.

The two UTs were disabled, and this PR is to reenable them. I found that 
the all the tests for PyTorch set up the regular sessions or connect sessions 
in `setUp` and close them in `tearDown`, however such session operations are 
very expensive and should be placed into `setUpClass` and `tearDownClass` 
instead. After this change, the related tests seems much stable. So I think the 
root cause is still related to the resources, since TorchDistributor works on 
barrier mode, when there is n [...]

### Why are the changes needed?
for test coverage

### Does this PR introduce _any_ user-facing change?
No, test-only

### How was this patch tested?
CI

Closes #40793 from zhengruifeng/torch_reenable.

Lead-authored-by: Ruifeng Zheng 
Co-authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../tests/connect/test_parity_torch_distributor.py | 111 +++--
 python/pyspark/ml/torch/tests/test_distributor.py  | 177 +++--
 2 files changed, 158 insertions(+), 130 deletions(-)

diff --git a/python/pyspark/ml/tests/connect/test_parity_torch_distributor.py 
b/python/pyspark/ml/tests/connect/test_parity_torch_distributor.py
index 8f5699afdf2..55ea99a6540 100644
--- a/python/pyspark/ml/tests/connect/test_parity_torch_distributor.py
+++ b/python/pyspark/ml/tests/connect/test_parity_torch_distributor.py
@@ -17,7 +17,6 @@
 
 import os
 import shutil
-import tempfile
 import unittest
 
 have_torch = True
@@ -33,6 +32,9 @@ from pyspark.ml.torch.tests.test_distributor import (
 TorchDistributorLocalUnitTestsMixin,
 TorchDistributorDistributedUnitTestsMixin,
 TorchWrapperUnitTestsMixin,
+set_up_test_dirs,
+get_local_mode_conf,
+get_distributed_mode_conf,
 )
 
 
@@ -40,31 +42,35 @@ from pyspark.ml.torch.tests.test_distributor import (
 class TorchDistributorBaselineUnitTestsOnConnect(
 TorchDistributorBaselineUnitTestsMixin, unittest.TestCase
 ):
-def setUp(self) -> None:
-self.spark = SparkSession.builder.remote("local[4]").getOrCreate()
+@classmethod
+def setUpClass(cls):
+cls.spark = SparkSession.builder.remote("local[4]").getOrCreate()
 
-def tearDown(self) -> None:
-self.spark.stop()
+@classmethod
+def tearDownClass(cls):
+cls.spark.stop()
 
 
-@unittest.skip("unstable, ignore for now")
+@unittest.skipIf(not have_torch, "torch is required")
 class TorchDistributorLocalUnitTestsOnConnect(
 TorchDistributorLocalUnitTestsMixin, unittest.TestCase
 ):
-def setUp(self) -> None:
-class_name = self.__class__.__name__
-conf = self._get_spark_conf()
-builder = SparkSession.builder.appName(class_name)
-for k, v in conf.getAll():
-if k not in ["spark.master", "spark.remote", "spark.app.name"]:
-builder = builder.config(k, v)
-self.spark = builder.remote("local-cluster[2,2,1024]").getOrCreate()
-self.mnist_dir_path = tempfile.mkdtemp()
-
-def tearDown(self) -> None:
-shutil.rmtree(self.mnist_dir_path)
-os.unlink(self.gpu_discovery_script_file.name)
-self.spark.stop()
+@classmethod
+def setUpClass(cls):
+(cls.gpu_discovery_script_file_name, cls.mnist_dir_path) = 
set_up_test_dirs()
+builder = SparkSession.builder.appName(cls.__name__)
+for k, v in get_local_mode_conf().items():
+builder = builder.config(k, v)
+builder = builder.config(
+"spark.driver.resource.gpu.discoveryScript", 
cls.gpu_discovery_script_file_name
+)
+cls.spark = builder.remote("local-cluster[2,2,1024]").getOrCreate()
+
+@classmethod
+def tearDownClass(cls):
+shutil.rmtree(cls.mnist_dir_path)
+os.unlink(cls.gpu_discovery_script_file_name)
+cls.spark.stop()
 
 def _get_inputs_for_test_local_training_succeeds(self):
 return [
@@ -75,24 +81,27 @@ class TorchDistributorLocalUnitTestsOnConnect(
 ]
 
 
-@unittest.skip("unstable, ign

[spark-docker] branch master updated: [SPARK-43148] Add Apache Spark 3.4.0 Dockerfiles

2023-04-17 Thread yikun

This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new fe05e38  [SPARK-43148] Add Apache Spark 3.4.0 Dockerfiles
fe05e38 is described below

commit fe05e38f0ffad271edccd6ae40a77d5f14f3eef7
Author: Yikun Jiang 
AuthorDate: Tue Apr 18 10:58:59 2023 +0800

[SPARK-43148] Add Apache Spark 3.4.0 Dockerfiles

### What changes were proposed in this pull request?
Add Apache Spark 3.4.0 Dockerfiles.
- Add 3.4.0 GPG key
- Add .github/workflows/build_3.4.0.yaml
- ./add-dockerfiles.sh 3.4.0

### Why are the changes needed?
Apache Spark 3.4.0 released:
https://spark.apache.org/releases/spark-release-3-4-0.html

### Does this PR introduce _any_ user-facing change?
Yes in future, new image will publised in future (after DOI reviewed)

### How was this patch tested?
Add workflow and CI passed

Closes #33 from Yikun/3.4.0.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 .github/workflows/build_3.4.0.yaml |  43 
 .github/workflows/publish.yml  |   6 +-
 .github/workflows/test.yml |   6 +-
 3.4.0/scala2.12-java11-python3-r-ubuntu/Dockerfile |  86 
 .../entrypoint.sh  | 114 +
 3.4.0/scala2.12-java11-python3-ubuntu/Dockerfile   |  83 +++
 .../scala2.12-java11-python3-ubuntu/entrypoint.sh  | 114 +
 3.4.0/scala2.12-java11-r-ubuntu/Dockerfile |  82 +++
 3.4.0/scala2.12-java11-r-ubuntu/entrypoint.sh  | 107 +++
 3.4.0/scala2.12-java11-ubuntu/Dockerfile   |  79 ++
 3.4.0/scala2.12-java11-ubuntu/entrypoint.sh| 107 +++
 tools/template.py  |   2 +
 versions.json  |  42 ++--
 13 files changed, 860 insertions(+), 11 deletions(-)

diff --git a/.github/workflows/build_3.4.0.yaml 
b/.github/workflows/build_3.4.0.yaml
new file mode 100644
index 000..8dd4e1e
--- /dev/null
+++ b/.github/workflows/build_3.4.0.yaml
@@ -0,0 +1,43 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+name: "Build and Test (3.4.0)"
+
+on:
+  pull_request:
+branches:
+  - 'master'
+paths:
+  - '3.4.0/**'
+  - '.github/workflows/build_3.4.0.yaml'
+  - '.github/workflows/main.yml'
+
+jobs:
+  run-build:
+strategy:
+  matrix:
+image-type: ["all", "python", "scala", "r"]
+name: Run
+secrets: inherit
+uses: ./.github/workflows/main.yml
+with:
+  spark: 3.4.0
+  scala: 2.12
+  java: 11
+  image-type: ${{ matrix.image-type }}
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
index 2941cfb..70b88b8 100644
--- a/.github/workflows/publish.yml
+++ b/.github/workflows/publish.yml
@@ -25,11 +25,13 @@ on:
   spark:
 description: 'The Spark version of Spark image.'
 required: true
-default: '3.3.0'
+default: '3.4.0'
 type: choice
 options:
-- 3.3.0
+- 3.4.0
+- 3.3.2
 - 3.3.1
+- 3.3.0
   publish:
 description: 'Publish the image or not.'
 default: false
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
index efb401b..06e2321 100644
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -25,11 +25,13 @@ on:
   spark:
 description: 'The Spark version of Spark image.'
 required: true
-default: '3.3.1'
+default: '3.4.0'
 type: choice
 options:
-- 3.3.0
+- 3.4.0
+- 3.3.2
 - 3.3.1
+- 3.3.0
   java:
 description: 'The Java version of Spark image.'
 default: 11
diff --git a/3.4.0/scala2.12-java11-python3-r-ubuntu/Dockerfile 
b/3.4.0/scala2.12-java11-python3-r-ubuntu/Dockerfile
new file mode 100644
index 000..4f62e8d
--- /dev/null
+++ b/3.4.0/scala2.12-java11-python3-r-ubuntu/Doc

[spark] branch master updated: [SPARK-43168][SQL] Remove get PhysicalDataType method from Datatype class

2023-04-17 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new db2625c70a8 [SPARK-43168][SQL] Remove get PhysicalDataType method from 
Datatype class
db2625c70a8 is described below

commit db2625c70a8c3aff64e6a9466981c8dd49a4ca51
Author: Rui Wang 
AuthorDate: Mon Apr 17 22:16:25 2023 -0400

[SPARK-43168][SQL] Remove get PhysicalDataType method from Datatype class

### What changes were proposed in this pull request?

DataType is public API while we can leave PhysicalDataType as internal 
API/implementation thus we can remove PhysicalDataType from DataType. So 
DataType does not need to have a class dependency on PhysicalDataType.

### Why are the changes needed?

Simplify DataType.

### Does this PR introduce _any_ user-facing change?

NO
### How was this patch tested?

UT

Closes #40826 from amaliujia/catalyst_datatype_refactor_8.

Authored-by: Rui Wang 
Signed-off-by: Herman van Hovell 
---
 .../spark/sql/catalyst/expressions/SpecializedGettersReader.java | 2 +-
 .../java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java| 4 ++--
 .../main/java/org/apache/spark/sql/vectorized/ColumnarBatchRow.java  | 2 +-
 .../src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java   | 2 +-
 .../src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala   | 2 +-
 .../spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala | 2 +-
 .../spark/sql/catalyst/expressions/codegen/CodeGenerator.scala   | 4 ++--
 .../scala/org/apache/spark/sql/catalyst/expressions/literals.scala   | 2 +-
 .../org/apache/spark/sql/catalyst/expressions/namedExpressions.scala | 2 +-
 .../scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala | 5 -
 .../src/main/scala/org/apache/spark/sql/types/ArrayType.scala| 4 
 .../src/main/scala/org/apache/spark/sql/types/BinaryType.scala   | 3 ---
 .../src/main/scala/org/apache/spark/sql/types/BooleanType.scala  | 3 ---
 .../src/main/scala/org/apache/spark/sql/types/ByteType.scala | 3 ---
 .../main/scala/org/apache/spark/sql/types/CalendarIntervalType.scala | 3 ---
 .../src/main/scala/org/apache/spark/sql/types/CharType.scala | 2 --
 .../src/main/scala/org/apache/spark/sql/types/DataType.scala | 4 +---
 .../src/main/scala/org/apache/spark/sql/types/DateType.scala | 3 ---
 .../main/scala/org/apache/spark/sql/types/DayTimeIntervalType.scala  | 3 ---
 .../src/main/scala/org/apache/spark/sql/types/DecimalType.scala  | 4 
 .../src/main/scala/org/apache/spark/sql/types/DoubleType.scala   | 3 ---
 .../src/main/scala/org/apache/spark/sql/types/FloatType.scala| 3 ---
 .../src/main/scala/org/apache/spark/sql/types/IntegerType.scala  | 3 ---
 .../src/main/scala/org/apache/spark/sql/types/LongType.scala | 3 ---
 sql/catalyst/src/main/scala/org/apache/spark/sql/types/MapType.scala | 4 
 .../src/main/scala/org/apache/spark/sql/types/NullType.scala | 3 ---
 .../src/main/scala/org/apache/spark/sql/types/ShortType.scala| 3 ---
 .../src/main/scala/org/apache/spark/sql/types/StringType.scala   | 3 ---
 .../src/main/scala/org/apache/spark/sql/types/StructType.scala   | 4 +---
 .../src/main/scala/org/apache/spark/sql/types/TimestampNTZType.scala | 3 ---
 .../src/main/scala/org/apache/spark/sql/types/TimestampType.scala| 3 ---
 .../src/main/scala/org/apache/spark/sql/types/VarcharType.scala  | 3 ---
 .../scala/org/apache/spark/sql/types/YearMonthIntervalType.scala | 3 ---
 .../org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java | 2 +-
 34 files changed, 18 insertions(+), 84 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/SpecializedGettersReader.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/SpecializedGettersReader.java
index c5a7d34281f..be50350b106 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/SpecializedGettersReader.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/SpecializedGettersReader.java
@@ -29,7 +29,7 @@ public final class SpecializedGettersReader {
   DataType dataType,
   boolean handleNull,
   boolean handleUserDefinedType) {
-PhysicalDataType physicalDataType = dataType.physicalDataType();
+PhysicalDataType physicalDataType = PhysicalDataType.apply(dataType);
 if (handleNull && (obj.isNullAt(ordinal) || physicalDataType instanceof 
PhysicalNullType)) {
   return null;
 }
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
index a3097cd6770..d2433292fc7 1

[spark] branch master updated: [SPARK-42984][CONNECT][PYTHON][TESTS] Enable test_createDataFrame_with_single_data_type

2023-04-17 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 61e8c5b1fe4 [SPARK-42984][CONNECT][PYTHON][TESTS] Enable 
test_createDataFrame_with_single_data_type
61e8c5b1fe4 is described below

commit 61e8c5b1fe46173464e2e04acf77806086c89a8d
Author: Takuya UESHIN 
AuthorDate: Tue Apr 18 10:13:56 2023 +0800

[SPARK-42984][CONNECT][PYTHON][TESTS] Enable 
test_createDataFrame_with_single_data_type

### What changes were proposed in this pull request?

Enables `ArrowParityTests.test_createDataFrame_with_single_data_type`.

### Why are the changes needed?

The test is already fixed by previous commits.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Enabled/updated the related tests.

Closes #40828 from ueshin/issues/SPARK-42984/test.

Authored-by: Takuya UESHIN 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/connect/test_parity_arrow.py | 2 --
 python/pyspark/sql/tests/test_arrow.py| 6 --
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_arrow.py 
b/python/pyspark/sql/tests/connect/test_parity_arrow.py
index ec33bb22a4b..fd05821f052 100644
--- a/python/pyspark/sql/tests/connect/test_parity_arrow.py
+++ b/python/pyspark/sql/tests/connect/test_parity_arrow.py
@@ -43,8 +43,6 @@ class ArrowParityTests(ArrowTestsMixin, 
ReusedConnectTestCase):
 def test_createDataFrame_with_ndarray(self):
 self.check_createDataFrame_with_ndarray(True)
 
-# TODO(SPARK-42984): ValueError not raised
-@unittest.skip("Fails in Spark Connect, should enable.")
 def test_createDataFrame_with_single_data_type(self):
 self.check_createDataFrame_with_single_data_type()
 
diff --git a/python/pyspark/sql/tests/test_arrow.py 
b/python/pyspark/sql/tests/test_arrow.py
index 1c96273a22f..cf28d32c903 100644
--- a/python/pyspark/sql/tests/test_arrow.py
+++ b/python/pyspark/sql/tests/test_arrow.py
@@ -533,8 +533,10 @@ class ArrowTestsMixin:
 self.check_createDataFrame_with_single_data_type()
 
 def check_createDataFrame_with_single_data_type(self):
-with self.assertRaisesRegex(ValueError, ".*IntegerType.*not 
supported.*"):
-self.spark.createDataFrame(pd.DataFrame({"a": [1]}), 
schema="int").collect()
+for schema in ["int", IntegerType()]:
+with self.subTest(schema=schema):
+with self.assertRaisesRegex(ValueError, ".*IntegerType.*not 
supported.*"):
+self.spark.createDataFrame(pd.DataFrame({"a": [1]}), 
schema=schema).collect()
 
 def test_createDataFrame_does_not_modify_input(self):
 # Some series get converted for Spark to consume, this makes sure 
input is unchanged


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-41210][K8S] Port executor failure tracker from Spark on YARN to K8s

2023-04-17 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 40872e9a094 [SPARK-41210][K8S] Port executor failure tracker from 
Spark on YARN to K8s
40872e9a094 is described below

commit 40872e9a094f8459b0b6f626937ced48a8d98efb
Author: Cheng Pan 
AuthorDate: Tue Apr 18 09:51:28 2023 +0800

[SPARK-41210][K8S] Port executor failure tracker from Spark on YARN to K8s

### What changes were proposed in this pull request?

Fail Spark Application when the number of executor failures reaches the 
threshold.

### Why are the changes needed?

Sometimes, the executors can not launch successfully because of the wrong 
configuration, but in K8s, Driver does not know that, and just keep requesting 
new executors.

This PR ports the window-based executor failure tracking mechanism to 
K8s(only takes effect when `spark.kubernetes.allocation.pods.allocator` is set 
to 'direct'), to reduce functionality gap between YARN and K8s.

Note that, YARN mode also supports host-based executor allocation failure 
tracking and application terminating mechanism[2], this PR does not port such 
functionalities to Kubernetes since it's kind of an independent and big 
feature, and relies on some YARN features which I'm not sure if K8s has similar 
one.

[1] [SPARK-6735](https://issues.apache.org/jira/browse/SPARK-6735)
[2] [SPARK-17675](https://issues.apache.org/jira/browse/SPARK-17675)

### Does this PR introduce _any_ user-facing change?

Yes, this PR provides two new configurations

- `spark.executor.maxNumFailures`
- `spark.executor.failuresValidityInterval`

which takes effect on YARN, or on Kubernetes when 
`spark.kubernetes.allocation.pods.allocator` is set to 'direct'.

### How was this patch tested?

New UT added, and manually tested in internal K8s cluster.

Closes #40774 from pan3793/SPARK-41210.

Authored-by: Cheng Pan 
Signed-off-by: Kent Yao 
---
 .../main/scala/org/apache/spark/SparkConf.scala|   6 +-
 .../spark/deploy/ExecutorFailureTracker.scala  | 102 +
 .../org/apache/spark/internal/config/package.scala |  18 
 .../org/apache/spark/util/SparkExitCode.scala  |   3 +
 .../spark/deploy/ExecutorFailureTrackerSuite.scala |  10 +-
 .../cluster/k8s/ExecutorPodsAllocator.scala|  36 +++-
 .../cluster/k8s/ExecutorPodsAllocatorSuite.scala   |  44 -
 .../k8s/KubernetesClusterManagerSuite.scala|   6 +-
 .../spark/deploy/yarn/ApplicationMaster.scala  |  27 +-
 .../apache/spark/deploy/yarn/YarnAllocator.scala   |   3 +-
 .../yarn/YarnAllocatorNodeHealthTracker.scala  |  61 +---
 .../org/apache/spark/deploy/yarn/config.scala  |  13 ---
 .../yarn/YarnAllocatorHealthTrackerSuite.scala |   3 +-
 .../deploy/yarn/YarnShuffleIntegrationSuite.scala  |   1 -
 14 files changed, 224 insertions(+), 109 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala 
b/core/src/main/scala/org/apache/spark/SparkConf.scala
index 08344d8e547..813a14acd19 100644
--- a/core/src/main/scala/org/apache/spark/SparkConf.scala
+++ b/core/src/main/scala/org/apache/spark/SparkConf.scala
@@ -709,7 +709,11 @@ private[spark] object SparkConf extends Logging {
   AlternateConfig("spark.yarn.access.namenodes", "2.2"),
   AlternateConfig("spark.yarn.access.hadoopFileSystems", "3.0")),
 "spark.kafka.consumer.cache.capacity" -> Seq(
-  AlternateConfig("spark.sql.kafkaConsumerCache.capacity", "3.0"))
+  AlternateConfig("spark.sql.kafkaConsumerCache.capacity", "3.0")),
+MAX_EXECUTOR_FAILURES.key -> Seq(
+  AlternateConfig("spark.yarn.max.executor.failures", "3.5")),
+EXECUTOR_ATTEMPT_FAILURE_VALIDITY_INTERVAL_MS.key -> Seq(
+  AlternateConfig("spark.yarn.executor.failuresValidityInterval", "3.5"))
   )
 
   /**
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/ExecutorFailureTracker.scala 
b/core/src/main/scala/org/apache/spark/deploy/ExecutorFailureTracker.scala
new file mode 100644
index 000..7c7b9c60b47
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/deploy/ExecutorFailureTracker.scala
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License i

[spark] branch master updated (3941369d13a -> cbe94a172ca)

2023-04-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3941369d13a [SPARK-42657][CONNECT] Support to find and transfer 
client-side REPL classfiles to server as artifacts
 add cbe94a172ca [SPARK-43084][SS] Add applyInPandasWithState support for 
spark connect

No new revisions were added by this update.

Summary of changes:
 .../main/protobuf/spark/connect/relations.proto|  24 ++
 .../sql/connect/planner/SparkConnectPlanner.scala  |  23 ++
 dev/sparktestsupport/modules.py|   1 +
 python/pyspark/sql/connect/_typing.py  |   5 +
 python/pyspark/sql/connect/group.py|  46 +++-
 python/pyspark/sql/connect/plan.py |  39 
 python/pyspark/sql/connect/proto/relations_pb2.py  | 258 +++--
 python/pyspark/sql/connect/proto/relations_pb2.pyi |  80 +++
 python/pyspark/sql/pandas/group_ops.py |   3 +
 .../sql/tests/connect/test_connect_basic.py|   7 -
 .../test_parity_pandas_grouped_map_with_state.py}  |  19 +-
 .../pandas/test_pandas_grouped_map_with_state.py   |   8 +-
 12 files changed, 375 insertions(+), 138 deletions(-)
 copy python/pyspark/{pandas/tests/connect/test_parity_dataframe_spark_io.py => 
sql/tests/connect/test_parity_pandas_grouped_map_with_state.py} (59%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] viirya commented on a diff in pull request #454: Improve instructions for the release process

2023-04-17 Thread via GitHub



viirya commented on code in PR #454:
URL: https://github.com/apache/spark-website/pull/454#discussion_r1169138426


##
release-process.md:
##
@@ -186,6 +186,18 @@ that looks something like `[VOTE][RESULT] ...`.
 
 Finalize the release
 
+Note that `dev/create-release/do-release-docker.sh` script (`finalize` step ) 
automates most of the following steps **except** for:
+- Publish to CRAN

Review Comment:
   Hm? Last time I did release I remember I still need to do `twine upload`, 
has it changed?
   
   EDIT: it was changed by SPARK-35872.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] viirya commented on a diff in pull request #454: Improve instructions for the release process

2023-04-17 Thread via GitHub



viirya commented on code in PR #454:
URL: https://github.com/apache/spark-website/pull/454#discussion_r1169148125


##
release-process.md:
##
@@ -186,6 +186,18 @@ that looks something like `[VOTE][RESULT] ...`.
 
 Finalize the release
 
+Note that `dev/create-release/do-release-docker.sh` script (`finalize` step ) 
automates most of the following steps **except** for:
+- Publish to CRAN
+- Update the configuration of Algolia Crawler
+- Remove old releases from Mirror Network
+- Update the rest of the Spark website
+- Create and upload Spark Docker Images
+- Create an announcement
+
+Please manually verify the result after each step.
+
+Upload to Apache release directory

Review Comment:
   Oh, it was changed by SPARK-35872. 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] viirya commented on a diff in pull request #454: Improve instructions for the release process

2023-04-17 Thread via GitHub



viirya commented on code in PR #454:
URL: https://github.com/apache/spark-website/pull/454#discussion_r1169138426


##
release-process.md:
##
@@ -186,6 +186,18 @@ that looks something like `[VOTE][RESULT] ...`.
 
 Finalize the release
 
+Note that `dev/create-release/do-release-docker.sh` script (`finalize` step ) 
automates most of the following steps **except** for:
+- Publish to CRAN

Review Comment:
   Hm? Last time I did release I remember I still need to do `twine upload`, 
has it changed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] xinrong-meng commented on a diff in pull request #454: Improve instructions for the release process

2023-04-17 Thread via GitHub



xinrong-meng commented on code in PR #454:
URL: https://github.com/apache/spark-website/pull/454#discussion_r1169034199


##
release-process.md:
##
@@ -186,6 +186,18 @@ that looks something like `[VOTE][RESULT] ...`.
 
 Finalize the release
 
+Note that `dev/create-release/do-release-docker.sh` script (`finalize` step ) 
automates most of the following steps **except** for:
+- Publish to CRAN
+- Update the configuration of Algolia Crawler
+- Remove old releases from Mirror Network
+- Update the rest of the Spark website
+- Create and upload Spark Docker Images
+- Create an announcement
+
+Please manually verify the result after each step.
+
+Upload to Apache release directory

Review Comment:
   See more at 
https://github.com/xinrong-meng/spark/blob/master/dev/create-release/release-build.sh#L97.
 : )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] xinrong-meng commented on a diff in pull request #454: Improve instructions for the release process

2023-04-17 Thread via GitHub



xinrong-meng commented on code in PR #454:
URL: https://github.com/apache/spark-website/pull/454#discussion_r1169032103


##
release-process.md:
##
@@ -186,6 +186,18 @@ that looks something like `[VOTE][RESULT] ...`.
 
 Finalize the release
 
+Note that `dev/create-release/do-release-docker.sh` script (`finalize` step ) 
automates most of the following steps **except** for:
+- Publish to CRAN

Review Comment:
   "Uploading PySpark to PyPi" is automated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] xinrong-meng commented on pull request #454: Improve instructions for the release process

2023-04-17 Thread via GitHub



xinrong-meng commented on PR #454:
URL: https://github.com/apache/spark-website/pull/454#issuecomment-1511752634

   Thanks @srowen !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] xinrong-meng commented on a diff in pull request #454: Improve instructions for the release process

2023-04-17 Thread via GitHub



xinrong-meng commented on code in PR #454:
URL: https://github.com/apache/spark-website/pull/454#discussion_r1169032739


##
release-process.md:
##
@@ -186,6 +186,18 @@ that looks something like `[VOTE][RESULT] ...`.
 
 Finalize the release
 
+Note that `dev/create-release/do-release-docker.sh` script (`finalize` step ) 
automates most of the following steps **except** for:
+- Publish to CRAN
+- Update the configuration of Algolia Crawler
+- Remove old releases from Mirror Network
+- Update the rest of the Spark website
+- Create and upload Spark Docker Images
+- Create an announcement
+
+Please manually verify the result after each step.
+
+Upload to Apache release directory

Review Comment:
   "Moving Spark binaries to the release directory" is automated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-42657][CONNECT] Support to find and transfer client-side REPL classfiles to server as artifacts

2023-04-17 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3941369d13a [SPARK-42657][CONNECT] Support to find and transfer 
client-side REPL classfiles to server as artifacts
3941369d13a is described below

commit 3941369d13ad885eac21bd8ac1769aaf1a325c5a
Author: vicennial 
AuthorDate: Mon Apr 17 09:18:00 2023 -0400

[SPARK-42657][CONNECT] Support to find and transfer client-side REPL 
classfiles to server as artifacts

### What changes were proposed in this pull request?

This PR introduces the concept of a `ClassFinder` that is able to scrape 
the REPL output (either file-based or in-memory based) for generated class 
files.  The `ClassFinder` is registered during initialization of the REPL and 
aids in uploading the generated class files as artifacts to the Spark Connect 
server.

### Why are the changes needed?

To run UDFs which are defined on the client side REPL, we require a 
mechanism that can find the local REPL classfiles and then utilise the 
mechanism from https://issues.apache.org/jira/browse/SPARK-42653 to transfer 
them to the server as artifacts.

### Does this PR introduce _any_ user-facing change?

Yes, users can now run UDFs on the default (ammonite) REPL with spark 
connect.

Input (in REPL):
```
class A(x: Int) { def get = x * 5 + 19 }
def dummyUdf(x: Int): Int = new A(x).get
val myUdf = udf(dummyUdf _)
spark.range(5).select(myUdf(col("id"))).as[Int].collect()
```

Output:
```
Array[Int] = Array(19, 24, 29, 34, 39)
```

### How was this patch tested?

Unit tests + E2E tests.

Closes #40675 from vicennial/SPARK-42657.

Lead-authored-by: vicennial 
Co-authored-by: Venkata Sai Akhil Gudesa 
Signed-off-by: Herman van Hovell 
---
 .github/workflows/build_and_test.yml   |   3 +
 .../scala/org/apache/spark/sql/SparkSession.scala  |  10 +-
 .../apache/spark/sql/application/ConnectRepl.scala |  37 --
 .../spark/sql/connect/client/ArtifactManager.scala |  32 +-
 .../spark/sql/connect/client/ClassFinder.scala |  80 +
 .../sql/connect/client/SparkConnectClient.scala|  10 +-
 .../org/apache/spark/sql/ClientE2ETestSuite.scala  |  11 --
 .../spark/sql/application/ReplE2ESuite.scala   | 128 +
 .../sql/connect/client/ClassFinderSuite.scala  |  57 +
 .../connect/client/util/RemoteSparkSession.scala   |   6 +-
 .../sql/connect/SimpleSparkConnectService.scala|   2 +-
 11 files changed, 349 insertions(+), 27 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 311274c9203..630956a9e73 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -249,7 +249,10 @@ jobs:
 # Run the tests.
 - name: Run tests
   env: ${{ fromJSON(inputs.envs) }}
+  shell: 'script -q -e -c "bash {0}"'
   run: |
+# Fix for TTY related issues when launching the Ammonite REPL in tests.
+export TERM=vt100 && script -qfc 'echo exit | amm -s' && rm typescript
 # Hive "other tests" test needs larger metaspace size based on 
experiment.
 if [[ "$MODULES_TO_TEST" == "hive" ]] && [[ "$EXCLUDED_TAGS" == 
"org.apache.spark.tags.SlowHiveTest" ]]; then export METASPACE_SIZE=2g; fi
 export SERIAL_SBT_TESTS=1
diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
index b1b779f0f08..e285db39e80 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
@@ -33,7 +33,7 @@ import org.apache.spark.sql.catalog.Catalog
 import org.apache.spark.sql.catalyst.{JavaTypeInference, ScalaReflection}
 import org.apache.spark.sql.catalyst.encoders.{AgnosticEncoder, RowEncoder}
 import 
org.apache.spark.sql.catalyst.encoders.AgnosticEncoders.{BoxedLongEncoder, 
UnboundRowEncoder}
-import org.apache.spark.sql.connect.client.{SparkConnectClient, SparkResult}
+import org.apache.spark.sql.connect.client.{ClassFinder, SparkConnectClient, 
SparkResult}
 import org.apache.spark.sql.connect.client.util.{Cleaner, ConvertToArrow}
 import 
org.apache.spark.sql.connect.common.LiteralValueProtoConverter.toLiteralProto
 import org.apache.spark.sql.internal.CatalogImpl
@@ -495,6 +495,14 @@ class SparkSession private[sql] (
   @scala.annotation.varargs
   def addArtifacts(uri: URI*): Unit = client.addArtifacts(uri)
 
+  /**
+   * Register a [[ClassFinder]] for dynamically generated classes.
+   *
+   * @since 3.4.0
+   */
+  @Experimental
+  def registerClassFi

[spark] branch master updated: [MINOR][CONNECT][PYTHON] Add missing `super().init()` in expressions

2023-04-17 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7a5b6c837a4 [MINOR][CONNECT][PYTHON] Add missing `super().__init__()` 
in expressions
7a5b6c837a4 is described below

commit 7a5b6c837a448f7ede4ef679cac6fd4a6f8babcd
Author: Ruifeng Zheng 
AuthorDate: Mon Apr 17 16:13:00 2023 +0800

[MINOR][CONNECT][PYTHON] Add missing `super().__init__()` in expressions

### What changes were proposed in this pull request?
Add missing `super().__init__()` in expressions

### Why are the changes needed?
to make IDEA happy:
https://user-images.githubusercontent.com/7322292/232402659-20e7f740-7816-495f-967f-d90c3ac7eedc.png";>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing UT

Closes #40818 from zhengruifeng/connect_super.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/expressions.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/python/pyspark/sql/connect/expressions.py 
b/python/pyspark/sql/connect/expressions.py
index 1c332e56226..8ed365091fc 100644
--- a/python/pyspark/sql/connect/expressions.py
+++ b/python/pyspark/sql/connect/expressions.py
@@ -106,6 +106,7 @@ class CaseWhen(Expression):
 def __init__(
 self, branches: Sequence[Tuple[Expression, Expression]], else_value: 
Optional[Expression]
 ):
+super().__init__()
 
 assert isinstance(branches, list)
 for branch in branches:
@@ -142,6 +143,7 @@ class CaseWhen(Expression):
 
 class ColumnAlias(Expression):
 def __init__(self, parent: Expression, alias: Sequence[str], metadata: 
Any):
+super().__init__()
 
 self._alias = alias
 self._metadata = metadata
@@ -649,6 +651,7 @@ class CommonInlineUserDefinedFunction(Expression):
 deterministic: bool = False,
 arguments: Sequence[Expression] = [],
 ):
+super().__init__()
 self._function_name = function_name
 self._deterministic = deterministic
 self._arguments = arguments


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-42657][CONNECT][FOLLOWUP] Correct the API version in scaladoc

[spark] branch master updated: [SPARK-43111][PS][CONNECT][PYTHON] Merge nested `if` statements into single `if` statements

[spark] branch branch-3.4 updated: [SPARK-43113][SQL] Evaluate stream-side variables when generating code for a bound condition

[spark] branch master updated (dc84e529ba9 -> 119ec5b2ea8)

[spark] branch master updated: [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect

[spark-docker] branch master updated: [SPARK-43148] Add Apache Spark 3.4.0 Dockerfiles

[spark] branch master updated: [SPARK-43168][SQL] Remove get PhysicalDataType method from Datatype class

[spark] branch master updated: [SPARK-42984][CONNECT][PYTHON][TESTS] Enable test_createDataFrame_with_single_data_type

[spark] branch master updated: [SPARK-41210][K8S] Port executor failure tracker from Spark on YARN to K8s

[spark] branch master updated (3941369d13a -> cbe94a172ca)

[GitHub] [spark-website] viirya commented on a diff in pull request #454: Improve instructions for the release process

[GitHub] [spark-website] viirya commented on a diff in pull request #454: Improve instructions for the release process

[GitHub] [spark-website] viirya commented on a diff in pull request #454: Improve instructions for the release process

[GitHub] [spark-website] xinrong-meng commented on a diff in pull request #454: Improve instructions for the release process

[GitHub] [spark-website] xinrong-meng commented on a diff in pull request #454: Improve instructions for the release process

[GitHub] [spark-website] xinrong-meng commented on pull request #454: Improve instructions for the release process

[GitHub] [spark-website] xinrong-meng commented on a diff in pull request #454: Improve instructions for the release process

[spark] branch master updated: [SPARK-42657][CONNECT] Support to find and transfer client-side REPL classfiles to server as artifacts

[spark] branch master updated: [MINOR][CONNECT][PYTHON] Add missing `super().init()` in expressions

19 matches

Site Navigation

Mail list logo

Footer information