[spark] branch master updated: [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5`

2023-06-20 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 692ec99bc78 [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until 
creating `branch-3.5`
692ec99bc78 is described below

commit 692ec99bc7884cc998afbf63e4ef53053c0c9dd7
Author: Hyukjin Kwon 
AuthorDate: Tue Jun 20 19:20:20 2023 -0700

[SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating 
`branch-3.5`

### What changes were proposed in this pull request?

This adds a couple of changes to make the script working
See https://github.com/apache/spark/pull/41682

### Why are the changes needed?

See https://github.com/apache/spark/pull/41682

### Does this PR introduce _any_ user-facing change?

See https://github.com/apache/spark/pull/41682

### How was this patch tested?

See https://github.com/apache/spark/pull/41682

Closes #41684 from HyukjinKwon/SPARK-44129.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 dev/merge_spark_pr.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py
index 1621432c01c..e9024573a21 100755
--- a/dev/merge_spark_pr.py
+++ b/dev/merge_spark_pr.py
@@ -240,7 +240,8 @@ def cherry_pick(pr_num, merge_hash, default_branch):
 def fix_version_from_branch(branch, versions):
 # Note: Assumes this is a sorted (newest->oldest) list of un-released 
versions
 if branch == "master":
-return versions[0]
+# TODO(SPARK-44130) Revert SPARK-44129 after creating branch-3.5
+return [v for v in versions if v.name == "3.5.0"][0]
 else:
 branch_ver = branch.replace("branch-", "")
 return list(filter(lambda x: x.name.startswith(branch_ver), 
versions))[-1]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: Revert "[SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5`"

2023-06-20 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5a55061df25 Revert "[SPARK-44129][INFRA] Use "3.5.0" for "master" 
branch until creating `branch-3.5`"
5a55061df25 is described below

commit 5a55061df25a9f9c1c35c272b1563705d957eb84
Author: Hyukjin Kwon 
AuthorDate: Wed Jun 21 11:11:57 2023 +0900

Revert "[SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating 
`branch-3.5`"

This reverts commit 6cc63cbccfca67d13b2e4166382ccd4f2bd49681.
---
 dev/merge_spark_pr.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py
index c3d524d5433..1621432c01c 100755
--- a/dev/merge_spark_pr.py
+++ b/dev/merge_spark_pr.py
@@ -240,7 +240,7 @@ def cherry_pick(pr_num, merge_hash, default_branch):
 def fix_version_from_branch(branch, versions):
 # Note: Assumes this is a sorted (newest->oldest) list of un-released 
versions
 if branch == "master":
-return "3.5.0" # TODO(SPARK-44130) Revert SPARK-44129 after creating 
branch-3.5
+return versions[0]
 else:
 branch_ver = branch.replace("branch-", "")
 return list(filter(lambda x: x.name.startswith(branch_ver), 
versions))[-1]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44107][CONNECT][PYTHON] Hide unsupported Column methods from auto-completion

2023-06-20 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3d52ba49946 [SPARK-44107][CONNECT][PYTHON] Hide unsupported Column 
methods from auto-completion
3d52ba49946 is described below

commit 3d52ba49946cb0054a58e0026e26ba442c64988d
Author: Ruifeng Zheng 
AuthorDate: Wed Jun 21 10:11:36 2023 +0800

[SPARK-44107][CONNECT][PYTHON] Hide unsupported Column methods from 
auto-completion

### What changes were proposed in this pull request?
Hide unsupported Column method `_jc` from auto-completion, it is already 
handled in `__getattr__`, see 
https://github.com/apache/spark/blob/e6c6d444ae07f1ece127cea6332cce906b5aa1c5/python/pyspark/sql/connect/column.py#L445-L454

### Why are the changes needed?
no need to show unsupported methods

### Does this PR introduce _any_ user-facing change?
yes

before this PR:
https://github.com/apache/spark/assets/7322292/bee3c41d-8fa5-4981-9392-cde93a1e9f34;>

after this PR:
https://github.com/apache/spark/assets/7322292/85e5c7cc-86b7-4919-8c8a-db8dba2c94a9;>

### How was this patch tested?
existing UTs and manually check in ipython

Closes #41675 from zhengruifeng/connect_col_hide.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/column.py | 6 --
 1 file changed, 6 deletions(-)

diff --git a/python/pyspark/sql/connect/column.py 
b/python/pyspark/sql/connect/column.py
index 4d32da56192..05292938163 100644
--- a/python/pyspark/sql/connect/column.py
+++ b/python/pyspark/sql/connect/column.py
@@ -478,12 +478,6 @@ class Column:
 
 __bool__ = __nonzero__
 
-@property
-def _jc(self) -> None:
-raise PySparkAttributeError(
-error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", 
message_parameters={"attr_name": "_jc"}
-)
-
 
 Column.__doc__ = PySparkColumn.__doc__
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44125][R] Support Java 21 in SparkR

2023-06-20 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 95f071cf5f3 [SPARK-44125][R] Support Java 21 in SparkR
95f071cf5f3 is described below

commit 95f071cf5f34d73d193b9c4f28f5459fa92aaeef
Author: Dongjoon Hyun 
AuthorDate: Wed Jun 21 11:10:54 2023 +0900

[SPARK-44125][R] Support Java 21 in SparkR

### What changes were proposed in this pull request?

This PR aims to support Java 21 in SparkR. Arrow-related issue will be 
fixed when we upgrade Arrow library. Also, the following JIRA is created to 
re-enable them even in Java 21.
- SPARK-44127 Reenable `test_sparkSQL_arrow.R` in Java 21

### Why are the changes needed?

To be ready for Java 21.

### Does this PR introduce _any_ user-facing change?

No, this is additional support.

### How was this patch tested?

Pass the CIs and do manual tests.

```
$ java -version
openjdk version "21-ea" 2023-09-19
OpenJDK Runtime Environment (build 21-ea+27-2343)
OpenJDK 64-Bit Server VM (build 21-ea+27-2343, mixed mode, sharing)

$ build/sbt test:package -Psparkr -Phive

$ R/install-dev.sh; R/run-tests.sh
...
══ Skipped 
═
1. createDataFrame/collect Arrow optimization 
('test_sparkSQL_arrow.R:29:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

2. createDataFrame/collect Arrow optimization - many partitions (partition 
order test) ('test_sparkSQL_arrow.R:47:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

3. createDataFrame/collect Arrow optimization - type specification 
('test_sparkSQL_arrow.R:54:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

4. dapply() Arrow optimization ('test_sparkSQL_arrow.R:79:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

5. dapply() Arrow optimization - type specification 
('test_sparkSQL_arrow.R:114:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

6. dapply() Arrow optimization - type specification (date and timestamp) 
('test_sparkSQL_arrow.R:144:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

7. gapply() Arrow optimization ('test_sparkSQL_arrow.R:154:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

8. gapply() Arrow optimization - type specification 
('test_sparkSQL_arrow.R:198:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

9. gapply() Arrow optimization - type specification (date and timestamp) 
('test_sparkSQL_arrow.R:231:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

10. Arrow optimization - unsupported types ('test_sparkSQL_arrow.R:243:3') 
- Reason: sparkR.callJStatic("org.apache.spark.util.Utils", 
"isJavaVersionAtLeast21") is TRUE

11. SPARK-32478: gapply() Arrow optimization - error message for schema 
mismatch ('test_sparkSQL_arrow.R:255:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

12. SPARK-43789: Automatically pick the number of partitions based on Arrow 
batch size ('test_sparkSQL_arrow.R:265:3') - Reason: 
sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is 
TRUE

13. sparkJars tag in SparkContext ('test_Windows.R:22:5') - Reason: This 
test is only for Windows, skipped

══ DONE 

...
* DONE

Status: 2 NOTEs
See
  ‘/Users/dongjoon/APACHE/spark-merge/R/SparkR.Rcheck/00check.log’
for details.

+ popd
Tests passed.
```

Closes #41680 from dongjoon-hyun/SPARK-44125.

Authored-by: Dongjoon Hyun 
Signed-off-by: Hyukjin Kwon 
---
 R/pkg/R/client.R|  6 --
 R/pkg/tests/fulltests/test_sparkSQL_arrow.R | 24 
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/R/pkg/R/client.R b/R/pkg/R/client.R
index 797a5c7da15..88f9e9fe857 100644
--- a/R/pkg/R/client.R
+++ b/R/pkg/R/client.R
@@ -93,8 +93,10 @@ checkJavaVersion <- function() {
   }, javaVersionOut)
 
   javaVersionStr <- strsplit(javaVersionFilter[[1]], '"', fixed = 
TRUE)[[1L]][2]
-  # javaVersionStr is of the form 1.8.0_92/9.0.x/11.0.x.
-  # We are using 8, 9, 10, 11 for sparkJavaVersion.
+  # javaVersionStr is of the form 

[spark] branch master updated: [SPARK-44103][INFRA] Make pending container jobs cancelable

2023-06-20 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 526831b8f07 [SPARK-44103][INFRA] Make pending container jobs cancelable
526831b8f07 is described below

commit 526831b8f072109df464d503076fda32f8fd5981
Author: Ruifeng Zheng 
AuthorDate: Wed Jun 21 09:48:26 2023 +0800

[SPARK-44103][INFRA] Make pending container jobs cancelable

### What changes were proposed in this pull request?
Make pending container jobs (pyspark/sparkr/linter) cancelable, to release 
the resources ASAP.

### Why are the changes needed?
to release the resources ASAP.

before this PR when we click the `Cancel Workflow`,  container jobs can not 
be cancelled, no matter the status (running or pending).

In this PR, we can cancel the pending container jobs. Note that the running 
ones still can not be cancelled.

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
manually check

Closes #41668 from zhengruifeng/infra_docker_stop.

Lead-authored-by: Ruifeng Zheng 
Co-authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 93d69f18740..a03aa53dc88 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -333,7 +333,7 @@ jobs:
   pyspark:
 needs: [precondition, infra-image]
 # always run if pyspark == 'true', even infra-image is skip (such as 
non-master job)
-if: always() && fromJson(needs.precondition.outputs.required).pyspark == 
'true'
+if: (!cancelled()) && 
fromJson(needs.precondition.outputs.required).pyspark == 'true'
 name: "Build modules: ${{ matrix.modules }}"
 runs-on: ubuntu-22.04
 container:
@@ -446,7 +446,7 @@ jobs:
   sparkr:
 needs: [precondition, infra-image]
 # always run if sparkr == 'true', even infra-image is skip (such as 
non-master job)
-if: always() && fromJson(needs.precondition.outputs.required).sparkr == 
'true'
+if: (!cancelled()) && fromJson(needs.precondition.outputs.required).sparkr 
== 'true'
 name: "Build modules: sparkr"
 runs-on: ubuntu-22.04
 container:
@@ -548,7 +548,7 @@ jobs:
   lint:
 needs: [precondition, infra-image]
 # always run if lint == 'true', even infra-image is skip (such as 
non-master job)
-if: always() && fromJson(needs.precondition.outputs.required).lint == 
'true'
+if: (!cancelled()) && fromJson(needs.precondition.outputs.required).lint 
== 'true'
 name: Linters, licenses, dependencies and documentation generation
 runs-on: ubuntu-22.04
 env:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5`

2023-06-20 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6cc63cbccfc [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until 
creating `branch-3.5`
6cc63cbccfc is described below

commit 6cc63cbccfca67d13b2e4166382ccd4f2bd49681
Author: Dongjoon Hyun 
AuthorDate: Wed Jun 21 10:47:04 2023 +0900

[SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating 
`branch-3.5`

### What changes were proposed in this pull request?

This PR aims to use a hard-coded "3.5.0" temporarily for "master" branch 
because we have "4.0.0" in Jira system and we cannot bump `master` version 
until we create `branch-3.5`.

### Why are the changes needed?

To avoid setting 4.0.0 as 'Fixed Version' mistakenly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual.

Closes #41682 from dongjoon-hyun/SPARK-44129.

Authored-by: Dongjoon Hyun 
Signed-off-by: Hyukjin Kwon 
---
 dev/merge_spark_pr.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py
index 1621432c01c..c3d524d5433 100755
--- a/dev/merge_spark_pr.py
+++ b/dev/merge_spark_pr.py
@@ -240,7 +240,7 @@ def cherry_pick(pr_num, merge_hash, default_branch):
 def fix_version_from_branch(branch, versions):
 # Note: Assumes this is a sorted (newest->oldest) list of un-released 
versions
 if branch == "master":
-return versions[0]
+return "3.5.0" # TODO(SPARK-44130) Revert SPARK-44129 after creating 
branch-3.5
 else:
 branch_ver = branch.replace("branch-", "")
 return list(filter(lambda x: x.name.startswith(branch_ver), 
versions))[-1]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-42941][SS][CONNECT] 1/2] StreamingQueryListener - Event Serde in JSON format

2023-06-20 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6bfc01188e9 [SPARK-42941][SS][CONNECT] 1/2] StreamingQueryListener - 
Event Serde in JSON format
6bfc01188e9 is described below

commit 6bfc01188e96af065218e9f4574c3c0b8c87fde0
Author: Wei Liu 
AuthorDate: Wed Jun 21 10:34:06 2023 +0900

[SPARK-42941][SS][CONNECT] 1/2] StreamingQueryListener - Event Serde in 
JSON format

### What changes were proposed in this pull request?

Following the discussion of `foreachBatch` implementation, we decide to 
implement connect StreamingQueryListener in a way that the server runs the 
listener code, rather than the client.

Following this POC: https://github.com/apache/spark/pull/41096, this is 
going to be done in a way such that
1. Client sends serialized python code to server
2. Server initializes a Scala `StreamingQueryListener`, which initialize 
the python progress and run the python code. (Details of this step still 
depends on `foreachBatch` implementation.
3. When a new StreamingQuery Event comes in, the jvm serialize it to JSON 
and send it to the python progress to process.

This PR focus on step 3, the serialization and deserialization of the 
events.

Also finishes a TODO to check exception in QueryTerminatedEvent

### Why are the changes needed?

For implementing Connect StreamingQueryListener

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unit tests

Closes #41540 from WweiL/SPARK-42941-listener-python-new-1.

Authored-by: Wei Liu 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/streaming/listener.py   | 452 +
 .../sql/tests/streaming/test_streaming_listener.py | 221 +-
 .../sql/streaming/StreamingQueryListener.scala |  44 +-
 3 files changed, 618 insertions(+), 99 deletions(-)

diff --git a/python/pyspark/sql/streaming/listener.py 
b/python/pyspark/sql/streaming/listener.py
index 33482664a7b..198af0c9cbe 100644
--- a/python/pyspark/sql/streaming/listener.py
+++ b/python/pyspark/sql/streaming/listener.py
@@ -15,7 +15,8 @@
 # limitations under the License.
 #
 import uuid
-from typing import Optional, Dict, List
+import json
+from typing import Any, Dict, List, Optional
 from abc import ABC, abstractmethod
 
 from py4j.java_gateway import JavaObject
@@ -129,16 +130,16 @@ class JStreamingQueryListener:
 self.pylistener = pylistener
 
 def onQueryStarted(self, jevent: JavaObject) -> None:
-self.pylistener.onQueryStarted(QueryStartedEvent(jevent))
+self.pylistener.onQueryStarted(QueryStartedEvent.fromJObject(jevent))
 
 def onQueryProgress(self, jevent: JavaObject) -> None:
-self.pylistener.onQueryProgress(QueryProgressEvent(jevent))
+self.pylistener.onQueryProgress(QueryProgressEvent.fromJObject(jevent))
 
 def onQueryIdle(self, jevent: JavaObject) -> None:
-self.pylistener.onQueryIdle(QueryIdleEvent(jevent))
+self.pylistener.onQueryIdle(QueryIdleEvent.fromJObject(jevent))
 
 def onQueryTerminated(self, jevent: JavaObject) -> None:
-self.pylistener.onQueryTerminated(QueryTerminatedEvent(jevent))
+
self.pylistener.onQueryTerminated(QueryTerminatedEvent.fromJObject(jevent))
 
 class Java:
 implements = 
["org.apache.spark.sql.streaming.PythonStreamingQueryListener"]
@@ -155,11 +156,31 @@ class QueryStartedEvent:
 This API is evolving.
 """
 
-def __init__(self, jevent: JavaObject) -> None:
-self._id: uuid.UUID = uuid.UUID(jevent.id().toString())
-self._runId: uuid.UUID = uuid.UUID(jevent.runId().toString())
-self._name: Optional[str] = jevent.name()
-self._timestamp: str = jevent.timestamp()
+def __init__(
+self, id: uuid.UUID, runId: uuid.UUID, name: Optional[str], timestamp: 
str
+) -> None:
+self._id: uuid.UUID = id
+self._runId: uuid.UUID = runId
+self._name: Optional[str] = name
+self._timestamp: str = timestamp
+
+@classmethod
+def fromJObject(cls, jevent: JavaObject) -> "QueryStartedEvent":
+return cls(
+id=uuid.UUID(jevent.id().toString()),
+runId=uuid.UUID(jevent.runId().toString()),
+name=jevent.name(),
+timestamp=jevent.timestamp(),
+)
+
+@classmethod
+def fromJson(cls, j: Dict[str, Any]) -> "QueryStartedEvent":
+return cls(
+id=uuid.UUID(j["id"]),
+runId=uuid.UUID(j["runId"]),
+name=j["name"],
+timestamp=j["timestamp"],
+)
 
 @property
 def id(self) -> uuid.UUID:
@@ -203,8 +224,16 @@ class QueryProgressEvent:
 This API is evolving.
 """
 
-  

[spark] branch master updated: [SPARK-44122][CONNECT][TESTS] Make `connect` module pass except Arrow-related ones in Java 21

2023-06-20 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cafbea5b136 [SPARK-44122][CONNECT][TESTS] Make `connect` module pass 
except Arrow-related ones in Java 21
cafbea5b136 is described below

commit cafbea5b13623276517a9d716f75745eff91f616
Author: Dongjoon Hyun 
AuthorDate: Tue Jun 20 17:22:14 2023 -0700

[SPARK-44122][CONNECT][TESTS] Make `connect` module pass except 
Arrow-related ones in Java 21

### What changes were proposed in this pull request?

This PR aims to `connect` module pass except `Arrow-based` ones in Java 21 
environment. In addition, the following JIRA is created to enable them.

- SPARK-44121 Renable Arrow-based connect tests in Java 21

### Why are the changes needed?

Although `Arrow` is crucial in `connect` module, this PR identifies those 
tests and helps us monitor newly added ones in the future because they will 
cause a new failure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and manual tests.

```
$ java -version
openjdk version "21-ea" 2023-09-19
OpenJDK Runtime Environment (build 21-ea+27-2343)
OpenJDK 64-Bit Server VM (build 21-ea+27-2343, mixed mode, sharing)
```

**BEFORE**
```
$ build/sbt "connect/test"
...
[info] *** 9 TESTS FAILED ***
[error] Failed tests:
[error] org.apache.spark.sql.connect.planner.SparkConnectProtoSuite
[error] 
org.apache.spark.sql.connect.planner.SparkConnectPlannerSuite
[error] 
org.apache.spark.sql.connect.planner.SparkConnectServiceSuite
[error] (connect / Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 67 s (01:07), completed Jun 20, 2023, 2:42:10 PM
```

**AFTER**
```
$ build/sbt "connect/test"
...
[info] Tests: succeeded 742, failed 0, canceled 10, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 66 s (01:06), completed Jun 20, 2023, 2:40:35 PM
```

Closes #41679 from dongjoon-hyun/SPARK-44122.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/connect/planner/SparkConnectPlannerSuite.scala  |  7 +++
 .../spark/sql/connect/planner/SparkConnectProtoSuite.scala| 11 +++
 .../spark/sql/connect/planner/SparkConnectServiceSuite.scala  |  5 +
 3 files changed, 23 insertions(+)

diff --git 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala
 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala
index ab01f2a6c14..14fdc8c0073 100644
--- 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala
+++ 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala
@@ -21,6 +21,7 @@ import scala.collection.JavaConverters._
 
 import com.google.protobuf.ByteString
 import io.grpc.stub.StreamObserver
+import org.apache.commons.lang3.{JavaVersion, SystemUtils}
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.connect.proto
@@ -439,6 +440,8 @@ class SparkConnectPlannerSuite extends SparkFunSuite with 
SparkConnectPlanTest {
   }
 
   test("transform LocalRelation") {
+// TODO(SPARK-44121) Renable Arrow-based connect tests in Java 21
+assume(SystemUtils.isJavaVersionAtMost(JavaVersion.JAVA_17))
 val rows = (0 until 10).map { i =>
   InternalRow(i, UTF8String.fromString(s"str-$i"), InternalRow(i))
 }
@@ -540,6 +543,8 @@ class SparkConnectPlannerSuite extends SparkFunSuite with 
SparkConnectPlanTest {
   }
 
   test("transform UnresolvedStar and ExpressionString") {
+// TODO(SPARK-44121) Renable Arrow-based connect tests in Java 21
+assume(SystemUtils.isJavaVersionAtMost(JavaVersion.JAVA_17))
 val sql =
   "SELECT * FROM VALUES (1,'spark',1), (2,'hadoop',2), (3,'kafka',3) AS 
tab(id, name, value)"
 val input = proto.Relation
@@ -576,6 +581,8 @@ class SparkConnectPlannerSuite extends SparkFunSuite with 
SparkConnectPlanTest {
   }
 
   test("transform UnresolvedStar with target field") {
+// TODO(SPARK-44121) Renable Arrow-based connect tests in Java 21
+assume(SystemUtils.isJavaVersionAtMost(JavaVersion.JAVA_17))
 val rows = (0 until 10).map { i =>
   InternalRow(InternalRow(InternalRow(i, i + 1)))
 }
diff --git 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala
 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala
index 8cb5c1a2919..181564a3b60 100644
--- 

[spark] branch master updated: [SPARK-43952][CORE][CONNECT][SQL] Add SparkContext APIs for query cancellation by tag

2023-06-20 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 607469b2fd2 [SPARK-43952][CORE][CONNECT][SQL] Add SparkContext APIs 
for query cancellation by tag
607469b2fd2 is described below

commit 607469b2fd2ee6d70739c5e8b3aca15f67a45cde
Author: Juliusz Sompolski 
AuthorDate: Wed Jun 21 09:21:30 2023 +0900

[SPARK-43952][CORE][CONNECT][SQL] Add SparkContext APIs for query 
cancellation by tag

### What changes were proposed in this pull request?

Currently, the only way to cancel running Spark Jobs is by using 
`SparkContext.cancelJobGroup`, using a job group name that was previously set 
using `SparkContext.setJobGroup`. This is problematic if multiple different 
parts of the system want to do cancellation, and set their own ids.

For example, BroadcastExchangeExec sets it's own job group, which may 
override job group set by user. This way, if user cancels the job group they 
set in the "parent" execution, it will not cancel these broadcast jobs launches 
from within their jobs. It would also be useful in e.g. Spark Connect to be 
able to cancel jobs without overriding jobGroupId, which may be used and needed 
for other purposes.

As a solution, consider add API to set tags on jobs, and to cancel jobs 
using tags:
* `SparkContext.addJobTag(tag: String): Unit`
* `SparkContext.removeJobTag(tag: String): Unit`
* `SparkContext.getJobTags(): Set[String]`
* `SparkContext.clearJobTags(): Unit`
* `SparkContext.cancelJobsWithTag(tag: String): Unit`
* `DAGScheduler.cancelJobsWithTag(tag: String): Unit`

Also added `SparkContext.setInterruptOnCancel(interruptOnCancel: Boolean): 
Unit`, which previously could only be set in `setJobGroup`.

The tags are also added to `JobData` and `AppStatusTracker`. A new API is 
added to `SparkStatusTracker`:
* `SparkStatusTracker.getJobIdsForTag(jobTag: String): Array[Int]`

Use the new API internally in BroadcastExchangeExec instead of cancellation 
using job group, to fix the issue with these not being cancelled by user-set 
jobgroupid. Now, the user set jobgroupid should propagate into broadcast 
execution.

Also, switch cancellation in Spark Connect to use tag instead of jobgroup.

### Why are the changes needed?

Currently, there may be multiple places that want to cancel a set of jobs, 
with different scopes.

### Does this PR introduce _any_ user-facing change?

The APIs described above are added.

### How was this patch tested?

Added test to JobCancellationSuite.

Closes #41440 from juliuszsompolski/SPARK-43952.

Authored-by: Juliusz Sompolski 
Signed-off-by: Hyukjin Kwon 
---
 .../sql/connect/service/ExecutePlanHolder.scala|   7 +-
 .../service/SparkConnectStreamHandler.scala|   8 +-
 .../apache/spark/status/protobuf/store_types.proto |   1 +
 .../main/scala/org/apache/spark/SparkContext.scala |  77 
 .../org/apache/spark/SparkStatusTracker.scala  |  11 ++
 .../org/apache/spark/scheduler/DAGScheduler.scala  |  25 
 .../apache/spark/scheduler/DAGSchedulerEvent.scala |   2 +
 .../apache/spark/status/AppStatusListener.scala|   7 ++
 .../scala/org/apache/spark/status/LiveEntity.scala |   2 +
 .../scala/org/apache/spark/status/api/v1/api.scala |   1 +
 .../status/protobuf/JobDataWrapperSerializer.scala |   2 +
 ...from_multi_attempt_app_json_1__expectation.json |   1 +
 ...from_multi_attempt_app_json_2__expectation.json |   1 +
 .../job_list_json_expectation.json |   3 +
 .../one_job_json_expectation.json  |   1 +
 ...succeeded_failed_job_list_json_expectation.json |   3 +
 .../succeeded_job_list_json_expectation.json   |   2 +
 .../org/apache/spark/JobCancellationSuite.scala| 129 -
 .../org/apache/spark/StatusTrackerSuite.scala  |  41 +++
 .../protobuf/KVStoreProtobufSerializerSuite.scala  |   8 +-
 project/MimaExcludes.scala |   4 +-
 .../execution/exchange/BroadcastExchangeExec.scala |  11 +-
 .../sql/execution/BroadcastExchangeSuite.scala |   4 +-
 23 files changed, 335 insertions(+), 16 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecutePlanHolder.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecutePlanHolder.scala
index a3c17b9826e..9bf9df07e01 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecutePlanHolder.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecutePlanHolder.scala
@@ -27,15 +27,16 @@ case class ExecutePlanHolder(
 sessionHolder: SessionHolder,
 request: proto.ExecutePlanRequest) {

[spark] branch master updated: [SPARK-44012][SS] KafkaDataConsumer to print some read status

2023-06-20 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ef8f22e53d3 [SPARK-44012][SS] KafkaDataConsumer to print some read 
status
ef8f22e53d3 is described below

commit ef8f22e53d31225b3429e2f24ca588d113fc7462
Author: Siying Dong 
AuthorDate: Wed Jun 21 07:33:46 2023 +0900

[SPARK-44012][SS] KafkaDataConsumer to print some read status

### What changes were proposed in this pull request?
In the end of each KafkaDataConsumer, it logs some stats. Here is an sample 
log line:

23/06/08 23:48:14 INFO KafkaDataConsumer: From Kafka 
topicPartition=topic-121-2 
groupId=spark-kafka-source-623fa0a8-04a5-4f34-a9ad-adbf31494e85-711383366-executor
 read 1 records, taking 504554479 nanos, during time span of 504620999 nanos

### Why are the changes needed?
For each task, Kafka source should report fraction of time spent in 
KafkaConsumer to fetch records. It should also report overall read bandwidth 
(bytes or records read / time spent fetching).

This will be useful in verifying if fetching is the bottleneck.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
1. Run unit tests and validate log line is correct
2. Run some benchmarks and see it doesn't show up much in CPU profiling.

Closes #41525 from siying/kafka_logging2.

Authored-by: Siying Dong 
Signed-off-by: Jungtaek Lim 
---
 .../sql/kafka010/consumer/KafkaDataConsumer.scala  | 55 --
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git 
a/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala
 
b/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala
index d88e9821489..a9e394d3c88 100644
--- 
a/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala
+++ 
b/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala
@@ -258,6 +258,17 @@ private[kafka010] class KafkaDataConsumer(
*/
   private val fetchedRecord: FetchedRecord = FetchedRecord(null, 
UNKNOWN_OFFSET)
 
+  // Total duration spent on reading from Kafka
+  private var totalTimeReadNanos: Long = 0
+  // Number of times we poll Kafka consumers.
+  private var numPolls: Long = 0
+  // Number of times we poll Kafka consumers.
+  private var numRecordsPolled: Long = 0
+  // Total number of records fetched from Kafka
+  private var totalRecordsRead: Long = 0
+  // Starting timestamp when the consumer is created.
+  private var startTimestampNano: Long = System.nanoTime()
+
   /**
* Get the record for the given offset if available.
*
@@ -343,6 +354,7 @@ private[kafka010] class KafkaDataConsumer(
 }
 
 if (isFetchComplete) {
+  totalRecordsRead += 1
   fetchedRecord.record
 } else {
   fetchedData.reset()
@@ -356,7 +368,9 @@ private[kafka010] class KafkaDataConsumer(
*/
   def getAvailableOffsetRange(): AvailableOffsetRange = 
runUninterruptiblyIfPossible {
 val consumer = getOrRetrieveConsumer()
-consumer.getAvailableOffsetRange()
+timeNanos {
+  consumer.getAvailableOffsetRange()
+}
   }
 
   def getNumOffsetOutOfRange(): Long = offsetOutOfRange
@@ -367,6 +381,17 @@ private[kafka010] class KafkaDataConsumer(
* must call method after using the instance to make sure resources are not 
leaked.
*/
   def release(): Unit = {
+val kafkaMeta = _consumer
+  .map(c => s"topicPartition=${c.topicPartition} groupId=${c.groupId}")
+  .getOrElse("")
+val walTime = System.nanoTime() - startTimestampNano
+
+logInfo(
+  s"From Kafka $kafkaMeta read $totalRecordsRead records through $numPolls 
polls (polled " +
+  s" out $numRecordsPolled records), taking $totalTimeReadNanos nanos, 
during time span of " +
+  s"$walTime nanos."
+)
+
 releaseConsumer()
 releaseFetchedData()
   }
@@ -394,7 +419,9 @@ private[kafka010] class KafkaDataConsumer(
   consumer: InternalKafkaConsumer,
   offset: Long,
   untilOffset: Long): Long = {
-val range = consumer.getAvailableOffsetRange()
+val range = timeNanos {
+  consumer.getAvailableOffsetRange()
+}
 logWarning(s"Some data may be lost. Recovering from the earliest offset: 
${range.earliest}")
 
 val topicPartition = consumer.topicPartition
@@ -548,7 +575,11 @@ private[kafka010] class KafkaDataConsumer(
   fetchedData: FetchedData,
   offset: Long,
   pollTimeoutMs: Long): Unit = {
-val (records, offsetAfterPoll, range) = consumer.fetch(offset, 
pollTimeoutMs)
+val (records, offsetAfterPoll, range) = timeNanos {
+  consumer.fetch(offset, pollTimeoutMs)
+}
+numPolls += 1
+

[spark] branch master updated (68b30053f78 -> 66f25e31403)

2023-06-20 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 68b30053f78 [SPARK-43939][CONNECT][PYTHON] Add try_* functions to 
Scala and Python
 add 66f25e31403 [SPARK-43876][SQL] Enable fast hashmap for distinct queries

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala   | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43939][CONNECT][PYTHON] Add try_* functions to Scala and Python

2023-06-20 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 68b30053f78 [SPARK-43939][CONNECT][PYTHON] Add try_* functions to 
Scala and Python
68b30053f78 is described below

commit 68b30053f786e8178e6bdba736734e91adb51088
Author: panbingkun 
AuthorDate: Wed Jun 21 00:38:22 2023 +0800

[SPARK-43939][CONNECT][PYTHON] Add try_* functions to Scala and Python

### What changes were proposed in this pull request?
Add following functions:

- try_add
- try_avg
- try_divide
- try_element_at
- try_multiply
- try_subtract
- try_sum
- try_to_binary
- try_to_number
- try_to_timestamp

to:

- Scala API
- Python API
- Spark Connect Scala Client
- Spark Connect Python Client

### Why are the changes needed?
for parity

### Does this PR introduce _any_ user-facing change?
Yes, new functions.

### How was this patch tested?
- Add New UT.

Closes #41653 from panbingkun/SPARK-43939.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/sql/functions.scala | 115 +++
 .../apache/spark/sql/PlanGenerationTestSuite.scala |  52 
 .../explain-results/function_try_add.explain   |   2 +
 .../explain-results/function_try_avg.explain   |   2 +
 .../explain-results/function_try_divide.explain|   2 +
 .../function_try_element_at_array.explain  |   2 +
 .../function_try_element_at_map.explain|   2 +
 .../explain-results/function_try_multiply.explain  |   2 +
 .../explain-results/function_try_subtract.explain  |   2 +
 .../explain-results/function_try_sum.explain   |   2 +
 .../explain-results/function_try_to_binary.explain |   2 +
 .../function_try_to_binary_without_format.explain  |   2 +
 .../explain-results/function_try_to_number.explain |   2 +
 .../function_try_to_timestamp.explain  |   2 +
 ...unction_try_to_timestamp_without_format.explain |   2 +
 .../query-tests/queries/function_try_add.json  |  29 ++
 .../query-tests/queries/function_try_add.proto.bin | Bin 0 -> 183 bytes
 .../query-tests/queries/function_try_avg.json  |  25 ++
 .../query-tests/queries/function_try_avg.proto.bin | Bin 0 -> 176 bytes
 .../query-tests/queries/function_try_divide.json   |  29 ++
 .../queries/function_try_divide.proto.bin  | Bin 0 -> 186 bytes
 .../queries/function_try_element_at_array.json |  29 ++
 .../function_try_element_at_array.proto.bin| Bin 0 -> 190 bytes
 .../queries/function_try_element_at_map.json   |  29 ++
 .../queries/function_try_element_at_map.proto.bin  | Bin 0 -> 190 bytes
 .../query-tests/queries/function_try_multiply.json |  29 ++
 .../queries/function_try_multiply.proto.bin| Bin 0 -> 188 bytes
 .../query-tests/queries/function_try_subtract.json |  29 ++
 .../queries/function_try_subtract.proto.bin| Bin 0 -> 188 bytes
 .../query-tests/queries/function_try_sum.json  |  25 ++
 .../query-tests/queries/function_try_sum.proto.bin | Bin 0 -> 176 bytes
 .../queries/function_try_to_binary.json|  29 ++
 .../queries/function_try_to_binary.proto.bin   | Bin 0 -> 194 bytes
 .../function_try_to_binary_without_format.json |  25 ++
 ...function_try_to_binary_without_format.proto.bin | Bin 0 -> 182 bytes
 .../queries/function_try_to_number.json|  29 ++
 .../queries/function_try_to_number.proto.bin   | Bin 0 -> 194 bytes
 .../queries/function_try_to_timestamp.json |  29 ++
 .../queries/function_try_to_timestamp.proto.bin| Bin 0 -> 192 bytes
 .../function_try_to_timestamp_without_format.json  |  25 ++
 ...ction_try_to_timestamp_without_format.proto.bin | Bin 0 -> 185 bytes
 .../source/reference/pyspark.sql/functions.rst |  10 +
 python/pyspark/sql/connect/functions.py|  76 +
 python/pyspark/sql/functions.py| 341 +
 .../scala/org/apache/spark/sql/functions.scala | 137 +
 .../org/apache/spark/sql/DateFunctionsSuite.scala  |  11 +
 .../org/apache/spark/sql/MathFunctionsSuite.scala  |  91 ++
 .../apache/spark/sql/StringFunctionsSuite.scala|  17 +
 48 files changed, 1237 insertions(+)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index a3f4a273661..d258abcecfa 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -1807,6 +1807,58 @@ object functions {
*/
   def sqrt(colName: String): Column = sqrt(Column(colName))
 
+  /**
+   * Returns the sum of `left` and `right` and the result is 

[spark] branch master updated: [SPARK-44022][BUILD] Enforce max bytecode version on Maven dependencies

2023-06-20 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 27e1e56d54b [SPARK-44022][BUILD] Enforce max bytecode version on Maven 
dependencies
27e1e56d54b is described below

commit 27e1e56d54b1c6986369469b27774fa7bc267cf4
Author: liangbowen 
AuthorDate: Tue Jun 20 08:22:30 2023 -0700

[SPARK-44022][BUILD] Enforce max bytecode version on Maven dependencies

### What changes were proposed in this pull request?

- to enforce Java's max bytecode version to maven dependencies, by using 
enforceBytecodeVersion enforcer rule 
(https://www.mojohaus.org/extra-enforcer-rules/enforceBytecodeVersion.html)
- exclude `org.threeten:threeten-extra` in enforcer rule as its 
package-info.class requiring bytecode version 53 but no side effects on other 
classes

### Why are the changes needed?

- to avoiding introduction of dependencies requiring higher Java version

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #41546 from bowenliang123/enforce-bytecode-version.

Authored-by: liangbowen 
Signed-off-by: Dongjoon Hyun 
---
 pom.xml | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/pom.xml b/pom.xml
index c322112fbc6..74f64da9fd3 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2782,6 +2782,17 @@
 
 true
   
+  
+${java.version}
+test
+provided
+
+  
+  org.threeten:threeten-extra
+
+  
 
   
 
@@ -2797,6 +2808,13 @@
   
 
   
+  
+
+  org.codehaus.mojo
+  extra-enforcer-rules
+  1.7.0
+
+  
 
 
   org.codehaus.mojo


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44105][SQL] `LastNonNull` should be lazily resolved

2023-06-20 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4c15fa784dc [SPARK-44105][SQL] `LastNonNull` should be lazily resolved
4c15fa784dc is described below

commit 4c15fa784dc268ac0844771585677a90de2d064f
Author: Ruifeng Zheng 
AuthorDate: Tue Jun 20 07:49:24 2023 -0700

[SPARK-44105][SQL] `LastNonNull` should be lazily resolved

### What changes were proposed in this pull request?
`LastNonNull` should be lazily resolved

### Why are the changes needed?
to fix https://github.com/apache/spark/pull/41670/files#r1234805869

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
existing GA and manually check

Closes #41672 from zhengruifeng/ps_fix_last_not_null.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/catalyst/expressions/windowExpressions.scala | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index 1b9d232bf8a..50c98c01645 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -1164,19 +1164,19 @@ case class LastNonNull(input: Expression)
 
   override def dataType: DataType = input.dataType
 
-  private val last = AttributeReference("last", dataType, nullable = true)()
+  private lazy val last = AttributeReference("last", dataType, nullable = 
true)()
 
   override def aggBufferAttributes: Seq[AttributeReference] = last :: Nil
 
-  override val initialValues: Seq[Expression] = Seq(Literal.create(null, 
dataType))
+  override lazy val initialValues: Seq[Expression] = Seq(Literal.create(null, 
dataType))
 
-  override val updateExpressions: Seq[Expression] = {
+  override lazy val updateExpressions: Seq[Expression] = {
 Seq(
   /* last = */ If(IsNull(input), last, input)
 )
   }
 
-  override val evaluateExpression: Expression = last
+  override lazy val evaluateExpression: Expression = last
 
   override def prettyName: String = "last_non_null"
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44104][BUILD] Enabled `protobuf` module mima check for Spark 3.5.0

2023-06-20 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bfad72b0a86 [SPARK-44104][BUILD] Enabled `protobuf` module mima check 
for Spark 3.5.0
bfad72b0a86 is described below

commit bfad72b0a8689f7a3361785cc5004030bc94da3d
Author: yangjie01 
AuthorDate: Tue Jun 20 07:45:49 2023 -0700

[SPARK-44104][BUILD] Enabled `protobuf` module mima check for Spark 3.5.0

### What changes were proposed in this pull request?
This pr adds a mima check for the `protobuf` module for Apache Spark 3.5.0

### Why are the changes needed?
The `protobuf` module is a new module introduced in Spark 3.4.0, which 
includes some client APIs, so it should be added to Spark 3.5.0's mima check

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions
- Manual checked:

```
dev/mima
```

and

```
dev/change-scala-version.sh 2.13
dev/mima -Pscala-2.13
```

Closes #41671 from LuciferYang/SPARK-44104.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 project/MimaExcludes.scala | 4 
 project/SparkBuild.scala   | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index bba20534f44..7cac416838d 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -93,6 +93,10 @@ object MimaExcludes {
 ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.ErrorInfo$"),
 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.ErrorSubInfo$"),
 
+// SPARK-44104: shaded protobuf code and Apis with parameters relocated
+
ProblemFilters.exclude[Problem]("org.sparkproject.spark_protobuf.protobuf.*"),
+
ProblemFilters.exclude[Problem]("org.apache.spark.sql.protobuf.utils.SchemaConverters.*"),
+
 (problem: Problem) => problem match {
   case MissingClassProblem(cls) => 
!cls.fullName.startsWith("org.sparkproject.jpmml") &&
   !cls.fullName.startsWith("org.sparkproject.dmg.pmml")
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 607daa67138..761b8f905f5 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -415,7 +415,7 @@ object SparkBuild extends PomBuild {
   val mimaProjects = allProjects.filterNot { x =>
 Seq(
   spark, hive, hiveThriftServer, repl, networkCommon, networkShuffle, 
networkYarn,
-  unsafe, tags, tokenProviderKafka010, sqlKafka010, connectCommon, 
connect, connectClient, protobuf,
+  unsafe, tags, tokenProviderKafka010, sqlKafka010, connectCommon, 
connect, connectClient,
   commonUtils, sqlApi
 ).contains(x)
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43929][SPARK-44073][SQL][PYTHON][CONNECT][FOLLOWUP] Add extract, date_part, datepart to Scala, Python and Connect API

2023-06-20 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8dc02863b92 [SPARK-43929][SPARK-44073][SQL][PYTHON][CONNECT][FOLLOWUP] 
Add extract, date_part, datepart to Scala, Python and Connect API
8dc02863b92 is described below

commit 8dc02863b926b9e0780b994f9ee6c5c1058d49a0
Author: Jiaan Geng 
AuthorDate: Tue Jun 20 21:47:05 2023 +0800

[SPARK-43929][SPARK-44073][SQL][PYTHON][CONNECT][FOLLOWUP] Add extract, 
date_part, datepart to Scala, Python and Connect API

### What changes were proposed in this pull request?
This PR follows up https://github.com/apache/spark/pull/41636 and 
https://github.com/apache/spark/pull/41651 and add extract, date_part, datepart 
to Scala, Python and Connect API.

### Why are the changes needed?
Add extract, date_part, datepart to Scala, Python and Connect API

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes #41667 from beliefer/datetime_functions_followup.

Authored-by: Jiaan Geng 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/sql/functions.scala |  50 ++
 .../apache/spark/sql/PlanGenerationTestSuite.scala |  12 +++
 .../explain-results/function_date_part.explain |   2 +
 .../explain-results/function_datepart.explain  |   2 +
 .../explain-results/function_extract.explain   |   2 +
 .../query-tests/queries/function_date_part.json|  29 ++
 .../queries/function_date_part.proto.bin   | Bin 0 -> 133 bytes
 .../query-tests/queries/function_datepart.json |  29 ++
 .../queries/function_datepart.proto.bin| Bin 0 -> 132 bytes
 .../query-tests/queries/function_extract.json  |  29 ++
 .../query-tests/queries/function_extract.proto.bin | Bin 0 -> 131 bytes
 .../source/reference/pyspark.sql/functions.rst |   3 +
 python/pyspark/sql/connect/functions.py|  21 
 python/pyspark/sql/functions.py| 110 +
 .../scala/org/apache/spark/sql/functions.scala |  41 
 .../org/apache/spark/sql/DateFunctionsSuite.scala  |  68 +
 16 files changed, 398 insertions(+)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index 2ac20bd5911..a3f4a273661 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -4438,6 +4438,56 @@ object functions {
*/
   def hour(e: Column): Column = Column.fn("hour", e)
 
+  /**
+   * Extracts a part of the date/timestamp or interval source.
+   *
+   * @param field
+   *   selects which part of the source should be extracted.
+   * @param source
+   *   a date/timestamp or interval column from where `field` should be 
extracted.
+   * @return
+   *   a part of the date/timestamp or interval source
+   * @group datetime_funcs
+   * @since 3.5.0
+   */
+  def extract(field: Column, source: Column): Column = {
+Column.fn("extract", field, source)
+  }
+
+  /**
+   * Extracts a part of the date/timestamp or interval source.
+   *
+   * @param field
+   *   selects which part of the source should be extracted, and supported 
string values are as
+   *   same as the fields of the equivalent function `extract`.
+   * @param source
+   *   a date/timestamp or interval column from where `field` should be 
extracted.
+   * @return
+   *   a part of the date/timestamp or interval source
+   * @group datetime_funcs
+   * @since 3.5.0
+   */
+  def date_part(field: Column, source: Column): Column = {
+Column.fn("date_part", field, source)
+  }
+
+  /**
+   * Extracts a part of the date/timestamp or interval source.
+   *
+   * @param field
+   *   selects which part of the source should be extracted, and supported 
string values are as
+   *   same as the fields of the equivalent function `extract`.
+   * @param source
+   *   a date/timestamp or interval column from where `field` should be 
extracted.
+   * @return
+   *   a part of the date/timestamp or interval source
+   * @group datetime_funcs
+   * @since 3.5.0
+   */
+  def datepart(field: Column, source: Column): Column = {
+Column.fn("datepart", field, source)
+  }
+
   /**
* Returns the last day of the month which the given date belongs to. For 
example, input
* "2015-07-27" returns "2015-07-31" since July 31 is the last day of the 
month in July 2015.
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
 

[spark] branch master updated (0865c0db923 -> 35b3a18ff04)

2023-06-20 Thread weichenxu123
This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0865c0db923 [SPARK-43944][SPARK-43942][SQL][FOLLOWUP] Directly 
leverage `UnresolvedFunction` for functions `startswith`/`endswith`/`contains`
 add 35b3a18ff04 [SPARK-44100][ML][CONNECT][PYTHON] Move namespace from 
`pyspark.mlv2` to `pyspark.ml.connect`

No new revisions were added by this update.

Summary of changes:
 dev/sparktestsupport/modules.py| 18 +-
 python/mypy.ini|  3 --
 python/pyspark/ml/connect/__init__.py  | 24 +
 python/pyspark/{mlv2 => ml/connect}/base.py|  2 +-
 .../pyspark/{mlv2 => ml/connect}/classification.py |  6 ++--
 python/pyspark/{mlv2 => ml/connect}/evaluation.py  |  6 ++--
 python/pyspark/{mlv2 => ml/connect}/feature.py |  6 ++--
 python/pyspark/{mlv2 => ml/connect}/io_utils.py|  0
 python/pyspark/{mlv2 => ml/connect}/pipeline.py|  8 ++---
 python/pyspark/{mlv2 => ml/connect}/summarizer.py  |  2 +-
 python/pyspark/{mlv2 => ml/connect}/util.py|  0
 .../tests/connect/test_connect_classification.py}  |  6 ++--
 .../tests/connect/test_connect_evaluation.py}  |  4 +--
 .../tests/connect/test_connect_feature.py} |  4 +--
 .../tests/connect/test_connect_pipeline.py}|  6 ++--
 .../tests/connect/test_connect_summarizer.py}  |  4 +--
 .../connect/test_legacy_mode_classification.py}|  6 ++--
 .../tests/connect/test_legacy_mode_evaluation.py}  |  4 +--
 .../tests/connect/test_legacy_mode_feature.py} |  4 +--
 .../tests/connect/test_legacy_mode_pipeline.py}| 10 +++---
 .../tests/connect/test_legacy_mode_summarizer.py}  |  4 +--
 python/pyspark/mlv2/__init__.py| 40 --
 22 files changed, 75 insertions(+), 92 deletions(-)
 rename python/pyspark/{mlv2 => ml/connect}/base.py (99%)
 rename python/pyspark/{mlv2 => ml/connect}/classification.py (98%)
 rename python/pyspark/{mlv2 => ml/connect}/evaluation.py (95%)
 rename python/pyspark/{mlv2 => ml/connect}/feature.py (97%)
 rename python/pyspark/{mlv2 => ml/connect}/io_utils.py (100%)
 rename python/pyspark/{mlv2 => ml/connect}/pipeline.py (96%)
 rename python/pyspark/{mlv2 => ml/connect}/summarizer.py (98%)
 rename python/pyspark/{mlv2 => ml/connect}/util.py (100%)
 copy python/pyspark/{mlv2/tests/connect/test_parity_classification.py => 
ml/tests/connect/test_connect_classification.py} (84%)
 rename python/pyspark/{mlv2/tests/connect/test_parity_evaluation.py => 
ml/tests/connect/test_connect_evaluation.py} (89%)
 rename python/pyspark/{mlv2/tests/connect/test_parity_feature.py => 
ml/tests/connect/test_connect_feature.py} (89%)
 rename python/pyspark/{mlv2/tests/connect/test_parity_classification.py => 
ml/tests/connect/test_connect_pipeline.py} (85%)
 rename python/pyspark/{mlv2/tests/connect/test_parity_summarizer.py => 
ml/tests/connect/test_connect_summarizer.py} (88%)
 rename python/pyspark/{mlv2/tests/test_classification.py => 
ml/tests/connect/test_legacy_mode_classification.py} (98%)
 rename python/pyspark/{mlv2/tests/test_evaluation.py => 
ml/tests/connect/test_legacy_mode_evaluation.py} (94%)
 rename python/pyspark/{mlv2/tests/test_feature.py => 
ml/tests/connect/test_legacy_mode_feature.py} (97%)
 rename python/pyspark/{mlv2/tests/test_pipeline.py => 
ml/tests/connect/test_legacy_mode_pipeline.py} (95%)
 rename python/pyspark/{mlv2/tests/test_summarizer.py => 
ml/tests/connect/test_legacy_mode_summarizer.py} (94%)
 delete mode 100644 python/pyspark/mlv2/__init__.py


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43944][SPARK-43942][SQL][FOLLOWUP] Directly leverage `UnresolvedFunction` for functions `startswith`/`endswith`/`contains`

2023-06-20 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0865c0db923 [SPARK-43944][SPARK-43942][SQL][FOLLOWUP] Directly 
leverage `UnresolvedFunction` for functions `startswith`/`endswith`/`contains`
0865c0db923 is described below

commit 0865c0db923eadb840dd9af5834dc72ba19e43c4
Author: Ruifeng Zheng 
AuthorDate: Tue Jun 20 18:00:33 2023 +0800

[SPARK-43944][SPARK-43942][SQL][FOLLOWUP] Directly leverage 
`UnresolvedFunction` for functions `startswith`/`endswith`/`contains`

### What changes were proposed in this pull request?
Directly leverage UnresolvedFunction for `startswith`/`endswith`/`contains`

### Why are the changes needed?
to be more consistent with existing functions, like 
[ceil](https://github.com/apache/spark/blob/6b36a9368d6e97f7f1f94c4ca7f6ee76dcd0015f/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2242-L2260),
 
[floor](https://github.com/apache/spark/blob/6b36a9368d6e97f7f1f94c4ca7f6ee76dcd0015f/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2397-L2416),
 
[lpad](https://github.com/apache/spark/blob/6b36a9368d6e97f7f1f94c4ca7f6ee76dcd0015f/sql/core/src/main/scala/org/apa
 [...]

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
existing GA

Closes #41669 from zhengruifeng/use_unresolved_func.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../main/scala/org/apache/spark/sql/functions.scala| 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
index 68b81810da4..7c3f65e2495 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
@@ -4051,10 +4051,8 @@ object functions {
* @group string_funcs
* @since 3.5.0
*/
-  def endswith(str: Column, suffix: Column): Column = {
-// 'EndsWith' expression only supports StringType,
-// use 'call_udf' to support both StringType and BinaryType.
-call_udf("endswith", str, suffix)
+  def endswith(str: Column, suffix: Column): Column = withExpr {
+UnresolvedFunction(Seq("endswith"), Seq(str.expr, suffix.expr), isDistinct 
= false)
   }
 
   /**
@@ -4065,10 +4063,8 @@ object functions {
* @group string_funcs
* @since 3.5.0
*/
-  def startswith(str: Column, prefix: Column): Column = {
-// 'StartsWith' expression only supports StringType,
-// use 'call_udf' to support both StringType and BinaryType.
-call_udf("startswith", str, prefix)
+  def startswith(str: Column, prefix: Column): Column = withExpr {
+UnresolvedFunction(Seq("startswith"), Seq(str.expr, prefix.expr), 
isDistinct = false)
   }
 
   /**
@@ -4145,10 +4141,8 @@ object functions {
* @group string_funcs
* @since 3.5.0
*/
-  def contains(left: Column, right: Column): Column = {
-// 'Contains' expression only supports StringType
-// use 'call_udf' to support both StringType and BinaryType.
-call_udf("contains", left, right)
+  def contains(left: Column, right: Column): Column = withExpr {
+UnresolvedFunction(Seq("contains"), Seq(left.expr, right.expr), isDistinct 
= false)
   }
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43944][SQL][CONNECT][PYTHON][FOLLOW-UP] Make `startswith` & `endswith` support binary type

2023-06-20 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6b36a9368d6 [SPARK-43944][SQL][CONNECT][PYTHON][FOLLOW-UP] Make 
`startswith` & `endswith` support binary type
6b36a9368d6 is described below

commit 6b36a9368d6e97f7f1f94c4ca7f6ee76dcd0015f
Author: Ruifeng Zheng 
AuthorDate: Tue Jun 20 14:08:56 2023 +0800

[SPARK-43944][SQL][CONNECT][PYTHON][FOLLOW-UP] Make `startswith` & 
`endswith` support binary type

### What changes were proposed in this pull request?
Make `startswith`, `endswith` support binary type:
1, in Connect API, `startswith` & `endswith` actually already support 
binary type;
2, in vanilla API, support binary type via `call_udf`

### Why are the changes needed?
for parity

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
added ut

Closes #41659 from zhengruifeng/sql_func_sw.

Lead-authored-by: Ruifeng Zheng 
Co-authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/sql/functions.scala | 14 +++--
 python/pyspark/sql/functions.py| 36 --
 .../scala/org/apache/spark/sql/functions.scala | 24 ++-
 .../apache/spark/sql/StringFunctionsSuite.scala| 14 +++--
 4 files changed, 52 insertions(+), 36 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index 93cf8f521b2..2ac20bd5911 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -3945,11 +3945,8 @@ object functions {
 
   /**
* Returns a boolean. The value is True if str ends with suffix. Returns 
NULL if either input
-   * expression is NULL. Otherwise, returns False. Both str or suffix must be 
of STRING type.
-   *
-   * @note
-   *   Only STRING type is supported in this function, while `endswith` in SQL 
supports both
-   *   STRING and BINARY.
+   * expression is NULL. Otherwise, returns False. Both str or suffix must be 
of STRING or BINARY
+   * type.
*
* @group string_funcs
* @since 3.5.0
@@ -3959,11 +3956,8 @@ object functions {
 
   /**
* Returns a boolean. The value is True if str starts with prefix. Returns 
NULL if either input
-   * expression is NULL. Otherwise, returns False. Both str or prefix must be 
of STRING type.
-   *
-   * @note
-   *   Only STRING type is supported in this function, while `startswith` in 
SQL supports both
-   *   STRING and BINARY.
+   * expression is NULL. Otherwise, returns False. Both str or prefix must be 
of STRING or BINARY
+   * type.
*
* @group string_funcs
* @since 3.5.0
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 3eaccdc1ea1..0cfc19615be 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -9660,11 +9660,6 @@ def endswith(str: "ColumnOrName", suffix: 
"ColumnOrName") -> Column:
 
 .. versionadded:: 3.5.0
 
-Notes
--
-Only STRING type is supported in this function,
-while `startswith` in SQL supports both STRING and BINARY.
-
 Parameters
 --
 str : :class:`~pyspark.sql.Column` or str
@@ -9677,6 +9672,19 @@ def endswith(str: "ColumnOrName", suffix: 
"ColumnOrName") -> Column:
 >>> df = spark.createDataFrame([("Spark SQL", "Spark",)], ["a", "b"])
 >>> df.select(endswith(df.a, df.b).alias('r')).collect()
 [Row(r=False)]
+
+>>> df = spark.createDataFrame([("414243", "4243",)], ["e", "f"])
+>>> df = df.select(to_binary("e").alias("e"), to_binary("f").alias("f"))
+>>> df.printSchema()
+root
+ |-- e: binary (nullable = true)
+ |-- f: binary (nullable = true)
+>>> df.select(endswith("e", "f"), endswith("f", "e")).show()
++--+--+
+|endswith(e, f)|endswith(f, e)|
++--+--+
+|  true| false|
++--+--+
 """
 return _invoke_function_over_columns("endswith", str, suffix)
 
@@ -9690,11 +9698,6 @@ def startswith(str: "ColumnOrName", prefix: 
"ColumnOrName") -> Column:
 
 .. versionadded:: 3.5.0
 
-Notes
--
-Only STRING type is supported in this function,
-while `startswith` in SQL supports both STRING and BINARY.
-
 Parameters
 --
 str : :class:`~pyspark.sql.Column` or str
@@ -9707,6 +9710,19 @@ def startswith(str: "ColumnOrName", prefix: 
"ColumnOrName") -> Column:
 >>> df = spark.createDataFrame([("Spark SQL", "Spark",)], ["a", "b"])
 >>> df.select(startswith(df.a,