[spark] branch master updated: [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 692ec99bc78 [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5` 692ec99bc78 is described below commit 692ec99bc7884cc998afbf63e4ef53053c0c9dd7 Author: Hyukjin Kwon AuthorDate: Tue Jun 20 19:20:20 2023 -0700 [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5` ### What changes were proposed in this pull request? This adds a couple of changes to make the script working See https://github.com/apache/spark/pull/41682 ### Why are the changes needed? See https://github.com/apache/spark/pull/41682 ### Does this PR introduce _any_ user-facing change? See https://github.com/apache/spark/pull/41682 ### How was this patch tested? See https://github.com/apache/spark/pull/41682 Closes #41684 from HyukjinKwon/SPARK-44129. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- dev/merge_spark_pr.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py index 1621432c01c..e9024573a21 100755 --- a/dev/merge_spark_pr.py +++ b/dev/merge_spark_pr.py @@ -240,7 +240,8 @@ def cherry_pick(pr_num, merge_hash, default_branch): def fix_version_from_branch(branch, versions): # Note: Assumes this is a sorted (newest->oldest) list of un-released versions if branch == "master": -return versions[0] +# TODO(SPARK-44130) Revert SPARK-44129 after creating branch-3.5 +return [v for v in versions if v.name == "3.5.0"][0] else: branch_ver = branch.replace("branch-", "") return list(filter(lambda x: x.name.startswith(branch_ver), versions))[-1] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: Revert "[SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5`"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5a55061df25 Revert "[SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5`" 5a55061df25 is described below commit 5a55061df25a9f9c1c35c272b1563705d957eb84 Author: Hyukjin Kwon AuthorDate: Wed Jun 21 11:11:57 2023 +0900 Revert "[SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5`" This reverts commit 6cc63cbccfca67d13b2e4166382ccd4f2bd49681. --- dev/merge_spark_pr.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py index c3d524d5433..1621432c01c 100755 --- a/dev/merge_spark_pr.py +++ b/dev/merge_spark_pr.py @@ -240,7 +240,7 @@ def cherry_pick(pr_num, merge_hash, default_branch): def fix_version_from_branch(branch, versions): # Note: Assumes this is a sorted (newest->oldest) list of un-released versions if branch == "master": -return "3.5.0" # TODO(SPARK-44130) Revert SPARK-44129 after creating branch-3.5 +return versions[0] else: branch_ver = branch.replace("branch-", "") return list(filter(lambda x: x.name.startswith(branch_ver), versions))[-1] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44107][CONNECT][PYTHON] Hide unsupported Column methods from auto-completion
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3d52ba49946 [SPARK-44107][CONNECT][PYTHON] Hide unsupported Column methods from auto-completion 3d52ba49946 is described below commit 3d52ba49946cb0054a58e0026e26ba442c64988d Author: Ruifeng Zheng AuthorDate: Wed Jun 21 10:11:36 2023 +0800 [SPARK-44107][CONNECT][PYTHON] Hide unsupported Column methods from auto-completion ### What changes were proposed in this pull request? Hide unsupported Column method `_jc` from auto-completion, it is already handled in `__getattr__`, see https://github.com/apache/spark/blob/e6c6d444ae07f1ece127cea6332cce906b5aa1c5/python/pyspark/sql/connect/column.py#L445-L454 ### Why are the changes needed? no need to show unsupported methods ### Does this PR introduce _any_ user-facing change? yes before this PR: https://github.com/apache/spark/assets/7322292/bee3c41d-8fa5-4981-9392-cde93a1e9f34;> after this PR: https://github.com/apache/spark/assets/7322292/85e5c7cc-86b7-4919-8c8a-db8dba2c94a9;> ### How was this patch tested? existing UTs and manually check in ipython Closes #41675 from zhengruifeng/connect_col_hide. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/connect/column.py | 6 -- 1 file changed, 6 deletions(-) diff --git a/python/pyspark/sql/connect/column.py b/python/pyspark/sql/connect/column.py index 4d32da56192..05292938163 100644 --- a/python/pyspark/sql/connect/column.py +++ b/python/pyspark/sql/connect/column.py @@ -478,12 +478,6 @@ class Column: __bool__ = __nonzero__ -@property -def _jc(self) -> None: -raise PySparkAttributeError( -error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": "_jc"} -) - Column.__doc__ = PySparkColumn.__doc__ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44125][R] Support Java 21 in SparkR
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 95f071cf5f3 [SPARK-44125][R] Support Java 21 in SparkR 95f071cf5f3 is described below commit 95f071cf5f34d73d193b9c4f28f5459fa92aaeef Author: Dongjoon Hyun AuthorDate: Wed Jun 21 11:10:54 2023 +0900 [SPARK-44125][R] Support Java 21 in SparkR ### What changes were proposed in this pull request? This PR aims to support Java 21 in SparkR. Arrow-related issue will be fixed when we upgrade Arrow library. Also, the following JIRA is created to re-enable them even in Java 21. - SPARK-44127 Reenable `test_sparkSQL_arrow.R` in Java 21 ### Why are the changes needed? To be ready for Java 21. ### Does this PR introduce _any_ user-facing change? No, this is additional support. ### How was this patch tested? Pass the CIs and do manual tests. ``` $ java -version openjdk version "21-ea" 2023-09-19 OpenJDK Runtime Environment (build 21-ea+27-2343) OpenJDK 64-Bit Server VM (build 21-ea+27-2343, mixed mode, sharing) $ build/sbt test:package -Psparkr -Phive $ R/install-dev.sh; R/run-tests.sh ... ══ Skipped ═ 1. createDataFrame/collect Arrow optimization ('test_sparkSQL_arrow.R:29:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 2. createDataFrame/collect Arrow optimization - many partitions (partition order test) ('test_sparkSQL_arrow.R:47:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 3. createDataFrame/collect Arrow optimization - type specification ('test_sparkSQL_arrow.R:54:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 4. dapply() Arrow optimization ('test_sparkSQL_arrow.R:79:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 5. dapply() Arrow optimization - type specification ('test_sparkSQL_arrow.R:114:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 6. dapply() Arrow optimization - type specification (date and timestamp) ('test_sparkSQL_arrow.R:144:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 7. gapply() Arrow optimization ('test_sparkSQL_arrow.R:154:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 8. gapply() Arrow optimization - type specification ('test_sparkSQL_arrow.R:198:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 9. gapply() Arrow optimization - type specification (date and timestamp) ('test_sparkSQL_arrow.R:231:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 10. Arrow optimization - unsupported types ('test_sparkSQL_arrow.R:243:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 11. SPARK-32478: gapply() Arrow optimization - error message for schema mismatch ('test_sparkSQL_arrow.R:255:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 12. SPARK-43789: Automatically pick the number of partitions based on Arrow batch size ('test_sparkSQL_arrow.R:265:3') - Reason: sparkR.callJStatic("org.apache.spark.util.Utils", "isJavaVersionAtLeast21") is TRUE 13. sparkJars tag in SparkContext ('test_Windows.R:22:5') - Reason: This test is only for Windows, skipped ══ DONE ... * DONE Status: 2 NOTEs See ‘/Users/dongjoon/APACHE/spark-merge/R/SparkR.Rcheck/00check.log’ for details. + popd Tests passed. ``` Closes #41680 from dongjoon-hyun/SPARK-44125. Authored-by: Dongjoon Hyun Signed-off-by: Hyukjin Kwon --- R/pkg/R/client.R| 6 -- R/pkg/tests/fulltests/test_sparkSQL_arrow.R | 24 2 files changed, 28 insertions(+), 2 deletions(-) diff --git a/R/pkg/R/client.R b/R/pkg/R/client.R index 797a5c7da15..88f9e9fe857 100644 --- a/R/pkg/R/client.R +++ b/R/pkg/R/client.R @@ -93,8 +93,10 @@ checkJavaVersion <- function() { }, javaVersionOut) javaVersionStr <- strsplit(javaVersionFilter[[1]], '"', fixed = TRUE)[[1L]][2] - # javaVersionStr is of the form 1.8.0_92/9.0.x/11.0.x. - # We are using 8, 9, 10, 11 for sparkJavaVersion. + # javaVersionStr is of the form
[spark] branch master updated: [SPARK-44103][INFRA] Make pending container jobs cancelable
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 526831b8f07 [SPARK-44103][INFRA] Make pending container jobs cancelable 526831b8f07 is described below commit 526831b8f072109df464d503076fda32f8fd5981 Author: Ruifeng Zheng AuthorDate: Wed Jun 21 09:48:26 2023 +0800 [SPARK-44103][INFRA] Make pending container jobs cancelable ### What changes were proposed in this pull request? Make pending container jobs (pyspark/sparkr/linter) cancelable, to release the resources ASAP. ### Why are the changes needed? to release the resources ASAP. before this PR when we click the `Cancel Workflow`, container jobs can not be cancelled, no matter the status (running or pending). In this PR, we can cancel the pending container jobs. Note that the running ones still can not be cancelled. ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? manually check Closes #41668 from zhengruifeng/infra_docker_stop. Lead-authored-by: Ruifeng Zheng Co-authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .github/workflows/build_and_test.yml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 93d69f18740..a03aa53dc88 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -333,7 +333,7 @@ jobs: pyspark: needs: [precondition, infra-image] # always run if pyspark == 'true', even infra-image is skip (such as non-master job) -if: always() && fromJson(needs.precondition.outputs.required).pyspark == 'true' +if: (!cancelled()) && fromJson(needs.precondition.outputs.required).pyspark == 'true' name: "Build modules: ${{ matrix.modules }}" runs-on: ubuntu-22.04 container: @@ -446,7 +446,7 @@ jobs: sparkr: needs: [precondition, infra-image] # always run if sparkr == 'true', even infra-image is skip (such as non-master job) -if: always() && fromJson(needs.precondition.outputs.required).sparkr == 'true' +if: (!cancelled()) && fromJson(needs.precondition.outputs.required).sparkr == 'true' name: "Build modules: sparkr" runs-on: ubuntu-22.04 container: @@ -548,7 +548,7 @@ jobs: lint: needs: [precondition, infra-image] # always run if lint == 'true', even infra-image is skip (such as non-master job) -if: always() && fromJson(needs.precondition.outputs.required).lint == 'true' +if: (!cancelled()) && fromJson(needs.precondition.outputs.required).lint == 'true' name: Linters, licenses, dependencies and documentation generation runs-on: ubuntu-22.04 env: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6cc63cbccfc [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5` 6cc63cbccfc is described below commit 6cc63cbccfca67d13b2e4166382ccd4f2bd49681 Author: Dongjoon Hyun AuthorDate: Wed Jun 21 10:47:04 2023 +0900 [SPARK-44129][INFRA] Use "3.5.0" for "master" branch until creating `branch-3.5` ### What changes were proposed in this pull request? This PR aims to use a hard-coded "3.5.0" temporarily for "master" branch because we have "4.0.0" in Jira system and we cannot bump `master` version until we create `branch-3.5`. ### Why are the changes needed? To avoid setting 4.0.0 as 'Fixed Version' mistakenly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #41682 from dongjoon-hyun/SPARK-44129. Authored-by: Dongjoon Hyun Signed-off-by: Hyukjin Kwon --- dev/merge_spark_pr.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py index 1621432c01c..c3d524d5433 100755 --- a/dev/merge_spark_pr.py +++ b/dev/merge_spark_pr.py @@ -240,7 +240,7 @@ def cherry_pick(pr_num, merge_hash, default_branch): def fix_version_from_branch(branch, versions): # Note: Assumes this is a sorted (newest->oldest) list of un-released versions if branch == "master": -return versions[0] +return "3.5.0" # TODO(SPARK-44130) Revert SPARK-44129 after creating branch-3.5 else: branch_ver = branch.replace("branch-", "") return list(filter(lambda x: x.name.startswith(branch_ver), versions))[-1] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-42941][SS][CONNECT] 1/2] StreamingQueryListener - Event Serde in JSON format
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6bfc01188e9 [SPARK-42941][SS][CONNECT] 1/2] StreamingQueryListener - Event Serde in JSON format 6bfc01188e9 is described below commit 6bfc01188e96af065218e9f4574c3c0b8c87fde0 Author: Wei Liu AuthorDate: Wed Jun 21 10:34:06 2023 +0900 [SPARK-42941][SS][CONNECT] 1/2] StreamingQueryListener - Event Serde in JSON format ### What changes were proposed in this pull request? Following the discussion of `foreachBatch` implementation, we decide to implement connect StreamingQueryListener in a way that the server runs the listener code, rather than the client. Following this POC: https://github.com/apache/spark/pull/41096, this is going to be done in a way such that 1. Client sends serialized python code to server 2. Server initializes a Scala `StreamingQueryListener`, which initialize the python progress and run the python code. (Details of this step still depends on `foreachBatch` implementation. 3. When a new StreamingQuery Event comes in, the jvm serialize it to JSON and send it to the python progress to process. This PR focus on step 3, the serialization and deserialization of the events. Also finishes a TODO to check exception in QueryTerminatedEvent ### Why are the changes needed? For implementing Connect StreamingQueryListener ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests Closes #41540 from WweiL/SPARK-42941-listener-python-new-1. Authored-by: Wei Liu Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/streaming/listener.py | 452 + .../sql/tests/streaming/test_streaming_listener.py | 221 +- .../sql/streaming/StreamingQueryListener.scala | 44 +- 3 files changed, 618 insertions(+), 99 deletions(-) diff --git a/python/pyspark/sql/streaming/listener.py b/python/pyspark/sql/streaming/listener.py index 33482664a7b..198af0c9cbe 100644 --- a/python/pyspark/sql/streaming/listener.py +++ b/python/pyspark/sql/streaming/listener.py @@ -15,7 +15,8 @@ # limitations under the License. # import uuid -from typing import Optional, Dict, List +import json +from typing import Any, Dict, List, Optional from abc import ABC, abstractmethod from py4j.java_gateway import JavaObject @@ -129,16 +130,16 @@ class JStreamingQueryListener: self.pylistener = pylistener def onQueryStarted(self, jevent: JavaObject) -> None: -self.pylistener.onQueryStarted(QueryStartedEvent(jevent)) +self.pylistener.onQueryStarted(QueryStartedEvent.fromJObject(jevent)) def onQueryProgress(self, jevent: JavaObject) -> None: -self.pylistener.onQueryProgress(QueryProgressEvent(jevent)) +self.pylistener.onQueryProgress(QueryProgressEvent.fromJObject(jevent)) def onQueryIdle(self, jevent: JavaObject) -> None: -self.pylistener.onQueryIdle(QueryIdleEvent(jevent)) +self.pylistener.onQueryIdle(QueryIdleEvent.fromJObject(jevent)) def onQueryTerminated(self, jevent: JavaObject) -> None: -self.pylistener.onQueryTerminated(QueryTerminatedEvent(jevent)) + self.pylistener.onQueryTerminated(QueryTerminatedEvent.fromJObject(jevent)) class Java: implements = ["org.apache.spark.sql.streaming.PythonStreamingQueryListener"] @@ -155,11 +156,31 @@ class QueryStartedEvent: This API is evolving. """ -def __init__(self, jevent: JavaObject) -> None: -self._id: uuid.UUID = uuid.UUID(jevent.id().toString()) -self._runId: uuid.UUID = uuid.UUID(jevent.runId().toString()) -self._name: Optional[str] = jevent.name() -self._timestamp: str = jevent.timestamp() +def __init__( +self, id: uuid.UUID, runId: uuid.UUID, name: Optional[str], timestamp: str +) -> None: +self._id: uuid.UUID = id +self._runId: uuid.UUID = runId +self._name: Optional[str] = name +self._timestamp: str = timestamp + +@classmethod +def fromJObject(cls, jevent: JavaObject) -> "QueryStartedEvent": +return cls( +id=uuid.UUID(jevent.id().toString()), +runId=uuid.UUID(jevent.runId().toString()), +name=jevent.name(), +timestamp=jevent.timestamp(), +) + +@classmethod +def fromJson(cls, j: Dict[str, Any]) -> "QueryStartedEvent": +return cls( +id=uuid.UUID(j["id"]), +runId=uuid.UUID(j["runId"]), +name=j["name"], +timestamp=j["timestamp"], +) @property def id(self) -> uuid.UUID: @@ -203,8 +224,16 @@ class QueryProgressEvent: This API is evolving. """ -
[spark] branch master updated: [SPARK-44122][CONNECT][TESTS] Make `connect` module pass except Arrow-related ones in Java 21
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cafbea5b136 [SPARK-44122][CONNECT][TESTS] Make `connect` module pass except Arrow-related ones in Java 21 cafbea5b136 is described below commit cafbea5b13623276517a9d716f75745eff91f616 Author: Dongjoon Hyun AuthorDate: Tue Jun 20 17:22:14 2023 -0700 [SPARK-44122][CONNECT][TESTS] Make `connect` module pass except Arrow-related ones in Java 21 ### What changes were proposed in this pull request? This PR aims to `connect` module pass except `Arrow-based` ones in Java 21 environment. In addition, the following JIRA is created to enable them. - SPARK-44121 Renable Arrow-based connect tests in Java 21 ### Why are the changes needed? Although `Arrow` is crucial in `connect` module, this PR identifies those tests and helps us monitor newly added ones in the future because they will cause a new failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual tests. ``` $ java -version openjdk version "21-ea" 2023-09-19 OpenJDK Runtime Environment (build 21-ea+27-2343) OpenJDK 64-Bit Server VM (build 21-ea+27-2343, mixed mode, sharing) ``` **BEFORE** ``` $ build/sbt "connect/test" ... [info] *** 9 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.connect.planner.SparkConnectProtoSuite [error] org.apache.spark.sql.connect.planner.SparkConnectPlannerSuite [error] org.apache.spark.sql.connect.planner.SparkConnectServiceSuite [error] (connect / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 67 s (01:07), completed Jun 20, 2023, 2:42:10 PM ``` **AFTER** ``` $ build/sbt "connect/test" ... [info] Tests: succeeded 742, failed 0, canceled 10, ignored 0, pending 0 [info] All tests passed. [success] Total time: 66 s (01:06), completed Jun 20, 2023, 2:40:35 PM ``` Closes #41679 from dongjoon-hyun/SPARK-44122. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../spark/sql/connect/planner/SparkConnectPlannerSuite.scala | 7 +++ .../spark/sql/connect/planner/SparkConnectProtoSuite.scala| 11 +++ .../spark/sql/connect/planner/SparkConnectServiceSuite.scala | 5 + 3 files changed, 23 insertions(+) diff --git a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala index ab01f2a6c14..14fdc8c0073 100644 --- a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala +++ b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala @@ -21,6 +21,7 @@ import scala.collection.JavaConverters._ import com.google.protobuf.ByteString import io.grpc.stub.StreamObserver +import org.apache.commons.lang3.{JavaVersion, SystemUtils} import org.apache.spark.SparkFunSuite import org.apache.spark.connect.proto @@ -439,6 +440,8 @@ class SparkConnectPlannerSuite extends SparkFunSuite with SparkConnectPlanTest { } test("transform LocalRelation") { +// TODO(SPARK-44121) Renable Arrow-based connect tests in Java 21 +assume(SystemUtils.isJavaVersionAtMost(JavaVersion.JAVA_17)) val rows = (0 until 10).map { i => InternalRow(i, UTF8String.fromString(s"str-$i"), InternalRow(i)) } @@ -540,6 +543,8 @@ class SparkConnectPlannerSuite extends SparkFunSuite with SparkConnectPlanTest { } test("transform UnresolvedStar and ExpressionString") { +// TODO(SPARK-44121) Renable Arrow-based connect tests in Java 21 +assume(SystemUtils.isJavaVersionAtMost(JavaVersion.JAVA_17)) val sql = "SELECT * FROM VALUES (1,'spark',1), (2,'hadoop',2), (3,'kafka',3) AS tab(id, name, value)" val input = proto.Relation @@ -576,6 +581,8 @@ class SparkConnectPlannerSuite extends SparkFunSuite with SparkConnectPlanTest { } test("transform UnresolvedStar with target field") { +// TODO(SPARK-44121) Renable Arrow-based connect tests in Java 21 +assume(SystemUtils.isJavaVersionAtMost(JavaVersion.JAVA_17)) val rows = (0 until 10).map { i => InternalRow(InternalRow(InternalRow(i, i + 1))) } diff --git a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala index 8cb5c1a2919..181564a3b60 100644 ---
[spark] branch master updated: [SPARK-43952][CORE][CONNECT][SQL] Add SparkContext APIs for query cancellation by tag
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 607469b2fd2 [SPARK-43952][CORE][CONNECT][SQL] Add SparkContext APIs for query cancellation by tag 607469b2fd2 is described below commit 607469b2fd2ee6d70739c5e8b3aca15f67a45cde Author: Juliusz Sompolski AuthorDate: Wed Jun 21 09:21:30 2023 +0900 [SPARK-43952][CORE][CONNECT][SQL] Add SparkContext APIs for query cancellation by tag ### What changes were proposed in this pull request? Currently, the only way to cancel running Spark Jobs is by using `SparkContext.cancelJobGroup`, using a job group name that was previously set using `SparkContext.setJobGroup`. This is problematic if multiple different parts of the system want to do cancellation, and set their own ids. For example, BroadcastExchangeExec sets it's own job group, which may override job group set by user. This way, if user cancels the job group they set in the "parent" execution, it will not cancel these broadcast jobs launches from within their jobs. It would also be useful in e.g. Spark Connect to be able to cancel jobs without overriding jobGroupId, which may be used and needed for other purposes. As a solution, consider add API to set tags on jobs, and to cancel jobs using tags: * `SparkContext.addJobTag(tag: String): Unit` * `SparkContext.removeJobTag(tag: String): Unit` * `SparkContext.getJobTags(): Set[String]` * `SparkContext.clearJobTags(): Unit` * `SparkContext.cancelJobsWithTag(tag: String): Unit` * `DAGScheduler.cancelJobsWithTag(tag: String): Unit` Also added `SparkContext.setInterruptOnCancel(interruptOnCancel: Boolean): Unit`, which previously could only be set in `setJobGroup`. The tags are also added to `JobData` and `AppStatusTracker`. A new API is added to `SparkStatusTracker`: * `SparkStatusTracker.getJobIdsForTag(jobTag: String): Array[Int]` Use the new API internally in BroadcastExchangeExec instead of cancellation using job group, to fix the issue with these not being cancelled by user-set jobgroupid. Now, the user set jobgroupid should propagate into broadcast execution. Also, switch cancellation in Spark Connect to use tag instead of jobgroup. ### Why are the changes needed? Currently, there may be multiple places that want to cancel a set of jobs, with different scopes. ### Does this PR introduce _any_ user-facing change? The APIs described above are added. ### How was this patch tested? Added test to JobCancellationSuite. Closes #41440 from juliuszsompolski/SPARK-43952. Authored-by: Juliusz Sompolski Signed-off-by: Hyukjin Kwon --- .../sql/connect/service/ExecutePlanHolder.scala| 7 +- .../service/SparkConnectStreamHandler.scala| 8 +- .../apache/spark/status/protobuf/store_types.proto | 1 + .../main/scala/org/apache/spark/SparkContext.scala | 77 .../org/apache/spark/SparkStatusTracker.scala | 11 ++ .../org/apache/spark/scheduler/DAGScheduler.scala | 25 .../apache/spark/scheduler/DAGSchedulerEvent.scala | 2 + .../apache/spark/status/AppStatusListener.scala| 7 ++ .../scala/org/apache/spark/status/LiveEntity.scala | 2 + .../scala/org/apache/spark/status/api/v1/api.scala | 1 + .../status/protobuf/JobDataWrapperSerializer.scala | 2 + ...from_multi_attempt_app_json_1__expectation.json | 1 + ...from_multi_attempt_app_json_2__expectation.json | 1 + .../job_list_json_expectation.json | 3 + .../one_job_json_expectation.json | 1 + ...succeeded_failed_job_list_json_expectation.json | 3 + .../succeeded_job_list_json_expectation.json | 2 + .../org/apache/spark/JobCancellationSuite.scala| 129 - .../org/apache/spark/StatusTrackerSuite.scala | 41 +++ .../protobuf/KVStoreProtobufSerializerSuite.scala | 8 +- project/MimaExcludes.scala | 4 +- .../execution/exchange/BroadcastExchangeExec.scala | 11 +- .../sql/execution/BroadcastExchangeSuite.scala | 4 +- 23 files changed, 335 insertions(+), 16 deletions(-) diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecutePlanHolder.scala b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecutePlanHolder.scala index a3c17b9826e..9bf9df07e01 100644 --- a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecutePlanHolder.scala +++ b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecutePlanHolder.scala @@ -27,15 +27,16 @@ case class ExecutePlanHolder( sessionHolder: SessionHolder, request: proto.ExecutePlanRequest) {
[spark] branch master updated: [SPARK-44012][SS] KafkaDataConsumer to print some read status
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ef8f22e53d3 [SPARK-44012][SS] KafkaDataConsumer to print some read status ef8f22e53d3 is described below commit ef8f22e53d31225b3429e2f24ca588d113fc7462 Author: Siying Dong AuthorDate: Wed Jun 21 07:33:46 2023 +0900 [SPARK-44012][SS] KafkaDataConsumer to print some read status ### What changes were proposed in this pull request? In the end of each KafkaDataConsumer, it logs some stats. Here is an sample log line: 23/06/08 23:48:14 INFO KafkaDataConsumer: From Kafka topicPartition=topic-121-2 groupId=spark-kafka-source-623fa0a8-04a5-4f34-a9ad-adbf31494e85-711383366-executor read 1 records, taking 504554479 nanos, during time span of 504620999 nanos ### Why are the changes needed? For each task, Kafka source should report fraction of time spent in KafkaConsumer to fetch records. It should also report overall read bandwidth (bytes or records read / time spent fetching). This will be useful in verifying if fetching is the bottleneck. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? 1. Run unit tests and validate log line is correct 2. Run some benchmarks and see it doesn't show up much in CPU profiling. Closes #41525 from siying/kafka_logging2. Authored-by: Siying Dong Signed-off-by: Jungtaek Lim --- .../sql/kafka010/consumer/KafkaDataConsumer.scala | 55 -- 1 file changed, 51 insertions(+), 4 deletions(-) diff --git a/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala b/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala index d88e9821489..a9e394d3c88 100644 --- a/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala +++ b/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala @@ -258,6 +258,17 @@ private[kafka010] class KafkaDataConsumer( */ private val fetchedRecord: FetchedRecord = FetchedRecord(null, UNKNOWN_OFFSET) + // Total duration spent on reading from Kafka + private var totalTimeReadNanos: Long = 0 + // Number of times we poll Kafka consumers. + private var numPolls: Long = 0 + // Number of times we poll Kafka consumers. + private var numRecordsPolled: Long = 0 + // Total number of records fetched from Kafka + private var totalRecordsRead: Long = 0 + // Starting timestamp when the consumer is created. + private var startTimestampNano: Long = System.nanoTime() + /** * Get the record for the given offset if available. * @@ -343,6 +354,7 @@ private[kafka010] class KafkaDataConsumer( } if (isFetchComplete) { + totalRecordsRead += 1 fetchedRecord.record } else { fetchedData.reset() @@ -356,7 +368,9 @@ private[kafka010] class KafkaDataConsumer( */ def getAvailableOffsetRange(): AvailableOffsetRange = runUninterruptiblyIfPossible { val consumer = getOrRetrieveConsumer() -consumer.getAvailableOffsetRange() +timeNanos { + consumer.getAvailableOffsetRange() +} } def getNumOffsetOutOfRange(): Long = offsetOutOfRange @@ -367,6 +381,17 @@ private[kafka010] class KafkaDataConsumer( * must call method after using the instance to make sure resources are not leaked. */ def release(): Unit = { +val kafkaMeta = _consumer + .map(c => s"topicPartition=${c.topicPartition} groupId=${c.groupId}") + .getOrElse("") +val walTime = System.nanoTime() - startTimestampNano + +logInfo( + s"From Kafka $kafkaMeta read $totalRecordsRead records through $numPolls polls (polled " + + s" out $numRecordsPolled records), taking $totalTimeReadNanos nanos, during time span of " + + s"$walTime nanos." +) + releaseConsumer() releaseFetchedData() } @@ -394,7 +419,9 @@ private[kafka010] class KafkaDataConsumer( consumer: InternalKafkaConsumer, offset: Long, untilOffset: Long): Long = { -val range = consumer.getAvailableOffsetRange() +val range = timeNanos { + consumer.getAvailableOffsetRange() +} logWarning(s"Some data may be lost. Recovering from the earliest offset: ${range.earliest}") val topicPartition = consumer.topicPartition @@ -548,7 +575,11 @@ private[kafka010] class KafkaDataConsumer( fetchedData: FetchedData, offset: Long, pollTimeoutMs: Long): Unit = { -val (records, offsetAfterPoll, range) = consumer.fetch(offset, pollTimeoutMs) +val (records, offsetAfterPoll, range) = timeNanos { + consumer.fetch(offset, pollTimeoutMs) +} +numPolls += 1 +
[spark] branch master updated (68b30053f78 -> 66f25e31403)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 68b30053f78 [SPARK-43939][CONNECT][PYTHON] Add try_* functions to Scala and Python add 66f25e31403 [SPARK-43876][SQL] Enable fast hashmap for distinct queries No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43939][CONNECT][PYTHON] Add try_* functions to Scala and Python
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 68b30053f78 [SPARK-43939][CONNECT][PYTHON] Add try_* functions to Scala and Python 68b30053f78 is described below commit 68b30053f786e8178e6bdba736734e91adb51088 Author: panbingkun AuthorDate: Wed Jun 21 00:38:22 2023 +0800 [SPARK-43939][CONNECT][PYTHON] Add try_* functions to Scala and Python ### What changes were proposed in this pull request? Add following functions: - try_add - try_avg - try_divide - try_element_at - try_multiply - try_subtract - try_sum - try_to_binary - try_to_number - try_to_timestamp to: - Scala API - Python API - Spark Connect Scala Client - Spark Connect Python Client ### Why are the changes needed? for parity ### Does this PR introduce _any_ user-facing change? Yes, new functions. ### How was this patch tested? - Add New UT. Closes #41653 from panbingkun/SPARK-43939. Authored-by: panbingkun Signed-off-by: Ruifeng Zheng --- .../scala/org/apache/spark/sql/functions.scala | 115 +++ .../apache/spark/sql/PlanGenerationTestSuite.scala | 52 .../explain-results/function_try_add.explain | 2 + .../explain-results/function_try_avg.explain | 2 + .../explain-results/function_try_divide.explain| 2 + .../function_try_element_at_array.explain | 2 + .../function_try_element_at_map.explain| 2 + .../explain-results/function_try_multiply.explain | 2 + .../explain-results/function_try_subtract.explain | 2 + .../explain-results/function_try_sum.explain | 2 + .../explain-results/function_try_to_binary.explain | 2 + .../function_try_to_binary_without_format.explain | 2 + .../explain-results/function_try_to_number.explain | 2 + .../function_try_to_timestamp.explain | 2 + ...unction_try_to_timestamp_without_format.explain | 2 + .../query-tests/queries/function_try_add.json | 29 ++ .../query-tests/queries/function_try_add.proto.bin | Bin 0 -> 183 bytes .../query-tests/queries/function_try_avg.json | 25 ++ .../query-tests/queries/function_try_avg.proto.bin | Bin 0 -> 176 bytes .../query-tests/queries/function_try_divide.json | 29 ++ .../queries/function_try_divide.proto.bin | Bin 0 -> 186 bytes .../queries/function_try_element_at_array.json | 29 ++ .../function_try_element_at_array.proto.bin| Bin 0 -> 190 bytes .../queries/function_try_element_at_map.json | 29 ++ .../queries/function_try_element_at_map.proto.bin | Bin 0 -> 190 bytes .../query-tests/queries/function_try_multiply.json | 29 ++ .../queries/function_try_multiply.proto.bin| Bin 0 -> 188 bytes .../query-tests/queries/function_try_subtract.json | 29 ++ .../queries/function_try_subtract.proto.bin| Bin 0 -> 188 bytes .../query-tests/queries/function_try_sum.json | 25 ++ .../query-tests/queries/function_try_sum.proto.bin | Bin 0 -> 176 bytes .../queries/function_try_to_binary.json| 29 ++ .../queries/function_try_to_binary.proto.bin | Bin 0 -> 194 bytes .../function_try_to_binary_without_format.json | 25 ++ ...function_try_to_binary_without_format.proto.bin | Bin 0 -> 182 bytes .../queries/function_try_to_number.json| 29 ++ .../queries/function_try_to_number.proto.bin | Bin 0 -> 194 bytes .../queries/function_try_to_timestamp.json | 29 ++ .../queries/function_try_to_timestamp.proto.bin| Bin 0 -> 192 bytes .../function_try_to_timestamp_without_format.json | 25 ++ ...ction_try_to_timestamp_without_format.proto.bin | Bin 0 -> 185 bytes .../source/reference/pyspark.sql/functions.rst | 10 + python/pyspark/sql/connect/functions.py| 76 + python/pyspark/sql/functions.py| 341 + .../scala/org/apache/spark/sql/functions.scala | 137 + .../org/apache/spark/sql/DateFunctionsSuite.scala | 11 + .../org/apache/spark/sql/MathFunctionsSuite.scala | 91 ++ .../apache/spark/sql/StringFunctionsSuite.scala| 17 + 48 files changed, 1237 insertions(+) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala index a3f4a273661..d258abcecfa 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala @@ -1807,6 +1807,58 @@ object functions { */ def sqrt(colName: String): Column = sqrt(Column(colName)) + /** + * Returns the sum of `left` and `right` and the result is
[spark] branch master updated: [SPARK-44022][BUILD] Enforce max bytecode version on Maven dependencies
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 27e1e56d54b [SPARK-44022][BUILD] Enforce max bytecode version on Maven dependencies 27e1e56d54b is described below commit 27e1e56d54b1c6986369469b27774fa7bc267cf4 Author: liangbowen AuthorDate: Tue Jun 20 08:22:30 2023 -0700 [SPARK-44022][BUILD] Enforce max bytecode version on Maven dependencies ### What changes were proposed in this pull request? - to enforce Java's max bytecode version to maven dependencies, by using enforceBytecodeVersion enforcer rule (https://www.mojohaus.org/extra-enforcer-rules/enforceBytecodeVersion.html) - exclude `org.threeten:threeten-extra` in enforcer rule as its package-info.class requiring bytecode version 53 but no side effects on other classes ### Why are the changes needed? - to avoiding introduction of dependencies requiring higher Java version ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #41546 from bowenliang123/enforce-bytecode-version. Authored-by: liangbowen Signed-off-by: Dongjoon Hyun --- pom.xml | 18 ++ 1 file changed, 18 insertions(+) diff --git a/pom.xml b/pom.xml index c322112fbc6..74f64da9fd3 100644 --- a/pom.xml +++ b/pom.xml @@ -2782,6 +2782,17 @@ true + +${java.version} +test +provided + + + org.threeten:threeten-extra + + @@ -2797,6 +2808,13 @@ + + + org.codehaus.mojo + extra-enforcer-rules + 1.7.0 + + org.codehaus.mojo - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44105][SQL] `LastNonNull` should be lazily resolved
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4c15fa784dc [SPARK-44105][SQL] `LastNonNull` should be lazily resolved 4c15fa784dc is described below commit 4c15fa784dc268ac0844771585677a90de2d064f Author: Ruifeng Zheng AuthorDate: Tue Jun 20 07:49:24 2023 -0700 [SPARK-44105][SQL] `LastNonNull` should be lazily resolved ### What changes were proposed in this pull request? `LastNonNull` should be lazily resolved ### Why are the changes needed? to fix https://github.com/apache/spark/pull/41670/files#r1234805869 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing GA and manually check Closes #41672 from zhengruifeng/ps_fix_last_not_null. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/catalyst/expressions/windowExpressions.scala | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index 1b9d232bf8a..50c98c01645 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -1164,19 +1164,19 @@ case class LastNonNull(input: Expression) override def dataType: DataType = input.dataType - private val last = AttributeReference("last", dataType, nullable = true)() + private lazy val last = AttributeReference("last", dataType, nullable = true)() override def aggBufferAttributes: Seq[AttributeReference] = last :: Nil - override val initialValues: Seq[Expression] = Seq(Literal.create(null, dataType)) + override lazy val initialValues: Seq[Expression] = Seq(Literal.create(null, dataType)) - override val updateExpressions: Seq[Expression] = { + override lazy val updateExpressions: Seq[Expression] = { Seq( /* last = */ If(IsNull(input), last, input) ) } - override val evaluateExpression: Expression = last + override lazy val evaluateExpression: Expression = last override def prettyName: String = "last_non_null" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44104][BUILD] Enabled `protobuf` module mima check for Spark 3.5.0
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bfad72b0a86 [SPARK-44104][BUILD] Enabled `protobuf` module mima check for Spark 3.5.0 bfad72b0a86 is described below commit bfad72b0a8689f7a3361785cc5004030bc94da3d Author: yangjie01 AuthorDate: Tue Jun 20 07:45:49 2023 -0700 [SPARK-44104][BUILD] Enabled `protobuf` module mima check for Spark 3.5.0 ### What changes were proposed in this pull request? This pr adds a mima check for the `protobuf` module for Apache Spark 3.5.0 ### Why are the changes needed? The `protobuf` module is a new module introduced in Spark 3.4.0, which includes some client APIs, so it should be added to Spark 3.5.0's mima check ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual checked: ``` dev/mima ``` and ``` dev/change-scala-version.sh 2.13 dev/mima -Pscala-2.13 ``` Closes #41671 from LuciferYang/SPARK-44104. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- project/MimaExcludes.scala | 4 project/SparkBuild.scala | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala index bba20534f44..7cac416838d 100644 --- a/project/MimaExcludes.scala +++ b/project/MimaExcludes.scala @@ -93,6 +93,10 @@ object MimaExcludes { ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.ErrorInfo$"), ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.ErrorSubInfo$"), +// SPARK-44104: shaded protobuf code and Apis with parameters relocated + ProblemFilters.exclude[Problem]("org.sparkproject.spark_protobuf.protobuf.*"), + ProblemFilters.exclude[Problem]("org.apache.spark.sql.protobuf.utils.SchemaConverters.*"), + (problem: Problem) => problem match { case MissingClassProblem(cls) => !cls.fullName.startsWith("org.sparkproject.jpmml") && !cls.fullName.startsWith("org.sparkproject.dmg.pmml") diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 607daa67138..761b8f905f5 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -415,7 +415,7 @@ object SparkBuild extends PomBuild { val mimaProjects = allProjects.filterNot { x => Seq( spark, hive, hiveThriftServer, repl, networkCommon, networkShuffle, networkYarn, - unsafe, tags, tokenProviderKafka010, sqlKafka010, connectCommon, connect, connectClient, protobuf, + unsafe, tags, tokenProviderKafka010, sqlKafka010, connectCommon, connect, connectClient, commonUtils, sqlApi ).contains(x) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43929][SPARK-44073][SQL][PYTHON][CONNECT][FOLLOWUP] Add extract, date_part, datepart to Scala, Python and Connect API
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8dc02863b92 [SPARK-43929][SPARK-44073][SQL][PYTHON][CONNECT][FOLLOWUP] Add extract, date_part, datepart to Scala, Python and Connect API 8dc02863b92 is described below commit 8dc02863b926b9e0780b994f9ee6c5c1058d49a0 Author: Jiaan Geng AuthorDate: Tue Jun 20 21:47:05 2023 +0800 [SPARK-43929][SPARK-44073][SQL][PYTHON][CONNECT][FOLLOWUP] Add extract, date_part, datepart to Scala, Python and Connect API ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/41636 and https://github.com/apache/spark/pull/41651 and add extract, date_part, datepart to Scala, Python and Connect API. ### Why are the changes needed? Add extract, date_part, datepart to Scala, Python and Connect API ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes #41667 from beliefer/datetime_functions_followup. Authored-by: Jiaan Geng Signed-off-by: Ruifeng Zheng --- .../scala/org/apache/spark/sql/functions.scala | 50 ++ .../apache/spark/sql/PlanGenerationTestSuite.scala | 12 +++ .../explain-results/function_date_part.explain | 2 + .../explain-results/function_datepart.explain | 2 + .../explain-results/function_extract.explain | 2 + .../query-tests/queries/function_date_part.json| 29 ++ .../queries/function_date_part.proto.bin | Bin 0 -> 133 bytes .../query-tests/queries/function_datepart.json | 29 ++ .../queries/function_datepart.proto.bin| Bin 0 -> 132 bytes .../query-tests/queries/function_extract.json | 29 ++ .../query-tests/queries/function_extract.proto.bin | Bin 0 -> 131 bytes .../source/reference/pyspark.sql/functions.rst | 3 + python/pyspark/sql/connect/functions.py| 21 python/pyspark/sql/functions.py| 110 + .../scala/org/apache/spark/sql/functions.scala | 41 .../org/apache/spark/sql/DateFunctionsSuite.scala | 68 + 16 files changed, 398 insertions(+) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala index 2ac20bd5911..a3f4a273661 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala @@ -4438,6 +4438,56 @@ object functions { */ def hour(e: Column): Column = Column.fn("hour", e) + /** + * Extracts a part of the date/timestamp or interval source. + * + * @param field + * selects which part of the source should be extracted. + * @param source + * a date/timestamp or interval column from where `field` should be extracted. + * @return + * a part of the date/timestamp or interval source + * @group datetime_funcs + * @since 3.5.0 + */ + def extract(field: Column, source: Column): Column = { +Column.fn("extract", field, source) + } + + /** + * Extracts a part of the date/timestamp or interval source. + * + * @param field + * selects which part of the source should be extracted, and supported string values are as + * same as the fields of the equivalent function `extract`. + * @param source + * a date/timestamp or interval column from where `field` should be extracted. + * @return + * a part of the date/timestamp or interval source + * @group datetime_funcs + * @since 3.5.0 + */ + def date_part(field: Column, source: Column): Column = { +Column.fn("date_part", field, source) + } + + /** + * Extracts a part of the date/timestamp or interval source. + * + * @param field + * selects which part of the source should be extracted, and supported string values are as + * same as the fields of the equivalent function `extract`. + * @param source + * a date/timestamp or interval column from where `field` should be extracted. + * @return + * a part of the date/timestamp or interval source + * @group datetime_funcs + * @since 3.5.0 + */ + def datepart(field: Column, source: Column): Column = { +Column.fn("datepart", field, source) + } + /** * Returns the last day of the month which the given date belongs to. For example, input * "2015-07-27" returns "2015-07-31" since July 31 is the last day of the month in July 2015. diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
[spark] branch master updated (0865c0db923 -> 35b3a18ff04)
This is an automated email from the ASF dual-hosted git repository. weichenxu123 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 0865c0db923 [SPARK-43944][SPARK-43942][SQL][FOLLOWUP] Directly leverage `UnresolvedFunction` for functions `startswith`/`endswith`/`contains` add 35b3a18ff04 [SPARK-44100][ML][CONNECT][PYTHON] Move namespace from `pyspark.mlv2` to `pyspark.ml.connect` No new revisions were added by this update. Summary of changes: dev/sparktestsupport/modules.py| 18 +- python/mypy.ini| 3 -- python/pyspark/ml/connect/__init__.py | 24 + python/pyspark/{mlv2 => ml/connect}/base.py| 2 +- .../pyspark/{mlv2 => ml/connect}/classification.py | 6 ++-- python/pyspark/{mlv2 => ml/connect}/evaluation.py | 6 ++-- python/pyspark/{mlv2 => ml/connect}/feature.py | 6 ++-- python/pyspark/{mlv2 => ml/connect}/io_utils.py| 0 python/pyspark/{mlv2 => ml/connect}/pipeline.py| 8 ++--- python/pyspark/{mlv2 => ml/connect}/summarizer.py | 2 +- python/pyspark/{mlv2 => ml/connect}/util.py| 0 .../tests/connect/test_connect_classification.py} | 6 ++-- .../tests/connect/test_connect_evaluation.py} | 4 +-- .../tests/connect/test_connect_feature.py} | 4 +-- .../tests/connect/test_connect_pipeline.py}| 6 ++-- .../tests/connect/test_connect_summarizer.py} | 4 +-- .../connect/test_legacy_mode_classification.py}| 6 ++-- .../tests/connect/test_legacy_mode_evaluation.py} | 4 +-- .../tests/connect/test_legacy_mode_feature.py} | 4 +-- .../tests/connect/test_legacy_mode_pipeline.py}| 10 +++--- .../tests/connect/test_legacy_mode_summarizer.py} | 4 +-- python/pyspark/mlv2/__init__.py| 40 -- 22 files changed, 75 insertions(+), 92 deletions(-) rename python/pyspark/{mlv2 => ml/connect}/base.py (99%) rename python/pyspark/{mlv2 => ml/connect}/classification.py (98%) rename python/pyspark/{mlv2 => ml/connect}/evaluation.py (95%) rename python/pyspark/{mlv2 => ml/connect}/feature.py (97%) rename python/pyspark/{mlv2 => ml/connect}/io_utils.py (100%) rename python/pyspark/{mlv2 => ml/connect}/pipeline.py (96%) rename python/pyspark/{mlv2 => ml/connect}/summarizer.py (98%) rename python/pyspark/{mlv2 => ml/connect}/util.py (100%) copy python/pyspark/{mlv2/tests/connect/test_parity_classification.py => ml/tests/connect/test_connect_classification.py} (84%) rename python/pyspark/{mlv2/tests/connect/test_parity_evaluation.py => ml/tests/connect/test_connect_evaluation.py} (89%) rename python/pyspark/{mlv2/tests/connect/test_parity_feature.py => ml/tests/connect/test_connect_feature.py} (89%) rename python/pyspark/{mlv2/tests/connect/test_parity_classification.py => ml/tests/connect/test_connect_pipeline.py} (85%) rename python/pyspark/{mlv2/tests/connect/test_parity_summarizer.py => ml/tests/connect/test_connect_summarizer.py} (88%) rename python/pyspark/{mlv2/tests/test_classification.py => ml/tests/connect/test_legacy_mode_classification.py} (98%) rename python/pyspark/{mlv2/tests/test_evaluation.py => ml/tests/connect/test_legacy_mode_evaluation.py} (94%) rename python/pyspark/{mlv2/tests/test_feature.py => ml/tests/connect/test_legacy_mode_feature.py} (97%) rename python/pyspark/{mlv2/tests/test_pipeline.py => ml/tests/connect/test_legacy_mode_pipeline.py} (95%) rename python/pyspark/{mlv2/tests/test_summarizer.py => ml/tests/connect/test_legacy_mode_summarizer.py} (94%) delete mode 100644 python/pyspark/mlv2/__init__.py - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43944][SPARK-43942][SQL][FOLLOWUP] Directly leverage `UnresolvedFunction` for functions `startswith`/`endswith`/`contains`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0865c0db923 [SPARK-43944][SPARK-43942][SQL][FOLLOWUP] Directly leverage `UnresolvedFunction` for functions `startswith`/`endswith`/`contains` 0865c0db923 is described below commit 0865c0db923eadb840dd9af5834dc72ba19e43c4 Author: Ruifeng Zheng AuthorDate: Tue Jun 20 18:00:33 2023 +0800 [SPARK-43944][SPARK-43942][SQL][FOLLOWUP] Directly leverage `UnresolvedFunction` for functions `startswith`/`endswith`/`contains` ### What changes were proposed in this pull request? Directly leverage UnresolvedFunction for `startswith`/`endswith`/`contains` ### Why are the changes needed? to be more consistent with existing functions, like [ceil](https://github.com/apache/spark/blob/6b36a9368d6e97f7f1f94c4ca7f6ee76dcd0015f/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2242-L2260), [floor](https://github.com/apache/spark/blob/6b36a9368d6e97f7f1f94c4ca7f6ee76dcd0015f/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2397-L2416), [lpad](https://github.com/apache/spark/blob/6b36a9368d6e97f7f1f94c4ca7f6ee76dcd0015f/sql/core/src/main/scala/org/apa [...] ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing GA Closes #41669 from zhengruifeng/use_unresolved_func. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .../main/scala/org/apache/spark/sql/functions.scala| 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index 68b81810da4..7c3f65e2495 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -4051,10 +4051,8 @@ object functions { * @group string_funcs * @since 3.5.0 */ - def endswith(str: Column, suffix: Column): Column = { -// 'EndsWith' expression only supports StringType, -// use 'call_udf' to support both StringType and BinaryType. -call_udf("endswith", str, suffix) + def endswith(str: Column, suffix: Column): Column = withExpr { +UnresolvedFunction(Seq("endswith"), Seq(str.expr, suffix.expr), isDistinct = false) } /** @@ -4065,10 +4063,8 @@ object functions { * @group string_funcs * @since 3.5.0 */ - def startswith(str: Column, prefix: Column): Column = { -// 'StartsWith' expression only supports StringType, -// use 'call_udf' to support both StringType and BinaryType. -call_udf("startswith", str, prefix) + def startswith(str: Column, prefix: Column): Column = withExpr { +UnresolvedFunction(Seq("startswith"), Seq(str.expr, prefix.expr), isDistinct = false) } /** @@ -4145,10 +4141,8 @@ object functions { * @group string_funcs * @since 3.5.0 */ - def contains(left: Column, right: Column): Column = { -// 'Contains' expression only supports StringType -// use 'call_udf' to support both StringType and BinaryType. -call_udf("contains", left, right) + def contains(left: Column, right: Column): Column = withExpr { +UnresolvedFunction(Seq("contains"), Seq(left.expr, right.expr), isDistinct = false) } /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43944][SQL][CONNECT][PYTHON][FOLLOW-UP] Make `startswith` & `endswith` support binary type
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6b36a9368d6 [SPARK-43944][SQL][CONNECT][PYTHON][FOLLOW-UP] Make `startswith` & `endswith` support binary type 6b36a9368d6 is described below commit 6b36a9368d6e97f7f1f94c4ca7f6ee76dcd0015f Author: Ruifeng Zheng AuthorDate: Tue Jun 20 14:08:56 2023 +0800 [SPARK-43944][SQL][CONNECT][PYTHON][FOLLOW-UP] Make `startswith` & `endswith` support binary type ### What changes were proposed in this pull request? Make `startswith`, `endswith` support binary type: 1, in Connect API, `startswith` & `endswith` actually already support binary type; 2, in vanilla API, support binary type via `call_udf` ### Why are the changes needed? for parity ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added ut Closes #41659 from zhengruifeng/sql_func_sw. Lead-authored-by: Ruifeng Zheng Co-authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .../scala/org/apache/spark/sql/functions.scala | 14 +++-- python/pyspark/sql/functions.py| 36 -- .../scala/org/apache/spark/sql/functions.scala | 24 ++- .../apache/spark/sql/StringFunctionsSuite.scala| 14 +++-- 4 files changed, 52 insertions(+), 36 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala index 93cf8f521b2..2ac20bd5911 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala @@ -3945,11 +3945,8 @@ object functions { /** * Returns a boolean. The value is True if str ends with suffix. Returns NULL if either input - * expression is NULL. Otherwise, returns False. Both str or suffix must be of STRING type. - * - * @note - * Only STRING type is supported in this function, while `endswith` in SQL supports both - * STRING and BINARY. + * expression is NULL. Otherwise, returns False. Both str or suffix must be of STRING or BINARY + * type. * * @group string_funcs * @since 3.5.0 @@ -3959,11 +3956,8 @@ object functions { /** * Returns a boolean. The value is True if str starts with prefix. Returns NULL if either input - * expression is NULL. Otherwise, returns False. Both str or prefix must be of STRING type. - * - * @note - * Only STRING type is supported in this function, while `startswith` in SQL supports both - * STRING and BINARY. + * expression is NULL. Otherwise, returns False. Both str or prefix must be of STRING or BINARY + * type. * * @group string_funcs * @since 3.5.0 diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 3eaccdc1ea1..0cfc19615be 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -9660,11 +9660,6 @@ def endswith(str: "ColumnOrName", suffix: "ColumnOrName") -> Column: .. versionadded:: 3.5.0 -Notes -- -Only STRING type is supported in this function, -while `startswith` in SQL supports both STRING and BINARY. - Parameters -- str : :class:`~pyspark.sql.Column` or str @@ -9677,6 +9672,19 @@ def endswith(str: "ColumnOrName", suffix: "ColumnOrName") -> Column: >>> df = spark.createDataFrame([("Spark SQL", "Spark",)], ["a", "b"]) >>> df.select(endswith(df.a, df.b).alias('r')).collect() [Row(r=False)] + +>>> df = spark.createDataFrame([("414243", "4243",)], ["e", "f"]) +>>> df = df.select(to_binary("e").alias("e"), to_binary("f").alias("f")) +>>> df.printSchema() +root + |-- e: binary (nullable = true) + |-- f: binary (nullable = true) +>>> df.select(endswith("e", "f"), endswith("f", "e")).show() ++--+--+ +|endswith(e, f)|endswith(f, e)| ++--+--+ +| true| false| ++--+--+ """ return _invoke_function_over_columns("endswith", str, suffix) @@ -9690,11 +9698,6 @@ def startswith(str: "ColumnOrName", prefix: "ColumnOrName") -> Column: .. versionadded:: 3.5.0 -Notes -- -Only STRING type is supported in this function, -while `startswith` in SQL supports both STRING and BINARY. - Parameters -- str : :class:`~pyspark.sql.Column` or str @@ -9707,6 +9710,19 @@ def startswith(str: "ColumnOrName", prefix: "ColumnOrName") -> Column: >>> df = spark.createDataFrame([("Spark SQL", "Spark",)], ["a", "b"]) >>> df.select(startswith(df.a,