(spark) branch master updated: [SPARK-47969][PYTHON][TESTS][FOLLOWUP] Make Test `test_creation_index` deterministic
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5d1f976f85fe [SPARK-47969][PYTHON][TESTS][FOLLOWUP] Make Test `test_creation_index` deterministic 5d1f976f85fe is described below commit 5d1f976f85fe1ee39ca3cc4f0f2e6afa8b43e5ea Author: Ruifeng Zheng AuthorDate: Fri May 3 20:42:30 2024 -0700 [SPARK-47969][PYTHON][TESTS][FOLLOWUP] Make Test `test_creation_index` deterministic ### What changes were proposed in this pull request? followup https://github.com/apache/spark/pull/46200 ### Why are the changes needed? there is still non-deterministic codes in this test: ``` Traceback (most recent call last): File "/home/jenkins/python/pyspark/testing/pandasutils.py", line 91, in _assert_pandas_equal assert_frame_equal( File "/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py", line 1257, in assert_frame_equal assert_index_equal( File "/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py", line 407, in assert_index_equal raise_assert_detail(obj, msg, left, right) File "/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py", line 665, in raise_assert_detail raise AssertionError(msg) AssertionError: DataFrame.index are different DataFrame.index values are different (75.0 %) [left]: DatetimeIndex(['2022-09-02', '2022-09-03', '2022-08-31', '2022-09-05'], dtype='datetime64[ns]', freq=None) [right]: DatetimeIndex(['2022-08-31', '2022-09-02', '2022-09-03', '2022-09-05'], dtype='datetime64[ns]', freq=None) ``` ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46378 from zhengruifeng/ps_test_create_index. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/pandas/tests/frame/test_constructor.py | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/python/pyspark/pandas/tests/frame/test_constructor.py b/python/pyspark/pandas/tests/frame/test_constructor.py index d7581895c6c9..e093adfa7ba3 100644 --- a/python/pyspark/pandas/tests/frame/test_constructor.py +++ b/python/pyspark/pandas/tests/frame/test_constructor.py @@ -269,11 +269,11 @@ class FrameConstructorMixin: ps.DataFrame( data=pdf, index=pd.DatetimeIndex(["2022-08-31", "2022-09-02", "2022-09-03", "2022-09-05"]), -), +).sort_index(), pd.DataFrame( data=pdf, index=pd.DatetimeIndex(["2022-08-31", "2022-09-02", "2022-09-03", "2022-09-05"]), -), +).sort_index(), ) # test with pd.DataFrame and ps.DatetimeIndex @@ -281,11 +281,11 @@ class FrameConstructorMixin: ps.DataFrame( data=pdf, index=ps.DatetimeIndex(["2022-08-31", "2022-09-02", "2022-09-03", "2022-09-05"]), -), +).sort_index(), pd.DataFrame( data=pdf, index=pd.DatetimeIndex(["2022-08-31", "2022-09-02", "2022-09-03", "2022-09-05"]), -), +).sort_index(), ) with ps.option_context("compute.ops_on_diff_frames", True): @@ -296,13 +296,13 @@ class FrameConstructorMixin: index=pd.DatetimeIndex( ["2022-08-31", "2022-09-02", "2022-09-03", "2022-09-05"] ), -), +).sort_index(), pd.DataFrame( data=pdf, index=pd.DatetimeIndex( ["2022-08-31", "2022-09-02", "2022-09-03", "2022-09-05"] ), -), +).sort_index(), ) # test with ps.DataFrame and ps.DatetimeIndex @@ -312,13 +312,13 @@ class FrameConstructorMixin: index=ps.DatetimeIndex( ["2022-08-31", "2022-09-02", "2022-09-03", "2022-09-05"] ), -), +).sort_index(), pd.DataFrame( data
(spark) branch master updated: [SPARK-47097][CONNECT][TESTS][FOLLOWUP] Increase timeout to `1 minute` for `interrupt tag` test
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7f08df4af95d [SPARK-47097][CONNECT][TESTS][FOLLOWUP] Increase timeout to `1 minute` for `interrupt tag` test 7f08df4af95d is described below commit 7f08df4af95d20f3fd056588b5a3cfa5f5c57654 Author: Dongjoon Hyun AuthorDate: Fri May 3 16:54:24 2024 -0700 [SPARK-47097][CONNECT][TESTS][FOLLOWUP] Increase timeout to `1 minute` for `interrupt tag` test ### What changes were proposed in this pull request? This is a follow-up to increase `timeout` from `30s` to `1 minute` like the other timeouts of the same test case. - #45173 ### Why are the changes needed? To reduce the flakiness more. The following is the recent failure on `master` branch. - https://github.com/apache/spark/actions/runs/8944948827/job/24572965877 - https://github.com/apache/spark/actions/runs/8945375279/job/24574263993 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46374 from dongjoon-hyun/SPARK-47097. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala| 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala index b967245d90c2..d1015d55b1df 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala @@ -196,7 +196,7 @@ class SparkSessionE2ESuite extends RemoteSparkSession { // q2 and q3 should be cancelled interrupted.clear() -eventually(timeout(30.seconds), interval(1.seconds)) { +eventually(timeout(1.minute), interval(1.seconds)) { val ids = spark.interruptTag("two") interrupted ++= ids assert(interrupted.length == 2, s"Interrupted operations: $interrupted.") @@ -213,7 +213,7 @@ class SparkSessionE2ESuite extends RemoteSparkSession { // q1 and q4 should be cancelled interrupted.clear() -eventually(timeout(30.seconds), interval(1.seconds)) { +eventually(timeout(1.minute), interval(1.seconds)) { val ids = spark.interruptTag("one") interrupted ++= ids assert(interrupted.length == 2, s"Interrupted operations: $interrupted.") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48121][K8S] Promote `KubernetesDriverConf` to `DeveloperApi`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c3a462ce2966 [SPARK-48121][K8S] Promote `KubernetesDriverConf` to `DeveloperApi` c3a462ce2966 is described below commit c3a462ce2966d42a3cebf238b809e2c2e2631c08 Author: zhou-jiang AuthorDate: Fri May 3 16:25:38 2024 -0700 [SPARK-48121][K8S] Promote `KubernetesDriverConf` to `DeveloperApi` ### What changes were proposed in this pull request? This PR aims to promote `KubernetesDriverConf` to `DeveloperApi` ### Why are the changes needed? Since Apache Spark Kubernetes Operator requires this, we had better maintain it as a developer API officially from Apache Spark 4.0.0. https://github.com/apache/spark-kubernetes-operator/pull/10 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs ### Was this patch authored or co-authored using generative AI tooling? No Closes #46373 from jiangzho/driver_conf. Authored-by: zhou-jiang Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/deploy/k8s/KubernetesConf.scala| 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala index fda772b737fe..f62204a8a9c0 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala @@ -22,6 +22,7 @@ import io.fabric8.kubernetes.api.model.{LocalObjectReference, LocalObjectReferen import org.apache.commons.lang3.StringUtils import org.apache.spark.{SPARK_VERSION, SparkConf} +import org.apache.spark.annotation.{DeveloperApi, Since, Unstable} import org.apache.spark.deploy.k8s.Config._ import org.apache.spark.deploy.k8s.Constants._ import org.apache.spark.deploy.k8s.features.DriverServiceFeatureStep._ @@ -78,7 +79,15 @@ private[spark] abstract class KubernetesConf(val sparkConf: SparkConf) { def getOption(key: String): Option[String] = sparkConf.getOption(key) } -private[spark] class KubernetesDriverConf( +/** + * :: DeveloperApi :: + * + * Used for K8s operations internally and Spark K8s operator. + */ +@Unstable +@DeveloperApi +@Since("4.0.0") +class KubernetesDriverConf( sparkConf: SparkConf, val appId: String, val mainAppResource: MainAppResource, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark-kubernetes-operator) branch main updated: [SPARK-48120] Enable autolink to SPARK jira issue
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new 91ecc93 [SPARK-48120] Enable autolink to SPARK jira issue 91ecc93 is described below commit 91ecc932096f0f41f395d2b6e935daa075c7d47a Author: Dongjoon Hyun AuthorDate: Fri May 3 15:52:27 2024 -0700 [SPARK-48120] Enable autolink to SPARK jira issue ### What changes were proposed in this pull request? This PR aims to enable `autolink` feature to `SPARK` jira issue like `Apache Spark` repository. ### Why are the changes needed? Since we share the same JIRA project name, we need to link it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #11 from dongjoon-hyun/SPARK-48120. Lead-authored-by: Dongjoon Hyun Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .asf.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/.asf.yaml b/.asf.yaml index c7e6ae7..c1409a7 100644 --- a/.asf.yaml +++ b/.asf.yaml @@ -26,6 +26,7 @@ github: merge: false squash: true rebase: true + autolink_jira: SPARK notifications: pullrequests: revi...@spark.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (85902880d709 -> b42d235c2930)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 85902880d709 [SPARK-48119][K8S] Promote `KubernetesDriverSpec` to `DeveloperApi` add b42d235c2930 [SPARK-48114][CORE] Precompile template regex to avoid unnecessary work No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48119][K8S] Promote `KubernetesDriverSpec` to `DeveloperApi`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 85902880d709 [SPARK-48119][K8S] Promote `KubernetesDriverSpec` to `DeveloperApi` 85902880d709 is described below commit 85902880d709a66ef89bd6a5e0e7f1233f4d4fec Author: zhou-jiang AuthorDate: Fri May 3 15:02:56 2024 -0700 [SPARK-48119][K8S] Promote `KubernetesDriverSpec` to `DeveloperApi` ### What changes were proposed in this pull request? This PR aims to promote ` KubernetesDriverSpec` to `DeveloperApi` ### Why are the changes needed? Since Apache Spark Kubernetes Operator requires this, we had better maintain it as a developer API officially from Apache Spark 4.0.0. https://github.com/apache/spark-kubernetes-operator/pull/10 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs ### Was this patch authored or co-authored using generative AI tooling? No Closes #46371 from jiangzho/k8s_dev_apis. Authored-by: zhou-jiang Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala index a603cb08ba9a..0fd2cf16e74e 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala @@ -18,7 +18,18 @@ package org.apache.spark.deploy.k8s import io.fabric8.kubernetes.api.model.HasMetadata -private[spark] case class KubernetesDriverSpec( +import org.apache.spark.annotation.{DeveloperApi, Since, Unstable} + +/** + * :: DeveloperApi :: + * + * Spec for driver pod and resources, used for K8s operations internally + * and Spark K8s operator. + */ +@Unstable +@DeveloperApi +@Since("3.3.0") +case class KubernetesDriverSpec( pod: SparkPod, driverPreKubernetesResources: Seq[HasMetadata], driverKubernetesResources: Seq[HasMetadata], - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (aa00b00c18e6 -> d6ca2c5c3c4b)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from aa00b00c18e6 [SPARK-48115][INFRA] Remove `Python 3.11` from `build_python.yml` add d6ca2c5c3c4b [SPARK-48118][SQL] Support `SPARK_SQL_LEGACY_CREATE_HIVE_TABLE` env variable No new revisions were added by this update. Summary of changes: docs/sql-migration-guide.md | 2 +- sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48115][INFRA] Remove `Python 3.11` from `build_python.yml`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new aa00b00c18e6 [SPARK-48115][INFRA] Remove `Python 3.11` from `build_python.yml` aa00b00c18e6 is described below commit aa00b00c18e6a714dc02e9444576e063c8e49db7 Author: Dongjoon Hyun AuthorDate: Fri May 3 14:10:39 2024 -0700 [SPARK-48115][INFRA] Remove `Python 3.11` from `build_python.yml` ### What changes were proposed in this pull request? This PR aims to remove `Python 3.11` from `build_python.yml` Daily CI because `Python 3.11` is the main python version in the PR and commit build. - https://github.com/apache/spark/actions/workflows/build_python.yml ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46366 from dongjoon-hyun/SPARK-48115. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_python.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_python.yml b/.github/workflows/build_python.yml index 2249dd230265..761fd20f0c79 100644 --- a/.github/workflows/build_python.yml +++ b/.github/workflows/build_python.yml @@ -17,7 +17,7 @@ # under the License. # -name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.11/Python 3.12)" +name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.12)" on: schedule: @@ -28,7 +28,7 @@ jobs: strategy: fail-fast: false matrix: -pyversion: ["pypy3", "python3.10", "python3.11", "python3.12"] +pyversion: ["pypy3", "python3.10", "python3.12"] permissions: packages: write name: Run - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48088][PYTHON][CONNECT][TESTS] Prepare backward compatibility test 4.0 <> above
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cd789acb5e51 [SPARK-48088][PYTHON][CONNECT][TESTS] Prepare backward compatibility test 4.0 <> above cd789acb5e51 is described below commit cd789acb5e51172e43052b59c4b610e64f380a16 Author: Hyukjin Kwon AuthorDate: Fri May 3 01:08:05 2024 -0700 [SPARK-48088][PYTHON][CONNECT][TESTS] Prepare backward compatibility test 4.0 <> above ### What changes were proposed in this pull request? This PR forward ports https://github.com/apache/spark/pull/46334 to reduce conflicts. ### Why are the changes needed? To reduce the conflict against branch-3.5, and prepare 4.0 <> above test. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? CI in this PR should verify them. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46358 from HyukjinKwon/SPARK-48088-40. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- python/pyspark/util.py | 3 +++ python/run-tests.py| 18 +++--- 2 files changed, 14 insertions(+), 7 deletions(-) diff --git a/python/pyspark/util.py b/python/pyspark/util.py index bf1cf5b59553..f0fa4a2413ce 100644 --- a/python/pyspark/util.py +++ b/python/pyspark/util.py @@ -747,6 +747,9 @@ def is_remote_only() -> bool: """ global _is_remote_only +if "SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ: +return True + if _is_remote_only is not None: return _is_remote_only try: diff --git a/python/run-tests.py b/python/run-tests.py index ebdd4a9a2179..64ac48e210db 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -62,13 +62,15 @@ LOGGER = logging.getLogger() # Find out where the assembly jars are located. # TODO: revisit for Scala 2.13 -for scala in ["2.13"]: -build_dir = os.path.join(SPARK_HOME, "assembly", "target", "scala-" + scala) -if os.path.isdir(build_dir): -SPARK_DIST_CLASSPATH = os.path.join(build_dir, "jars", "*") -break -else: -raise RuntimeError("Cannot find assembly build directory, please build Spark first.") +SPARK_DIST_CLASSPATH = "" +if "SPARK_SKIP_CONNECT_COMPAT_TESTS" not in os.environ: +for scala in ["2.13"]: +build_dir = os.path.join(SPARK_HOME, "assembly", "target", "scala-" + scala) +if os.path.isdir(build_dir): +SPARK_DIST_CLASSPATH = os.path.join(build_dir, "jars", "*") +break +else: +raise RuntimeError("Cannot find assembly build directory, please build Spark first.") def run_individual_python_test(target_dir, test_name, pyspark_python, keep_test_output): @@ -100,6 +102,8 @@ def run_individual_python_test(target_dir, test_name, pyspark_python, keep_test_ if "SPARK_CONNECT_TESTING_REMOTE" in os.environ: env.update({"SPARK_CONNECT_TESTING_REMOTE": os.environ["SPARK_CONNECT_TESTING_REMOTE"]}) +if "SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ: +env.update({"SPARK_SKIP_JVM_REQUIRED_TESTS": os.environ["SPARK_SKIP_CONNECT_COMPAT_TESTS"]}) # Create a unique temp directory under 'target/' for each run. The TMPDIR variable is # recognized by the tempfile module to override the default system temp directory. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48111][INFRA] Disable Docker integration test and TPC-DS in commit builder
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2d346fbb9c5c [SPARK-48111][INFRA] Disable Docker integration test and TPC-DS in commit builder 2d346fbb9c5c is described below commit 2d346fbb9c5c5e58f8fba076fc7f2348565bea91 Author: Hyukjin Kwon AuthorDate: Fri May 3 00:16:09 2024 -0700 [SPARK-48111][INFRA] Disable Docker integration test and TPC-DS in commit builder ### What changes were proposed in this pull request? This PR proposes to disable Docker integration test and TPC-DS in commit builder ### Why are the changes needed? This is being tested in daily scheduled build: https://github.com/apache/spark/blob/master/.github/workflows/build_java21.yml#L48-L49 Both are pretty unlikely broken in my experience. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? CI in this PR should verify them ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46361 from HyukjinKwon/SPARK-48111. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index f7e83854c1f7..0dc217570ba0 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -76,12 +76,10 @@ jobs: id: set-outputs run: | if [ -z "${{ inputs.jobs }}" ]; then - pyspark=true; sparkr=true; tpcds=true; docker=true; + pyspark=true; sparkr=true; pyspark_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"` pyspark=`./dev/is-changed.py -m $pyspark_modules` sparkr=`./dev/is-changed.py -m sparkr` - tpcds=`./dev/is-changed.py -m sql` - docker=`./dev/is-changed.py -m docker-integration-tests` kubernetes=`./dev/is-changed.py -m kubernetes` # 'build' is always true for now. # It does not save significant time and most of PRs trigger the build. @@ -90,8 +88,8 @@ jobs: \"build\": \"true\", \"pyspark\": \"$pyspark\", \"sparkr\": \"$sparkr\", - \"tpcds-1g\": \"$tpcds\", - \"docker-integration-tests\": \"$docker\", + \"tpcds-1g\": \"false\", + \"docker-integration-tests\": \"false\", \"lint\" : \"true\", \"k8s-integration-tests\" : \"$kubernetes\", \"buf\" : \"true\", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48110][INFRA] Remove all Maven compilation build
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new defda8663c05 [SPARK-48110][INFRA] Remove all Maven compilation build defda8663c05 is described below commit defda8663c05fdba122325b36c45ef8f2da6624e Author: Hyukjin Kwon AuthorDate: Fri May 3 00:13:28 2024 -0700 [SPARK-48110][INFRA] Remove all Maven compilation build ### What changes were proposed in this pull request? This PR proposes to reduce the concurrency of GitHub Action Job, by removing all Maven-only builds because this is tested in daily build (https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml) ### Why are the changes needed? Same as https://github.com/apache/spark/pull/46347 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46360 from HyukjinKwon/SPARK-48110. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 59 +--- 1 file changed, 1 insertion(+), 58 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 3bb37e74805f..f7e83854c1f7 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -83,7 +83,7 @@ jobs: tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` kubernetes=`./dev/is-changed.py -m kubernetes` - # 'build' and 'maven-build' are always true for now. + # 'build' is always true for now. # It does not save significant time and most of PRs trigger the build. precondition=" { @@ -92,7 +92,6 @@ jobs: \"sparkr\": \"$sparkr\", \"tpcds-1g\": \"$tpcds\", \"docker-integration-tests\": \"$docker\", - \"maven-build\": \"true\", \"lint\" : \"true\", \"k8s-integration-tests\" : \"$kubernetes\", \"buf\" : \"true\", @@ -789,62 +788,6 @@ jobs: path: site.tar.bz2 retention-days: 1 - maven-build: -needs: precondition -if: fromJson(needs.precondition.outputs.required).maven-build == 'true' -name: Java ${{ matrix.java }} build with Maven (${{ matrix.os }}) -strategy: - fail-fast: false - matrix: -include: - - java: 21 -os: macos-14 -runs-on: ${{ matrix.os }} -timeout-minutes: 180 -steps: -- name: Checkout Spark repository - uses: actions/checkout@v4 - with: -fetch-depth: 0 -repository: apache/spark -ref: ${{ inputs.branch }} -- name: Sync the current branch with the latest in Apache Spark - if: github.repository != 'apache/spark' - run: | -git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/} -git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' merge --no-commit --progress --squash FETCH_HEAD -git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' commit -m "Merged commit" --allow-empty -- name: Cache SBT and Maven - uses: actions/cache@v4 - with: -path: | - build/apache-maven-* - build/*.jar - ~/.sbt -key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }} -restore-keys: | - build- -- name: Cache Maven local repository - uses: actions/cache@v4 - with: -path: ~/.m2/repository -key: java${{ matrix.java }}-maven-${{ hashFiles('**/pom.xml') }} -restore-keys: | - java${{ matrix.java }}-maven- -- name: Install Java ${{ matrix.java }} - uses: actions/setup-java@v4 - with: -distribution: zulu -java-version: ${{ matrix.java }} -- name: Build with Maven - run: | -export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN" -export MAVEN_CLI_OPTS="--no-transfer-progress" -export JAVA_VERSION=${{ matrix.java }} -# It uses Maven's 'install' intentionally, see
(spark) branch master updated: [SPARK-48107][PYTHON] Exclude tests from Python distribution
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8d70f4ba5396 [SPARK-48107][PYTHON] Exclude tests from Python distribution 8d70f4ba5396 is described below commit 8d70f4ba53962de540fb3dc5bdedd32754be974d Author: Nicholas Chammas AuthorDate: Thu May 2 23:52:43 2024 -0700 [SPARK-48107][PYTHON] Exclude tests from Python distribution ### What changes were proposed in this pull request? Change the Python manifest so that tests are excluded from the packages that are built for distribution. ### Why are the changes needed? Tests were unintentionally included in the distributions as part of #44920. See [this comment](https://github.com/apache/spark/pull/44920/files#r1586979834). ### Does this PR introduce _any_ user-facing change? No, since #44920 hasn't been released to any users yet. ### How was this patch tested? I built Python packages and inspected `SOURCES.txt` to confirm that tests were excluded: ```sh cd python rm -rf pyspark.egg-info || echo "No existing egg info file, skipping deletion" python3 packaging/classic/setup.py sdist python3 packaging/connect/setup.py sdist find dist -name '*.tar.gz' | xargs -I _ tar xf _ --directory=dist cd .. open python/dist find python/dist -name SOURCES.txt | xargs code ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46354 from nchammas/SPARK-48107-package-json. Authored-by: Nicholas Chammas Signed-off-by: Dongjoon Hyun --- python/MANIFEST.in | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/MANIFEST.in b/python/MANIFEST.in index 0374b3096d47..45c9dca8b474 100644 --- a/python/MANIFEST.in +++ b/python/MANIFEST.in @@ -16,7 +16,7 @@ # Reference: https://setuptools.pypa.io/en/latest/userguide/miscellaneous.html -graft pyspark +recursive-include pyspark *.pyi py.typed *.json recursive-include deps/jars *.jar graft deps/bin recursive-include deps/sbin spark-config.sh spark-daemon.sh start-history-server.sh stop-history-server.sh - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48106][INFRA] Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ddc1f6b2a466 [SPARK-48106][INFRA] Use `Python 3.11` in `pyspark` tests of `build_and_test.yml` ddc1f6b2a466 is described below commit ddc1f6b2a466892110ea0010c36f83847b9dc36e Author: Dongjoon Hyun AuthorDate: Thu May 2 23:34:47 2024 -0700 [SPARK-48106][INFRA] Use `Python 3.11` in `pyspark` tests of `build_and_test.yml` ### What changes were proposed in this pull request? This PR aims to use Python `3.11` instead of `3.9` in `pyspark` tests of `build_and_test.yml`. ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). `Python 3.11` is faster in general. - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights > Python 3.11 is between 10-60% faster than Python 3.10. On average, we measured a 1.25x speedup on the standard benchmark suite. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46353 from dongjoon-hyun/SPARK-48106. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 56516c95dcb8..3bb37e74805f 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -366,7 +366,7 @@ jobs: pyspark-pandas-connect-part3 env: MODULES_TO_TEST: ${{ matrix.modules }} - PYTHON_TO_TEST: 'python3.9' + PYTHON_TO_TEST: 'python3.11' HADOOP_PROFILE: ${{ inputs.hadoop }} HIVE_PROFILE: hive2.3 GITHUB_PREV_SHA: ${{ github.event.before }} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (63837020ed29 -> f044748efeac)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 63837020ed29 [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change add f044748efeac [SPARK-48103][K8S] Promote `KubernetesDriverBuilder` to `DeveloperApi` No new revisions were added by this update. Summary of changes: .../spark/deploy/k8s/submit/KubernetesDriverBuilder.scala| 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 63837020ed29 [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change 63837020ed29 is described below commit 63837020ed29c9e6003f24117ad21f8b97f40f0f Author: Dongjoon Hyun AuthorDate: Thu May 2 23:21:59 2024 -0700 [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change ### What changes were proposed in this pull request? This PR aims to enable `k8s-integration-tests` only for `kubernetes` module change. Although there is a chance of missing `core` module change, the daily CI test coverage will reveal that. ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46356 from dongjoon-hyun/SPARK-48109. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 3 ++- .github/workflows/build_branch34.yml | 1 + .github/workflows/build_branch35.yml | 1 + .github/workflows/build_java21.yml | 3 ++- 4 files changed, 6 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 13a05e824f6a..56516c95dcb8 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -82,6 +82,7 @@ jobs: sparkr=`./dev/is-changed.py -m sparkr` tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` + kubernetes=`./dev/is-changed.py -m kubernetes` # 'build' and 'maven-build' are always true for now. # It does not save significant time and most of PRs trigger the build. precondition=" @@ -93,7 +94,7 @@ jobs: \"docker-integration-tests\": \"$docker\", \"maven-build\": \"true\", \"lint\" : \"true\", - \"k8s-integration-tests\" : \"true\", + \"k8s-integration-tests\" : \"$kubernetes\", \"buf\" : \"true\", \"ui\" : \"true\", }" diff --git a/.github/workflows/build_branch34.yml b/.github/workflows/build_branch34.yml index deb43d82c979..68887970d4d8 100644 --- a/.github/workflows/build_branch34.yml +++ b/.github/workflows/build_branch34.yml @@ -47,5 +47,6 @@ jobs: "sparkr": "true", "tpcds-1g": "true", "docker-integration-tests": "true", + "k8s-integration-tests": "true", "lint" : "true" } diff --git a/.github/workflows/build_branch35.yml b/.github/workflows/build_branch35.yml index 9e6fe13c020e..55616c2f1f01 100644 --- a/.github/workflows/build_branch35.yml +++ b/.github/workflows/build_branch35.yml @@ -47,5 +47,6 @@ jobs: "sparkr": "true", "tpcds-1g": "true", "docker-integration-tests": "true", + "k8s-integration-tests": "true", "lint" : "true" } diff --git a/.github/workflows/build_java21.yml b/.github/workflows/build_java21.yml index b1ef5a321835..bfeedd4174cf 100644 --- a/.github/workflows/build_java21.yml +++ b/.github/workflows/build_java21.yml @@ -46,5 +46,6 @@ jobs: "pyspark": "true", "sparkr": "true", "tpcds-1g": "true", - "docker-integration-tests": "true" + "docker-integration-tests": "true", + "k8s-integration-tests": "true" } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48108][INFRA] Skip `tpcds-1g` and `docker-integration-tests` tests from `RocksDB UI-Backend` job
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 111df27d21ee [SPARK-48108][INFRA] Skip `tpcds-1g` and `docker-integration-tests` tests from `RocksDB UI-Backend` job 111df27d21ee is described below commit 111df27d21ee4b9353d053628d76ae26c7f8f8f0 Author: Dongjoon Hyun AuthorDate: Thu May 2 21:51:34 2024 -0700 [SPARK-48108][INFRA] Skip `tpcds-1g` and `docker-integration-tests` tests from `RocksDB UI-Backend` job ### What changes were proposed in this pull request? This PR aims to skip `tpcds-1g` and `docker-integration-tests` tests from `RocksDB UI-Backend` job, `build_rockdb_as_ui_backend.yml`. ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review because this is a daily CI update. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46355 from dongjoon-hyun/SPARK-48108. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_rockdb_as_ui_backend.yml | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.github/workflows/build_rockdb_as_ui_backend.yml b/.github/workflows/build_rockdb_as_ui_backend.yml index e11ec85b8b17..a1cc34f7b54f 100644 --- a/.github/workflows/build_rockdb_as_ui_backend.yml +++ b/.github/workflows/build_rockdb_as_ui_backend.yml @@ -42,7 +42,5 @@ jobs: { "build": "true", "pyspark": "true", - "sparkr": "true", - "tpcds-1g": "true", - "docker-integration-tests": "true" + "sparkr": "true" } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48104][INFRA] Run `publish_snapshot.yml` once per day
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7b472e30db99 [SPARK-48104][INFRA] Run `publish_snapshot.yml` once per day 7b472e30db99 is described below commit 7b472e30db99fe935b22a748d3f2adbce474ea37 Author: Dongjoon Hyun AuthorDate: Thu May 2 20:15:03 2024 -0700 [SPARK-48104][INFRA] Run `publish_snapshot.yml` once per day ### What changes were proposed in this pull request? This PR aims to reduce `publish_snapshot.yml` frequency from twice per day to once per day. Technically, this is a revert of - #45686 ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46352 from dongjoon-hyun/SPARK-48104. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/publish_snapshot.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/publish_snapshot.yml b/.github/workflows/publish_snapshot.yml index d09babd37240..006ccf239e6f 100644 --- a/.github/workflows/publish_snapshot.yml +++ b/.github/workflows/publish_snapshot.yml @@ -21,7 +21,7 @@ name: Publish Snapshot on: schedule: - - cron: '0 0,12 * * *' + - cron: '0 0 * * *' workflow_dispatch: inputs: branch: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47671][CORE] Enable structured logging in log4j2.properties.template and update docs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c6696cdcd611 [SPARK-47671][CORE] Enable structured logging in log4j2.properties.template and update docs c6696cdcd611 is described below commit c6696cdcd611a682ebf5b7a183e2970ecea3b58c Author: Gengliang Wang AuthorDate: Thu May 2 19:45:48 2024 -0700 [SPARK-47671][CORE] Enable structured logging in log4j2.properties.template and update docs ### What changes were proposed in this pull request? - Rename the current log4j2.properties.template as log4j2.properties.pattern-layout-template - Enable structured logging in log4j2.properties.template - Update `configuration.md` on how to configure logging ### Why are the changes needed? Providing a structured logging template and document how to configure loggings in Spark 4.0.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46349 from gengliangwang/logTemplate. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun --- ...template => log4j2.properties.pattern-layout-template} | 0 conf/log4j2.properties.template | 10 ++ docs/configuration.md | 15 +-- 3 files changed, 11 insertions(+), 14 deletions(-) diff --git a/conf/log4j2.properties.template b/conf/log4j2.properties.pattern-layout-template similarity index 100% copy from conf/log4j2.properties.template copy to conf/log4j2.properties.pattern-layout-template diff --git a/conf/log4j2.properties.template b/conf/log4j2.properties.template index ab96e03baed2..876724531444 100644 --- a/conf/log4j2.properties.template +++ b/conf/log4j2.properties.template @@ -19,17 +19,11 @@ rootLogger.level = info rootLogger.appenderRef.stdout.ref = console -# In the pattern layout configuration below, we specify an explicit `%ex` conversion -# pattern for logging Throwables. If this was omitted, then (by default) Log4J would -# implicitly add an `%xEx` conversion pattern which logs stacktraces with additional -# class packaging information. That extra information can sometimes add a substantial -# performance overhead, so we disable it in our default logging config. -# For more information, see SPARK-39361. appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR -appender.console.layout.type = PatternLayout -appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex +appender.console.layout.type = JsonTemplateLayout +appender.console.layout.eventTemplateUri = classpath:org/apache/spark/SparkLayout.json # Set the default spark-shell/spark-sql log level to WARN. When running the # spark-shell/spark-sql, the log level for these classes is used to overwrite diff --git a/docs/configuration.md b/docs/configuration.md index 2e612ffd9ab9..a3b4e731f057 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -3670,14 +3670,17 @@ Note: When running Spark on YARN in `cluster` mode, environment variables need t # Configuring Logging Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a -`log4j2.properties` file in the `conf` directory. One way to start is to copy the existing -`log4j2.properties.template` located there. +`log4j2.properties` file in the `conf` directory. One way to start is to copy the existing templates `log4j2.properties.template` or `log4j2.properties.pattern-layout-template` located there. -By default, Spark adds 1 record to the MDC (Mapped Diagnostic Context): `mdc.taskName`, which shows something -like `task 1.0 in stage 0.0`. You can add `%X{mdc.taskName}` to your patternLayout in -order to print it in the logs. +## Structured Logging +Starting from version 4.0.0, Spark has adopted the [JSON Template Layout](https://logging.apache.org/log4j/2.x/manual/json-template-layout.html) for logging, which outputs logs in JSON format. This format facilitates querying logs using Spark SQL with the JSON data source. Additionally, the logs include all Mapped Diagnostic Context (MDC) information for search and debugging purposes. + +To implement structured logging, start with the `log4j2.properties.template` file. + +## Plain Text Logging +If you prefer plain text logging, you can use the `log4j2.properties.pattern-layout-template` file as a starting point. This is the default configuration used by Spark before the 4.0.0 release. This configuration uses the [PatternLayout](https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout) to log all the logs in plain text.
(spark) branch master updated (7f6a1399a56b -> 8d9e7c9c6623)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 7f6a1399a56b [SPARK-48098][INFRA] Enable `NOLINT_ON_COMPILE` for all except `lint` job add 8d9e7c9c6623 [SPARK-48099][INFRA] Run `maven-build` test only on `Java 21 on MacOS14 (Apple Silicon)` No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 4 1 file changed, 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48098][INFRA] Enable `NOLINT_ON_COMPILE` for all except `lint` job
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7f6a1399a56b [SPARK-48098][INFRA] Enable `NOLINT_ON_COMPILE` for all except `lint` job 7f6a1399a56b is described below commit 7f6a1399a56b07fa253a85dac757fdd788285274 Author: Dongjoon Hyun AuthorDate: Thu May 2 19:26:13 2024 -0700 [SPARK-48098][INFRA] Enable `NOLINT_ON_COMPILE` for all except `lint` job ### What changes were proposed in this pull request? This PR aims to enable `NOLINT_ON_COMPILE` for all except `lint` job. ### Why are the changes needed? This will reduce the redundant CPU cycle and GitHub action usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46346 from dongjoon-hyun/SPARK-48098. Lead-authored-by: Dongjoon Hyun Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 ++ project/SparkBuild.scala | 4 +++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 92fda7adeb33..3f5a8087885e 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -193,6 +193,7 @@ jobs: HIVE_PROFILE: ${{ matrix.hive }} GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost + NOLINT_ON_COMPILE: true SKIP_UNIDOC: true SKIP_MIMA: true SKIP_PACKAGING: true @@ -606,6 +607,7 @@ jobs: env: LC_ALL: C.UTF-8 LANG: C.UTF-8 + NOLINT_ON_COMPILE: false PYSPARK_DRIVER_PYTHON: python3.9 PYSPARK_PYTHON: python3.9 GITHUB_PREV_SHA: ${{ github.event.before }} diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 9d2ee6077d11..5bb7745d77bf 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -255,9 +255,11 @@ object SparkBuild extends PomBuild { } ) + val noLintOnCompile = sys.env.contains("NOLINT_ON_COMPILE") && + !sys.env.get("NOLINT_ON_COMPILE").contains("false") lazy val sharedSettings = sparkGenjavadocSettings ++ compilerWarningSettings ++ - (if (sys.env.contains("NOLINT_ON_COMPILE")) Nil else enableScalaStyle) ++ Seq( + (if (noLintOnCompile) Nil else enableScalaStyle) ++ Seq( (Compile / exportJars) := true, (Test / exportJars) := false, javaHome := sys.env.get("JAVA_HOME") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48097][INFRA] Limit GHA job execution time to up to 3 hours in `build_and_test.yml`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1321dd604480 [SPARK-48097][INFRA] Limit GHA job execution time to up to 3 hours in `build_and_test.yml` 1321dd604480 is described below commit 1321dd6044809dbbdd8c1887b8345b0f8d76797d Author: Dongjoon Hyun AuthorDate: Thu May 2 15:10:33 2024 -0700 [SPARK-48097][INFRA] Limit GHA job execution time to up to 3 hours in `build_and_test.yml` ### What changes were proposed in this pull request? This PR aims to limit GHA job execution time to up to 3 hours in `build_and_test.yml` in order to avoid idle hung time. New limit is applied for all jobs except three jobs (`precondition`, `infra-image`, and `breaking-changes-buf`) which didn't get a hung situation before. ### Why are the changes needed? Since SPARK-45010, Apache spark used 5 hours. - #42727 This is shorter than GitHub Action's the default value (6 hour) is used. - https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes > The maximum number of minutes to let a job run before GitHub automatically cancels it. Default: 360 This PR reduces to `3 hour` to follow new ASF INFRA policy which has been applied since April 20, 2024. - https://infra.apache.org/github-actions-policy.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46344 from dongjoon-hyun/SPARK-48097. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 7e59f7b792b4..92fda7adeb33 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -123,7 +123,7 @@ jobs: needs: precondition if: fromJson(needs.precondition.outputs.required).build == 'true' runs-on: ubuntu-latest -timeout-minutes: 300 +timeout-minutes: 180 strategy: fail-fast: false matrix: @@ -333,7 +333,7 @@ jobs: if: (!cancelled()) && fromJson(needs.precondition.outputs.required).pyspark == 'true' name: "Build modules: ${{ matrix.modules }}" runs-on: ubuntu-latest -timeout-minutes: 300 +timeout-minutes: 180 container: image: ${{ needs.precondition.outputs.image_url }} strategy: @@ -480,7 +480,7 @@ jobs: if: (!cancelled()) && fromJson(needs.precondition.outputs.required).sparkr == 'true' name: "Build modules: sparkr" runs-on: ubuntu-latest -timeout-minutes: 300 +timeout-minutes: 180 container: image: ${{ needs.precondition.outputs.image_url }} env: @@ -602,7 +602,7 @@ jobs: if: (!cancelled()) && fromJson(needs.precondition.outputs.required).lint == 'true' name: Linters, licenses, dependencies and documentation generation runs-on: ubuntu-latest -timeout-minutes: 300 +timeout-minutes: 180 env: LC_ALL: C.UTF-8 LANG: C.UTF-8 @@ -801,7 +801,7 @@ jobs: - java: 21 os: macos-14 runs-on: ${{ matrix.os }} -timeout-minutes: 300 +timeout-minutes: 180 steps: - name: Checkout Spark repository uses: actions/checkout@v4 @@ -853,7 +853,7 @@ jobs: name: Run TPC-DS queries with SF=1 # Pin to 'Ubuntu 20.04' due to 'databricks/tpcds-kit' compilation runs-on: ubuntu-20.04 -timeout-minutes: 300 +timeout-minutes: 180 env: SPARK_LOCAL_IP: localhost steps: @@ -954,7 +954,7 @@ jobs: if: fromJson(needs.precondition.outputs.required).docker-integration-tests == 'true' name: Run Docker integration tests runs-on: ubuntu-latest -timeout-minutes: 300 +timeout-minutes: 180 env: HADOOP_PROFILE: ${{ inputs.hadoop }} HIVE_PROFILE: hive2.3 @@ -1022,7 +1022,7 @@ jobs: if: fromJson(needs.precondition.outputs.required).k8s-integration-tests == 'true' name: Run Spark on Kubernetes Integration test runs-on: ubuntu-latest -timeout-minutes: 300 +timeout-minutes: 180 steps: - name: Checkout Spark repository uses: actions/checkout@v4 @@ -1094,7 +1094,7 @@ jobs: if: fromJson(needs.precondition.outputs.required).ui == 'true' name: Run Spark UI tests runs-on: ubuntu-latest -timeout-minutes: 300 +timeout-minutes: 180
(spark) branch master updated: [SPARK-48096][INFRA] Run `build_maven_java21_macos14.yml` every two days
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 53d4cdb4eefa [SPARK-48096][INFRA] Run `build_maven_java21_macos14.yml` every two days 53d4cdb4eefa is described below commit 53d4cdb4eefa66161315f04d58d2742f52bfbcce Author: Dongjoon Hyun AuthorDate: Thu May 2 14:12:16 2024 -0700 [SPARK-48096][INFRA] Run `build_maven_java21_macos14.yml` every two days ### What changes were proposed in this pull request? This PR aims to reduce `build_maven_java21_macos14.yml` frequency from once per day to every two days. ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46343 from dongjoon-hyun/SPARK-48096. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_maven_java21_macos14.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_maven_java21_macos14.yml b/.github/workflows/build_maven_java21_macos14.yml index 70b47fcecb26..fb5e609f4eae 100644 --- a/.github/workflows/build_maven_java21_macos14.yml +++ b/.github/workflows/build_maven_java21_macos14.yml @@ -21,7 +21,7 @@ name: "Build / Maven (master, Scala 2.13, Hadoop 3, JDK 21, macos-14)" on: schedule: -- cron: '0 20 * * *' +- cron: '0 20 */2 * *' jobs: run-build: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48095][INFRA] Run `build_non_ansi.yml` once per day
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 48df28f6b311 [SPARK-48095][INFRA] Run `build_non_ansi.yml` once per day 48df28f6b311 is described below commit 48df28f6b3112b949c0057f0c4ecb1d334f3662c Author: Dongjoon Hyun AuthorDate: Thu May 2 14:00:36 2024 -0700 [SPARK-48095][INFRA] Run `build_non_ansi.yml` once per day ### What changes were proposed in this pull request? This PR aims to reduce `build_non_ansi.yml` frequency from twice per day to once per day. ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46342 from dongjoon-hyun/SPARK-48095. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_non_ansi.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_non_ansi.yml b/.github/workflows/build_non_ansi.yml index cf97cdd4bfa1..ff3fda4625cc 100644 --- a/.github/workflows/build_non_ansi.yml +++ b/.github/workflows/build_non_ansi.yml @@ -21,7 +21,7 @@ name: "Build / NON-ANSI (master, Hadoop 3, JDK 17, Scala 2.13)" on: schedule: -- cron: '0 1,13 * * *' +- cron: '0 1 * * *' jobs: run-build: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48081][SQL][3.4] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 681a1de72bdf [SPARK-48081][SQL][3.4] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type 681a1de72bdf is described below commit 681a1de72bdf749e0a0782dde9bddfcbb3248d99 Author: Josh Rosen AuthorDate: Thu May 2 12:50:54 2024 -0700 [SPARK-48081][SQL][3.4] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type branch-3.4 pick of PR https://github.com/apache/spark/pull/46333 , fixing test issue due to difference in expected error message parameter formatting across branches; original description follows below: --- ### What changes were proposed in this pull request? While migrating the `NTile` expression's type check failures to the new error class framework, PR https://github.com/apache/spark/pull/38457 removed a pair of not-unnecessary `return` statements and thus caused certain branches' values to be discarded rather than returned. As a result, invalid usages like ``` select ntile(99.9) OVER (order by id) from range(10) ``` trigger internal errors like errors like ``` java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal is in unnamed module of loader 'app'; java.lang.Integer is in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99) at org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877) ``` instead of clear error framework errors like ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to data type mismatch: The first parameter requires the "INT" type, however "99.9" has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7; 'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(] +- Range (0, 10, step=1, splits=None) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315) ``` ### Why are the changes needed? Improve error messages. ### Does this PR introduce _any_ user-facing change? Yes, it improves an error message. ### How was this patch tested? Added a new test case to AnalysisErrorSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46337 from JoshRosen/SPARK-48081-branch-3.4. Authored-by: Josh Rosen Signed-off-by: Dongjoon Hyun --- .../catalyst/expressions/windowExpressions.scala | 4 +-- .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++ 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index 2d11b581ee4c..adc32866f58d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -848,7 +848,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow // for each partition. override def checkInputDataTypes(): TypeCheckResult = { if (!buckets.foldable) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "NON_FOLDABLE_INPUT", messageParameters = Map( "inputName" -> "buckets", @@ -859,7 +859,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow } if (buckets.dataType != IntegerType) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "UNEXPECTED_INPUT_TYPE", messageParameters = Map( "paramIndex" -> "1", diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala index cbd6749807f7..ebc133719238 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisE
(spark) branch branch-3.5 updated: [SPARK-48081][SQL][3.5] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 9cd312574e97 [SPARK-48081][SQL][3.5] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type 9cd312574e97 is described below commit 9cd312574e9706e9a1784c18ef1c1bccb957bcba Author: Josh Rosen AuthorDate: Thu May 2 12:49:54 2024 -0700 [SPARK-48081][SQL][3.5] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type branch-3.5 pick of PR https://github.com/apache/spark/pull/46333 , fixing test issue due to difference in expected error message parameter formatting across branches; original description follows below: --- ### What changes were proposed in this pull request? While migrating the `NTile` expression's type check failures to the new error class framework, PR https://github.com/apache/spark/pull/38457 removed a pair of not-unnecessary `return` statements and thus caused certain branches' values to be discarded rather than returned. As a result, invalid usages like ``` select ntile(99.9) OVER (order by id) from range(10) ``` trigger internal errors like errors like ``` java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal is in unnamed module of loader 'app'; java.lang.Integer is in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99) at org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877) ``` instead of clear error framework errors like ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to data type mismatch: The first parameter requires the "INT" type, however "99.9" has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7; 'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(] +- Range (0, 10, step=1, splits=None) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315) ``` ### Why are the changes needed? Improve error messages. ### Does this PR introduce _any_ user-facing change? Yes, it improves an error message. ### How was this patch tested? Added a new test case to AnalysisErrorSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46336 from JoshRosen/SPARK-48081-branch-3.5. Authored-by: Josh Rosen Signed-off-by: Dongjoon Hyun --- .../catalyst/expressions/windowExpressions.scala | 4 +-- .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++ 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index 50c98c01645d..a4ce78d1bb6d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -850,7 +850,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow // for each partition. override def checkInputDataTypes(): TypeCheckResult = { if (!buckets.foldable) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "NON_FOLDABLE_INPUT", messageParameters = Map( "inputName" -> "buckets", @@ -861,7 +861,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow } if (buckets.dataType != IntegerType) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "UNEXPECTED_INPUT_TYPE", messageParameters = Map( "paramIndex" -> "1", diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala index e8dc9061199c..a7df53db936f 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisE
(spark) branch branch-3.4 updated: [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` GenericAlias in Python 3.11+
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new a75c93be9c0a [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` GenericAlias in Python 3.11+ a75c93be9c0a is described below commit a75c93be9c0a9c96de788db9fc74125590d2d26f Author: Dongjoon Hyun AuthorDate: Mon Nov 20 08:30:42 2023 +0900 [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` GenericAlias in Python 3.11+ ### What changes were proposed in this pull request? This PR aims to fix `type hints` to handle `list` GenericAlias in Python 3.11+ for Apache Spark 4.0.0 and 3.5.1. - https://github.com/apache/spark/actions/workflows/build_python.yml ### Why are the changes needed? PEP 646 changes `GenericAlias` instances into `Iterable` ones at Python 3.11. - https://peps.python.org/pep-0646/ This behavior changes introduce the following failure on Python 3.11. - **Python 3.11.6** ```python Python 3.11.6 (main, Nov 1 2023, 07:46:30) [Clang 14.0.0 (clang-1400.0.28.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/11/18 16:34:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.11.6 (main, Nov 1 2023 07:46:30) Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1700354049391). SparkSession available as 'spark'. >>> from pyspark import pandas as ps >>> from typing import List >>> ps.DataFrame[float, [int, List[int]]] Traceback (most recent call last): File "", line 1, in File "/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/frame.py", line 13647, in __class_getitem__ return create_tuple_for_frame_type(params) ^^^ File "/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py", line 717, in create_tuple_for_frame_type return Tuple[_to_type_holders(params)] File "/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py", line 762, in _to_type_holders data_types = _new_type_holders(data_types, NameTypeHolder) ^ File "/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py", line 828, in _new_type_holders raise TypeError( TypeError: Type hints should be specified as one of: - DataFrame[type, type, ...] - DataFrame[name: type, name: type, ...] - DataFrame[dtypes instance] - DataFrame[zip(names, types)] - DataFrame[index_type, [type, ...]] - DataFrame[(index_name, index_type), [(name, type), ...]] - DataFrame[dtype instance, dtypes instance] - DataFrame[(index_name, index_type), zip(names, types)] - DataFrame[[index_type, ...], [type, ...]] - DataFrame[[(index_name, index_type), ...], [(name, type), ...]] - DataFrame[dtypes instance, dtypes instance] - DataFrame[zip(index_names, index_types), zip(names, types)] However, got (, typing.List[int]). ``` - **Python 3.10.13** ```python Python 3.10.13 (main, Sep 29 2023, 16:03:45) [Clang 14.0.0 (clang-1400.0.28.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/11/18 16:33:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.10.13 (main, Sep 29 2023 16:03:45) Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master =
(spark) branch branch-3.4 updated: Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 4baf5ee19ba4 Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type" 4baf5ee19ba4 is described below commit 4baf5ee19ba410ea39d784380b8e5ae434cf8601 Author: Dongjoon Hyun AuthorDate: Thu May 2 08:42:48 2024 -0700 Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type" This reverts commit 32789ba3bbaa98dd14537d80204ed4aab8f77d9b. --- .../catalyst/expressions/windowExpressions.scala | 4 +-- .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 -- 2 files changed, 2 insertions(+), 36 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index adc32866f58d..2d11b581ee4c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -848,7 +848,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow // for each partition. override def checkInputDataTypes(): TypeCheckResult = { if (!buckets.foldable) { - return DataTypeMismatch( + DataTypeMismatch( errorSubClass = "NON_FOLDABLE_INPUT", messageParameters = Map( "inputName" -> "buckets", @@ -859,7 +859,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow } if (buckets.dataType != IntegerType) { - return DataTypeMismatch( + DataTypeMismatch( errorSubClass = "UNEXPECTED_INPUT_TYPE", messageParameters = Map( "paramIndex" -> "1", diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala index 5a2aa87d7a83..cbd6749807f7 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala @@ -316,40 +316,6 @@ class AnalysisErrorSuite extends AnalysisTest { listRelation.select(Explode($"list").as("a"), Explode($"list").as("b")), "only one generator" :: "explode" :: Nil) - errorClassTest( -"the buckets of ntile window function is not foldable", -testRelation2.select( - WindowExpression( -NTile(Literal(99.9f)), -WindowSpecDefinition( - UnresolvedAttribute("a") :: Nil, - SortOrder(UnresolvedAttribute("b"), Ascending) :: Nil, - UnspecifiedFrame)).as("window")), -errorClass = "DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE", -messageParameters = Map( - "sqlExpr" -> "\"ntile(99.9)\"", - "paramIndex" -> "first", - "inputSql" -> "\"99.9\"", - "inputType" -> "\"FLOAT\"", - "requiredType" -> "\"INT\"")) - - - errorClassTest( -"the buckets of ntile window function is not int literal", -testRelation2.select( - WindowExpression( -NTile(AttributeReference("b", IntegerType)()), -WindowSpecDefinition( - UnresolvedAttribute("a") :: Nil, - SortOrder(UnresolvedAttribute("b"), Ascending) :: Nil, - UnspecifiedFrame)).as("window")), -errorClass = "DATATYPE_MISMATCH.NON_FOLDABLE_INPUT", -messageParameters = Map( - "sqlExpr" -> "\"ntile(b)\"", - "inputName" -> "`buckets`", - "inputExpr" -> "\"b\"", - "inputType" -> "\"INT\"")) - errorClassTest( "unresolved attributes", testRelation.select($"abcd"), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new d82403f98033 Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type" d82403f98033 is described below commit d82403f980334cd40b1f24518c9c766827710c8c Author: Dongjoon Hyun AuthorDate: Thu May 2 08:42:24 2024 -0700 Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type" This reverts commit 3d72063ccec6167bd3fe92e24a0ebd11bec8637b. --- .../catalyst/expressions/windowExpressions.scala | 4 +-- .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 -- 2 files changed, 2 insertions(+), 36 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index a4ce78d1bb6d..50c98c01645d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -850,7 +850,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow // for each partition. override def checkInputDataTypes(): TypeCheckResult = { if (!buckets.foldable) { - return DataTypeMismatch( + DataTypeMismatch( errorSubClass = "NON_FOLDABLE_INPUT", messageParameters = Map( "inputName" -> "buckets", @@ -861,7 +861,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow } if (buckets.dataType != IntegerType) { - return DataTypeMismatch( + DataTypeMismatch( errorSubClass = "UNEXPECTED_INPUT_TYPE", messageParameters = Map( "paramIndex" -> "1", diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala index 48d9266542f1..e8dc9061199c 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala @@ -344,40 +344,6 @@ class AnalysisErrorSuite extends AnalysisTest { "inputType" -> "\"BOOLEAN\"", "requiredType" -> "\"INT\"")) - errorClassTest( -"the buckets of ntile window function is not foldable", -testRelation2.select( - WindowExpression( -NTile(Literal(99.9f)), -WindowSpecDefinition( - UnresolvedAttribute("a") :: Nil, - SortOrder(UnresolvedAttribute("b"), Ascending) :: Nil, - UnspecifiedFrame)).as("window")), -errorClass = "DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE", -messageParameters = Map( - "sqlExpr" -> "\"ntile(99.9)\"", - "paramIndex" -> "first", - "inputSql" -> "\"99.9\"", - "inputType" -> "\"FLOAT\"", - "requiredType" -> "\"INT\"")) - - - errorClassTest( -"the buckets of ntile window function is not int literal", -testRelation2.select( - WindowExpression( -NTile(AttributeReference("b", IntegerType)()), -WindowSpecDefinition( - UnresolvedAttribute("a") :: Nil, - SortOrder(UnresolvedAttribute("b"), Ascending) :: Nil, - UnspecifiedFrame)).as("window")), -errorClass = "DATATYPE_MISMATCH.NON_FOLDABLE_INPUT", -messageParameters = Map( - "sqlExpr" -> "\"ntile(b)\"", - "inputName" -> "`buckets`", - "inputExpr" -> "\"b\"", - "inputType" -> "\"INT\"")) - errorClassTest( "unresolved attributes", testRelation.select($"abcd"), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (b99a64b0fd1c -> bf1300835503)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b99a64b0fd1c [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type add bf1300835503 [SPARK-48079][BUILD] Upgrade maven-install/deploy-plugin to 3.1.2 No new revisions were added by this update. Summary of changes: pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 32789ba3bbaa [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type 32789ba3bbaa is described below commit 32789ba3bbaa98dd14537d80204ed4aab8f77d9b Author: Josh Rosen AuthorDate: Thu May 2 07:22:44 2024 -0700 [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type ### What changes were proposed in this pull request? While migrating the `NTile` expression's type check failures to the new error class framework, PR https://github.com/apache/spark/pull/38457 removed a pair of not-unnecessary `return` statements and thus caused certain branches' values to be discarded rather than returned. As a result, invalid usages like ``` select ntile(99.9) OVER (order by id) from range(10) ``` trigger internal errors like errors like ``` java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal is in unnamed module of loader 'app'; java.lang.Integer is in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99) at org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877) ``` instead of clear error framework errors like ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to data type mismatch: The first parameter requires the "INT" type, however "99.9" has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7; 'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(] +- Range (0, 10, step=1, splits=None) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315) ``` ### Why are the changes needed? Improve error messages. ### Does this PR introduce _any_ user-facing change? Yes, it improves an error message. ### How was this patch tested? Added a new test case to AnalysisErrorSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46333 from JoshRosen/SPARK-48081. Authored-by: Josh Rosen Signed-off-by: Dongjoon Hyun (cherry picked from commit b99a64b0fd1cf4b32dd2f17423775db87bae20a6) Signed-off-by: Dongjoon Hyun --- .../catalyst/expressions/windowExpressions.scala | 4 +-- .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++ 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index 2d11b581ee4c..adc32866f58d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -848,7 +848,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow // for each partition. override def checkInputDataTypes(): TypeCheckResult = { if (!buckets.foldable) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "NON_FOLDABLE_INPUT", messageParameters = Map( "inputName" -> "buckets", @@ -859,7 +859,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow } if (buckets.dataType != IntegerType) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "UNEXPECTED_INPUT_TYPE", messageParameters = Map( "paramIndex" -> "1", diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala index cbd6749807f7..5a2aa87d7a83 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala @@ -316,6 +316,40 @@ class AnalysisErrorSui
(spark) branch branch-3.5 updated: [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 3d72063ccec6 [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type 3d72063ccec6 is described below commit 3d72063ccec6167bd3fe92e24a0ebd11bec8637b Author: Josh Rosen AuthorDate: Thu May 2 07:22:44 2024 -0700 [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type ### What changes were proposed in this pull request? While migrating the `NTile` expression's type check failures to the new error class framework, PR https://github.com/apache/spark/pull/38457 removed a pair of not-unnecessary `return` statements and thus caused certain branches' values to be discarded rather than returned. As a result, invalid usages like ``` select ntile(99.9) OVER (order by id) from range(10) ``` trigger internal errors like errors like ``` java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal is in unnamed module of loader 'app'; java.lang.Integer is in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99) at org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877) ``` instead of clear error framework errors like ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to data type mismatch: The first parameter requires the "INT" type, however "99.9" has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7; 'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(] +- Range (0, 10, step=1, splits=None) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315) ``` ### Why are the changes needed? Improve error messages. ### Does this PR introduce _any_ user-facing change? Yes, it improves an error message. ### How was this patch tested? Added a new test case to AnalysisErrorSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46333 from JoshRosen/SPARK-48081. Authored-by: Josh Rosen Signed-off-by: Dongjoon Hyun (cherry picked from commit b99a64b0fd1cf4b32dd2f17423775db87bae20a6) Signed-off-by: Dongjoon Hyun --- .../catalyst/expressions/windowExpressions.scala | 4 +-- .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++ 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index 50c98c01645d..a4ce78d1bb6d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -850,7 +850,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow // for each partition. override def checkInputDataTypes(): TypeCheckResult = { if (!buckets.foldable) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "NON_FOLDABLE_INPUT", messageParameters = Map( "inputName" -> "buckets", @@ -861,7 +861,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow } if (buckets.dataType != IntegerType) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "UNEXPECTED_INPUT_TYPE", messageParameters = Map( "paramIndex" -> "1", diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala index e8dc9061199c..48d9266542f1 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala @@ -344,6 +344,40 @@ class AnalysisErrorSui
(spark) branch master updated: [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b99a64b0fd1c [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type b99a64b0fd1c is described below commit b99a64b0fd1cf4b32dd2f17423775db87bae20a6 Author: Josh Rosen AuthorDate: Thu May 2 07:22:44 2024 -0700 [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type ### What changes were proposed in this pull request? While migrating the `NTile` expression's type check failures to the new error class framework, PR https://github.com/apache/spark/pull/38457 removed a pair of not-unnecessary `return` statements and thus caused certain branches' values to be discarded rather than returned. As a result, invalid usages like ``` select ntile(99.9) OVER (order by id) from range(10) ``` trigger internal errors like errors like ``` java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal is in unnamed module of loader 'app'; java.lang.Integer is in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99) at org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877) ``` instead of clear error framework errors like ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to data type mismatch: The first parameter requires the "INT" type, however "99.9" has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7; 'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(] +- Range (0, 10, step=1, splits=None) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315) ``` ### Why are the changes needed? Improve error messages. ### Does this PR introduce _any_ user-facing change? Yes, it improves an error message. ### How was this patch tested? Added a new test case to AnalysisErrorSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46333 from JoshRosen/SPARK-48081. Authored-by: Josh Rosen Signed-off-by: Dongjoon Hyun --- .../catalyst/expressions/windowExpressions.scala | 4 +-- .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++ 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index 00711332350c..5881c456f6e8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -853,7 +853,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow // for each partition. override def checkInputDataTypes(): TypeCheckResult = { if (!buckets.foldable) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "NON_FOLDABLE_INPUT", messageParameters = Map( "inputName" -> toSQLId("buckets"), @@ -864,7 +864,7 @@ case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindow } if (buckets.dataType != IntegerType) { - DataTypeMismatch( + return DataTypeMismatch( errorSubClass = "UNEXPECTED_INPUT_TYPE", messageParameters = Map( "paramIndex" -> ordinalNumber(0), diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala index f12d22409691..19eb3a418543 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala @@ -360,6 +360,40 @@ class AnalysisErrorSuite extends AnalysisTest with DataTypeErrorsBase { "inputType" -> "\"BO
(spark) branch master updated: [SPARK-48072][SQL][TESTS] Improve SQLQuerySuite test output - use `===` instead of `sameElements` for Arrays
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5bbbc6c25bb7 [SPARK-48072][SQL][TESTS] Improve SQLQuerySuite test output - use `===` instead of `sameElements` for Arrays 5bbbc6c25bb7 is described below commit 5bbbc6c25bb7cb7cf24330a384c67bc3e8b3a5e4 Author: Vladimir Golubev AuthorDate: Thu May 2 07:18:11 2024 -0700 [SPARK-48072][SQL][TESTS] Improve SQLQuerySuite test output - use `===` instead of `sameElements` for Arrays ### What changes were proposed in this pull request? Improve test output for the actual query to be printed alongside of expected ### Why are the changes needed? To reduce confusion later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `testOnly org.apache.spark.sql.SQLQuerySuite -- -z SPARK-47939` `testOnly org.apache.spark.sql.SQLQuerySuite -- -z SPARK-37965` `testOnly org.apache.spark.sql.SQLQuerySuite -- -z SPARK-27442` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46318 from vladimirg-db/vladimirg-db/improve-test-output-for-sql-query-suite. Authored-by: Vladimir Golubev Signed-off-by: Dongjoon Hyun --- .../src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala| 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index 470f8ff4cd85..56c364e20846 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -4399,8 +4399,8 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark checkAnswer(df, Row(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) :: Row(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) :: Nil) - assert(df.schema.names.sameElements( -Array("max(t)", "max(t", "=", "\n", ";", "a b", "{", ".", "a.b", "a"))) + assert(df.schema.names === +Array("max(t)", "max(t", "=", "\n", ";", "a b", "{", ".", "a.b", "a")) checkAnswer(df.select("`max(t)`", "`a b`", "`{`", "`.`", "`a.b`"), Row(1, 6, 7, 8, 9) :: Row(2, 12, 14, 16, 18) :: Nil) checkAnswer(df.where("`a.b` > 10"), @@ -4418,8 +4418,8 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark checkAnswer(df, Row(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) :: Row(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22) :: Nil) - assert(df.schema.names.sameElements( -Array("max(t)", "max(t", "=", "\n", ";", "a b", "{", ".", "a.b", "a", ","))) + assert(df.schema.names === +Array("max(t)", "max(t", "=", "\n", ";", "a b", "{", ".", "a.b", "a", ",")) checkAnswer(df.select("`max(t)`", "`a b`", "`{`", "`.`", "`a.b`"), Row(1, 6, 7, 8, 9) :: Row(2, 12, 14, 16, 18) :: Nil) checkAnswer(df.where("`a.b` > 10"), @@ -4754,7 +4754,7 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark df.collect() .map(_.getString(0)) .map(_.replaceAll("#[0-9]+", "#N")) -.sameElements(Array(plan.stripMargin)) +=== Array(plan.stripMargin) ) checkQueryPlan( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48080][K8S] Promote `*MainAppResource` and `NonJVMResource` to `DeveloperApi`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 66e2a350fc55 [SPARK-48080][K8S] Promote `*MainAppResource` and `NonJVMResource` to `DeveloperApi` 66e2a350fc55 is described below commit 66e2a350fc55946315b52557a41d276d52124938 Author: Dongjoon Hyun AuthorDate: Wed May 1 20:41:48 2024 -0700 [SPARK-48080][K8S] Promote `*MainAppResource` and `NonJVMResource` to `DeveloperApi` ### What changes were proposed in this pull request? This PR aims to promote `*MainAppResource` and `NonJVMResource` to `DeveloperApi`. ### Why are the changes needed? Since `Apache Spark Kubernetes Operator` depends on these traits and classes, we had better maintain it as a developer API officially from `Apache Spark 4.0.0`. - https://github.com/apache/spark-kubernetes-operator/pull/10 Since there are no changes after `3.0.0`, these are defined as `Stable`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46332 from dongjoon-hyun/SPARK-48080. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../spark/deploy/k8s/submit/MainAppResource.scala | 33 ++ 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/MainAppResource.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/MainAppResource.scala index a2e01fa2d9a0..398bb76376cf 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/MainAppResource.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/MainAppResource.scala @@ -16,15 +16,38 @@ */ package org.apache.spark.deploy.k8s.submit -private[spark] sealed trait MainAppResource +import org.apache.spark.annotation.{DeveloperApi, Since, Stable} -private[spark] sealed trait NonJVMResource +/** + * :: DeveloperApi :: + * + * All traits and classes in this file are used by K8s module and Spark K8s operator. + */ + +@Stable +@DeveloperApi +@Since("2.3.0") +sealed trait MainAppResource + +@Stable +@DeveloperApi +@Since("2.4.0") +sealed trait NonJVMResource -private[spark] case class JavaMainAppResource(primaryResource: Option[String]) +@Stable +@DeveloperApi +@Since("3.0.0") +case class JavaMainAppResource(primaryResource: Option[String]) extends MainAppResource -private[spark] case class PythonMainAppResource(primaryResource: String) +@Stable +@DeveloperApi +@Since("2.4.0") +case class PythonMainAppResource(primaryResource: String) extends MainAppResource with NonJVMResource -private[spark] case class RMainAppResource(primaryResource: String) +@Stable +@DeveloperApi +@Since("2.4.0") +case class RMainAppResource(primaryResource: String) extends MainAppResource with NonJVMResource - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48078][K8S] Promote `o.a.s.d.k8s.Constants` to `DeveloperApi`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4b16238784e0 [SPARK-48078][K8S] Promote `o.a.s.d.k8s.Constants` to `DeveloperApi` 4b16238784e0 is described below commit 4b16238784e0a3bb1a6555c90a913b54f2aec2b1 Author: Dongjoon Hyun AuthorDate: Wed May 1 19:30:52 2024 -0700 [SPARK-48078][K8S] Promote `o.a.s.d.k8s.Constants` to `DeveloperApi` ### What changes were proposed in this pull request? This PR aims to promote `org.apache.spark.deploy.k8s.Constants` to `DeveloperApi` ### Why are the changes needed? Since `Apache Spark Kubernetes Operator` depends on this, we had better maintain it as a developer API officially from `Apache Spark 4.0.0`. - https://github.com/apache/spark-kubernetes-operator/pull/10 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46329 from dongjoon-hyun/SPARK-48078. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/k8s/Constants.scala| 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala index 385734c557a3..ead3188aa649 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala @@ -16,7 +16,16 @@ */ package org.apache.spark.deploy.k8s -private[spark] object Constants { +import org.apache.spark.annotation.{DeveloperApi, Stable} + +/** + * :: DeveloperApi :: + * + * This is used in both K8s module and Spark K8s Operator. + */ +@Stable +@DeveloperApi +object Constants { // Labels val SPARK_VERSION_LABEL = "spark-version" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48077][K8S] Promote `KubernetesClientUtils` to `DeveloperApi`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a42eef9e029a [SPARK-48077][K8S] Promote `KubernetesClientUtils` to `DeveloperApi` a42eef9e029a is described below commit a42eef9e029a388559e461f856af435457406a6d Author: Dongjoon Hyun AuthorDate: Wed May 1 18:10:53 2024 -0700 [SPARK-48077][K8S] Promote `KubernetesClientUtils` to `DeveloperApi` ### What changes were proposed in this pull request? This PR aims to promote `KubernetesClientUtils` to `DeveloperApi`. ### Why are the changes needed? Since `Apache Spark Kubernetes Operator` requires this, we had better maintain it as a developer API officially from `Apache Spark 4.0.0`. - https://github.com/apache/spark-kubernetes-operator/pull/10 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46327 from dongjoon-hyun/SPARK-48077. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../spark/deploy/k8s/submit/KubernetesClientUtils.scala | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala index 930588fb0077..d6b1da39bcbb 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala @@ -28,6 +28,7 @@ import scala.jdk.CollectionConverters._ import io.fabric8.kubernetes.api.model.{ConfigMap, ConfigMapBuilder, KeyToPath} import org.apache.spark.SparkConf +import org.apache.spark.annotation.{DeveloperApi, Since, Unstable} import org.apache.spark.deploy.k8s.{Config, Constants, KubernetesUtils} import org.apache.spark.deploy.k8s.Config.{KUBERNETES_DNS_SUBDOMAIN_NAME_MAX_LENGTH, KUBERNETES_NAMESPACE} import org.apache.spark.deploy.k8s.Constants.ENV_SPARK_CONF_DIR @@ -35,16 +36,26 @@ import org.apache.spark.internal.{Logging, MDC} import org.apache.spark.internal.LogKeys.{CONFIG, PATH, PATHS} import org.apache.spark.util.ArrayImplicits._ -private[spark] object KubernetesClientUtils extends Logging { +/** + * :: DeveloperApi :: + * + * A utility class used for K8s operations internally and Spark K8s operator. + */ +@Unstable +@DeveloperApi +object KubernetesClientUtils extends Logging { // Config map name can be KUBERNETES_DNS_SUBDOMAIN_NAME_MAX_LENGTH chars at max. + @Since("3.3.0") def configMapName(prefix: String): String = { val suffix = "-conf-map" s"${prefix.take(KUBERNETES_DNS_SUBDOMAIN_NAME_MAX_LENGTH - suffix.length)}$suffix" } + @Since("3.1.0") val configMapNameExecutor: String = configMapName(s"spark-exec-${KubernetesUtils.uniqueID()}") + @Since("3.1.0") val configMapNameDriver: String = configMapName(s"spark-drv-${KubernetesUtils.uniqueID()}") private def buildStringFromPropertiesMap(configMapName: String, @@ -62,6 +73,7 @@ private[spark] object KubernetesClientUtils extends Logging { /** * Build, file -> 'file's content' map of all the selected files in SPARK_CONF_DIR. */ + @Since("3.1.1") def buildSparkConfDirFilesMap( configMapName: String, sparkConf: SparkConf, @@ -77,6 +89,7 @@ private[spark] object KubernetesClientUtils extends Logging { } } + @Since("3.1.0") def buildKeyToPathObjects(confFilesMap: Map[String, String]): Seq[KeyToPath] = { confFilesMap.map { case (fileName: String, _: String) => @@ -89,6 +102,7 @@ private[spark] object KubernetesClientUtils extends Logging { * Build a Config Map that will hold the content for environment variable SPARK_CONF_DIR * on remote pods. */ + @Since("3.1.0") def buildConfigMap(configMapName: String, confFileMap: Map[String, String], withLabels: Map[String, String] = Map()): ConfigMap = { val configMapNameSpace = - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (04f3a938895c -> 0fc7c4a29c46)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 04f3a938895c [SPARK-48076][K8S] Promote `KubernetesVolumeUtils` to `DeveloperApi` add 0fc7c4a29c46 [SPARK-45891][SQL][FOLLOW-UP] Added length check to the is_variant_null expression No new revisions were added by this update. Summary of changes: .../expressions/variant/VariantExpressionEvalUtils.scala | 10 +++--- .../org/apache/spark/sql/errors/QueryExecutionErrors.scala | 5 + .../catalyst/expressions/variant/VariantExpressionSuite.scala | 7 +++ 3 files changed, 19 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (e521d3c1f357 -> 04f3a938895c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from e521d3c1f357 [MINOR] Fix the grammar of some comments on renaming error classes add 04f3a938895c [SPARK-48076][K8S] Promote `KubernetesVolumeUtils` to `DeveloperApi` No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (69ea082fc69a -> fd57c3493af7)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 69ea082fc69a [SPARK-47934][CORE] Ensure trailing slashes in `HistoryServer` URL redirections add fd57c3493af7 [SPARK-47911][SQL] Introduces a universal BinaryFormatter to make binary output consistent No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/expressions/Cast.scala | 2 - .../sql/catalyst/expressions/ToPrettyString.scala | 2 +- .../sql/catalyst/expressions/ToStringBase.scala| 40 ++- .../org/apache/spark/sql/internal/SQLConf.scala| 42 .../apache/spark/sql/execution/HiveResult.scala| 34 +++- .../sql-tests/analyzer-results/binary.sql.out | 27 + .../analyzer-results/binary_base64.sql.out | 27 + .../analyzer-results/binary_basic.sql.out | 27 + .../sql-tests/analyzer-results/binary_hex.sql.out | 27 + .../src/test/resources/sql-tests/inputs/binary.sql | 6 +++ .../resources/sql-tests/inputs/binary_base64.sql | 3 ++ .../resources/sql-tests/inputs/binary_basic.sql| 4 ++ .../test/resources/sql-tests/inputs/binary_hex.sql | 3 ++ .../resources/sql-tests/results/binary.sql.out | 31 +++ .../sql-tests/results/binary_base64.sql.out| 31 +++ .../sql-tests/results/binary_basic.sql.out | 31 +++ .../resources/sql-tests/results/binary_hex.sql.out | 31 +++ .../org/apache/spark/sql/DataFrameShowSuite.scala | 8 +++- .../org/apache/spark/sql/DataFrameSuite.scala | 45 +++--- .../spark/sql/execution/HiveResultSuite.scala | 3 +- .../spark/sql/hive/thriftserver/RowSetUtils.scala | 33 +--- .../SparkExecuteStatementOperation.scala | 3 +- .../thriftserver/ThriftServerQueryTestSuite.scala | 24 +++- 23 files changed, 429 insertions(+), 55 deletions(-) create mode 100644 sql/core/src/test/resources/sql-tests/analyzer-results/binary.sql.out create mode 100644 sql/core/src/test/resources/sql-tests/analyzer-results/binary_base64.sql.out create mode 100644 sql/core/src/test/resources/sql-tests/analyzer-results/binary_basic.sql.out create mode 100644 sql/core/src/test/resources/sql-tests/analyzer-results/binary_hex.sql.out create mode 100644 sql/core/src/test/resources/sql-tests/inputs/binary.sql create mode 100644 sql/core/src/test/resources/sql-tests/inputs/binary_base64.sql create mode 100644 sql/core/src/test/resources/sql-tests/inputs/binary_basic.sql create mode 100644 sql/core/src/test/resources/sql-tests/inputs/binary_hex.sql create mode 100644 sql/core/src/test/resources/sql-tests/results/binary.sql.out create mode 100644 sql/core/src/test/resources/sql-tests/results/binary_base64.sql.out create mode 100644 sql/core/src/test/resources/sql-tests/results/binary_basic.sql.out create mode 100644 sql/core/src/test/resources/sql-tests/results/binary_hex.sql.out - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (5ac803079b30 -> 69ea082fc69a)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 5ac803079b30 [SPARK-48074][CORE] Improve the readability of JSON loggings add 69ea082fc69a [SPARK-47934][CORE] Ensure trailing slashes in `HistoryServer` URL redirections No new revisions were added by this update. Summary of changes: .../spark/deploy/history/HistoryServer.scala | 4 +- .../spark/deploy/history/HistoryServerSuite.scala | 58 ++ 2 files changed, 38 insertions(+), 24 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (35767bb09fe1 -> 5ac803079b30)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 35767bb09fe1 [SPARK-48070][SQL][TESTS] Support `AdaptiveQueryExecSuite.runAdaptiveAndVerifyResult` to skip check results add 5ac803079b30 [SPARK-48074][CORE] Improve the readability of JSON loggings No new revisions were added by this update. Summary of changes: .../resources/org/apache/spark/SparkLayout.json| 31 +++--- .../apache/spark/util/StructuredLoggingSuite.scala | 2 +- 2 files changed, 28 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48070][SQL][TESTS] Support `AdaptiveQueryExecSuite.runAdaptiveAndVerifyResult` to skip check results
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 35767bb09fe1 [SPARK-48070][SQL][TESTS] Support `AdaptiveQueryExecSuite.runAdaptiveAndVerifyResult` to skip check results 35767bb09fe1 is described below commit 35767bb09fe13468c03ffbb3a45e106e8b8eb179 Author: sychen AuthorDate: Wed May 1 12:37:24 2024 -0700 [SPARK-48070][SQL][TESTS] Support `AdaptiveQueryExecSuite.runAdaptiveAndVerifyResult` to skip check results ### What changes were proposed in this pull request? This PR aims to support AdaptiveQueryExecSuite to skip check results. ### Why are the changes needed? https://github.com/apache/spark/pull/46273#discussion_r1585445992 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? No Closes #46316 from cxzl25/SPARK-48070. Authored-by: sychen Signed-off-by: Dongjoon Hyun --- .../spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala index f6ca7ff3cdcc..d74ecb32971c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala @@ -64,7 +64,8 @@ class AdaptiveQueryExecSuite setupTestData() - private def runAdaptiveAndVerifyResult(query: String): (SparkPlan, SparkPlan) = { + private def runAdaptiveAndVerifyResult(query: String, + skipCheckAnswer: Boolean = false): (SparkPlan, SparkPlan) = { var finalPlanCnt = 0 var hasMetricsEvent = false val listener = new SparkListener { @@ -88,8 +89,10 @@ class AdaptiveQueryExecSuite assert(planBefore.toString.startsWith("AdaptiveSparkPlan isFinalPlan=false")) val result = dfAdaptive.collect() withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { - val df = sql(query) - checkAnswer(df, result.toImmutableArraySeq) + if (!skipCheckAnswer) { +val df = sql(query) +checkAnswer(df, result.toImmutableArraySeq) + } } val planAfter = dfAdaptive.queryExecution.executedPlan assert(planAfter.toString.startsWith("AdaptiveSparkPlan isFinalPlan=true")) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46009][SQL][FOLLOWUP] Remove unused PERCENTILE_CONT and PERCENTILE_DISC in g4
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ad63eef20617 [SPARK-46009][SQL][FOLLOWUP] Remove unused PERCENTILE_CONT and PERCENTILE_DISC in g4 ad63eef20617 is described below commit ad63eef20617db7cdecce465af54e4787d0deeac Author: beliefer AuthorDate: Wed May 1 11:25:54 2024 -0700 [SPARK-46009][SQL][FOLLOWUP] Remove unused PERCENTILE_CONT and PERCENTILE_DISC in g4 ### What changes were proposed in this pull request? This PR propose to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4 ### Why are the changes needed? https://github.com/apache/spark/pull/43910 merged the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`, but forgot to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46272 from beliefer/SPARK-46009_followup2. Authored-by: beliefer Signed-off-by: Dongjoon Hyun --- docs/sql-ref-ansi-compliance.md| 2 - .../spark/sql/catalyst/parser/SqlBaseLexer.g4 | 2 - .../spark/sql/catalyst/parser/SqlBaseParser.g4 | 2 - .../sql-tests/analyzer-results/window2.sql.out | 126 + .../sql-tests/results/ansi/keywords.sql.out| 4 - .../resources/sql-tests/results/keywords.sql.out | 2 - .../ThriftServerWithSparkContextSuite.scala| 2 +- 7 files changed, 127 insertions(+), 13 deletions(-) diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md index 011bd671ca1f..84416ffd5f83 100644 --- a/docs/sql-ref-ansi-compliance.md +++ b/docs/sql-ref-ansi-compliance.md @@ -608,8 +608,6 @@ Below is a list of all the keywords in Spark SQL. |PARTITIONED|non-reserved|non-reserved|non-reserved| |PARTITIONS|non-reserved|non-reserved|non-reserved| |PERCENT|non-reserved|non-reserved|non-reserved| -|PERCENTILE_CONT|reserved|non-reserved|non-reserved| -|PERCENTILE_DISC|reserved|non-reserved|non-reserved| |PIVOT|non-reserved|non-reserved|non-reserved| |PLACING|non-reserved|non-reserved|non-reserved| |POSITION|non-reserved|non-reserved|reserved| diff --git a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 index 83e40c4a20a2..86e16af7ff10 100644 --- a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 +++ b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 @@ -298,8 +298,6 @@ OVERWRITE: 'OVERWRITE'; PARTITION: 'PARTITION'; PARTITIONED: 'PARTITIONED'; PARTITIONS: 'PARTITIONS'; -PERCENTILE_CONT: 'PERCENTILE_CONT'; -PERCENTILE_DISC: 'PERCENTILE_DISC'; PERCENTLIT: 'PERCENT'; PIVOT: 'PIVOT'; PLACING: 'PLACING'; diff --git a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 index 71bd75f934ca..653224c5475f 100644 --- a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 +++ b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 @@ -1829,8 +1829,6 @@ nonReserved | PARTITION | PARTITIONED | PARTITIONS -| PERCENTILE_CONT -| PERCENTILE_DISC | PERCENTLIT | PIVOT | PLACING diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out new file mode 100644 index ..6fd41286959a --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out @@ -0,0 +1,126 @@ +-- Automatically generated by SQLQueryTestSuite +-- !query +CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES +(null, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"), +(1, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"), +(1, 2L, 2.5D, date("2017-08-02"), timestamp_seconds(150200), "a"), +(2, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), "a"), +(1, null, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "b"), +(2, 3L, 3.3D, date("2017-08-03"), timestamp_seconds(150300), "b"), +(3, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), "b"), +(null, null, null, null, null, null), +(3, 1L, 1.0D, date("2017-
(spark) branch branch-3.5 updated: Revert "[SPARK-48016][SQL] Fix a bug in try_divide function when with decimals"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new fc0ef07f2949 Revert "[SPARK-48016][SQL] Fix a bug in try_divide function when with decimals" fc0ef07f2949 is described below commit fc0ef07f2949c399537c6d9b5fb7b81f546de212 Author: Dongjoon Hyun AuthorDate: Wed May 1 11:18:29 2024 -0700 Revert "[SPARK-48016][SQL] Fix a bug in try_divide function when with decimals" This reverts commit e78ee2c5770218a521340cb84f57a02dd00f7f3a. --- .../sql/catalyst/analysis/DecimalPrecision.scala | 14 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 10 ++-- sql/core/src/test/resources/log4j2.properties | 2 +- .../analyzer-results/ansi/try_arithmetic.sql.out | 56 --- .../analyzer-results/try_arithmetic.sql.out| 56 --- .../resources/sql-tests/inputs/try_arithmetic.sql | 8 --- .../sql-tests/results/ansi/try_arithmetic.sql.out | 64 -- .../sql-tests/results/try_arithmetic.sql.out | 64 -- 8 files changed, 13 insertions(+), 261 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala index f51127f53b38..09cf61a77955 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala @@ -83,7 +83,7 @@ object DecimalPrecision extends TypeCoercionRule { val resultType = widerDecimalType(p1, s1, p2, s2) val newE1 = if (e1.dataType == resultType) e1 else Cast(e1, resultType) val newE2 = if (e2.dataType == resultType) e2 else Cast(e2, resultType) - b.withNewChildren(Seq(newE1, newE2)) + b.makeCopy(Array(newE1, newE2)) } /** @@ -202,21 +202,21 @@ object DecimalPrecision extends TypeCoercionRule { case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && l.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.withNewChildren(Seq(Cast(l, DataTypeUtils.fromLiteral(l)), r)) + b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) case (l, r: Literal) if l.dataType.isInstanceOf[DecimalType] && r.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.withNewChildren(Seq(l, Cast(r, DataTypeUtils.fromLiteral(r + b.makeCopy(Array(l, Cast(r, DataTypeUtils.fromLiteral(r // Promote integers inside a binary expression with fixed-precision decimals to decimals, // and fixed-precision decimals in an expression with floats / doubles to doubles case (l @ IntegralTypeExpression(), r @ DecimalExpression(_, _)) => - b.withNewChildren(Seq(Cast(l, DecimalType.forType(l.dataType)), r)) + b.makeCopy(Array(Cast(l, DecimalType.forType(l.dataType)), r)) case (l @ DecimalExpression(_, _), r @ IntegralTypeExpression()) => - b.withNewChildren(Seq(l, Cast(r, DecimalType.forType(r.dataType + b.makeCopy(Array(l, Cast(r, DecimalType.forType(r.dataType case (l, r @ DecimalExpression(_, _)) if isFloat(l.dataType) => - b.withNewChildren(Seq(l, Cast(r, DoubleType))) + b.makeCopy(Array(l, Cast(r, DoubleType))) case (l @ DecimalExpression(_, _), r) if isFloat(r.dataType) => - b.withNewChildren(Seq(Cast(l, DoubleType), r)) + b.makeCopy(Array(Cast(l, DoubleType), r)) case _ => b } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala index c9a4a2d40246..190e72a8e669 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala @@ -1102,22 +1102,22 @@ object TypeCoercion extends TypeCoercionBase { case a @ BinaryArithmetic(left @ StringTypeExpression(), right) if right.dataType != CalendarIntervalType => -a.withNewChildren(Seq(Cast(left, DoubleType), right)) +a.makeCopy(Array(Cast(left, DoubleType), right)) case a @ BinaryArithmetic(left, right @ StringTypeExpression()) if left.dataType != CalendarIntervalType => -a.withNewChildren(Seq(left, Cast(right, DoubleType))) +a.makeCopy(Array(left, Cast(right, DoubleType))) // For equality between string and timestamp we cast the string to a timestam
(spark) branch branch-3.4 updated: [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 70ce67cc77cc [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter 70ce67cc77cc is described below commit 70ce67cc77ccce3a4509bba608dbab69b45cc2b9 Author: Dongjoon Hyun AuthorDate: Wed May 1 10:42:26 2024 -0700 [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter ### What changes were proposed in this pull request? This PR aims to fix `mypy` failure by propagating `lint-python`'s `PYTHON_EXECUTABLE` to `mypy`'s parameter correctly. ### Why are the changes needed? We assumed that `PYTHON_EXECUTABLE` is used for `dev/lint-python` like the following. That's not always guaranteed. We need to use `mypy`'s parameter to make it sure. https://github.com/apache/spark/blob/ff401dde50343c9bbc1c49a0294272f2da7d01e2/.github/workflows/build_and_test.yml#L705 This patch is useful whose `python3` chooses one of multiple Python installation like our CI environment. ``` $ docker run -it --rm ghcr.io/apache/apache-spark-ci-image:master-8905641334 bash WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested root2ef6ce08d2c4:/# python3 --version Python 3.10.12 root2ef6ce08d2c4:/# python3.9 --version Python 3.9.19 ``` For example, the following shows that `PYTHON_EXECUTABLE` is not considered by `mypy`. ``` root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy --python-executable=python3.11 --namespace-packages --config-file python/mypy.ini python/pyspark | wc -l 3428 root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy --namespace-packages --config-file python/mypy.ini python/pyspark | wc -l 1 root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.11 mypy --namespace-packages --config-file python/mypy.ini python/pyspark | wc -l 1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46314 from dongjoon-hyun/SPARK-48068. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 26c871f180306fbf86ce65f14f8e7a71f89885ed) Signed-off-by: Dongjoon Hyun --- dev/lint-python | 2 ++ 1 file changed, 2 insertions(+) diff --git a/dev/lint-python b/dev/lint-python index b5ee63e38690..9b60ca75eb9b 100755 --- a/dev/lint-python +++ b/dev/lint-python @@ -69,6 +69,7 @@ function mypy_annotation_test { echo "starting mypy annotations test..." MYPY_REPORT=$( ($MYPY_BUILD \ + --python-executable $PYTHON_EXECUTABLE \ --namespace-packages \ --config-file python/mypy.ini \ --cache-dir /tmp/.mypy_cache/ \ @@ -128,6 +129,7 @@ function mypy_examples_test { echo "starting mypy examples test..." MYPY_REPORT=$( (MYPYPATH=python $MYPY_BUILD \ + --python-executable $PYTHON_EXECUTABLE \ --namespace-packages \ --config-file python/mypy.ini \ --exclude "mllib/*" \ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 953d7f90c6db [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter 953d7f90c6db is described below commit 953d7f90c6dbee597b0360c551dfac2a1d87d961 Author: Dongjoon Hyun AuthorDate: Wed May 1 10:42:26 2024 -0700 [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter ### What changes were proposed in this pull request? This PR aims to fix `mypy` failure by propagating `lint-python`'s `PYTHON_EXECUTABLE` to `mypy`'s parameter correctly. ### Why are the changes needed? We assumed that `PYTHON_EXECUTABLE` is used for `dev/lint-python` like the following. That's not always guaranteed. We need to use `mypy`'s parameter to make it sure. https://github.com/apache/spark/blob/ff401dde50343c9bbc1c49a0294272f2da7d01e2/.github/workflows/build_and_test.yml#L705 This patch is useful whose `python3` chooses one of multiple Python installation like our CI environment. ``` $ docker run -it --rm ghcr.io/apache/apache-spark-ci-image:master-8905641334 bash WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested root2ef6ce08d2c4:/# python3 --version Python 3.10.12 root2ef6ce08d2c4:/# python3.9 --version Python 3.9.19 ``` For example, the following shows that `PYTHON_EXECUTABLE` is not considered by `mypy`. ``` root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy --python-executable=python3.11 --namespace-packages --config-file python/mypy.ini python/pyspark | wc -l 3428 root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy --namespace-packages --config-file python/mypy.ini python/pyspark | wc -l 1 root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.11 mypy --namespace-packages --config-file python/mypy.ini python/pyspark | wc -l 1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46314 from dongjoon-hyun/SPARK-48068. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 26c871f180306fbf86ce65f14f8e7a71f89885ed) Signed-off-by: Dongjoon Hyun --- dev/lint-python | 2 ++ 1 file changed, 2 insertions(+) diff --git a/dev/lint-python b/dev/lint-python index d040493c86c4..7ccd32451acc 100755 --- a/dev/lint-python +++ b/dev/lint-python @@ -118,6 +118,7 @@ function mypy_annotation_test { echo "starting mypy annotations test..." MYPY_REPORT=$( ($MYPY_BUILD \ + --python-executable $PYTHON_EXECUTABLE \ --namespace-packages \ --config-file python/mypy.ini \ --cache-dir /tmp/.mypy_cache/ \ @@ -177,6 +178,7 @@ function mypy_examples_test { echo "starting mypy examples test..." MYPY_REPORT=$( (MYPYPATH=python $MYPY_BUILD \ + --python-executable $PYTHON_EXECUTABLE \ --namespace-packages \ --config-file python/mypy.ini \ --exclude "mllib/*" \ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 26c871f18030 [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter 26c871f18030 is described below commit 26c871f180306fbf86ce65f14f8e7a71f89885ed Author: Dongjoon Hyun AuthorDate: Wed May 1 10:42:26 2024 -0700 [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter ### What changes were proposed in this pull request? This PR aims to fix `mypy` failure by propagating `lint-python`'s `PYTHON_EXECUTABLE` to `mypy`'s parameter correctly. ### Why are the changes needed? We assumed that `PYTHON_EXECUTABLE` is used for `dev/lint-python` like the following. That's not always guaranteed. We need to use `mypy`'s parameter to make it sure. https://github.com/apache/spark/blob/ff401dde50343c9bbc1c49a0294272f2da7d01e2/.github/workflows/build_and_test.yml#L705 This patch is useful whose `python3` chooses one of multiple Python installation like our CI environment. ``` $ docker run -it --rm ghcr.io/apache/apache-spark-ci-image:master-8905641334 bash WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested root2ef6ce08d2c4:/# python3 --version Python 3.10.12 root2ef6ce08d2c4:/# python3.9 --version Python 3.9.19 ``` For example, the following shows that `PYTHON_EXECUTABLE` is not considered by `mypy`. ``` root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy --python-executable=python3.11 --namespace-packages --config-file python/mypy.ini python/pyspark | wc -l 3428 root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy --namespace-packages --config-file python/mypy.ini python/pyspark | wc -l 1 root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.11 mypy --namespace-packages --config-file python/mypy.ini python/pyspark | wc -l 1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46314 from dongjoon-hyun/SPARK-48068. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/lint-python | 2 ++ 1 file changed, 2 insertions(+) diff --git a/dev/lint-python b/dev/lint-python index 6bd843103bd7..b8703310bc4b 100755 --- a/dev/lint-python +++ b/dev/lint-python @@ -125,6 +125,7 @@ function mypy_annotation_test { echo "starting mypy annotations test..." MYPY_REPORT=$( ($MYPY_BUILD \ + --python-executable $PYTHON_EXECUTABLE \ --namespace-packages \ --config-file python/mypy.ini \ --cache-dir /tmp/.mypy_cache/ \ @@ -184,6 +185,7 @@ function mypy_examples_test { echo "starting mypy examples test..." MYPY_REPORT=$( (MYPYPATH=python $MYPY_BUILD \ + --python-executable $PYTHON_EXECUTABLE \ --namespace-packages \ --config-file python/mypy.ini \ --exclude "mllib/*" \ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48069][INFRA] Handle `PEP-632` by checking `ModuleNotFoundError` on `setuptools` in Python 3.12
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ff401dde5034 [SPARK-48069][INFRA] Handle `PEP-632` by checking `ModuleNotFoundError` on `setuptools` in Python 3.12 ff401dde5034 is described below commit ff401dde50343c9bbc1c49a0294272f2da7d01e2 Author: Dongjoon Hyun AuthorDate: Tue Apr 30 23:54:06 2024 -0700 [SPARK-48069][INFRA] Handle `PEP-632` by checking `ModuleNotFoundError` on `setuptools` in Python 3.12 ### What changes were proposed in this pull request? This PR aims to handle `PEP-632` by checking `ModuleNotFoundError` on `setuptools`. - [PEP 632 – Deprecate distutils module](https://peps.python.org/pep-0632/) ### Why are the changes needed? Use `Python 3.12`. ``` $ python3 --version Python 3.12.2 ``` **BEFORE** ``` $ dev/lint-python --mypy | grep ModuleNotFoundError Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'setuptools' ``` **AFTER** ``` $ dev/lint-python --mypy | grep ModuleNotFoundError ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46315 from dongjoon-hyun/SPARK-48069. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/lint-python | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/dev/lint-python b/dev/lint-python index 8d587bd52aca..6bd843103bd7 100755 --- a/dev/lint-python +++ b/dev/lint-python @@ -84,7 +84,10 @@ function satisfies_min_version { local expected_version="$2" echo "$( "$PYTHON_EXECUTABLE" << EOM -from setuptools.extern.packaging import version +try: +from setuptools.extern.packaging import version +except ModuleNotFoundError: +from packaging import version print(version.parse('$provided_version') >= version.parse('$expected_version')) EOM )" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48016][SQL][TESTS][FOLLOWUP] Update Java 21 golden file
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 65cf5b18648a [SPARK-48016][SQL][TESTS][FOLLOWUP] Update Java 21 golden file 65cf5b18648a is described below commit 65cf5b18648a81fc9b0787d03f23f7465c20f3ec Author: Dongjoon Hyun AuthorDate: Tue Apr 30 22:42:02 2024 -0700 [SPARK-48016][SQL][TESTS][FOLLOWUP] Update Java 21 golden file ### What changes were proposed in this pull request? This is a follow-up of SPARK-48016 to update the missed Java 21 golden file. - #46286 ### Why are the changes needed? To recover Java 21 CIs: - https://github.com/apache/spark/actions/workflows/build_java21.yml - https://github.com/apache/spark/actions/workflows/build_maven_java21.yml - https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests. I regenerated all in Java 21 and this was the only one affected. ``` $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46313 from dongjoon-hyun/SPARK-48016. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../results/try_arithmetic.sql.out.java21 | 64 ++ 1 file changed, 64 insertions(+) diff --git a/sql/core/src/test/resources/sql-tests/results/try_arithmetic.sql.out.java21 b/sql/core/src/test/resources/sql-tests/results/try_arithmetic.sql.out.java21 index dcdb9d0dcb19..002a0dfcf37e 100644 --- a/sql/core/src/test/resources/sql-tests/results/try_arithmetic.sql.out.java21 +++ b/sql/core/src/test/resources/sql-tests/results/try_arithmetic.sql.out.java21 @@ -15,6 +15,22 @@ struct NULL +-- !query +SELECT try_add(2147483647, decimal(1)) +-- !query schema +struct +-- !query output +2147483648 + + +-- !query +SELECT try_add(2147483647, "1") +-- !query schema +struct +-- !query output +2.147483648E9 + + -- !query SELECT try_add(-2147483648, -1) -- !query schema @@ -249,6 +265,22 @@ struct NULL +-- !query +SELECT try_divide(1, decimal(0)) +-- !query schema +struct +-- !query output +NULL + + +-- !query +SELECT try_divide(1, "0") +-- !query schema +struct +-- !query output +NULL + + -- !query SELECT try_divide(interval 2 year, 2) -- !query schema @@ -313,6 +345,22 @@ struct NULL +-- !query +SELECT try_subtract(2147483647, decimal(-1)) +-- !query schema +struct +-- !query output +2147483648 + + +-- !query +SELECT try_subtract(2147483647, "-1") +-- !query schema +struct +-- !query output +2.147483648E9 + + -- !query SELECT try_subtract(-2147483648, 1) -- !query schema @@ -409,6 +457,22 @@ struct NULL +-- !query +SELECT try_multiply(2147483647, decimal(-2)) +-- !query schema +struct +-- !query output +-4294967294 + + +-- !query +SELECT try_multiply(2147483647, "-2") +-- !query schema +struct +-- !query output +-4.294967294E9 + + -- !query SELECT try_multiply(-2147483648, 2) -- !query schema - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48047][SQL] Reduce memory pressure of empty TreeNode tags
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 02206cd66dbf [SPARK-48047][SQL] Reduce memory pressure of empty TreeNode tags 02206cd66dbf is described below commit 02206cd66dbfc8de602a685b032f1805bcf8e36f Author: Nick Young AuthorDate: Tue Apr 30 22:07:20 2024 -0700 [SPARK-48047][SQL] Reduce memory pressure of empty TreeNode tags ### What changes were proposed in this pull request? - Changed the `tags` variable of the `TreeNode` class to initialize lazily. This will reduce unnecessary driver memory pressure. ### Why are the changes needed? - Plans with large expression or operator trees are known to cause driver memory pressure; this is one step in alleviating that issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT covers behavior. Outwards facing behavior does not change. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46285 from n-young-db/treenode-tags. Authored-by: Nick Young Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/catalyst/trees/TreeNode.scala | 24 ++ 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index 94e893d468b3..dd39f3182bfb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -78,8 +78,16 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] /** * A mutable map for holding auxiliary information of this tree node. It will be carried over * when this node is copied via `makeCopy`, or transformed via `transformUp`/`transformDown`. + * We lazily evaluate the `tags` since the default size of a `mutable.Map` is nonzero. This + * will reduce unnecessary memory pressure. */ - private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty + private[this] var _tags: mutable.Map[TreeNodeTag[_], Any] = null + private def tags: mutable.Map[TreeNodeTag[_], Any] = { +if (_tags eq null) { + _tags = mutable.Map.empty +} +_tags + } /** * Default tree pattern [[BitSet] for a [[TreeNode]]. @@ -147,11 +155,13 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] ineffectiveRules.get(ruleId.id) } + def isTagsEmpty: Boolean = (_tags eq null) || _tags.isEmpty + def copyTagsFrom(other: BaseType): Unit = { // SPARK-32753: it only makes sense to copy tags to a new node // but it's too expensive to detect other cases likes node removal // so we make a compromise here to copy tags to node with no tags -if (tags.isEmpty) { +if (isTagsEmpty && !other.isTagsEmpty) { tags ++= other.tags } } @@ -161,11 +171,17 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] } def getTagValue[T](tag: TreeNodeTag[T]): Option[T] = { -tags.get(tag).map(_.asInstanceOf[T]) +if (isTagsEmpty) { + None +} else { + tags.get(tag).map(_.asInstanceOf[T]) +} } def unsetTagValue[T](tag: TreeNodeTag[T]): Unit = { -tags -= tag +if (!isTagsEmpty) { + tags -= tag +} } /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48063][CORE] Enable `spark.stage.ignoreDecommissionFetchFailure` by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f3cc8f930383 [SPARK-48063][CORE] Enable `spark.stage.ignoreDecommissionFetchFailure` by default f3cc8f930383 is described below commit f3cc8f930383659b9f99e56b38de4b97d588e20b Author: Dongjoon Hyun AuthorDate: Tue Apr 30 15:19:00 2024 -0700 [SPARK-48063][CORE] Enable `spark.stage.ignoreDecommissionFetchFailure` by default ### What changes were proposed in this pull request? This PR aims to **enable `spark.stage.ignoreDecommissionFetchFailure` by default** while keeping `spark.scheduler.maxRetainedRemovedDecommissionExecutors=0` without any change for Apache Spark 4.0.0 in order to help a user use this feature more easily by setting only one configuration, `spark.scheduler.maxRetainedRemovedDecommissionExecutors`. ### Why are the changes needed? This feature was added at Apache Spark 3.4.0 via SPARK-40481 and SPARK-40979 and has been used for two years to support executor decommissioning features in the production. - #37924 - #38441 ### Does this PR introduce _any_ user-facing change? No because `spark.scheduler.maxRetainedRemovedDecommissionExecutors` is still `0`. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46308 from dongjoon-hyun/SPARK-48063. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/internal/config/package.scala | 2 +- docs/configuration.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index b2cbb6f6deb6..2e207422ae06 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -2403,7 +2403,7 @@ package object config { s"count ${STAGE_MAX_CONSECUTIVE_ATTEMPTS.key}") .version("3.4.0") .booleanConf - .createWithDefault(false) + .createWithDefault(true) private[spark] val SCHEDULER_MAX_RETAINED_REMOVED_EXECUTORS = ConfigBuilder("spark.scheduler.maxRetainedRemovedDecommissionExecutors") diff --git a/docs/configuration.md b/docs/configuration.md index d5e2a569fdea..2e612ffd9ab9 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -3072,7 +3072,7 @@ Apart from these, the following properties are also available, and may be useful spark.stage.ignoreDecommissionFetchFailure - false + true Whether ignore stage fetch failure caused by executor decommission when count spark.stage.maxConsecutiveAttempts - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48060][SS][TESTS] Fix `StreamingQueryHashPartitionVerifySuite` to update golden files correctly
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new faab553cac70 [SPARK-48060][SS][TESTS] Fix `StreamingQueryHashPartitionVerifySuite` to update golden files correctly faab553cac70 is described below commit faab553cac70eefeec286b1823b70ad62bed87f8 Author: Dongjoon Hyun AuthorDate: Tue Apr 30 12:50:07 2024 -0700 [SPARK-48060][SS][TESTS] Fix `StreamingQueryHashPartitionVerifySuite` to update golden files correctly ### What changes were proposed in this pull request? This PR aims to fix `StreamingQueryHashPartitionVerifySuite` to update golden files correctly. - The documentation is added. - Newly generated files are updated. ### Why are the changes needed? Previously, `SPARK_GENERATE_GOLDEN_FILES` doesn't work as expected because it updates the files under `target` directory. We need to update `src/test` files. **BEFORE** ``` $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *StreamingQueryHashPartitionVerifySuite" $ git status On branch master Your branch is up to date with 'apache/master'. nothing to commit, working tree clean ``` **AFTER** ``` $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *StreamingQueryHashPartitionVerifySuite" \ -Dspark.sql.test.randomDataGenerator.maxStrLen=100 \ -Dspark.sql.test.randomDataGenerator.maxArraySize=4 $ git status On branch SPARK-48060 Your branch is up to date with 'dongjoon/SPARK-48060'. Changes not staged for commit: (use "git add ..." to update what will be committed) (use "git restore ..." to discard changes in working directory) modified: sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas modified: sql/core/src/test/resources/structured-streaming/partition-tests/rowsAndPartIds no changes added to commit (use "git add" and/or "git commit -a") ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. I regenerate the data like the following. ``` $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *StreamingQueryHashPartitionVerifySuite" \ -Dspark.sql.test.randomDataGenerator.maxStrLen=100 \ -Dspark.sql.test.randomDataGenerator.maxArraySize=4 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46304 from dongjoon-hyun/SPARK-48060. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../partition-tests/randomSchemas | 2 +- .../partition-tests/rowsAndPartIds | Bin 4862115 -> 13341426 bytes .../StreamingQueryHashPartitionVerifySuite.scala | 22 +++-- 3 files changed, 17 insertions(+), 7 deletions(-) diff --git a/sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas b/sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas index 8d6ff942610c..f6eadd776cc6 100644 --- a/sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas +++ b/sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas @@ -1 +1 @@ -col_0 STRUCT NOT NULL, col_3: FLOAT NOT NULL, col_4: INT NOT NULL>,col_1 STRUCT, col_3: ARRAY NOT NULL, col_4: ARRAY, col_5: TIMESTAMP NOT NULL, col_6: STRUCT, col_1: BIGINT NOT NULL> NOT NULL, col_7: ARRAY NOT NULL, col_8: ARRAY, col_9: BIGINT NOT NULL> NOT NULL,col_2 BIGINT NOT NULL,col_3 STRUCT,col_1 STRUCT NOT NULL,col_2 STRING NOT NULL,col_3 STRUCT, col_2: ARRAY NOT NULL> NOT NULL,col_4 BINARY NOT NULL,col_5 ARRAY NOT NULL,col_6 ARRAY,col_7 DOUBLE NOT NULL,col_8 ARRAY NOT NULL,col_9 ARRAY,col_10 FLOAT NOT NULL,col_11 STRUCT NOT NULL>, col_1: STRUCT NOT NULL, col_1: INT, col_2: STRUCT
(spark) branch master updated: [SPARK-48057][PYTHON][CONNECT][TESTS] Enable `GroupedApplyInPandasTests.test_grouped_with_empty_partition`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dab20b31388b [SPARK-48057][PYTHON][CONNECT][TESTS] Enable `GroupedApplyInPandasTests.test_grouped_with_empty_partition` dab20b31388b is described below commit dab20b31388ba7bcd2ab4d4424cbbd072bf84c30 Author: Ruifeng Zheng AuthorDate: Tue Apr 30 12:19:18 2024 -0700 [SPARK-48057][PYTHON][CONNECT][TESTS] Enable `GroupedApplyInPandasTests.test_grouped_with_empty_partition` ### What changes were proposed in this pull request? Enable `GroupedApplyInPandasTests. test_grouped_with_empty_partition` ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46299 from zhengruifeng/fix_test_grouped_with_empty_partition. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py | 4 python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py | 4 ++-- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py b/python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py index 1cc4ce012623..8a1da440c799 100644 --- a/python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py +++ b/python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py @@ -38,10 +38,6 @@ class GroupedApplyInPandasTests(GroupedApplyInPandasTestsMixin, ReusedConnectTes def test_apply_in_pandas_returning_incompatible_type(self): super().test_apply_in_pandas_returning_incompatible_type() -@unittest.skip("Spark Connect doesn't support RDD but the test depends on it.") -def test_grouped_with_empty_partition(self): -super().test_grouped_with_empty_partition() - if __name__ == "__main__": from pyspark.sql.tests.connect.test_parity_pandas_grouped_map import * # noqa: F401 diff --git a/python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py b/python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py index f43dafc0a4a1..1e86e12eb74f 100644 --- a/python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py +++ b/python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py @@ -680,13 +680,13 @@ class GroupedApplyInPandasTestsMixin: data = [Row(id=1, x=2), Row(id=1, x=3), Row(id=2, x=4)] expected = [Row(id=1, x=5), Row(id=1, x=5), Row(id=2, x=4)] num_parts = len(data) + 1 -df = self.spark.createDataFrame(self.sc.parallelize(data, numSlices=num_parts)) +df = self.spark.createDataFrame(data).repartition(num_parts) f = pandas_udf( lambda pdf: pdf.assign(x=pdf["x"].sum()), "id long, x int", PandasUDFType.GROUPED_MAP ) -result = df.groupBy("id").apply(f).collect() +result = df.groupBy("id").apply(f).sort("id").collect() self.assertEqual(result, expected) def test_grouped_over_window(self): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (0329479acb67 -> 9caa6f7f8b8e)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 0329479acb67 [SPARK-47359][SQL] Support TRANSLATE function to work with collated strings add 9caa6f7f8b8e [SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` No new revisions were added by this update. Summary of changes: .../test/scala/org/apache/spark/sql/RandomDataGenerator.scala| 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9e8c4aa3f43a [SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default 9e8c4aa3f43a is described below commit 9e8c4aa3f43a3d99bff56cca319db623abc473ee Author: Dongjoon Hyun AuthorDate: Tue Apr 30 01:44:37 2024 -0700 [SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default ### What changes were proposed in this pull request? This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`. ### Why are the changes needed? Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period. - 2019-12-06: #26736 (58be82a) - 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j - 2020-05-16: #28517 At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098. - 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 - 2020-12-03: #30554 Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction. - SPARK-42603 on 2023-02-27 as an independent idea. - SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea ### Does this PR introduce _any_ user-facing change? Yes, the migration document is updated. ### How was this patch tested? Pass the CIs with the adjusted test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46207 from dongjoon-hyun/SPARK-46122. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- docs/sql-migration-guide.md | 1 + python/pyspark/sql/tests/test_readwriter.py | 5 ++--- .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala| 2 +- .../apache/spark/sql/execution/command/PlanResolutionSuite.scala | 8 +++- 4 files changed, 7 insertions(+), 9 deletions(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 1e0fdadde1e3..07562babc87d 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -25,6 +25,7 @@ license: | ## Upgrading from Spark SQL 3.5 to 4.0 - Since Spark 4.0, `spark.sql.ansi.enabled` is on by default. To restore the previous behavior, set `spark.sql.ansi.enabled` to `false` or `SPARK_ANSI_SQL_MODE` to `false`. +- Since Spark 4.0, `CREATE TABLE` syntax without `USING` and `STORED AS` will use the value of `spark.sql.sources.default` as the table provider instead of `Hive`. To restore the previous behavior, set `spark.sql.legacy.createHiveTableByDefault` to `true`. - Since Spark 4.0, the default behaviour when inserting elements in a map is changed to first normalize keys -0.0 to 0.0. The affected SQL functions are `create_map`, `map_from_arrays`, `map_from_entries`, and `map_concat`. To restore the previous behaviour, set `spark.sql.legacy.disableMapKeyNormalization` to `true`. - Since Spark 4.0, the default value of `spark.sql.maxSinglePartitionBytes` is changed from `Long.MaxValue` to `128m`. To restore the previous behavior, set `spark.sql.maxSinglePartitionBytes` to `9223372036854775807`(`Long.MaxValue`). - Since Spark 4.0, any read of SQL tables takes into consideration the SQL configs `spark.sql.files.ignoreCorruptFiles`/`spark.sql.files.ignoreMissingFiles` instead of the core config `spark.files.ignoreCorruptFiles`/`spark.files.ignoreMissingFiles`. diff --git a/python/pyspark/sql/tests/test_readwriter.py b/python/pyspark/sql/tests/test_readwriter.py index 5784d2c72973..e752856d0316 100644 --- a/python/pyspark/sql/tests/test_readwriter.py +++ b/python/pyspark/sql/tests/test_readwriter.py @@ -247,10 +247,9 @@ class ReadwriterV2TestsMixin: def test_create_without_provider(self): df = self.df -with self.assertRaisesRegex( -AnalysisException, "NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT" -): +with self.table("test_table"): df.writeTo("test_table").create() +self.assertEqual(100, self.spark.sql("select * from test_table").count()) def test_table_overwrite(self): df = self.df diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.
(spark) branch master updated: [SPARK-48042][SQL] Use a timestamp formatter with timezone at class level instead of making copies at method level
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c9ed9dfccb72 [SPARK-48042][SQL] Use a timestamp formatter with timezone at class level instead of making copies at method level c9ed9dfccb72 is described below commit c9ed9dfccb72bc8d30557dcd2809c298a75c3f69 Author: Kent Yao AuthorDate: Mon Apr 29 11:13:39 2024 -0700 [SPARK-48042][SQL] Use a timestamp formatter with timezone at class level instead of making copies at method level ### What changes were proposed in this pull request? This PR creates a timestamp formatter with the timezone directly for formatting. Previously, we called `withZone` for every value in the `format` function. Because the original `zoneId` in the formatter is null and never equals the one we pass in, it creates new copies of the formatter over and over. ```java ... * * param zone the new override zone, null if no override * return a formatter based on this formatter with the requested override zone, not null */ public DateTimeFormatter withZone(ZoneId zone) { if (Objects.equals(this.zone, zone)) { return this; } return new DateTimeFormatter(printerParser, locale, decimalStyle, resolverStyle, resolverFields, chrono, zone); } ``` ### Why are the changes needed? improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - Existing tests - I also ran the DateTimeBenchmark result locally, there's no performance gain at least for these cases. ### Was this patch authored or co-authored using generative AI tooling? no Closes #46282 from yaooqinn/SPARK-48042. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/catalyst/util/TimestampFormatter.scala | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala index d59b52a3818a..9f57f8375c54 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala @@ -162,6 +162,9 @@ class Iso8601TimestampFormatter( protected lazy val formatter: DateTimeFormatter = getOrCreateFormatter(pattern, locale, isParsing) + @transient + private lazy val zonedFormatter: DateTimeFormatter = formatter.withZone(zoneId) + @transient protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter( pattern, zoneId, locale, legacyFormat) @@ -231,7 +234,7 @@ class Iso8601TimestampFormatter( override def format(instant: Instant): String = { try { - formatter.withZone(zoneId).format(instant) + zonedFormatter.format(instant) } catch checkFormattedDiff(toJavaTimestamp(instantToMicros(instant)), (t: Timestamp) => format(t)) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (f781d153a5e4 -> c35a21e5984f)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from f781d153a5e4 [SPARK-48046][K8S] Remove `clock` parameter from `DriverServiceFeatureStep` add c35a21e5984f [SPARK-48044][PYTHON][CONNECT] Cache `DataFrame.isStreaming` No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/dataframe.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (d42c10d9411d -> f781d153a5e4)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from d42c10d9411d [SPARK-47693][TESTS][FOLLOWUP] Reduce CollationBenchmarks time add f781d153a5e4 [SPARK-48046][K8S] Remove `clock` parameter from `DriverServiceFeatureStep` No new revisions were added by this update. Summary of changes: .../apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala | 4 +--- .../spark/deploy/k8s/features/DriverServiceFeatureStepSuite.scala | 2 +- 2 files changed, 2 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (ccb0eb699f7c -> d42c10d9411d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ccb0eb699f7c [SPARK-48038][K8S] Promote driverServiceName to KubernetesDriverConf add d42c10d9411d [SPARK-47693][TESTS][FOLLOWUP] Reduce CollationBenchmarks time No new revisions were added by this update. Summary of changes: .../execution/benchmark/CollationBenchmark.scala | 38 -- 1 file changed, 20 insertions(+), 18 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48038][K8S] Promote driverServiceName to KubernetesDriverConf
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ccb0eb699f7c [SPARK-48038][K8S] Promote driverServiceName to KubernetesDriverConf ccb0eb699f7c is described below commit ccb0eb699f7c54aa3902d1ebbb34684693b563de Author: Cheng Pan AuthorDate: Mon Apr 29 08:35:13 2024 -0700 [SPARK-48038][K8S] Promote driverServiceName to KubernetesDriverConf ### What changes were proposed in this pull request? Promote `driverServiceName` from `DriverServiceFeatureStep` to `KubernetesDriverConf`. ### Why are the changes needed? To allow other feature steps, e.g. ingress(proposed in SPARK-47954), to access `driverServiceName`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT has been updated. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46276 from pan3793/SPARK-48038. Authored-by: Cheng Pan Signed-off-by: Dongjoon Hyun --- .../apache/spark/deploy/k8s/KubernetesConf.scala | 22 +++--- .../k8s/features/DriverServiceFeatureStep.scala| 14 ++ .../spark/deploy/k8s/KubernetesTestConf.scala | 6 -- .../features/DriverServiceFeatureStepSuite.scala | 17 + 4 files changed, 34 insertions(+), 25 deletions(-) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala index b55f9317d10b..fda772b737fe 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala @@ -24,12 +24,13 @@ import org.apache.commons.lang3.StringUtils import org.apache.spark.{SPARK_VERSION, SparkConf} import org.apache.spark.deploy.k8s.Config._ import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.features.DriverServiceFeatureStep._ import org.apache.spark.deploy.k8s.submit._ import org.apache.spark.internal.{Logging, MDC} import org.apache.spark.internal.LogKeys.{CONFIG, EXECUTOR_ENV_REGEX} import org.apache.spark.internal.config.ConfigEntry import org.apache.spark.resource.ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID -import org.apache.spark.util.Utils +import org.apache.spark.util.{Clock, SystemClock, Utils} /** * Structure containing metadata for Kubernetes logic to build Spark pods. @@ -83,12 +84,27 @@ private[spark] class KubernetesDriverConf( val mainAppResource: MainAppResource, val mainClass: String, val appArgs: Array[String], -val proxyUser: Option[String]) - extends KubernetesConf(sparkConf) { +val proxyUser: Option[String], +clock: Clock = new SystemClock()) + extends KubernetesConf(sparkConf) with Logging { def driverNodeSelector: Map[String, String] = KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, KUBERNETES_DRIVER_NODE_SELECTOR_PREFIX) + lazy val driverServiceName: String = { +val preferredServiceName = s"$resourceNamePrefix$DRIVER_SVC_POSTFIX" +if (preferredServiceName.length <= MAX_SERVICE_NAME_LENGTH) { + preferredServiceName +} else { + val randomServiceId = KubernetesUtils.uniqueID(clock) + val shorterServiceName = s"spark-$randomServiceId$DRIVER_SVC_POSTFIX" + logWarning(s"Driver's hostname would preferably be $preferredServiceName, but this is " + +s"too long (must be <= $MAX_SERVICE_NAME_LENGTH characters). Falling back to use " + +s"$shorterServiceName as the driver service's name.") + shorterServiceName +} + } + override val resourceNamePrefix: String = { val custom = if (Utils.isTesting) get(KUBERNETES_DRIVER_POD_NAME_PREFIX) else None custom.getOrElse(KubernetesConf.getResourceNamePrefix(appName)) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala index cba4f442371c..9adfb2b8de49 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala @@ -20,7 +20,7 @@ import scala.jdk.CollectionConverters._ import io.fabric8.kubernetes.api.model.{HasMetadata, ServiceBuilder} -import org.apache.spark.deploy.k8s.{KubernetesDriverConf, KubernetesUtils, SparkPod} +
(spark) branch master updated: [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ff0751a56f01 [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page ff0751a56f01 is described below commit ff0751a56f010a6bf8a9ae86ddf0868bee615848 Author: Hyukjin Kwon AuthorDate: Sun Apr 28 22:34:30 2024 -0700 [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page ### What changes were proposed in this pull request? This PR removes a space in the middle of configuration name in Arrow-optimized Python UDF page. ![Screenshot 2024-04-29 at 1 53 42  PM](https://github.com/apache/spark/assets/6477701/46b7c448-fb30-4838-a5ba-c8f1c23398fd) https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#arrow-python-udfs ### Why are the changes needed? So users can copy and paste the configuration names properly. ### Does this PR introduce _any_ user-facing change? Yes it fixes the doc. ### How was this patch tested? Manually built the docs, and checked. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46274 from HyukjinKwon/fix-minor-typo. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- python/docs/source/user_guide/sql/arrow_pandas.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/python/docs/source/user_guide/sql/arrow_pandas.rst b/python/docs/source/user_guide/sql/arrow_pandas.rst index a5dfb9aa4e52..1d6a4df60690 100644 --- a/python/docs/source/user_guide/sql/arrow_pandas.rst +++ b/python/docs/source/user_guide/sql/arrow_pandas.rst @@ -339,9 +339,9 @@ Arrow Python UDFs Arrow Python UDFs are user defined functions that are executed row-by-row, utilizing Arrow for efficient batch data transfer and serialization. To define an Arrow Python UDF, you can use the :meth:`udf` decorator or wrap the function with the :meth:`udf` method, ensuring the ``useArrow`` parameter is set to True. Additionally, you can enable Arrow -optimization for Python UDFs throughout the entire SparkSession by setting the Spark configuration ``spark.sql -.execution.pythonUDF.arrow.enabled`` to true. It's important to note that the Spark configuration takes effect only -when ``useArrow`` is either not set or set to None. +optimization for Python UDFs throughout the entire SparkSession by setting the Spark configuration +``spark.sql.execution.pythonUDF.arrow.enabled`` to true. It's important to note that the Spark configuration takes +effect only when ``useArrow`` is either not set or set to None. The type hints for Arrow Python UDFs should be specified in the same way as for default, pickled Python UDFs. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (9a42610d5ad8 -> e1445e3f1cf5)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 9a42610d5ad8 [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image add e1445e3f1cf5 [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` No new revisions were added by this update. Summary of changes: docs/sql-ref-ansi-compliance.md | 14 ++ docs/sql-ref-identifier.md | 2 +- 2 files changed, 7 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9a42610d5ad8 [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image 9a42610d5ad8 is described below commit 9a42610d5ad8ae0ded92fb68c7617861cfe975e1 Author: panbingkun AuthorDate: Sun Apr 28 21:43:47 2024 -0700 [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image ### What changes were proposed in this pull request? The pr aims to update the packages name removed in building the spark docker image. ### Why are the changes needed? When our default image base was switched from `ubuntu 20.04` to `ubuntu 22.04`, the unused installation package in the base image has changed, in order to eliminate some warnings in building images and free disk space more accurately, we need to correct it. Before: ``` #35 [29/31] RUN apt-get remove --purge -y '^aspnet.*' '^dotnet-.*' '^llvm-.*' 'php.*' '^mongodb-.*' snapd google-chrome-stable microsoft-edge-stable firefox azure-cli google-cloud-sdk mono-devel powershell libgl1-mesa-dri || true #35 0.489 Reading package lists... #35 0.505 Building dependency tree... #35 0.507 Reading state information... #35 0.511 E: Unable to locate package ^aspnet.* #35 0.511 E: Couldn't find any package by glob '^aspnet.*' #35 0.511 E: Couldn't find any package by regex '^aspnet.*' #35 0.511 E: Unable to locate package ^dotnet-.* #35 0.511 E: Couldn't find any package by glob '^dotnet-.*' #35 0.511 E: Couldn't find any package by regex '^dotnet-.*' #35 0.511 E: Unable to locate package ^llvm-.* #35 0.511 E: Couldn't find any package by glob '^llvm-.*' #35 0.511 E: Couldn't find any package by regex '^llvm-.*' #35 0.511 E: Unable to locate package ^mongodb-.* #35 0.511 E: Couldn't find any package by glob '^mongodb-.*' #35 0.511 EPackage 'php-crypt-gpg' is not installed, so not removed #35 0.511 Package 'php' is not installed, so not removed #35 0.511 : Couldn't find any package by regex '^mongodb-.*' #35 0.511 E: Unable to locate package snapd #35 0.511 E: Unable to locate package google-chrome-stable #35 0.511 E: Unable to locate package microsoft-edge-stable #35 0.511 E: Unable to locate package firefox #35 0.511 E: Unable to locate package azure-cli #35 0.511 E: Unable to locate package google-cloud-sdk #35 0.511 E: Unable to locate package mono-devel #35 0.511 E: Unable to locate package powershell #35 DONE 0.5s #36 [30/31] RUN apt-get autoremove --purge -y #36 0.063 Reading package lists... #36 0.079 Building dependency tree... #36 0.082 Reading state information... #36 0.088 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. #36 DONE 0.4s ``` After: ``` #38 [32/36] RUN apt-get remove --purge -y 'gfortran-11' 'humanity-icon-theme' 'nodejs-doc' || true #38 0.066 Reading package lists... #38 0.087 Building dependency tree... #38 0.089 Reading state information... #38 0.094 The following packages were automatically installed and are no longer required: #38 0.094 at-spi2-core bzip2-doc dbus-user-session dconf-gsettings-backend #38 0.095 dconf-service gsettings-desktop-schemas gtk-update-icon-cache #38 0.095 hicolor-icon-theme libatk-bridge2.0-0 libatk1.0-0 libatk1.0-data #38 0.095 libatspi2.0-0 libbz2-dev libcairo-gobject2 libcolord2 libdconf1 libepoxy0 #38 0.095 libgfortran-11-dev libgtk-3-common libjs-highlight.js libllvm11 #38 0.095 libncurses-dev libncurses5-dev libphobos2-ldc-shared98 libreadline-dev #38 0.095 librsvg2-2 librsvg2-common libvte-2.91-common libwayland-client0 #38 0.095 libwayland-cursor0 libwayland-egl1 libxdamage1 libxkbcommon0 #38 0.095 session-migration tilix-common xkb-data #38 0.095 Use 'apt autoremove' to remove them. #38 0.096 The following packages will be REMOVED: #38 0.096 adwaita-icon-theme* gfortran* gfortran-11* humanity-icon-theme* libgtk-3-0* #38 0.096 libgtk-3-bin* libgtkd-3-0* libvte-2.91-0* libvted-3-0* nodejs-doc* #38 0.096 r-base-dev* tilix* ubuntu-mono* #38 0.248 0 upgraded, 0 newly installed, 13 to remove and 0 not upgraded. #38 0.248 After this operation, 99.6 MB disk space will be freed. ... (Reading database ... 70597 files and directories currently installed.) #38 0.304 Removing r-base-dev (4.1.2-1ubuntu2) ... #38 0.319 Removing gfortran (4:11
(spark) branch master updated (3d62dd72a58f -> 8f1634e833ce)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 3d62dd72a58f [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels add 8f1634e833ce [SPARK-48032][BUILD] Upgrade `commons-codec` to 1.17.0 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3d62dd72a58f [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels 3d62dd72a58f is described below commit 3d62dd72a58f5a19e9a371acc09604ab9ceb9e68 Author: Xi Chen AuthorDate: Sun Apr 28 18:30:06 2024 -0700 [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels ### What changes were proposed in this pull request? Currently, only the pod annotations supports `APP_ID` and `EXECUTOR_ID` placeholders. This commit aims to add the same function to pod labels. ### Why are the changes needed? The use case is to support using customized labels for availability zone based topology pod affinity. We want to use the Spark application ID as the customized label value, to allow Spark executor pods to run in the same availability zone as Spark driver pod. Although we can use the Spark internal label `spark-app-selector` directly, this is not a good practice when using it along with YuniKorn Gang Scheduling. When Gang Scheduling is enabled, the YuniKorn placeholder pods should use the same affinity as real Spark pods. In this way, we have to add the internal `spark-app-selector` label to the placeholder pods. This is not good because the placeholder pods could be recognized as Spark pods in the monitoring system. Thus we propose supporting the `APP_ID` and `EXECUTOR_ID` placeholders in Spark pod labels as well for flexibility. ### Does this PR introduce _any_ user-facing change? No because the pattern strings are very specific. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46149 from jshmchenxi/SPARK-47730/support-app-placeholder-in-labels. Authored-by: Xi Chen Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/deploy/k8s/KubernetesConf.scala | 10 ++ .../org/apache/spark/deploy/k8s/KubernetesConfSuite.scala | 13 ++--- .../deploy/k8s/features/BasicDriverFeatureStepSuite.scala | 11 +++ .../spark/deploy/k8s/integrationtest/BasicTestsSuite.scala | 6 -- .../spark/deploy/k8s/integrationtest/KubernetesSuite.scala | 6 -- 5 files changed, 31 insertions(+), 15 deletions(-) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala index a1ef04f4e311..b55f9317d10b 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala @@ -100,8 +100,9 @@ private[spark] class KubernetesDriverConf( SPARK_APP_ID_LABEL -> appId, SPARK_APP_NAME_LABEL -> KubernetesConf.getAppNameLabel(appName), SPARK_ROLE_LABEL -> SPARK_POD_DRIVER_ROLE) -val driverCustomLabels = KubernetesUtils.parsePrefixedKeyValuePairs( - sparkConf, KUBERNETES_DRIVER_LABEL_PREFIX) +val driverCustomLabels = + KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, KUBERNETES_DRIVER_LABEL_PREFIX) +.map { case(k, v) => (k, Utils.substituteAppNExecIds(v, appId, "")) } presetLabels.keys.foreach { key => require( @@ -173,8 +174,9 @@ private[spark] class KubernetesExecutorConf( SPARK_ROLE_LABEL -> SPARK_POD_EXECUTOR_ROLE, SPARK_RESOURCE_PROFILE_ID_LABEL -> resourceProfileId.toString) -val executorCustomLabels = KubernetesUtils.parsePrefixedKeyValuePairs( - sparkConf, KUBERNETES_EXECUTOR_LABEL_PREFIX) +val executorCustomLabels = + KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, KUBERNETES_EXECUTOR_LABEL_PREFIX) +.map { case(k, v) => (k, Utils.substituteAppNExecIds(v, appId, executorId)) } presetLabels.keys.foreach { key => require( diff --git a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesConfSuite.scala b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesConfSuite.scala index 9963db016ad9..3c53e9b74f92 100644 --- a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesConfSuite.scala +++ b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesConfSuite.scala @@ -40,7 +40,9 @@ class KubernetesConfSuite extends SparkFunSuite { "execNodeSelectorKey2" -> "execNodeSelectorValue2") private val CUSTOM_LABELS = Map( "customLabel1Key" -> "customLabe
(spark) branch master updated: [SPARK-48021][ML][BUILD] Add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 64d321926bbc [SPARK-48021][ML][BUILD] Add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions` 64d321926bbc is described below commit 64d321926bbcede05d1c145405d503b3431f185b Author: panbingkun AuthorDate: Sat Apr 27 17:38:55 2024 -0700 [SPARK-48021][ML][BUILD] Add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions` ### What changes were proposed in this pull request? The pr aims to: - add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions` - remove `jdk.incubator.foreign` and `-Dforeign.restricted=warn` from `SparkBuild.scala` ### Why are the changes needed? 1.`jdk.incubator.vector` First introduction: https://github.com/apache/spark/pull/30810 https://github.com/apache/spark/pull/30810/files#diff-6f545c33f2fcc975200bf208c900a600a593ce6b170180f81e2f93b3efb6cb3e https://github.com/apache/spark/assets/15246973/6ac7919a-5d82-475c-b8a2-7d9de71acacc";> Why should we add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions`, Because when we only add `--add-modules=jdk.incubator.vector` to `SparkBuild.scala`, it will only take effect when compiling, as follows: ``` build/sbt "mllib-local/Test/runMain org.apache.spark.ml.linalg.BLASBenchmark" ... ``` https://github.com/apache/spark/assets/15246973/54d5f55f-cefe-4126-b255-69488f8699a6";> However, when we use `spark-submit`, it is as follows: ``` ./bin/spark-submit --class org.apache.spark.ml.linalg.BLASBenchmark /Users/panbingkun/Developer/spark/spark-community/mllib-local/target/scala-2.13/spark-mllib-local_2.13-4.0.0-SNAPSHOT-tests.jar ``` https://github.com/apache/spark/assets/15246973/8e02fa93-fef4-4cdc-96bd-908b3e9baea1";> Obviously, `--add-modules=jdk.incubator.vector` does not take effect in the `Spark runtime`, so I propose adding `--add-modules=jdk.incubator.vector` to the `JavaModuleOptions`(`Spark runtime options`) so that we can improve `performance` by using `hardware-accelerated BLAS operations` by default. After this patch(add `--add-modules=jdk.incubator.vector` to the `JavaModuleOptions`), as follows: https://github.com/apache/spark/assets/15246973/da7aa494-0d3c-4c60-9991-e7cd29a1cec5";> 2.`jdk.incubator.foreign` and `-Dforeign.restricted=warn` A.First introduction: https://github.com/apache/spark/pull/32253 https://github.com/apache/spark/pull/32253/files#diff-6f545c33f2fcc975200bf208c900a600a593ce6b170180f81e2f93b3efb6cb3e https://github.com/apache/spark/assets/15246973/3f526019-c389-4e60-ab2a-f8e99cfb";> Use `dev.ludovic.netlib:blas:1.3.2`, the class `ForeignLinkerBLAS` uses `jdk.incubator.foreign.*` in this version, so we need to add `jdk.incubator.foreign` and `-Dforeign.restricted=warn` to `SparkBuild.scala` https://github.com/apache/spark/pull/32253/files#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8 https://github.com/apache/spark/assets/15246973/4fd35e96-0da2-4456-a3f6-6b57ad2e9b64";> https://github.com/luhenry/netlib/blob/v1.3.2/blas/src/main/java/dev/ludovic/netlib/blas/ForeignLinkerBLAS.java#L36 https://github.com/apache/spark/assets/15246973/4b7e3bd1-4650-4c7d-bdb4-c1761d48d478";> However, with the iterative development of `dev.ludovic.netlib`, `ForeignLinkerBLAS` has experienced one `major` change, as following: https://github.com/luhenry/netlib/commit/48e923c3e5e84560139eb25b3c9df9873c05e41d https://github.com/apache/spark/assets/15246973/7ba30b19-00c7-4cc4-bea7-a6ab4b326ad8";> From now on (V3.0.0), `jdk.incubator.foreign.*` will not be used in `dev.ludovic.netlib` Currently, Spark has used the `dev.ludovic.netlib` of version `v3.0.3`. In this version, `ForeignLinkerBLAS` has be removed. https://github.com/apache/spark/blob/master/pom.xml#L191 Double check (`jdk.incubator.foreign` cannot be found in the `netlib` source code): https://github.com/apache/spark/assets/15246973/5c6c6d73-6a5d-427a-9fb4-f626f02335ca";> So we can completely remove options `jdk.incubator.foreign` and `-Dforeign.restricted=warn`. B.For JDK 21 (PS: This is to explain the historical reasons for the differences between the current code logic and the initial ones) (Just because `Spark` made changes to support `JDK 21`) https://issues.apache.org/jira/browse/SPARK-44088 https://github.com/apache/spark/assets/15246973/34e7e7e8-4e72-470e-abc0-d79406ad25e5";> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test - Pass G
(spark) branch master updated: [SPARK-47408][SQL] Fix mathExpressions that use StringType
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b623601910a3 [SPARK-47408][SQL] Fix mathExpressions that use StringType b623601910a3 is described below commit b623601910a37c863edac56d18e79a44b93c5b36 Author: Mihailo Milosevic AuthorDate: Fri Apr 26 19:48:27 2024 -0700 [SPARK-47408][SQL] Fix mathExpressions that use StringType ### What changes were proposed in this pull request? Support more functions that use strings with collations. ### Why are the changes needed? Hex, Unhex, Conv are widely used and need to be enabled wih collations ### Does this PR introduce _any_ user-facing change? Yes, enabled more functions. ### How was this patch tested? With new tests in `CollationSQLExpressionsSuite.scala`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46227 from mihailom-db/SPARK-47408. Lead-authored-by: Mihailo Milosevic Co-authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun --- .../sql/catalyst/expressions/mathExpressions.scala | 21 ++-- .../catalyst/expressions/stringExpressions.scala | 2 +- .../spark/sql/CollationSQLExpressionsSuite.scala | 124 + 3 files changed, 138 insertions(+), 9 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala index 0c09e9be12e9..dc50c18f2ebb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala @@ -30,6 +30,7 @@ import org.apache.spark.sql.catalyst.expressions.codegen.Block._ import org.apache.spark.sql.catalyst.util.{MathUtils, NumberConverter, TypeUtils} import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryExecutionErrors} import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.internal.types.StringTypeAnyCollation import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String @@ -450,8 +451,9 @@ case class Conv( override def first: Expression = numExpr override def second: Expression = fromBaseExpr override def third: Expression = toBaseExpr - override def inputTypes: Seq[AbstractDataType] = Seq(StringType, IntegerType, IntegerType) - override def dataType: DataType = StringType + override def inputTypes: Seq[AbstractDataType] = +Seq(StringTypeAnyCollation, IntegerType, IntegerType) + override def dataType: DataType = first.dataType override def nullable: Boolean = true override def nullSafeEval(num: Any, fromBase: Any, toBase: Any): Any = { @@ -1002,7 +1004,7 @@ case class Bin(child: Expression) extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant with Serializable { override def inputTypes: Seq[DataType] = Seq(LongType) - override def dataType: DataType = StringType + override def dataType: DataType = SQLConf.get.defaultStringType protected override def nullSafeEval(input: Any): Any = UTF8String.fromString(jl.Long.toBinaryString(input.asInstanceOf[Long])) @@ -1108,21 +1110,24 @@ case class Hex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant { override def inputTypes: Seq[AbstractDataType] = -Seq(TypeCollection(LongType, BinaryType, StringType)) +Seq(TypeCollection(LongType, BinaryType, StringTypeAnyCollation)) - override def dataType: DataType = StringType + override def dataType: DataType = child.dataType match { +case st: StringType => st +case _ => SQLConf.get.defaultStringType + } protected override def nullSafeEval(num: Any): Any = child.dataType match { case LongType => Hex.hex(num.asInstanceOf[Long]) case BinaryType => Hex.hex(num.asInstanceOf[Array[Byte]]) -case StringType => Hex.hex(num.asInstanceOf[UTF8String].getBytes) +case _: StringType => Hex.hex(num.asInstanceOf[UTF8String].getBytes) } override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { nullSafeCodeGen(ctx, ev, (c) => { val hex = Hex.getClass.getName.stripSuffix("$") s"${ev.value} = " + (child.dataType match { -case StringType => s"""$hex.hex($c.getBytes());""" +case _: StringType => s"""$hex.hex($c.getBytes());""" case _ => s"""$hex.hex($c);""" }) }) @@ -1149,7 +1154,7 @@ case class Unhex(child: Expression, failOnError: Boolean
(spark-kubernetes-operator) branch main updated: [SPARK-48015] Update `build.gradle` to fix deprecation warnings
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new 167047a [SPARK-48015] Update `build.gradle` to fix deprecation warnings 167047a is described below commit 167047abed12ea8e6d709dbb3c6c326330d5787e Author: Dongjoon Hyun AuthorDate: Fri Apr 26 14:58:08 2024 -0700 [SPARK-48015] Update `build.gradle` to fix deprecation warnings ### What changes were proposed in this pull request? This PR aims to update `build.gradle` to fix deprecation warnings. ### Why are the changes needed? **AFTER** ``` $ ./gradlew build --warning-mode all > Configure project :spark-operator-api Updating PrinterColumns for generated CRD BUILD SUCCESSFUL in 331ms 16 actionable tasks: 16 up-to-date ``` **BEFORE** ``` $ ./gradlew build --warning-mode all > Configure project : Build file '/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle': line 20 The org.gradle.api.plugins.JavaPluginConvention type has been deprecated. This is scheduled to be removed in Gradle 9.0. Consult the upgrading guide for further information: https://docs.gradle.org/8.7/userguide/upgrading_version_8.html#java_convention_deprecation at build_1ab30mf3g41rlj3ezxkowdftr$_run_closure1.doCall$original(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:20) (Run with --stacktrace to get the full stack trace of this deprecation warning.) at build_1ab30mf3g41rlj3ezxkowdftr.run(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:16) (Run with --stacktrace to get the full stack trace of this deprecation warning.) Build file '/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle': line 21 The org.gradle.api.plugins.JavaPluginConvention type has been deprecated. This is scheduled to be removed in Gradle 9.0. Consult the upgrading guide for further information: https://docs.gradle.org/8.7/userguide/upgrading_version_8.html#java_convention_deprecation at build_1ab30mf3g41rlj3ezxkowdftr$_run_closure1.doCall$original(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:21) (Run with --stacktrace to get the full stack trace of this deprecation warning.) at build_1ab30mf3g41rlj3ezxkowdftr.run(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:16) (Run with --stacktrace to get the full stack trace of this deprecation warning.) Build file '/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle': line 25 The RepositoryHandler.jcenter() method has been deprecated. This is scheduled to be removed in Gradle 9.0. JFrog announced JCenter's sunset in February 2021. Use mavenCentral() instead. Consult the upgrading guide for further information: https://docs.gradle.org/8.7/userguide/upgrading_version_6.html#jcenter_deprecation at build_1ab30mf3g41rlj3ezxkowdftr$_run_closure1$_closure2.doCall$original(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:25) (Run with --stacktrace to get the full stack trace of this deprecation warning.) at build_1ab30mf3g41rlj3ezxkowdftr$_run_closure1.doCall$original(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:23) (Run with --stacktrace to get the full stack trace of this deprecation warning.) > Configure project :spark-operator-api Updating PrinterColumns for generated CRD BUILD SUCCESSFUL in 353ms 16 actionable tasks: 16 up-to-date ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually build with `--warning-mode all`. ``` $ ./gradlew build --warning-mode all > Configure project :spark-operator-api Updating PrinterColumns for generated CRD BUILD SUCCESSFUL in 331ms 16 actionable tasks: 16 up-to-date ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #9 from dongjoon-hyun/SPARK-48015. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- build.gradle | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/build.gradle b/build.gradle index ed54f7b..a6c1701 100644 --- a/build.gradle +++ b/build.gradle @@ -17,12 +17,14 @@ subprojects { apply plugin: 'idea' apply plugin: 'eclipse' apply plugin: 'java' - sourceCompatibility = 17 - targetCompatibility = 17 + + java { +sourceCompatibility = 17 +targetCompatibility = 17 + } repositories { mavenCentral() -jcenter() } apply plugin: 'chec
(spark-kubernetes-operator) branch main updated: [SPARK-47950] Add Java API Module for Spark Operator
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new 28ff3e0 [SPARK-47950] Add Java API Module for Spark Operator 28ff3e0 is described below commit 28ff3e069e80bffa2a3be69fc4905ad3a0f76fd5 Author: zhou-jiang AuthorDate: Fri Apr 26 14:18:09 2024 -0700 [SPARK-47950] Add Java API Module for Spark Operator ### What changes were proposed in this pull request? This PR adds Java API library for Spark Operator, with the ability to generate yaml spec. ### Why are the changes needed? Spark Operator API refers to the CustomResourceDefinition(https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/) that represents the spec for Spark Application in k8s. This module would be used by operator controller and reconciler. It can also serve external services that access k8s server with Java library. ### Does this PR introduce _any_ user-facing change? No API changes in Apache Spark core API. Spark Operator API is proposed. To view generate SparkApplication spec yaml, use ``` ./gradlew :spark-operator-api:finalizeGeneratedCRD ``` (this requires yq to be installed for patching additional printer columns) Generated yaml file would be located at ``` spark-operator-api/build/classes/java/main/META-INF/fabric8/sparkapplications.org.apache.spark-v1.yml ``` For more details, please also refer `spark-operator-docs/spark_application.md` ### How was this patch tested? This is tested locally. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #8 from jiangzho/api. Authored-by: zhou-jiang Signed-off-by: Dongjoon Hyun --- .github/.licenserc.yaml| 1 + build.gradle | 2 + dev/.rat-excludes | 2 + gradle.properties | 16 ++ settings.gradle| 2 + spark-operator-api/build.gradle| 32 .../apache/spark/k8s/operator/BaseResource.java| 36 + .../org/apache/spark/k8s/operator/Constants.java | 82 ++ .../spark/k8s/operator/SparkApplication.java | 57 +++ .../spark/k8s/operator/SparkApplicationList.java | 26 +++ .../k8s/operator/decorators/ResourceDecorator.java | 26 +++ .../apache/spark/k8s/operator/diff/Diffable.java | 22 +++ .../spark/k8s/operator/spec/ApplicationSpec.java | 57 +++ .../operator/spec/ApplicationTimeoutConfig.java| 66 .../k8s/operator/spec/ApplicationTolerations.java | 45 ++ .../operator/spec/BaseApplicationTemplateSpec.java | 38 + .../apache/spark/k8s/operator/spec/BaseSpec.java | 36 + .../spark/k8s/operator/spec/DeploymentMode.java| 25 +++ .../spark/k8s/operator/spec/InstanceConfig.java| 68 .../k8s/operator/spec/ResourceRetainPolicy.java| 39 + .../spark/k8s/operator/spec/RestartConfig.java | 39 + .../spark/k8s/operator/spec/RestartPolicy.java | 39 + .../spark/k8s/operator/spec/RuntimeVersions.java | 40 + .../operator/status/ApplicationAttemptSummary.java | 53 ++ .../k8s/operator/status/ApplicationState.java | 50 ++ .../operator/status/ApplicationStateSummary.java | 151 + .../k8s/operator/status/ApplicationStatus.java | 170 .../spark/k8s/operator/status/AttemptInfo.java | 44 + .../k8s/operator/status/BaseAttemptSummary.java| 37 + .../spark/k8s/operator/status/BaseState.java | 37 + .../k8s/operator/status/BaseStateSummary.java | 29 .../spark/k8s/operator/status/BaseStatus.java | 64 .../spark/k8s/operator/utils/ModelUtils.java | 110 + .../src/main/resources/printer-columns.sh | 14 +- .../k8s/operator/spec/ApplicationSpecTest.java | 42 + .../spark/k8s/operator/spec/RestartPolicyTest.java | 62 +++ .../k8s/operator/status/ApplicationStatusTest.java | 178 + .../spark/k8s/operator/utils/ModelUtilsTest.java | 124 ++ 38 files changed, 1956 insertions(+), 5 deletions(-) diff --git a/.github/.licenserc.yaml b/.github/.licenserc.yaml index 26ac0c1..d1d65e2 100644 --- a/.github/.licenserc.yaml +++ b/.github/.licenserc.yaml @@ -16,5 +16,6 @@ header: - '.asf.yaml' - '**/*.gradle' - gradlew +- 'build/**' comment: on-failure diff --git a/build.gradle b/build.gradle index f64212b..ed54f7b 100644 --- a/build.gradle +++ b/build.gradle
(spark) branch master updated: [SPARK-48011][CORE] Store LogKey name as a value to avoid generating new string instances
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2b2a33cc35a8 [SPARK-48011][CORE] Store LogKey name as a value to avoid generating new string instances 2b2a33cc35a8 is described below commit 2b2a33cc35a880fafc569c707674313a56c15811 Author: Gengliang Wang AuthorDate: Fri Apr 26 13:25:15 2024 -0700 [SPARK-48011][CORE] Store LogKey name as a value to avoid generating new string instances ### What changes were proposed in this pull request? Store LogKey name as a value to avoid generating new string instances ### Why are the changes needed? To save memory usage on getting the names of `LogKey`s. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46249 from gengliangwang/addKeyName. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun --- common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala | 6 +- common/utils/src/main/scala/org/apache/spark/internal/Logging.scala | 4 +--- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 04990ddc4c9d..2ca80a496ccb 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -16,10 +16,14 @@ */ package org.apache.spark.internal +import java.util.Locale + /** * All structured logging `keys` used in `MDC` must be extends `LogKey` */ -trait LogKey +trait LogKey { + val name: String = this.toString.toLowerCase(Locale.ROOT) +} /** * Various keys used for mapped diagnostic contexts(MDC) in logging. diff --git a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala index 085b22bee5f3..24a60f88c24a 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala @@ -17,8 +17,6 @@ package org.apache.spark.internal -import java.util.Locale - import scala.jdk.CollectionConverters._ import org.apache.logging.log4j.{CloseableThreadContext, Level, LogManager} @@ -110,7 +108,7 @@ trait Logging { val value = if (mdc.value != null) mdc.value.toString else null sb.append(value) if (Logging.isStructuredLoggingEnabled) { - context.put(mdc.key.toString.toLowerCase(Locale.ROOT), value) + context.put(mdc.key.name, value) } if (processedParts.hasNext) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48010][SQL] Avoid repeated calls to conf.resolver in resolveExpression
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6098bd944f66 [SPARK-48010][SQL] Avoid repeated calls to conf.resolver in resolveExpression 6098bd944f66 is described below commit 6098bd944f6603546601a9d5b5da5f756ce2257c Author: Nikhil Sheoran <125331115+nikhilsheoran...@users.noreply.github.com> AuthorDate: Fri Apr 26 11:23:12 2024 -0700 [SPARK-48010][SQL] Avoid repeated calls to conf.resolver in resolveExpression ### What changes were proposed in this pull request? - This PR instead of calling `conf.resolver` for each call in `resolveExpression`, reuses the `resolver` obtained once. ### Why are the changes needed? - Consider a view with large number of columns (~1000s). When looking at the RuleExecutor metrics and flamegraph for a query that only does `DESCRIBE SELECT * FROM large_view`, observed that a large fraction of time is spent in `ResolveReferences` and `ResolveRelations`. Of these, the majority of the driver time went in initializing the `conf` to obtain `conf.resolver` for each of the column in the view. - Since, the same `conf` is used in each of these calls, calling the `conf.resolver` again and again can be avoided by initializing it once and reusing the same resolver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Created a dummy view with 3000 columns. - Observed the `RuleExecutor` metrics using `RuleExecutor.dumpTimeSpent()`. - `RuleExecutor` metrics before this change (after multiple runs) ``` === Metrics of Analyzer/Optimizer Rules === Total number of runs: 1483 Total time: 8.026801698 seconds Rule Effective Time / Total Time Effective Runs / Total Runs org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations 4060159342 / 4062186814 1 / 6 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 3789405037 / 3809203288 2 / 6 org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$CombinedTypeCoercionRule 0 / 207411640 / 6 org.apache.spark.sql.catalyst.analysis.ResolveTimeZone 17800584 / 19431350 1 / 6 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast 15036018 / 15060440 1 / 6 org.apache.spark.sql.catalyst.analysis.UpdateAttributeNullability 0 / 149298100 / 7 ``` - `RuleExecutor` metrics after this change (after multiple runs) ``` === Metrics of Analyzer/Optimizer Rules === Total number of runs: 1483 Total time: 2.892630859 seconds Rule Effective Time / Total Time Effective Runs / Total Runs org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations 1490357745 / 1492398446 1 / 6 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 1212205822 / 1241729981 2 / 6 org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$CombinedTypeCoercionRule 0 / 238571610 / 6 org.apache.spark.sql.catalyst.analysis.ResolveTimeZone 16603250 / 18806065 1 / 6 org.apache.spark.sql.catalyst.analysis.UpdateAttributeNullability 0 / 167493060 / 7 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast 11158299 / 11183593 1 / 6 ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46248 from nikhilsheoran-db/SPARK-48010. Authored-by: Nikhil Sheoran <125331115+nikhilsheoran...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/analysis/ColumnResolutionHelper.scala | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala index 6e27192ead32..c10e000a098c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala +++ b/sql/
(spark) branch master updated: [SPARK-48005][PS][CONNECT][TESTS] Enable `DefaultIndexParityTests.test_index_distributed_sequence_cleanup`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 78b19d5af08e [SPARK-48005][PS][CONNECT][TESTS] Enable `DefaultIndexParityTests.test_index_distributed_sequence_cleanup` 78b19d5af08e is described below commit 78b19d5af08ea772eaea9c13b7b984a13294 Author: Ruifeng Zheng AuthorDate: Fri Apr 26 09:58:54 2024 -0700 [SPARK-48005][PS][CONNECT][TESTS] Enable `DefaultIndexParityTests.test_index_distributed_sequence_cleanup` ### What changes were proposed in this pull request? Enable `DefaultIndexParityTests. test_index_distributed_sequence_cleanup` ### Why are the changes needed? this test requires `sc` access, can be enabled in `Spark Connect with JVM` mode ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci, also manually test: ``` python/run-tests -k --python-executables python3 --testnames 'pyspark.pandas.tests.connect.indexes.test_parity_default DefaultIndexParityTests.test_index_distributed_sequence_cleanup' Running PySpark tests. Output is in /Users/ruifeng.zheng/Dev/spark/python/unit-tests.log Will test against the following Python executables: ['python3'] Will test the following Python tests: ['pyspark.pandas.tests.connect.indexes.test_parity_default DefaultIndexParityTests.test_index_distributed_sequence_cleanup'] python3 python_implementation is CPython python3 version is: Python 3.12.2 Starting test(python3): pyspark.pandas.tests.connect.indexes.test_parity_default DefaultIndexParityTests.test_index_distributed_sequence_cleanup (temp output: /Users/ruifeng.zheng/Dev/spark/python/target/ccd3da45-f774-4f5f-8283-a91a8ee12212/python3__pyspark.pandas.tests.connect.indexes.test_parity_default_DefaultIndexParityTests.test_index_distributed_sequence_cleanup__p9yved3e.log) Finished test(python3): pyspark.pandas.tests.connect.indexes.test_parity_default DefaultIndexParityTests.test_index_distributed_sequence_cleanup (16s) Tests passed in 16 seconds ``` ### Was this patch authored or co-authored using generative AI tooling? no Closes #46242 from zhengruifeng/enable_test_index_distributed_sequence_cleanup. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .../pyspark/pandas/tests/connect/indexes/test_parity_default.py | 3 ++- python/pyspark/pandas/tests/indexes/test_default.py | 8 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/python/pyspark/pandas/tests/connect/indexes/test_parity_default.py b/python/pyspark/pandas/tests/connect/indexes/test_parity_default.py index d6f0cadbf0cd..4240eb8fdbc8 100644 --- a/python/pyspark/pandas/tests/connect/indexes/test_parity_default.py +++ b/python/pyspark/pandas/tests/connect/indexes/test_parity_default.py @@ -19,6 +19,7 @@ import unittest from pyspark.pandas.tests.indexes.test_default import DefaultIndexTestsMixin from pyspark.testing.connectutils import ReusedConnectTestCase from pyspark.testing.pandasutils import PandasOnSparkTestUtils +from pyspark.util import is_remote_only class DefaultIndexParityTests( @@ -26,7 +27,7 @@ class DefaultIndexParityTests( PandasOnSparkTestUtils, ReusedConnectTestCase, ): -@unittest.skip("Test depends on SparkContext which is not supported from Spark Connect.") +@unittest.skipIf(is_remote_only(), "Requires JVM access") def test_index_distributed_sequence_cleanup(self): super().test_index_distributed_sequence_cleanup() diff --git a/python/pyspark/pandas/tests/indexes/test_default.py b/python/pyspark/pandas/tests/indexes/test_default.py index 3d19eb407b42..5cd9fae76dfb 100644 --- a/python/pyspark/pandas/tests/indexes/test_default.py +++ b/python/pyspark/pandas/tests/indexes/test_default.py @@ -44,7 +44,7 @@ class DefaultIndexTestsMixin: "compute.default_index_type", "distributed-sequence" ), ps.option_context("compute.ops_on_diff_frames", True): with ps.option_context("compute.default_index_cache", "LOCAL_CHECKPOINT"): -cached_rdd_ids = [rdd_id for rdd_id in self.spark._jsc.getPersistentRDDs()] +cached_rdd_ids = [rdd_id for rdd_id in self._legacy_sc._jsc.getPersistentRDDs()] psdf1 = ( self.spark.range(0, 100, 1, 10).withColumn("Key", F.col("id") % 33).pandas_api() @@ -61,13 +61,13 @@ class DefaultIndexTestsMixin: self.assertTrue( any( rdd_id not in cached_rdd_ids -for rdd_id in self.spark._jsc.getPersistentRDDs() +
(spark) branch master updated: [SPARK-48007][BUILD][TESTS] Upgrade `mssql.jdbc` to `12.6.1.jre11`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4ee528f9b29f [SPARK-48007][BUILD][TESTS] Upgrade `mssql.jdbc` to `12.6.1.jre11` 4ee528f9b29f is described below commit 4ee528f9b29f5cd52b70b27a4b8c250c8ca1a17c Author: Kent Yao AuthorDate: Fri Apr 26 08:08:57 2024 -0700 [SPARK-48007][BUILD][TESTS] Upgrade `mssql.jdbc` to `12.6.1.jre11` ### What changes were proposed in this pull request? This PR upgrades mssql.jdbc.version to 12.6.1.jre11, https://mvnrepository.com/artifact/com.microsoft.sqlserver/mssql-jdbc. ### Why are the changes needed? test dependency management ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46244 from yaooqinn/SPARK-48007. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala | 3 ++- pom.xml| 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala index b351b2ad1ec7..61530f713eb8 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala @@ -28,5 +28,6 @@ class MsSQLServerDatabaseOnDocker extends DatabaseOnDocker { override val jdbcPort: Int = 1433 override def getJdbcUrl(ip: String, port: Int): String = -s"jdbc:sqlserver://$ip:$port;user=sa;password=Sapass123;" +s"jdbc:sqlserver://$ip:$port;user=sa;password=Sapass123;" + + "encrypt=true;trustServerCertificate=true" } diff --git a/pom.xml b/pom.xml index 9c8f8fbb2ab0..b916659fdbfa 100644 --- a/pom.xml +++ b/pom.xml @@ -325,7 +325,7 @@ 8.3.0 42.7.3 11.5.9.0 -9.4.1.jre8 +12.6.1.jre11 23.3.0.23.09 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47991][SQL][TEST] Arrange the test cases for window frames and window functions
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ea4b7a242910 [SPARK-47991][SQL][TEST] Arrange the test cases for window frames and window functions ea4b7a242910 is described below commit ea4b7a2429106067eb30b6b47bf7c42059053d31 Author: beliefer AuthorDate: Thu Apr 25 20:54:27 2024 -0700 [SPARK-47991][SQL][TEST] Arrange the test cases for window frames and window functions ### What changes were proposed in this pull request? This PR propose to arrange the test cases for window frames and window functions. ### Why are the changes needed? Currently, `DataFrameWindowFramesSuite` and `DataFrameWindowFunctionsSuite` have different testing objectives. The comments for the above two classes are as follows: `DataFrameWindowFramesSuite` is `Window frame testing for DataFrame API.` `DataFrameWindowFunctionsSuite` is `Window function testing for DataFrame API.` But there are some test cases for window frame placed into `DataFrameWindowFunctionsSuite`. ### Does this PR introduce _any_ user-facing change? 'No'. Just arrange the test cases for window frames and window functions. ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46226 from beliefer/SPARK-47991. Authored-by: beliefer Signed-off-by: Dongjoon Hyun --- .../spark/sql/DataFrameWindowFramesSuite.scala | 48 ++ .../spark/sql/DataFrameWindowFunctionsSuite.scala | 48 -- 2 files changed, 48 insertions(+), 48 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala index fe1393af8174..95f4cc78d156 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala @@ -32,6 +32,28 @@ import org.apache.spark.sql.types.CalendarIntervalType class DataFrameWindowFramesSuite extends QueryTest with SharedSparkSession { import testImplicits._ + test("reuse window partitionBy") { +val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value") +val w = Window.partitionBy("key").orderBy("value") + +checkAnswer( + df.select( +lead("key", 1).over(w), +lead("value", 1).over(w)), + Row(1, "1") :: Row(2, "2") :: Row(null, null) :: Row(null, null) :: Nil) + } + + test("reuse window orderBy") { +val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value") +val w = Window.orderBy("value").partitionBy("key") + +checkAnswer( + df.select( +lead("key", 1).over(w), +lead("value", 1).over(w)), + Row(1, "1") :: Row(2, "2") :: Row(null, null) :: Row(null, null) :: Nil) + } + test("lead/lag with empty data frame") { val df = Seq.empty[(Int, String)].toDF("key", "value") val window = Window.partitionBy($"key").orderBy($"value") @@ -570,4 +592,30 @@ class DataFrameWindowFramesSuite extends QueryTest with SharedSparkSession { } } } + + test("SPARK-34227: WindowFunctionFrame should clear its states during preparation") { +// This creates a single partition dataframe with 3 records: +// "a", 0, null +// "a", 1, "x" +// "b", 0, null +val df = spark.range(0, 3, 1, 1).select( + when($"id" < 2, lit("a")).otherwise(lit("b")).as("key"), + ($"id" % 2).cast("int").as("order"), + when($"id" % 2 === 0, lit(null)).otherwise(lit("x")).as("value")) + +val window1 = Window.partitionBy($"key").orderBy($"order") + .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) +val window2 = Window.partitionBy($"key").orderBy($"order") + .rowsBetween(Window.unboundedPreceding, Window.currentRow) +checkAnswer( + df.select( +$"key", +$"order", +nth_value($"value", 1, ignoreNulls = true).over(window1), +nth_value($"value", 1, ignoreNulls = true).over(window2)), + Seq( +Row("a", 0, "x", null), +Row("a&quo
(spark) branch master updated: [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Avoid referencing _to_seq in `pyspark-connect`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 79357c8ccd22 [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Avoid referencing _to_seq in `pyspark-connect` 79357c8ccd22 is described below commit 79357c8ccd22729a074c42f700544e7e3f023a8d Author: Hyukjin Kwon AuthorDate: Thu Apr 25 14:49:21 2024 -0700 [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Avoid referencing _to_seq in `pyspark-connect` ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/46155 that removes the reference of `_to_seq` that `pyspark-connect` package does not have. ### Why are the changes needed? To recover the CI https://github.com/apache/spark/actions/runs/8821919392/job/24218893631 ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46229 from HyukjinKwon/SPARK-47933-followuptmp. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/group.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/group.py b/python/pyspark/sql/group.py index d26e23bc7160..34c3531c8302 100644 --- a/python/pyspark/sql/group.py +++ b/python/pyspark/sql/group.py @@ -43,9 +43,9 @@ def dfapi(f: Callable[..., DataFrame]) -> Callable[..., DataFrame]: def df_varargs_api(f: Callable[..., DataFrame]) -> Callable[..., DataFrame]: -from pyspark.sql.classic.column import _to_seq - def _api(self: "GroupedData", *cols: str) -> DataFrame: +from pyspark.sql.classic.column import _to_seq + name = f.__name__ jdf = getattr(self._jgd, name)(_to_seq(self.session._sc, cols)) return DataFrame(jdf, self.session) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-45425][DOCS][FOLLOWUP] Add a migration guide for TINYINT type mapping change
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e1d021214c61 [SPARK-45425][DOCS][FOLLOWUP] Add a migration guide for TINYINT type mapping change e1d021214c61 is described below commit e1d021214c6130588e69dfa05e0391d89b463f9d Author: Kent Yao AuthorDate: Thu Apr 25 08:19:40 2024 -0700 [SPARK-45425][DOCS][FOLLOWUP] Add a migration guide for TINYINT type mapping change ### What changes were proposed in this pull request? Followup of SPARK-45425, adding migration guide. ### Why are the changes needed? migration guide ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing build ### Was this patch authored or co-authored using generative AI tooling? no Closes #46224 from yaooqinn/SPARK-45425. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- docs/sql-migration-guide.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 9b189eee6ad1..024423fb145a 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -47,6 +47,7 @@ license: | - Since Spark 4.0, MySQL JDBC datasource will read BIT(n > 1) as BinaryType, while in Spark 3.5 and previous, read as LongType. To restore the previous behavior, set `spark.sql.legacy.mysql.bitArrayMapping.enabled` to `true`. - Since Spark 4.0, MySQL JDBC datasource will write ShortType as SMALLINT, while in Spark 3.5 and previous, write as INTEGER. To restore the previous behavior, you can replace the column with IntegerType whenever before writing. - Since Spark 4.0, Oracle JDBC datasource will write TimestampType as TIMESTAMP WITH LOCAL TIME ZONE, while in Spark 3.5 and previous, write as TIMESTAMP. To restore the previous behavior, set `spark.sql.legacy.oracle.timestampMapping.enabled` to `true`. +- Since Spark 4.0, MsSQL Server JDBC datasource will read TINYINT as ShortType, while in Spark 3.5 and previous, read as IntegerType. To restore the previous behavior, set `spark.sql.legacy.mssqlserver.numericMapping.enabled` to `true`. - Since Spark 4.0, The default value for `spark.sql.legacy.ctePrecedencePolicy` has been changed from `EXCEPTION` to `CORRECTED`. Instead of raising an error, inner CTE definitions take precedence over outer definitions. - Since Spark 4.0, The default value for `spark.sql.legacy.timeParserPolicy` has been changed from `EXCEPTION` to `CORRECTED`. Instead of raising an `INCONSISTENT_BEHAVIOR_CROSS_VERSION` error, `CANNOT_PARSE_TIMESTAMP` will be raised if ANSI mode is enable. `NULL` will be returned if ANSI mode is disabled. See [Datetime Patterns for Formatting and Parsing](sql-ref-datetime-pattern.html). - Since Spark 4.0, A bug falsely allowing `!` instead of `NOT` when `!` is not a prefix operator has been fixed. Clauses such as `expr ! IN (...)`, `expr ! BETWEEN ...`, or `col ! NULL` now raise syntax errors. To restore the previous behavior, set `spark.sql.legacy.bangEqualsNot` to `true`. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (de5c512e0179 -> 287d02073929)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from de5c512e0179 [SPARK-47987][PYTHON][CONNECT][TESTS] Enable `ArrowParityTests.test_createDataFrame_empty_partition` add 287d02073929 [SPARK-47989][SQL] MsSQLServer: Fix the scope of spark.sql.legacy.mssqlserver.numericMapping.enabled No new revisions were added by this update. Summary of changes: .../sql/jdbc/MsSqlServerIntegrationSuite.scala | 177 +++-- .../org/apache/spark/sql/internal/SQLConf.scala| 2 +- .../apache/spark/sql/jdbc/MsSqlServerDialect.scala | 29 ++-- 3 files changed, 104 insertions(+), 104 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47987][PYTHON][CONNECT][TESTS] Enable `ArrowParityTests.test_createDataFrame_empty_partition`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new de5c512e0179 [SPARK-47987][PYTHON][CONNECT][TESTS] Enable `ArrowParityTests.test_createDataFrame_empty_partition` de5c512e0179 is described below commit de5c512e017965b5c726e254f8969fb17d5c17ea Author: Ruifeng Zheng AuthorDate: Thu Apr 25 08:16:56 2024 -0700 [SPARK-47987][PYTHON][CONNECT][TESTS] Enable `ArrowParityTests.test_createDataFrame_empty_partition` ### What changes were proposed in this pull request? Reenable `ArrowParityTests.test_createDataFrame_empty_partition` We actually already had set up Classic SparkContext `_legacy_sc ` for Spark Connect test, so only need to add `_legacy_sc` in Classic PySpark test. ### Why are the changes needed? to improve test coverage ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46220 from zhengruifeng/enable_test_createDataFrame_empty_partition. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/tests/connect/test_parity_arrow.py | 4 python/pyspark/sql/tests/test_arrow.py| 4 +++- python/pyspark/testing/sqlutils.py| 1 + 3 files changed, 4 insertions(+), 5 deletions(-) diff --git a/python/pyspark/sql/tests/connect/test_parity_arrow.py b/python/pyspark/sql/tests/connect/test_parity_arrow.py index 93d0b6cf0f5f..8727cc279641 100644 --- a/python/pyspark/sql/tests/connect/test_parity_arrow.py +++ b/python/pyspark/sql/tests/connect/test_parity_arrow.py @@ -24,10 +24,6 @@ from pyspark.testing.pandasutils import PandasOnSparkTestUtils class ArrowParityTests(ArrowTestsMixin, ReusedConnectTestCase, PandasOnSparkTestUtils): -@unittest.skip("Spark Connect does not support Spark Context but the test depends on that.") -def test_createDataFrame_empty_partition(self): -super().test_createDataFrame_empty_partition() - @unittest.skip("Spark Connect does not support fallback.") def test_createDataFrame_fallback_disabled(self): super().test_createDataFrame_fallback_disabled() diff --git a/python/pyspark/sql/tests/test_arrow.py b/python/pyspark/sql/tests/test_arrow.py index 5235e021bae9..03cb35feb994 100644 --- a/python/pyspark/sql/tests/test_arrow.py +++ b/python/pyspark/sql/tests/test_arrow.py @@ -56,6 +56,7 @@ from pyspark.testing.sqlutils import ( ExamplePointUDT, ) from pyspark.errors import ArithmeticException, PySparkTypeError, UnsupportedOperationException +from pyspark.util import is_remote_only if have_pandas: import pandas as pd @@ -830,7 +831,8 @@ class ArrowTestsMixin: pdf = pd.DataFrame({"c1": [1], "c2": ["string"]}) df = self.spark.createDataFrame(pdf) self.assertEqual([Row(c1=1, c2="string")], df.collect()) -self.assertGreater(self.spark.sparkContext.defaultParallelism, len(pdf)) +if not is_remote_only(): +self.assertGreater(self._legacy_sc.defaultParallelism, len(pdf)) def test_toPandas_error(self): for arrow_enabled in [True, False]: diff --git a/python/pyspark/testing/sqlutils.py b/python/pyspark/testing/sqlutils.py index 690d5c37b22e..a0fdada72972 100644 --- a/python/pyspark/testing/sqlutils.py +++ b/python/pyspark/testing/sqlutils.py @@ -258,6 +258,7 @@ class ReusedSQLTestCase(ReusedPySparkTestCase, SQLTestUtils, PySparkErrorTestUti @classmethod def setUpClass(cls): super(ReusedSQLTestCase, cls).setUpClass() +cls._legacy_sc = cls.sc cls.spark = SparkSession(cls.sc) cls.tempdir = tempfile.NamedTemporaryFile(delete=False) os.unlink(cls.tempdir.name) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47990][BUILD] Upgrade `zstd-jni` to 1.5.6-3
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5810554ce0fa [SPARK-47990][BUILD] Upgrade `zstd-jni` to 1.5.6-3 5810554ce0fa is described below commit 5810554ce0faba4cb8e7f3ca3dd5812bd2cf179f Author: panbingkun AuthorDate: Thu Apr 25 08:10:04 2024 -0700 [SPARK-47990][BUILD] Upgrade `zstd-jni` to 1.5.6-3 ### What changes were proposed in this pull request? The pr aims to upgrade `zstd-jni` from `1.5.6-2` to `1.5.6-3`. ### Why are the changes needed? 1.This version fix a potential memory leak problem, as follows: https://github.com/apache/spark/assets/15246973/eeae3e7f-0c44-443d-838b-fa39b9e45d64";> 2.https://github.com/luben/zstd-jni/compare/v1.5.6-2...v1.5.6-3 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46225 from panbingkun/SPARK-47990. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index f6adb6d18b85..005cc7bfb435 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -278,4 +278,4 @@ xz/1.9//xz-1.9.jar zjsonpatch/0.3.0//zjsonpatch-0.3.0.jar zookeeper-jute/3.9.2//zookeeper-jute-3.9.2.jar zookeeper/3.9.2//zookeeper-3.9.2.jar -zstd-jni/1.5.6-2//zstd-jni-1.5.6-2.jar +zstd-jni/1.5.6-3//zstd-jni-1.5.6-3.jar diff --git a/pom.xml b/pom.xml index c98514efa356..9c8f8fbb2ab0 100644 --- a/pom.xml +++ b/pom.xml @@ -800,7 +800,7 @@ com.github.luben zstd-jni -1.5.6-2 +1.5.6-3 com.clearspring.analytics - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47979][SQL][TESTS] Use Hive tables explicitly for Hive table capability tests
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0fcced63be99 [SPARK-47979][SQL][TESTS] Use Hive tables explicitly for Hive table capability tests 0fcced63be99 is described below commit 0fcced63be99302593591d29370c00e7c0d73cec Author: Dongjoon Hyun AuthorDate: Wed Apr 24 18:57:29 2024 -0700 [SPARK-47979][SQL][TESTS] Use Hive tables explicitly for Hive table capability tests ### What changes were proposed in this pull request? This PR aims to use `Hive` tables explicitly for Hive table capability tests in `hive` and `hive-thriftserver` module. ### Why are the changes needed? To make Hive test coverage robust by making it independent from Apache Spark configuration changes. ### Does this PR introduce _any_ user-facing change? No, this is a test only change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46211 from dongjoon-hyun/SPARK-47979. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala | 2 +- .../scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala | 1 + .../scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala | 9 +++-- .../org/apache/spark/sql/hive/execution/HiveQuerySuite.scala | 6 +++--- .../spark/sql/hive/execution/command/ShowCreateTableSuite.scala | 4 5 files changed, 12 insertions(+), 10 deletions(-) diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala index b552611b75d1..2b2cbec41d64 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala @@ -108,7 +108,7 @@ class UISeleniumSuite val baseURL = s"http://$localhost:$uiPort"; val queries = Seq( -"CREATE TABLE test_map(key INT, value STRING)", +"CREATE TABLE test_map (key INT, value STRING) USING HIVE", s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE test_map") queries.foreach(statement.execute) diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala index 0bc288501a01..b60adfb6f4cf 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala @@ -686,6 +686,7 @@ class HiveClientSuite(version: String) extends HiveVersionSuite(version) { versionSpark.sql( s""" |CREATE TABLE tab(c1 string) + |USING HIVE |location '${tmpDir.toURI.toString}' """.stripMargin) diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala index 241fdd4b9ec5..965db22b78f1 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala @@ -216,7 +216,7 @@ class HiveDDLSuite test("SPARK-22431: alter table tests with nested types") { withTable("t1", "t2", "t3") { - spark.sql("CREATE TABLE t1 (q STRUCT, i1 INT)") + spark.sql("CREATE TABLE t1 (q STRUCT, i1 INT) USING HIVE") spark.sql("ALTER TABLE t1 ADD COLUMNS (newcol1 STRUCT<`col1`:STRING, col2:Int>)") val newcol = spark.sql("SELECT * FROM t1").schema.fields(2).name assert("newcol1".equals(newcol)) @@ -2614,7 +2614,7 @@ class HiveDDLSuite "msg" -> "java.lang.UnsupportedOperationException: Unknown field type: void") ) - sql("CREATE TABLE t3 AS SELECT NULL AS null_col") + sql("CREATE TABLE t3 USING HIVE AS SELECT NULL AS null_col") checkAnswer(sql("SELECT * FROM t3"), Row(null)) } @@ -2642,9 +2642,6 @@ class HiveDDLSuite sql("CREATE TABLE t3 (v VOID) USING hive") checkAnswer(sql("SELECT * FROM t3"), Seq.empty) - - sql("CREATE TABLE t4 (v VOID)") - checkAnswer(sql("SELECT * FROM t4"), Seq.empty) } //
(spark) branch branch-3.5 updated: [SPARK-47633][SQL][3.5] Include right-side plan output in `LateralJoin#allAttributes` for more consistent canonicalization
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new ce19bfc10682 [SPARK-47633][SQL][3.5] Include right-side plan output in `LateralJoin#allAttributes` for more consistent canonicalization ce19bfc10682 is described below commit ce19bfc1068229897454c5f5cb78aeb435821bd2 Author: Bruce Robbins AuthorDate: Wed Apr 24 09:48:21 2024 -0700 [SPARK-47633][SQL][3.5] Include right-side plan output in `LateralJoin#allAttributes` for more consistent canonicalization This is a backport of #45763 to branch-3.5. ### What changes were proposed in this pull request? Modify `LateralJoin` to include right-side plan output in `allAttributes`. ### Why are the changes needed? In the following example, the view v1 is cached, but a query of v1 does not use the cache: ``` CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); create or replace temp view v1 as select * from t1 join lateral ( select c1 as a, c2 as b from t2) on c1 = a; cache table v1; explain select * from v1; == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false :- LocalTableScan [c1#180, c2#181] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=113] +- LocalTableScan [a#173, b#174] ``` The canonicalized version of the `LateralJoin` node is not consistent when there is a join condition. For example, for the above query, the join condition is canonicalized as follows: ``` Before canonicalization: Some((c1#174 = a#167)) After canonicalization: Some((none#0 = none#167)) ``` You can see that the `exprId` for the second operand of `EqualTo` is not normalized (it remains 167). That's because the attribute `a` from the right-side plan is not included `allAttributes`. This PR adds right-side attributes to `allAttributes` so that references to right-side attributes in the join condition are normalized during canonicalization. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46190 from bersprockets/lj_canonical_issue_35. Authored-by: Bruce Robbins Signed-off-by: Dongjoon Hyun --- .../plans/logical/basicLogicalOperators.scala | 2 ++ .../scala/org/apache/spark/sql/CachedTableSuite.scala | 19 +++ 2 files changed, 21 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala index 58c03ee72d6d..ca2c6a850561 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala @@ -2017,6 +2017,8 @@ case class LateralJoin( joinType: JoinType, condition: Option[Expression]) extends UnaryNode { + override lazy val allAttributes: AttributeSeq = left.output ++ right.plan.output + require(Seq(Inner, LeftOuter, Cross).contains(joinType), s"Unsupported lateral join type $joinType") diff --git a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala index 8331a3c10fc9..9815cb816c99 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala @@ -1710,4 +1710,23 @@ class CachedTableSuite extends QueryTest with SQLTestUtils } } } + + test("SPARK-47633: Cache hit for lateral join with join condition") { +withTempView("t", "q1") { + sql("create or replace temp view t(c1, c2) as values (0, 1), (1, 2)") + val query = """select * +|from t +|join lateral ( +| select c1 as a, c2 as b +| from t) +|on c1 = a; +|""".stripMargin + sql(s"cache table q1 as $query") + val df = sql(query) + checkAnswer(df, +Row(0, 1, 0, 1) :: Row(1, 2, 1, 2) :: Nil) + assert(getNumInMemoryRelations(df) == 1) +} + + } } ---
(spark) branch master updated (09ed09cb18e7 -> 03d4ea6a707c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 09ed09cb18e7 [SPARK-47958][TESTS] Change LocalSchedulerBackend to notify scheduler of executor on start add 03d4ea6a707c [SPARK-47974][BUILD] Remove `install_scala` from `build/mvn` No new revisions were added by this update. Summary of changes: .github/workflows/benchmark.yml| 6 ++ .github/workflows/build_and_test.yml | 24 .github/workflows/build_python_connect.yml | 3 +-- .github/workflows/maven_test.yml | 3 +-- build/mvn | 24 5 files changed, 12 insertions(+), 48 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47969][PYTHON][TESTS] Make `test_creation_index` deterministic
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cb1e1f5cd49a [SPARK-47969][PYTHON][TESTS] Make `test_creation_index` deterministic cb1e1f5cd49a is described below commit cb1e1f5cd49a612c0c081949759c1f931883c263 Author: Ruifeng Zheng AuthorDate: Tue Apr 23 23:09:10 2024 -0700 [SPARK-47969][PYTHON][TESTS] Make `test_creation_index` deterministic ### What changes were proposed in this pull request? Make `test_creation_index` deterministic ### Why are the changes needed? it may fail in some env ``` FAIL [16.261s]: test_creation_index (pyspark.pandas.tests.frame.test_constructor.FrameConstructorTests.test_creation_index) -- Traceback (most recent call last): File "/home/jenkins/python/pyspark/testing/pandasutils.py", line 91, in _assert_pandas_equal assert_frame_equal( File "/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py", line 1257, in assert_frame_equal assert_index_equal( File "/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py", line 407, in assert_index_equal raise_assert_detail(obj, msg, left, right) File "/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py", line 665, in raise_assert_detail raise AssertionError(msg) AssertionError: DataFrame.index are different DataFrame.index values are different (40.0 %) [left]: Int64Index([2, 3, 4, 6, 5], dtype='int64') [right]: Int64Index([2, 3, 4, 5, 6], dtype='int64') ``` ### Does this PR introduce _any_ user-facing change? no. test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46200 from zhengruifeng/fix_test_creation_index. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/pandas/tests/frame/test_constructor.py | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/python/pyspark/pandas/tests/frame/test_constructor.py b/python/pyspark/pandas/tests/frame/test_constructor.py index ee010d8f023d..d7581895c6c9 100644 --- a/python/pyspark/pandas/tests/frame/test_constructor.py +++ b/python/pyspark/pandas/tests/frame/test_constructor.py @@ -195,14 +195,14 @@ class FrameConstructorMixin: with ps.option_context("compute.ops_on_diff_frames", True): # test with ps.DataFrame and pd.Index self.assert_eq( -ps.DataFrame(data=psdf, index=pd.Index([2, 3, 4, 5, 6])), -pd.DataFrame(data=pdf, index=pd.Index([2, 3, 4, 5, 6])), +ps.DataFrame(data=psdf, index=pd.Index([2, 3, 4, 5, 6])).sort_index(), +pd.DataFrame(data=pdf, index=pd.Index([2, 3, 4, 5, 6])).sort_index(), ) # test with ps.DataFrame and ps.Index self.assert_eq( -ps.DataFrame(data=psdf, index=ps.Index([2, 3, 4, 5, 6])), -pd.DataFrame(data=pdf, index=pd.Index([2, 3, 4, 5, 6])), +ps.DataFrame(data=psdf, index=ps.Index([2, 3, 4, 5, 6])).sort_index(), +pd.DataFrame(data=pdf, index=pd.Index([2, 3, 4, 5, 6])).sort_index(), ) # test String Index - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47956][SQL] Sanity check for unresolved LCA reference
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 66613ba042c4 [SPARK-47956][SQL] Sanity check for unresolved LCA reference 66613ba042c4 is described below commit 66613ba042c4b73b45b3c71e79ce05c225f527e7 Author: Wenchen Fan AuthorDate: Tue Apr 23 08:44:48 2024 -0700 [SPARK-47956][SQL] Sanity check for unresolved LCA reference ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/40558. The sanity check should apply to all plan nodes, not only Project/Aggregate/Window, as we don't know what bug can happen. Maybe the bug moves LCA references to other plan nodes. ### Why are the changes needed? better error message when bug happens ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #46185 from cloud-fan/small. Authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/analysis/CheckAnalysis.scala | 20 ++-- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 10bff5e6e59a..d1b336b08955 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -110,9 +110,8 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB } /** Check and throw exception when a given resolved plan contains LateralColumnAliasReference. */ - private def checkNotContainingLCA(exprSeq: Seq[NamedExpression], plan: LogicalPlan): Unit = { -if (!plan.resolved) return - exprSeq.foreach(_.transformDownWithPruning(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) { + private def checkNotContainingLCA(exprs: Seq[Expression], plan: LogicalPlan): Unit = { + exprs.foreach(_.transformDownWithPruning(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) { case lcaRef: LateralColumnAliasReference => throw SparkException.internalError("Resolved plan should not contain any " + s"LateralColumnAliasReference.\nDebugging information: plan:\n$plan", @@ -789,17 +788,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB msg = s"Found the unresolved operator: ${o.simpleString(SQLConf.get.maxToStringFields)}", context = o.origin.getQueryContext, summary = o.origin.context.summary) - // If the plan is resolved, the resolved Project, Aggregate or Window should have restored or - // resolved all lateral column alias references. Add check for extra safe. - case p @ Project(pList, _) -if pList.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) => -checkNotContainingLCA(pList, p) - case agg @ Aggregate(_, aggList, _) -if aggList.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) => -checkNotContainingLCA(aggList, agg) - case w @ Window(pList, _, _, _) -if pList.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) => -checkNotContainingLCA(pList, w) + // If the plan is resolved, all lateral column alias references should have been either + // restored or resolved. Add check for extra safe. + case o if o.expressions.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) => +checkNotContainingLCA(o.expressions, o) case _ => } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47948][PYTHON] Upgrade the minimum `Pandas` version to 2.0.0
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2b01755f2791 [SPARK-47948][PYTHON] Upgrade the minimum `Pandas` version to 2.0.0 2b01755f2791 is described below commit 2b01755f27917b1d391835e6f8b1b2f9a34cc832 Author: Haejoon Lee AuthorDate: Tue Apr 23 07:49:15 2024 -0700 [SPARK-47948][PYTHON] Upgrade the minimum `Pandas` version to 2.0.0 ### What changes were proposed in this pull request? This PR proposes to bump Pandas version up to 2.0.0. ### Why are the changes needed? From Apache Spark 4.0.0, Pandas API on Spark supports Pandas 2.0.0 and above and some of features will be broken from Pandas 1.x, so installing Pandas 2.x is required. See the full list of breaking changes from [Upgrading from PySpark 3.5 to 4.0](https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_upgrade.rst#upgrading-from-pyspark-35-to-40). ### Does this PR introduce _any_ user-facing change? No API changes, but the minimum Pandas version from user-facing documentation will be changed. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46175 from itholic/bump_pandas_2. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- dev/create-release/spark-rm/Dockerfile | 2 +- python/docs/source/getting_started/install.rst | 6 +++--- python/docs/source/migration_guide/pyspark_upgrade.rst | 3 +-- python/docs/source/user_guide/sql/arrow_pandas.rst | 2 +- python/packaging/classic/setup.py | 2 +- python/packaging/connect/setup.py | 2 +- python/pyspark/sql/pandas/utils.py | 2 +- 7 files changed, 9 insertions(+), 10 deletions(-) diff --git a/dev/create-release/spark-rm/Dockerfile b/dev/create-release/spark-rm/Dockerfile index f51b24d58394..8d5ca38ba88e 100644 --- a/dev/create-release/spark-rm/Dockerfile +++ b/dev/create-release/spark-rm/Dockerfile @@ -37,7 +37,7 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true # These arguments are just for reuse and not really meant to be customized. ARG APT_INSTALL="apt-get install --no-install-recommends -y" -ARG PIP_PKGS="sphinx==4.5.0 mkdocs==1.1.2 numpy==1.20.3 pydata_sphinx_theme==0.13.3 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==3.1.2 twine==3.4.1 sphinx-plotly-directive==0.1.3 sphinx-copybutton==0.5.2 pandas==1.5.3 pyarrow==10.0.1 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17 grpcio==1.62.0 protobuf==4.21.6 grpcio-status==1.62.0 googleapis-common-protos==1.56.4" +ARG PIP_PKGS="sphinx==4.5.0 mkdocs==1.1.2 numpy==1.20.3 pydata_sphinx_theme==0.13.3 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==3.1.2 twine==3.4.1 sphinx-plotly-directive==0.1.3 sphinx-copybutton==0.5.2 pandas==2.0.3 pyarrow==10.0.1 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17 grpcio==1.62.0 protobuf==4.21.6 grpcio-status==1.62.0 googleapis-common-protos==1.56.4" ARG GEM_PKGS="bundler:2.3.8" # Install extra needed repos and refresh. diff --git a/python/docs/source/getting_started/install.rst b/python/docs/source/getting_started/install.rst index 08b6cc813cba..33a0560764df 100644 --- a/python/docs/source/getting_started/install.rst +++ b/python/docs/source/getting_started/install.rst @@ -205,7 +205,7 @@ Installable with ``pip install "pyspark[connect]"``. == = == PackageSupported version Note == = == -`pandas` >=1.4.4 Required for Spark Connect +`pandas` >=2.0.0 Required for Spark Connect `pyarrow` >=10.0.0 Required for Spark Connect `grpcio` >=1.62.0 Required for Spark Connect `grpcio-status`>=1.62.0 Required for Spark Connect @@ -220,7 +220,7 @@ Installable with ``pip install "pyspark[sql]"``. = = == Package Supported version Note = = == -`pandas` >=1.4.4 Required for Spark SQL +`pandas` >=2.0.0 Required for Spark SQL `pyarrow` >=10.0.0 Required for Spark SQL = = == @@ -233,7 +233,7 @@ Installable with ``pip install "pyspark[pandas_on_spark]"``. = = Package Supported version Note = = -`p
(spark) branch master updated (cf5fc0c720ee -> 9c4f12ca04ac)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from cf5fc0c720ee [MINOR][DOCS] Fix type hint of 3 functions add 9c4f12ca04ac [SPARK-47949][SQL][DOCKER][TESTS] MsSQLServer: Bump up mssql docker image version to 2022-CU12-GDR1-ubuntu-22.04 No new revisions were added by this update. Summary of changes: ...OnDocker.scala => MsSQLServerDatabaseOnDocker.scala} | 13 +++-- .../spark/sql/jdbc/MsSqlServerIntegrationSuite.scala| 14 +- .../spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala | 16 ++-- .../spark/sql/jdbc/v2/MsSqlServerNamespaceSuite.scala | 17 ++--- 4 files changed, 12 insertions(+), 48 deletions(-) copy connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/{MySQLDatabaseOnDocker.scala => MsSQLServerDatabaseOnDocker.scala} (72%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [MINOR][DOCS] Fix type hint of 3 functions
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cf5fc0c720ee [MINOR][DOCS] Fix type hint of 3 functions cf5fc0c720ee is described below commit cf5fc0c720eef01c5fe86a6ce05160adbdbf4678 Author: Ruifeng Zheng AuthorDate: Tue Apr 23 07:42:44 2024 -0700 [MINOR][DOCS] Fix type hint of 3 functions ### What changes were proposed in this pull request? Fix type hint of 3 functions I did a quick scan of the functions, don't find other similar places. ### Why are the changes needed? a string input will be treated as literal instead of column name ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46179 from zhengruifeng/correct_con. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/connect/functions/builtin.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/python/pyspark/sql/connect/functions/builtin.py b/python/pyspark/sql/connect/functions/builtin.py index 519e53c3a13f..8fffb1831466 100644 --- a/python/pyspark/sql/connect/functions/builtin.py +++ b/python/pyspark/sql/connect/functions/builtin.py @@ -2141,7 +2141,7 @@ def sequence( sequence.__doc__ = pysparkfuncs.sequence.__doc__ -def schema_of_csv(csv: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Column: +def schema_of_csv(csv: Union[str, Column], options: Optional[Dict[str, str]] = None) -> Column: if isinstance(csv, Column): _csv = csv elif isinstance(csv, str): @@ -2161,7 +2161,7 @@ def schema_of_csv(csv: "ColumnOrName", options: Optional[Dict[str, str]] = None) schema_of_csv.__doc__ = pysparkfuncs.schema_of_csv.__doc__ -def schema_of_json(json: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Column: +def schema_of_json(json: Union[str, Column], options: Optional[Dict[str, str]] = None) -> Column: if isinstance(json, Column): _json = json elif isinstance(json, str): @@ -2181,7 +2181,7 @@ def schema_of_json(json: "ColumnOrName", options: Optional[Dict[str, str]] = Non schema_of_json.__doc__ = pysparkfuncs.schema_of_json.__doc__ -def schema_of_xml(xml: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Column: +def schema_of_xml(xml: Union[str, Column], options: Optional[Dict[str, str]] = None) -> Column: if isinstance(xml, Column): _xml = xml elif isinstance(xml, str): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (ca916258b991 -> 33fa77cb4868)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ca916258b991 [SPARK-47953][DOCS] MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server add 33fa77cb4868 [MINOR][DOCS] Add `docs/_generated/` to .gitignore No new revisions were added by this update. Summary of changes: .gitignore | 1 + 1 file changed, 1 insertion(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47953][DOCS] MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ca916258b991 [SPARK-47953][DOCS] MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server ca916258b991 is described below commit ca916258b9916452aa2f377608e6be8df65550e5 Author: Kent Yao AuthorDate: Tue Apr 23 07:41:04 2024 -0700 [SPARK-47953][DOCS] MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server ### What changes were proposed in this pull request? This PR adds Document Mapping Spark SQL Data Types to Microsoft SQL Server ### Why are the changes needed? doc improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ![image](https://github.com/apache/spark/assets/8326978/7220d96a-c5ca-4780-9fc5-f93c99f91c10) ### Was this patch authored or co-authored using generative AI tooling? no Closes #46177 from yaooqinn/SPARK-47953. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- docs/sql-data-sources-jdbc.md | 106 ++ 1 file changed, 106 insertions(+) diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md index 51c0886430a3..734ed43f912a 100644 --- a/docs/sql-data-sources-jdbc.md +++ b/docs/sql-data-sources-jdbc.md @@ -1630,3 +1630,109 @@ as the activated JDBC Driver. + +### Mapping Spark SQL Data Types to Microsoft SQL Server + +The below table describes the data type conversions from Spark SQL Data Types to Microsoft SQL Server data types, +when creating, altering, or writing data to a Microsoft SQL Server table using the built-in jdbc data source with +the mssql-jdbc as the activated JDBC Driver. + + + + + Spark SQL Data Type + SQL Server Data Type + Remarks + + + + + BooleanType + bit + + + + ByteType + smallint + Supported since Spark 4.0.0, previous versions throw errors + + + ShortType + smallint + + + + IntegerType + int + + + + LongType + bigint + + + + FloatType + real + + + + DoubleType + double precision + + + + DecimalType(p, s) + number(p,s) + + + + DateType + date + + + + TimestampType + datetime + + + + TimestampNTZType + datetime + + + + StringType + nvarchar(max) + + + + BinaryType + varbinary(max) + + + + CharType(n) + char(n) + + + + VarcharType(n) + varchar(n) + + + + + +The Spark Catalyst data types below are not supported with suitable SQL Server types. + +- DayTimeIntervalType +- YearMonthIntervalType +- CalendarIntervalType +- ArrayType +- MapType +- StructType +- UserDefinedType +- NullType +- ObjectType +- VariantType - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark-kubernetes-operator) branch main updated: [SPARK-47943] Add `GitHub Action` CI for Java Build and Test
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new 4a5febd [SPARK-47943] Add `GitHub Action` CI for Java Build and Test 4a5febd is described below commit 4a5febd8f48716c0506738fc6a5fd58afb95779f Author: zhou-jiang AuthorDate: Mon Apr 22 22:44:17 2024 -0700 [SPARK-47943] Add `GitHub Action` CI for Java Build and Test ### What changes were proposed in this pull request? This PR adds an additional CI build task for operator. ### Why are the changes needed? The additional CI task is needed in order to build and test Java code for upcoming operator pull requests. When Java plugin is enabled and Java source is checked in, `./gradlew build` [task](https://docs.gradle.org/3.3/userguide/java_plugin.html#sec:java_tasks) by default includes a set of tasks to compile and run tests. This can serve as pull request build. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tested locally. ### Was this patch authored or co-authored using generative AI tooling? no Closes #7 from jiangzho/ci. Authored-by: zhou-jiang Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 6a5a147..887119f 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -26,4 +26,20 @@ jobs: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} with: config: .github/.licenserc.yaml - + build-test: +name: "Build Test CI" +runs-on: ubuntu-latest +strategy: + matrix: +java-version: [ 17, 21 ] +steps: + - name: Checkout repository +uses: actions/checkout@v3 + - name: Set up JDK ${{ matrix.java-version }} +uses: actions/setup-java@v2 +with: + java-version: ${{ matrix.java-version }} + distribution: 'adopt' + - name: Build with Gradle +run: | + set -o pipefail; ./gradlew build; set +o pipefail - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark-kubernetes-operator) branch main updated: [SPARK-47929] Setup Static Analysis for Operator
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new 798ca15 [SPARK-47929] Setup Static Analysis for Operator 798ca15 is described below commit 798ca15844c71baf5d7f1f8842e461a73c1009a9 Author: zhou-jiang AuthorDate: Mon Apr 22 22:42:23 2024 -0700 [SPARK-47929] Setup Static Analysis for Operator ### What changes were proposed in this pull request? This is a breakdown PR from #2 - setting up common build Java tasks and corresponding plugins. ### Why are the changes needed? This PR includes checkstyle, pmd, spotbugs. Also includes jacoco for coverage analysis, spotless for formatting. These tasks can help to enhance the quality of future Java contributions. They can also be referred in CI tasks for automation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested manually. ### Was this patch authored or co-authored using generative AI tooling? no Closes #6 from jiangzho/builder_task. Authored-by: zhou-jiang Signed-off-by: Dongjoon Hyun --- build.gradle | 76 - config/checkstyle/checkstyle.xml | 208 +++ config/pmd/ruleset.xml | 33 ++ config/spotbugs/spotbugs_exclude.xml | 25 + gradle.properties| 22 5 files changed, 362 insertions(+), 2 deletions(-) diff --git a/build.gradle b/build.gradle index 6732f5a..f64212b 100644 --- a/build.gradle +++ b/build.gradle @@ -1,3 +1,18 @@ +buildscript { + repositories { +maven { + url = uri("https://plugins.gradle.org/m2/";) +} + } + dependencies { +classpath "com.github.spotbugs.snom:spotbugs-gradle-plugin:${spotBugsGradlePluginVersion}" +classpath "com.diffplug.spotless:spotless-plugin-gradle:${spotlessPluginVersion}" + } +} + +assert JavaVersion.current().isCompatibleWith(JavaVersion.VERSION_17): "Java 17 or newer is " + +"required" + subprojects { apply plugin: 'idea' apply plugin: 'eclipse' @@ -6,7 +21,64 @@ subprojects { targetCompatibility = 17 repositories { - mavenCentral() - jcenter() +mavenCentral() +jcenter() + } + + apply plugin: 'checkstyle' + checkstyle { +toolVersion = checkstyleVersion +configFile = file("$rootDir/config/checkstyle/checkstyle.xml") +ignoreFailures = false +showViolations = true + } + + apply plugin: 'pmd' + pmd { +ruleSets = ["java-basic", "java-braces"] +ruleSetFiles = files("$rootDir/config/pmd/ruleset.xml") +toolVersion = pmdVersion +consoleOutput = true +ignoreFailures = false + } + + apply plugin: 'com.github.spotbugs' + spotbugs { +toolVersion = spotBugsVersion +afterEvaluate { + reportsDir = file("${project.reporting.baseDir}/findbugs") +} +excludeFilter = file("$rootDir/config/spotbugs/spotbugs_exclude.xml") +ignoreFailures = false + } + + apply plugin: 'jacoco' + jacoco { +toolVersion = jacocoVersion + } + jacocoTestReport { +dependsOn test + } + + apply plugin: 'com.diffplug.spotless' + spotless { +java { + endWithNewline() + googleJavaFormat('1.17.0') + importOrder( +'java', +'javax', +'scala', +'', +'org.apache.spark', + ) + trimTrailingWhitespace() + removeUnusedImports() +} +format 'misc', { + target '*.md', '*.gradle', '**/*.properties', '**/*.xml', '**/*.yaml', '**/*.yml' + endWithNewline() + trimTrailingWhitespace() +} } } diff --git a/config/checkstyle/checkstyle.xml b/config/checkstyle/checkstyle.xml new file mode 100644 index 000..90161fe --- /dev/null +++ b/config/checkstyle/checkstyle.xml @@ -0,0 +1,208 @@ + + +https://checkstyle.org/dtds/configuration_1_3.dtd";> + + + + + + + + + + + + + + + + + + + + + + + + +ftp://"/> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
(spark) branch master updated (9d715ba49171 -> 876c2cf34a35)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 9d715ba49171 [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error add 876c2cf34a35 [SPARK-44170][BUILD][FOLLOWUP] Align JUnit5 dependency's version and clean up exclusions No new revisions were added by this update. Summary of changes: pom.xml | 69 +++-- 1 file changed, 41 insertions(+), 28 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9d715ba49171 [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error 9d715ba49171 is described below commit 9d715ba491710969340d9e8a49a21d11f51ef7d3 Author: Kent Yao AuthorDate: Mon Apr 22 22:31:13 2024 -0700 [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error ### What changes were proposed in this pull request? This PR uses SMALLINT (as TINYINT ranges [0, 255]) instead of BYTE to fix the ByteType mapping for MsSQLServer JDBC ```java [info] com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #1: Cannot find data type BYTE. [info] at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:265) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1662) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:898) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:793) [info] at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7417) [info] at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3488) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:262) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:237) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeUpdate(SQLServerStatement.java:733) [info] at org.apache.spark.sql.jdbc.JdbcDialect.createTable(JdbcDialects.scala:267) ``` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46164 from yaooqinn/SPARK-47938. Lead-authored-by: Kent Yao Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala | 8 .../main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala | 1 + 2 files changed, 9 insertions(+) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala index 8bceb9506e85..273e8c35dd07 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala @@ -437,4 +437,12 @@ class MsSqlServerIntegrationSuite extends DockerJDBCIntegrationSuite { .load() assert(df.collect().toSet === expectedResult) } + + test("SPARK-47938: Fix 'Cannot find data type BYTE' in SQL Server") { +spark.sql("select cast(1 as byte) as c0") + .write + .jdbc(jdbcUrl, "test_byte", new Properties) +val df = spark.read.jdbc(jdbcUrl, "test_byte", new Properties) +checkAnswer(df, Row(1.toShort)) + } } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala index 862e99adc3b0..1d05c0d7c24e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala @@ -136,6 +136,7 @@ private case class MsSqlServerDialect() extends JdbcDialect { case BinaryType => Some(JdbcType("VARBINARY(MAX)", java.sql.Types.VARBINARY)) case ShortType if !SQLConf.get.legacyMsSqlServerNumericMappingEnabled => Some(JdbcType("SMALLINT", java.sql.Types.SMALLINT)) +case ByteType => Some(JdbcType("SMALLINT", java.sql.Types.TINYINT)) case _ => None } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (e4fb7dd98219 -> a97e72cfa7d4)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from e4fb7dd98219 [MINOR] Remove unnecessary `imports` add a97e72cfa7d4 [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/functions/builtin.py | 8 +--- python/pyspark/sql/functions/builtin.py | 12 +++- 2 files changed, 12 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (b335dd366fb1 -> e4fb7dd98219)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b335dd366fb1 [SPARK-47909][CONNECT][PYTHON][TESTS][FOLLOW-UP] Move `pyspark.classic` references add e4fb7dd98219 [MINOR] Remove unnecessary `imports` No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/util/Distribution.scala| 2 -- .../scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala| 2 -- .../scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala | 2 -- sql/api/src/main/scala/org/apache/spark/sql/types/UpCastRule.scala | 2 -- .../src/main/scala/org/apache/spark/sql/execution/CacheManager.scala| 2 -- .../scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala| 2 -- .../scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala| 2 -- sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala | 1 - .../test/scala/org/apache/spark/sql/hive/client/HiveClientSuites.scala | 2 -- .../org/apache/spark/sql/hive/client/HiveClientUserNameSuites.scala | 2 -- .../scala/org/apache/spark/sql/hive/client/HiveClientVersions.scala | 2 -- .../org/apache/spark/sql/hive/client/HivePartitionFilteringSuites.scala | 2 -- 12 files changed, 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-47904][SQL][3.5] Preserve case in Avro schema when using enableStableIdentifiersForUnionType
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new d7c3794a0c56 [SPARK-47904][SQL][3.5] Preserve case in Avro schema when using enableStableIdentifiersForUnionType d7c3794a0c56 is described below commit d7c3794a0c567b12e8c8e18132aa362f11acdf5f Author: Ivan Sadikov AuthorDate: Mon Apr 22 15:36:13 2024 -0700 [SPARK-47904][SQL][3.5] Preserve case in Avro schema when using enableStableIdentifiersForUnionType ### What changes were proposed in this pull request? Backport of https://github.com/apache/spark/pull/46126 to branch-3.5. When `enableStableIdentifiersForUnionType` is enabled, all of the types are lowercased which creates a problem when field types are case-sensitive: Union type with fields: ``` Schema.createEnum("myENUM", "", null, List[String]("E1", "e2").asJava), Schema.createRecord("myRecord2", "", null, false, List[Schema.Field](new Schema.Field("F", Schema.create(Type.FLOAT))).asJava) ``` would become ``` struct> ``` but instead should be ``` struct> ``` ### Why are the changes needed? Fixes a bug of lowercasing the field name (the type portion). ### Does this PR introduce _any_ user-facing change? Yes, if a user enables `enableStableIdentifiersForUnionType` and has Union types, all fields will preserve the case. Previously, the field names would be all in lowercase. ### How was this patch tested? I added a test case to verify the new field names. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46169 from sadikovi/SPARK-47904-3.5. Authored-by: Ivan Sadikov Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/avro/SchemaConverters.scala | 10 +++ .../org/apache/spark/sql/avro/AvroSuite.scala | 31 -- 2 files changed, 34 insertions(+), 7 deletions(-) diff --git a/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala b/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala index 06abe977e3b0..af358a8d1c96 100644 --- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala +++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala @@ -183,14 +183,14 @@ object SchemaConverters { // Avro's field name may be case sensitive, so field names for two named type // could be "a" and "A" and we need to distinguish them. In this case, we throw // an exception. - val temp_name = s"member_${s.getName.toLowerCase(Locale.ROOT)}" - if (fieldNameSet.contains(temp_name)) { + // Stable id prefix can be empty so the name of the field can be just the type. + val tempFieldName = s"member_${s.getName}" + if (!fieldNameSet.add(tempFieldName.toLowerCase(Locale.ROOT))) { throw new IncompatibleSchemaException( - "Cannot generate stable indentifier for Avro union type due to name " + + "Cannot generate stable identifier for Avro union type due to name " + s"conflict of type name ${s.getName}") } - fieldNameSet.add(temp_name) - temp_name + tempFieldName } else { s"member$i" } diff --git a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala index 1df99210a55a..01c9dfb57a19 100644 --- a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala +++ b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala @@ -370,7 +370,7 @@ abstract class AvroSuite "", Seq()) } - assert(e.getMessage.contains("Cannot generate stable indentifier")) + assert(e.getMessage.contains("Cannot generate stable identifier")) } { val e = intercept[Exception] { @@ -381,7 +381,7 @@ abstract class AvroSuite "", Seq()) } - assert(e.getMessage.contains("Cannot generate stable indentifier")) + assert(e.getMessage.contains("Cannot generate stable identifier")) } // Two array types or two map types are not allowed in union. { @@ -434,6 +434,33 @@ abstract class AvroSuite } } +
(spark) branch master updated: [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ac9a12ef6e06 [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support ac9a12ef6e06 is described below commit ac9a12ef6e062ae07e878e202521b22de9979a17 Author: Dongjoon Hyun AuthorDate: Mon Apr 22 14:46:03 2024 -0700 [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support ### What changes were proposed in this pull request? This PR aims to update K8s docs to recommend K8s v1.27+ for Apache Spark 4.0.0. This is a kind of follow-up of the following previous PR because Apache Spark 4.0.0 schedule is delayed slightly. - #43069 ### Why are the changes needed? **1. K8s community starts to release v1.30.0 from 2024-04-17.** - https://kubernetes.io/releases/#release-v1-30 **2. Default K8s Version in Public Cloud environments** The default K8s versions of public cloud providers are already K8s 1.27+. - EKS: v1.29 (Default) - GKE: v1.29 (Rapid), v1.28 (Regular), v1.27 (Stable) - AKS: v1.27 **3. End Of Support** In addition, K8s 1.26 is going to reach EOL when Apache Spark 4.0.0 arrives because K8s 1.26 is also going to reach EOL on June. | K8s | AKS | GKE | EKS | | | --- | --- | --- | | 1.26 | 2024-03 | 2024-06 | 2024-06 | - [AKS EOL Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) - [GKE EOL Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule) - [EKS EOL Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar) ### Does this PR introduce _any_ user-facing change? - No, this is a documentation-only change about K8s versions. - Apache Spark K8s Integration Test is currently using K8s v1.30.0 on Minikube already. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46168 from dongjoon-hyun/SPARK-47942. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- docs/running-on-kubernetes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index 778af5f0751a..606b5eb6f900 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -44,7 +44,7 @@ Cluster administrators should use [Pod Security Policies](https://kubernetes.io/ # Prerequisites -* A running Kubernetes cluster at version >= 1.26 with access configured to it using +* A running Kubernetes cluster at version >= 1.27 with access configured to it using [kubectl](https://kubernetes.io/docs/reference/kubectl/). If you do not already have a working Kubernetes cluster, you may set up a test cluster on your local machine using [minikube](https://kubernetes.io/docs/getting-started-guides/minikube/). - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (f2d0cf23018f -> fc0c8553ea05)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from f2d0cf23018f [SPARK-47907][SQL] Put bang under a config add fc0c8553ea05 [SPARK-47904][SQL] Preserve case in Avro schema when using enableStableIdentifiersForUnionType No new revisions were added by this update. Summary of changes: .../apache/spark/sql/avro/SchemaConverters.scala | 8 +++--- .../org/apache/spark/sql/avro/AvroSuite.scala | 31 -- 2 files changed, 32 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to `33.1.0-jre` in Docker IT
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 86563169eef8 [SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to `33.1.0-jre` in Docker IT 86563169eef8 is described below commit 86563169eef899040e1ec70dd9963c64311dbaa1 Author: Cheng Pan AuthorDate: Mon Apr 22 13:34:20 2024 -0700 [SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to `33.1.0-jre` in Docker IT ### What changes were proposed in this pull request? This PR aims to upgrade `guava` dependency to `33.1.0-jre` in Docker Integration tests. ### Why are the changes needed? This is a preparation of the following PR. - #45372 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46167 from dongjoon-hyun/SPARK-47940. Authored-by: Cheng Pan Signed-off-by: Dongjoon Hyun --- connector/docker-integration-tests/pom.xml | 2 +- project/SparkBuild.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/connector/docker-integration-tests/pom.xml b/connector/docker-integration-tests/pom.xml index bb7647c72491..9003c2190be2 100644 --- a/connector/docker-integration-tests/pom.xml +++ b/connector/docker-integration-tests/pom.xml @@ -39,7 +39,7 @@ com.google.guava guava - 33.0.0-jre + 33.1.0-jre test diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index bcaa51ec30ff..1bcc9c893393 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -952,7 +952,7 @@ object Unsafe { object DockerIntegrationTests { // This serves to override the override specified in DependencyOverrides: lazy val settings = Seq( -dependencyOverrides += "com.google.guava" % "guava" % "33.0.0-jre" +dependencyOverrides += "com.google.guava" % "guava" % "33.1.0-jre" ) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (256fc51508e4 -> 676d47ffe091)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 256fc51508e4 [SPARK-47411][SQL] Support StringInstr & FindInSet functions to work with collated strings add 676d47ffe091 [SPARK-47935][INFRA][PYTHON] Pin `pandas==2.0.3` for `pypy3.8` No new revisions were added by this update. Summary of changes: dev/infra/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org