(spark) branch master updated (65db87697949 -> 7cba1ab4d6ac)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 65db87697949 [SPARK-48513][SS] Add error class for state schema compatibility and minor refactoring add 7cba1ab4d6ac [SPARK-48554][INFRA] Use R 4.4.0 in `windows` R GitHub Action Window job No new revisions were added by this update. Summary of changes: .github/workflows/build_sparkr_window.yml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark-kubernetes-operator) branch main updated: [SPARK-48528] Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` version
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new d7734bb [SPARK-48528] Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` version d7734bb is described below commit d7734bbc4413163cf60fe67e23c541929a9a37a8 Author: Dongjoon Hyun AuthorDate: Tue Jun 4 12:04:21 2024 -0700 [SPARK-48528] Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` version ### What changes were proposed in this pull request? This PR aims to refine `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` versions. ### Why are the changes needed? Previously, it uses Apache Spark's versions like 4.0.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually tested like the following by printing the versions. ``` Enter number of user, or userid, to assign to (blank to leave unassigned):0 [] ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #14 from dongjoon-hyun/SPARK-48528. Lead-authored-by: Dongjoon Hyun Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/merge_spark_pr.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py index 24e956d..9a8d39f 100755 --- a/dev/merge_spark_pr.py +++ b/dev/merge_spark_pr.py @@ -305,7 +305,9 @@ def resolve_jira_issue(merge_branches, comment, default_jira_id=""): versions = [ x for x in versions -if not x.raw["released"] and not x.raw["archived"] and re.match(r"\d+\.\d+\.\d+", x.name) +if not x.raw["released"] +and not x.raw["archived"] +and re.match(r"kubernetes-operator-\d+\.\d+\.\d+", x.name) ] versions = sorted(versions, key=lambda x: x.name, reverse=True) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48531][INFRA] Fix `Black` target version to Python 3.9
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 651f68782ab7 [SPARK-48531][INFRA] Fix `Black` target version to Python 3.9 651f68782ab7 is described below commit 651f68782ab705f277b2548382900cdf986e017e Author: Dongjoon Hyun AuthorDate: Tue Jun 4 10:28:50 2024 -0700 [SPARK-48531][INFRA] Fix `Black` target version to Python 3.9 ### What changes were proposed in this pull request? This PR aims to fix `Black` target version to `Python 3.9`. ### Why are the changes needed? Since SPARK-47993 dropped Python 3.8 support officially at Apache Spark 4.0.0, we had better update target version to `Python 3.9`. - #46228 `py39` is the version for `Python 3.9`. ``` $ black --help | grep target -t, --target-version [py33|py34|py35|py36|py37|py38|py39|py310|py311|py312] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with Python linter. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46867 from dongjoon-hyun/SPARK-48531. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/pyproject.toml b/dev/pyproject.toml index 4f462d14c783..f19107b3782a 100644 --- a/dev/pyproject.toml +++ b/dev/pyproject.toml @@ -29,6 +29,6 @@ testpaths = [ # GitHub workflow version and dev/reformat-python required-version = "23.9.1" line-length = 100 -target-version = ['py38'] +target-version = ['py39'] include = '\.pyi?$' extend-exclude = 'cloudpickle|error_classes.py' - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark-kubernetes-operator) branch main updated: [SPARK-48326] Use the official Apache Spark 4.0.0-preview1
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new cd23de3 [SPARK-48326] Use the official Apache Spark 4.0.0-preview1 cd23de3 is described below commit cd23de3ff5ee4dbc13d55c8552d86acc94cd8411 Author: Dongjoon Hyun AuthorDate: Tue Jun 4 09:24:09 2024 -0700 [SPARK-48326] Use the official Apache Spark 4.0.0-preview1 ### What changes were proposed in this pull request? This PR aims to use the official Apache Spark `4.0.0-preview1` artifacts. ### Why are the changes needed? The current used artifact is not the latest RC3 and will be removed. - https://repository.apache.org/content/repositories/orgapachespark-1454/ For the record, the latest RC was the following. And, it becomes the official artifact. - https://repository.apache.org/content/repositories/orgapachespark-1456/ (RC3) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #13 from dongjoon-hyun/SPARK-48326. Lead-authored-by: Dongjoon Hyun Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- build.gradle | 4 gradle.properties | 1 - 2 files changed, 5 deletions(-) diff --git a/build.gradle b/build.gradle index c0c75d0..a6c1701 100644 --- a/build.gradle +++ b/build.gradle @@ -25,10 +25,6 @@ subprojects { repositories { mavenCentral() -// TODO(SPARK-48326) Upgrade submission worker base Spark version to 4.0.0-preview2 -maven { - url "https://repository.apache.org/content/repositories/orgapachespark-1454/; -} } apply plugin: 'checkstyle' diff --git a/gradle.properties b/gradle.properties index ffa8302..31b75dc 100644 --- a/gradle.properties +++ b/gradle.properties @@ -26,7 +26,6 @@ lombokVersion=1.18.32 # Spark scalaVersion=2.13 -# TODO(SPARK-48326) Upgrade submission worker base Spark version to 4.0.0-preview2 sparkVersion=4.0.0-preview1 # Logging - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (1a536f01ead3 -> 6cd1ccc56321)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 1a536f01ead3 [SPARK-48407][SQL][DOCS] Teradata: Document Type Conversion rules between Spark SQL and teradata add 6cd1ccc56321 [SPARK-48394][CORE] Cleanup mapIdToMapIndex on mapoutput unregister No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/MapOutputTracker.scala | 26 ++ .../org/apache/spark/MapOutputTrackerSuite.scala | 55 ++ 2 files changed, 72 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48407][SQL][DOCS] Teradata: Document Type Conversion rules between Spark SQL and teradata
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1a536f01ead3 [SPARK-48407][SQL][DOCS] Teradata: Document Type Conversion rules between Spark SQL and teradata 1a536f01ead3 is described below commit 1a536f01ead35b770467381c476e093338d81e7c Author: Kent Yao AuthorDate: Fri May 24 15:56:19 2024 -0700 [SPARK-48407][SQL][DOCS] Teradata: Document Type Conversion rules between Spark SQL and teradata ### What changes were proposed in this pull request? This PR adds documentation for the builtin teradata jdbc dialect's data type conversion rules ### Why are the changes needed? doc improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ![image](https://github.com/apache/spark/assets/8326978/e1ec0de5-cd83-4339-896a-50c58ad01c4d) ### Was this patch authored or co-authored using generative AI tooling? no Closes #46728 from yaooqinn/SPARK-48407. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- docs/sql-data-sources-jdbc.md | 214 ++ 1 file changed, 214 insertions(+) diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md index 371dc0595071..9ffd96cd40ee 100644 --- a/docs/sql-data-sources-jdbc.md +++ b/docs/sql-data-sources-jdbc.md @@ -1991,3 +1991,217 @@ The Spark Catalyst data types below are not supported with suitable DB2 types. - NullType - ObjectType - VariantType + +### Mapping Spark SQL Data Types from Teradata + +The below table describes the data type conversions from Teradata data types to Spark SQL Data Types, +when reading data from a Teradata table using the built-in jdbc data source with the [Teradata JDBC Driver](https://mvnrepository.com/artifact/com.teradata.jdbc/terajdbc) +as the activated JDBC Driver. + + + + + Teradata Data Type + Spark SQL Data Type + Remarks + + + + + BYTEINT + ByteType + + + + SMALLINT + ShortType + + + + INTEGER, INT + IntegerType + + + + BIGINT + LongType + + + + REAL, DOUBLE PRECISION, FLOAT + DoubleType + + + + DECIMAL, NUMERIC, NUMBER + DecimalType + + + + DATE + DateType + + + + TIMESTAMP, TIMESTAMP WITH TIME ZONE + TimestampType + (Default)preferTimestampNTZ=false or spark.sql.timestampType=TIMESTAMP_LTZ + + + TIMESTAMP, TIMESTAMP WITH TIME ZONE + TimestampNTZType + preferTimestampNTZ=true or spark.sql.timestampType=TIMESTAMP_NTZ + + + TIME, TIME WITH TIME ZONE + TimestampType + (Default)preferTimestampNTZ=false or spark.sql.timestampType=TIMESTAMP_LTZ + + + TIME, TIME WITH TIME ZONE + TimestampNTZType + preferTimestampNTZ=true or spark.sql.timestampType=TIMESTAMP_NTZ + + + CHARACTER(n), CHAR(n), GRAPHIC(n) + CharType(n) + + + + VARCHAR(n), VARGRAPHIC(n) + VarcharType(n) + + + + BYTE(n), VARBYTE(n) + BinaryType + + + + CLOB + StringType + + + + BLOB + BinaryType + + + + INTERVAL Data Types + - + The INTERVAL data types are unknown yet + + + Period Data Types, ARRAY, UDT + - + Not Supported + + + + +### Mapping Spark SQL Data Types to Teradata + +The below table describes the data type conversions from Spark SQL Data Types to Teradata data types, +when creating, altering, or writing data to a Teradata table using the built-in jdbc data source with +the [Teradata JDBC Driver](https://mvnrepository.com/artifact/com.teradata.jdbc/terajdbc) as the activated JDBC Driver. + + + + + Spark SQL Data Type + Teradata Data Type + Remarks + + + + + BooleanType + CHAR(1) + + + + ByteType + BYTEINT + + + + ShortType + SMALLINT + + + + IntegerType + INTEGER + + + + LongType + BIGINT + + + + FloatType + REAL + + + + DoubleType + DOUBLE PRECISION + + + + DecimalType(p, s) + DECIMAL(p,s) + + + + DateType + DATE + + + + TimestampType + TIMESTAMP + + + + TimestampNTZType + TIMESTAMP + + + + StringType + VARCHAR(255) + + + + BinaryType + BLOB + + + + CharType(n) + CHAR(n) + + + + VarcharType(n) + VARCHAR(n
(spark) branch master updated: [SPARK-48325][CORE] Always specify messages in ExecutorRunner.killProcess
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7d96334902f2 [SPARK-48325][CORE] Always specify messages in ExecutorRunner.killProcess 7d96334902f2 is described below commit 7d96334902f22a80af63ce1253d5abda63178c4e Author: Bo Zhang AuthorDate: Fri May 24 15:54:21 2024 -0700 [SPARK-48325][CORE] Always specify messages in ExecutorRunner.killProcess ### What changes were proposed in this pull request? This change is to always specify the message in `ExecutorRunner.killProcess`. ### Why are the changes needed? This is to get the occurrence rate for different cases when killing the executor process, in order to analyze executor running stability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #46641 from bozhang2820/spark-48325. Authored-by: Bo Zhang Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/deploy/worker/ExecutorRunner.scala | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala b/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala index 7bb8b74eb021..bd98f19cdb60 100644 --- a/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala @@ -88,7 +88,7 @@ private[deploy] class ExecutorRunner( if (state == ExecutorState.LAUNCHING || state == ExecutorState.RUNNING) { state = ExecutorState.FAILED } - killProcess(Some("Worker shutting down")) } + killProcess("Worker shutting down") } } /** @@ -96,7 +96,7 @@ private[deploy] class ExecutorRunner( * * @param message the exception message which caused the executor's death */ - private def killProcess(message: Option[String]): Unit = { + private def killProcess(message: String): Unit = { var exitCode: Option[Int] = None if (process != null) { logInfo("Killing process!") @@ -113,7 +113,7 @@ private[deploy] class ExecutorRunner( } } try { - worker.send(ExecutorStateChanged(appId, execId, state, message, exitCode)) + worker.send(ExecutorStateChanged(appId, execId, state, Some(message), exitCode)) } catch { case e: IllegalStateException => logWarning(log"${MDC(ERROR, e.getMessage())}", e) } @@ -206,11 +206,11 @@ private[deploy] class ExecutorRunner( case interrupted: InterruptedException => logInfo("Runner thread for executor " + fullId + " interrupted") state = ExecutorState.KILLED -killProcess(None) +killProcess(s"Runner thread for executor $fullId interrupted") case e: Exception => logError("Error running executor", e) state = ExecutorState.FAILED -killProcess(Some(e.toString)) +killProcess(s"Error running executor: $e") } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (febdbf56fb22 -> 80c0f1165417)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from febdbf56fb22 [SPARK-48031] Grandfather legacy views to SCHEMA BINDING add 80c0f1165417 [SPARK-48381][K8S][DOCS] Update `YuniKorn` docs with v1.5.1 No new revisions were added by this update. Summary of changes: docs/running-on-kubernetes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48329][SQL] Enable `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6b3a88195e30 [SPARK-48329][SQL] Enable `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default 6b3a88195e30 is described below commit 6b3a88195e30027b74166d7729c232cd7ddba83b Author: Szehon Ho AuthorDate: Tue May 21 10:00:14 2024 -0700 [SPARK-48329][SQL] Enable `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default ### What changes were proposed in this pull request? This PR aims to enable `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default for Apache Spark 4.0.0 while keeping `spark.sql.sources.v2.bucketing.enabled` is `false`. ### Why are the changes needed? `spark.sql.sources.v2.bucketing.pushPartValues.enabled` was added at Apache Spark 3.4.0 and has been used as one of the datasource v2 bucketing feature. This PR will help the datasource v2 bucketing users use this feature more easily. Note that this change is technically no-op for the default users because `spark.sql.sources.v2.bucketing.enabled` is `false` still. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46673 from szehon-ho/default_pushpart. Lead-authored-by: Szehon Ho Co-authored-by: chesterxu Signed-off-by: Dongjoon Hyun --- docs/sql-migration-guide.md | 1 + sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 98075d019585..6e400ab93711 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -57,6 +57,7 @@ license: | - Since Spark 4.0, A bug falsely allowing `!` instead of `NOT` when `!` is not a prefix operator has been fixed. Clauses such as `expr ! IN (...)`, `expr ! BETWEEN ...`, or `col ! NULL` now raise syntax errors. To restore the previous behavior, set `spark.sql.legacy.bangEqualsNot` to `true`. - Since Spark 4.0, By default views tolerate column type changes in the query and compensate with casts. To restore the previous behavior, allowing up-casts only, set `spark.sql.legacy.viewSchemaCompensation` to `false`. - Since Spark 4.0, Views allow control over how they react to underlying query changes. By default views tolerate column type changes in the query and compensate with casts. To disable thsi feature set `spark.sql.legacy.viewSchemaBindingMode` to `false`. This also removes the clause from `DESCRIBE EXTENDED` and `SHOW CREATE TABLE`. +- Since Spark 4.0, The Storage-Partitioned Join feature flag `spark.sql.sources.v2.bucketing.pushPartValues.enabled` is set to `true`. To restore the previous behavior, set `spark.sql.sources.v2.bucketing.pushPartValues.enabled` to `false`. ## Upgrading from Spark SQL 3.5.1 to 3.5.2 diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 87b32ca0b9b5..9c4236679f3a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -1569,7 +1569,7 @@ object SQLConf { "side. This could help to eliminate unnecessary shuffles") .version("3.4.0") .booleanConf - .createWithDefault(false) + .createWithDefault(true) val V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED = buildConf("spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (4fc2910f92d1 -> f5ffb74f170e)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 4fc2910f92d1 [SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation add f5ffb74f170e [SPARK-48328][BUILD] Upgrade `Arrow` to 16.1.0 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-3-hive-2.3 | 10 +- pom.xml | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark-kubernetes-operator) branch main updated: [SPARK-48017] Add Spark application submission worker for operator
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new e747bcf [SPARK-48017] Add Spark application submission worker for operator e747bcf is described below commit e747bcfab106b828bbd9f2d44968698e5dce3c33 Author: zhou-jiang AuthorDate: Mon May 20 10:41:48 2024 -0700 [SPARK-48017] Add Spark application submission worker for operator ### What changes were proposed in this pull request? This is a breakdown PR of #2 - adding a submission worker implementation for SparkApplication. ### Why are the changes needed? Spark Operator needs a submission worker to convert its abstraction (the SparkApplication API) into k8s resource spec. This is a light-weight implementation based on native k8s integration. As of now, it's based off Spark 4.0.0-preview1 - but it's assumed to serve all Spark LTS versions. This is feasible because as it aims to cover only the spec generation, Spark core jars are still brought-in by application images. E2Es would set up with operator later to ensure that. Per SPIP doc, in future operator version(s) we may add more implementations for submission worker based on different Spark versions to achieve 100% version agnostic, at the cost of having multiple workers stand-by. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? no Closes #10 from jiangzho/worker. Authored-by: zhou-jiang Signed-off-by: Dongjoon Hyun --- build.gradle | 4 + gradle.properties | 7 + settings.gradle| 1 + spark-operator-api/build.gradle| 1 + .../spark/k8s/operator/utils/ModelUtils.java | 9 + spark-submission-worker/build.gradle | 18 ++ .../spark/k8s/operator/SparkAppDriverConf.java | 73 +++ .../spark/k8s/operator/SparkAppResourceSpec.java | 129 .../k8s/operator/SparkAppSubmissionWorker.java | 175 + .../spark/k8s/operator/SparkAppDriverConfTest.java | 75 +++ .../k8s/operator/SparkAppResourceSpecTest.java | 137 + .../k8s/operator/SparkAppSubmissionWorkerTest.java | 218 + 12 files changed, 847 insertions(+) diff --git a/build.gradle b/build.gradle index a6c1701..c0c75d0 100644 --- a/build.gradle +++ b/build.gradle @@ -25,6 +25,10 @@ subprojects { repositories { mavenCentral() +// TODO(SPARK-48326) Upgrade submission worker base Spark version to 4.0.0-preview2 +maven { + url "https://repository.apache.org/content/repositories/orgapachespark-1454/; +} } apply plugin: 'checkstyle' diff --git a/gradle.properties b/gradle.properties index 2606179..ffa8302 100644 --- a/gradle.properties +++ b/gradle.properties @@ -18,17 +18,24 @@ group=org.apache.spark.k8s.operator version=0.1.0 +# Caution: fabric8 version should be aligned with Spark dependency fabric8Version=6.12.1 commonsLang3Version=3.14.0 commonsIOVersion=2.16.1 lombokVersion=1.18.32 +# Spark +scalaVersion=2.13 +# TODO(SPARK-48326) Upgrade submission worker base Spark version to 4.0.0-preview2 +sparkVersion=4.0.0-preview1 + # Logging log4jVersion=2.22.1 # Test junitVersion=5.10.2 jacocoVersion=0.8.12 +mockitoVersion=5.11.0 # Build Analysis checkstyleVersion=10.15.0 diff --git a/settings.gradle b/settings.gradle index 69e7827..6808ec7 100644 --- a/settings.gradle +++ b/settings.gradle @@ -1,2 +1,3 @@ rootProject.name = 'apache-spark-kubernetes-operator' include 'spark-operator-api' +include 'spark-submission-worker' diff --git a/spark-operator-api/build.gradle b/spark-operator-api/build.gradle index b57beca..696415f 100644 --- a/spark-operator-api/build.gradle +++ b/spark-operator-api/build.gradle @@ -18,6 +18,7 @@ dependencies { testImplementation platform("org.junit:junit-bom:$junitVersion") testImplementation 'org.junit.jupiter:junit-jupiter' + testRuntimeOnly "org.junit.platform:junit-platform-launcher" } test { diff --git a/spark-operator-api/src/main/java/org/apache/spark/k8s/operator/utils/ModelUtils.java b/spark-operator-api/src/main/java/org/apache/spark/k8s/operator/utils/ModelUtils.java index 454e706..03d84be 100644 --- a/spark-operator-api/src/main/java/org/apache/spark/k8s/operator/utils/ModelUtils.java +++ b/spark-operator-api/src/main/java/org/apache/spark/k8s/operator/utils/ModelUtils.java @@ -36,6 +36,7 @@ import io.fabric8.kubernetes.api.model.PodBuilder; import io.fabric8.kubernetes.api.model
(spark) branch master updated (6767053dacd9 -> a2d93d104a6c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 6767053dacd9 [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException add a2d93d104a6c [SPARK-48256][BUILD] Add a rule to check file headers for the java side, and fix inconsistent files No new revisions were added by this update. Summary of changes: .../protocol/EncryptedMessageWithHeader.java | 2 +- .../spark/unsafe/types/CalendarIntervalSuite.java | 30 +++--- .../apache/spark/unsafe/types/UTF8StringSuite.java | 30 +++--- .../spark/io/NioBufferedFileInputStream.java | 11 +--- .../org/apache/spark/io/ReadAheadInputStream.java | 11 +--- dev/checkstyle-suppressions.xml| 4 +++ dev/checkstyle.xml | 6 + .../hive/package-info.java => dev/java-file-header | 4 +-- .../spark/sql/connector/catalog/Identifier.java| 2 +- .../sql/connector/catalog/IdentifierImpl.java | 2 +- .../spark/sql/connector/catalog/CatalogPlugin.java | 2 +- .../sql/connector/catalog/MetadataColumn.java | 26 +-- .../connector/catalog/SupportsMetadataColumns.java | 26 +-- .../sql/connector/catalog/index/SupportsIndex.java | 2 +- .../sql/connector/catalog/index/TableIndex.java| 2 +- .../sql/connector/catalog/CatalogLoadingSuite.java | 2 +- .../parquet/filter2/predicate/SparkFilterApi.java | 26 +-- .../spark/sql/JavaDataFrameReaderWriterSuite.java | 30 +++--- .../execution/datasources/orc/FakeKeyProvider.java | 15 +-- 19 files changed, 120 insertions(+), 113 deletions(-) copy sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java => dev/java-file-header (95%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6767053dacd9 [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException 6767053dacd9 is described below commit 6767053dacd9df623336e1f5faabf1eb16b7a7dd Author: sychen AuthorDate: Wed May 15 09:33:39 2024 -0700 [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException ### What changes were proposed in this pull request? This PR aims to add a check for `TransportChannelHandler` to be non-null in the `TransportClientFactory.createClient` method. ### Why are the changes needed? Line 178 synchronized (handler) , handler == null org.apache.spark.network.client.TransportClientFactory#createClient(java.lang.String, int, boolean) ```java TransportChannelHandler handler = cachedClient.getChannel().pipeline() .get(TransportChannelHandler.class); synchronized (handler) { handler.getResponseHandler().updateTimeOfLastRequest(); } ``` ```java org.apache.spark.shuffle.FetchFailedException at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1180) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:913) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) Caused by: java.lang.NullPointerException at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:178) at org.apache.spark.network.shuffle.ExternalBlockStoreClient.lambda$fetchBlocks$0(ExternalBlockStoreClient.java:128) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154) at org.apache.spark.network.shuffle.RetryingBlockTransferor.start(RetryingBlockTransferor.java:133) at org.apache.spark.network.shuffle.ExternalBlockStoreClient.fetchBlocks(ExternalBlockStoreClient.java:139) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? No Closes #46506 from cxzl25/SPARK-48218. Authored-by: sychen Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/network/client/TransportClientFactory.java | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java index ddf1b3cce349..f2dbfd92b854 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java +++ b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java @@ -171,8 +171,10 @@ public class TransportClientFactory implements Closeable { // this code was able to update things. TransportChannelHandler handler = cachedClient.getChannel().pipeline() .get(TransportChannelHandler.class); - synchronized (handler) { -handler.getResponseHandler().updateTimeOfLastRequest(); + if (handler != null) { +synchronized (handler) { + handler.getResponseHandler().updateTimeOfLastRequest(); +} } if (cachedClient.isActive()) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (973328cd376b -> 12820e11b094)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 973328cd376b [SPARK-48285][SQL][DOCS] Update docs for size function and sizeOfNull configuration add 12820e11b094 [SPARK-48049][BUILD] Upgrade Scala to 2.13.14 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-3-hive-2.3 | 10 +- docs/_config.yml | 2 +- pom.xml | 12 ++-- 3 files changed, 16 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (9e386b472981 -> 973328cd376b)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 9e386b472981 [SPARK-48172][SQL] Fix escaping issues in JDBCDialects add 973328cd376b [SPARK-48285][SQL][DOCS] Update docs for size function and sizeOfNull configuration No new revisions were added by this update. Summary of changes: .../jvm/src/main/scala/org/apache/spark/sql/functions.scala | 12 ++-- .../sql/catalyst/expressions/collectionOperations.scala | 6 +++--- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 12 ++-- 3 files changed, 15 insertions(+), 15 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48279][BUILD] Upgrade ORC to 2.0.1
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 76329f9bb60e [SPARK-48279][BUILD] Upgrade ORC to 2.0.1 76329f9bb60e is described below commit 76329f9bb60e3c61e16e8a285fe00cf4f185efd5 Author: William Hyun AuthorDate: Wed May 15 01:17:54 2024 -0700 [SPARK-48279][BUILD] Upgrade ORC to 2.0.1 ### What changes were proposed in this pull request? This PR aims to upgrade ORC to 2.0.1 ### Why are the changes needed? Apache ORC 2.0.1 is the first maintenance release of 2.0.x line. - https://orc.apache.org/news/2024/05/14/ORC-2.0.1/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46587 from williamhyun/SPARK-48279. Authored-by: William Hyun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 6 +++--- pom.xml | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 2b444dddcbe9..598be34e5e0f 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -231,10 +231,10 @@ opencsv/2.3//opencsv-2.3.jar opentracing-api/0.33.0//opentracing-api-0.33.0.jar opentracing-noop/0.33.0//opentracing-noop-0.33.0.jar opentracing-util/0.33.0//opentracing-util-0.33.0.jar -orc-core/2.0.0/shaded-protobuf/orc-core-2.0.0-shaded-protobuf.jar +orc-core/2.0.1/shaded-protobuf/orc-core-2.0.1-shaded-protobuf.jar orc-format/1.0.0/shaded-protobuf/orc-format-1.0.0-shaded-protobuf.jar -orc-mapreduce/2.0.0/shaded-protobuf/orc-mapreduce-2.0.0-shaded-protobuf.jar -orc-shims/2.0.0//orc-shims-2.0.0.jar +orc-mapreduce/2.0.1/shaded-protobuf/orc-mapreduce-2.0.1-shaded-protobuf.jar +orc-shims/2.0.1//orc-shims-2.0.1.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar diff --git a/pom.xml b/pom.xml index 12d20f4f0736..ce7d2546e7c2 100644 --- a/pom.xml +++ b/pom.xml @@ -138,7 +138,7 @@ 10.16.1.1 1.13.1 -2.0.0 +2.0.1 shaded-protobuf 11.0.20 5.0.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f699f556d8a0 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script f699f556d8a0 is described below commit f699f556d8a09bb755e9c8558661a36fbdb42e73 Author: panbingkun AuthorDate: Fri May 10 19:54:29 2024 -0700 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script ### What changes were proposed in this pull request? The pr aims to delete the dir `dev/pr-deps` after executing `test-dependencies.sh`. ### Why are the changes needed? We'd better clean the `temporary files` generated at the end. Before: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;> After: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46531 from panbingkun/minor_test-dependencies. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- dev/test-dependencies.sh | 4 1 file changed, 4 insertions(+) diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh index 048c59f4cec9..e645a66165a2 100755 --- a/dev/test-dependencies.sh +++ b/dev/test-dependencies.sh @@ -140,4 +140,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do fi done +if [[ -d "$FWDIR/dev/pr-deps" ]]; then + rm -rf "$FWDIR/dev/pr-deps" +fi + exit 0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 1e0fc1ef96aa [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script 1e0fc1ef96aa is described below commit 1e0fc1ef96aa6f541134224f1ba626f234442e74 Author: panbingkun AuthorDate: Fri May 10 19:54:29 2024 -0700 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script ### What changes were proposed in this pull request? The pr aims to delete the dir `dev/pr-deps` after executing `test-dependencies.sh`. ### Why are the changes needed? We'd better clean the `temporary files` generated at the end. Before: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;> After: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46531 from panbingkun/minor_test-dependencies. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun (cherry picked from commit f699f556d8a09bb755e9c8558661a36fbdb42e73) Signed-off-by: Dongjoon Hyun --- dev/test-dependencies.sh | 4 1 file changed, 4 insertions(+) diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh index 2268a262d5f8..2907ef27189c 100755 --- a/dev/test-dependencies.sh +++ b/dev/test-dependencies.sh @@ -144,4 +144,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do fi done +if [[ -d "$FWDIR/dev/pr-deps" ]]; then + rm -rf "$FWDIR/dev/pr-deps" +fi + exit 0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new e9a1b4254419 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script e9a1b4254419 is described below commit e9a1b4254419c751e612cd5e5c56f111b41399e7 Author: panbingkun AuthorDate: Fri May 10 19:54:29 2024 -0700 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script ### What changes were proposed in this pull request? The pr aims to delete the dir `dev/pr-deps` after executing `test-dependencies.sh`. ### Why are the changes needed? We'd better clean the `temporary files` generated at the end. Before: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;> After: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46531 from panbingkun/minor_test-dependencies. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun (cherry picked from commit f699f556d8a09bb755e9c8558661a36fbdb42e73) Signed-off-by: Dongjoon Hyun --- dev/test-dependencies.sh | 4 1 file changed, 4 insertions(+) diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh index d7967ac3afa9..36cc7a4f994d 100755 --- a/dev/test-dependencies.sh +++ b/dev/test-dependencies.sh @@ -140,4 +140,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do fi done +if [[ -d "$FWDIR/dev/pr-deps" ]]; then + rm -rf "$FWDIR/dev/pr-deps" +fi + exit 0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d82458f15539 [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API d82458f15539 is described below commit d82458f15539eef8df320345a7c2382ca4d5be8a Author: allisonwang-db AuthorDate: Fri May 10 16:31:47 2024 -0700 [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API ### What changes were proposed in this pull request? This is a follow-up PR for https://github.com/apache/spark/pull/46487 to add missing tags for the `dataSource` API. ### Why are the changes needed? To address comments from a previous PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46530 from allisonwang-db/spark-48205-followup. Authored-by: allisonwang-db Signed-off-by: Dongjoon Hyun --- sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala | 4 1 file changed, 4 insertions(+) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala index d5de74455dce..466e4cf81318 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -233,7 +233,11 @@ class SparkSession private( /** * A collection of methods for registering user-defined data sources. + * + * @since 4.0.0 */ + @Experimental + @Unstable def dataSource: DataSourceRegistration = sessionState.dataSourceRegistration /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5b3b8a90638c [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars 5b3b8a90638c is described below commit 5b3b8a90638c49fc7ddcace69a85989c1053f1ab Author: Dongjoon Hyun AuthorDate: Fri May 10 15:48:08 2024 -0700 [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars ### What changes were proposed in this pull request? This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 . ### Why are the changes needed? Recently, we dropped `commons-lang:commons-lang` during Hive upgrade. - #46468 However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing UDF jars built against those versions requires `commons-lang:commons-lang` still. - https://github.com/apache/hive/pull/4892 For example, Apache Hive 3.1.3 code: - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21 ``` import org.apache.commons.lang.StringUtils; ``` - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42 ``` return StringUtils.strip(val, " "); ``` As a result, Maven CIs are broken. - https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17) - https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21) The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`. ``` HiveUDFDynamicLoadSuite: - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. *** RUN ABORTED *** A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185) ... Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526) at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) ... ``` ### Does this PR introduce _any_ user-facing change? To support the existin
(spark) branch master updated: Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 726ef8aa66ea Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`" 726ef8aa66ea is described below commit 726ef8aa66ea6e56b739f3b16f99e457a0febb81 Author: Dongjoon Hyun AuthorDate: Fri May 10 15:34:12 2024 -0700 Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`" This reverts commit d8151186d79459fbde27a01bd97328e73548c55a. --- LICENSE-binary| 1 + dev/deps/spark-deps-hadoop-3-hive-2.3 | 1 + licenses-binary/LICENSE-jodd.txt | 24 pom.xml | 6 ++ sql/hive/pom.xml | 4 5 files changed, 36 insertions(+) diff --git a/LICENSE-binary b/LICENSE-binary index 034215f0ab15..40271c9924bc 100644 --- a/LICENSE-binary +++ b/LICENSE-binary @@ -436,6 +436,7 @@ com.esotericsoftware:reflectasm org.codehaus.janino:commons-compiler org.codehaus.janino:janino jline:jline +org.jodd:jodd-core com.github.wendykierp:JTransforms pl.edu.icm:JLargeArrays diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 29997815e5bc..392bacd73277 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -143,6 +143,7 @@ jline/2.14.6//jline-2.14.6.jar jline/3.24.1//jline-3.24.1.jar jna/5.13.0//jna-5.13.0.jar joda-time/2.12.7//joda-time-2.12.7.jar +jodd-core/3.5.2//jodd-core-3.5.2.jar jpam/1.1//jpam-1.1.jar json/1.8//json-1.8.jar json4s-ast_2.13/4.0.7//json4s-ast_2.13-4.0.7.jar diff --git a/licenses-binary/LICENSE-jodd.txt b/licenses-binary/LICENSE-jodd.txt new file mode 100644 index ..cc6b458adb38 --- /dev/null +++ b/licenses-binary/LICENSE-jodd.txt @@ -0,0 +1,24 @@ +Copyright (c) 2003-present, Jodd Team (https://jodd.org) +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, +this list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright +notice, this list of conditions and the following disclaimer in the +documentation and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file diff --git a/pom.xml b/pom.xml index a98efe8aed1e..56a34cedde51 100644 --- a/pom.xml +++ b/pom.xml @@ -201,6 +201,7 @@ 3.1.9 3.0.12 2.12.7 +3.5.2 3.0.0 2.2.11 0.16.0 @@ -2782,6 +2783,11 @@ joda-time ${joda.version} + +org.jodd +jodd-core +${jodd.version} + org.datanucleus datanucleus-core diff --git a/sql/hive/pom.xml b/sql/hive/pom.xml index 5e9fc256e7e6..3895d9dc5a63 100644 --- a/sql/hive/pom.xml +++ b/sql/hive/pom.xml @@ -152,6 +152,10 @@ joda-time joda-time + + org.jodd + jodd-core + com.google.code.findbugs jsr305 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (a6632ffa16f6 -> 2225aa1dab0f)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from a6632ffa16f6 [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser add 2225aa1dab0f [SPARK-48144][SQL] Fix `canPlanAsBroadcastHashJoin` to respect shuffle join hints No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/optimizer/joins.scala | 38 ++ .../spark/sql/execution/SparkStrategies.scala | 17 -- .../scala/org/apache/spark/sql/JoinSuite.scala | 26 +-- 3 files changed, 55 insertions(+), 26 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5beaf85cd5ef [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once 5beaf85cd5ef is described below commit 5beaf85cd5ef2b84a67ebce712e8d73d1e7d41ff Author: Chaoqin Li AuthorDate: Fri May 10 08:24:42 2024 -0700 [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once ### What changes were proposed in this pull request? Fix the flakiness in python streaming source exactly once test. The last executed batch may not be recorded in query progress, which cause the expected rows doesn't match. This fix takes the uncompleted batch into account and relax the condition ### Why are the changes needed? Fix flaky test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46481 from chaoqin-li1123/fix_python_ds_test. Authored-by: Chaoqin Li Signed-off-by: Dongjoon Hyun --- .../execution/python/PythonStreamingDataSourceSuite.scala| 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala index 97e6467c3eaf..d1f7c597b308 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala @@ -299,7 +299,7 @@ class PythonStreamingDataSourceSuite extends PythonDataSourceSuiteBase { val checkpointDir = new File(path, "checkpoint") val outputDir = new File(path, "output") val df = spark.readStream.format(dataSourceName).load() - var lastBatch = 0 + var lastBatchId = 0 // Restart streaming query multiple times to verify exactly once guarantee. for (i <- 1 to 5) { @@ -323,11 +323,15 @@ class PythonStreamingDataSourceSuite extends PythonDataSourceSuiteBase { } q.stop() q.awaitTermination() -lastBatch = q.lastProgress.batchId.toInt +lastBatchId = q.lastProgress.batchId.toInt } - assert(lastBatch > 20) + assert(lastBatchId > 20) + val rowCount = spark.read.format("json").load(outputDir.getAbsolutePath).count() + // There may be one uncommitted batch that is not recorded in query progress. + // The number of batch can be lastBatchId + 1 or lastBatchId + 2. + assert(rowCount == 2 * (lastBatchId + 1) || rowCount == 2 * (lastBatchId + 2)) checkAnswer(spark.read.format("json").load(outputDir.getAbsolutePath), -(0 to 2 * lastBatch + 1).map(Row(_))) +(0 until rowCount.toInt).map(Row(_))) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c5b6ec734bd0 [SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI c5b6ec734bd0 is described below commit c5b6ec734bd0c47551b59f9de13c6323b80974b2 Author: Yuming Wang AuthorDate: Fri May 10 08:22:03 2024 -0700 [SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI ### What changes were proposed in this pull request? This PR makes it do not add log link for unmanaged AM in Spark UI. ### Why are the changes needed? Avoid start driver error messages: ``` 24/03/18 04:58:25,022 ERROR [spark-listener-group-appStatus] scheduler.AsyncEventQueue:97 : Listener AppStatusListener threw an exception java.lang.NumberFormatException: For input string: "null" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) ~[?:?] at java.lang.Integer.parseInt(Integer.java:668) ~[?:?] at java.lang.Integer.parseInt(Integer.java:786) ~[?:?] at scala.collection.immutable.StringLike.toInt(StringLike.scala:310) ~[scala-library-2.12.18.jar:?] at scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) ~[scala-library-2.12.18.jar:?] at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) ~[scala-library-2.12.18.jar:?] at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.ProcessSummaryWrapper.(storeTypes.scala:609) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:1045) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1233) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1445) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) ~[scala-library-2.12.18.jar:?] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) ~[scala-library-2.12.18.jar:?] at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) [spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) [spark-core_2.12-3.5.1.jar:3.5.1] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing: ```shell bin/spark-sql --master yarn --conf spark.yarn.unmanagedAM.enabled=true ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45565 from wangyum/SPARK-47441. Authored-by: Yuming Wang Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/d
(spark) branch master updated: [SPARK-48235][SQL] Directly pass join instead of all arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 73bb619d45b2 [SPARK-48235][SQL] Directly pass join instead of all arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide 73bb619d45b2 is described below commit 73bb619d45b2d0699ca4a9d251eea57c359f275b Author: fred-db AuthorDate: Fri May 10 07:45:28 2024 -0700 [SPARK-48235][SQL] Directly pass join instead of all arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide ### What changes were proposed in this pull request? * Refactor getBroadcastBuildSide and getShuffleHashJoinBuildSide to pass the join as argument instead of all member variables of the join separately. ### Why are the changes needed? * Makes to code easier to read. ### Does this PR introduce _any_ user-facing change? * no ### How was this patch tested? * Existing UTs ### Was this patch authored or co-authored using generative AI tooling? * No Closes #46525 from fred-db/parameter-change. Authored-by: fred-db Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/optimizer/joins.scala | 56 +--- .../optimizer/JoinSelectionHelperSuite.scala | 59 +- .../spark/sql/execution/SparkStrategies.scala | 6 +-- 3 files changed, 40 insertions(+), 81 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala index 2b4ee033b088..5571178832db 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala @@ -289,58 +289,52 @@ case object BuildLeft extends BuildSide trait JoinSelectionHelper { def getBroadcastBuildSide( - left: LogicalPlan, - right: LogicalPlan, - joinType: JoinType, - hint: JoinHint, + join: Join, hintOnly: Boolean, conf: SQLConf): Option[BuildSide] = { val buildLeft = if (hintOnly) { - hintToBroadcastLeft(hint) + hintToBroadcastLeft(join.hint) } else { - canBroadcastBySize(left, conf) && !hintToNotBroadcastLeft(hint) + canBroadcastBySize(join.left, conf) && !hintToNotBroadcastLeft(join.hint) } val buildRight = if (hintOnly) { - hintToBroadcastRight(hint) + hintToBroadcastRight(join.hint) } else { - canBroadcastBySize(right, conf) && !hintToNotBroadcastRight(hint) + canBroadcastBySize(join.right, conf) && !hintToNotBroadcastRight(join.hint) } getBuildSide( - canBuildBroadcastLeft(joinType) && buildLeft, - canBuildBroadcastRight(joinType) && buildRight, - left, - right + canBuildBroadcastLeft(join.joinType) && buildLeft, + canBuildBroadcastRight(join.joinType) && buildRight, + join.left, + join.right ) } def getShuffleHashJoinBuildSide( - left: LogicalPlan, - right: LogicalPlan, - joinType: JoinType, - hint: JoinHint, + join: Join, hintOnly: Boolean, conf: SQLConf): Option[BuildSide] = { val buildLeft = if (hintOnly) { - hintToShuffleHashJoinLeft(hint) + hintToShuffleHashJoinLeft(join.hint) } else { - hintToPreferShuffleHashJoinLeft(hint) || -(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(left, conf) && - muchSmaller(left, right, conf)) || + hintToPreferShuffleHashJoinLeft(join.hint) || +(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(join.left, conf) && + muchSmaller(join.left, join.right, conf)) || forceApplyShuffledHashJoin(conf) } val buildRight = if (hintOnly) { - hintToShuffleHashJoinRight(hint) + hintToShuffleHashJoinRight(join.hint) } else { - hintToPreferShuffleHashJoinRight(hint) || -(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(right, conf) && - muchSmaller(right, left, conf)) || + hintToPreferShuffleHashJoinRight(join.hint) || +(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(join.right, conf) && + muchSmaller(join.right, join.left, conf)) || forceApplyShuffledHashJoin(conf) } getBuildSide( - canBuildShuffledHashJoinLeft(joinType) && buildLeft, - canBuildShuffledHashJoinRight(joinType) && buildRight, - left, - right + canBuildShuffledHashJoinLeft(join.joinType) && buildLeft, + canBuildShuffledHashJoinRight(join.joinType) && buildRight, +
(spark) branch master updated: [SPARK-48230][BUILD] Remove unused `jodd-core`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d8151186d794 [SPARK-48230][BUILD] Remove unused `jodd-core` d8151186d794 is described below commit d8151186d79459fbde27a01bd97328e73548c55a Author: Cheng Pan AuthorDate: Fri May 10 01:09:01 2024 -0700 [SPARK-48230][BUILD] Remove unused `jodd-core` ### What changes were proposed in this pull request? Remove a jar that has CVE https://github.com/advisories/GHSA-jrg3-qq99-35g7 ### Why are the changes needed? Previously, `jodd-core` came from Hive transitive deps, while https://github.com/apache/hive/pull/5151 (Hive 2.3.10) cut it out, so we can remove it from Spark now. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46520 from pan3793/SPARK-48230. Authored-by: Cheng Pan Signed-off-by: Dongjoon Hyun --- LICENSE-binary| 1 - dev/deps/spark-deps-hadoop-3-hive-2.3 | 1 - licenses-binary/LICENSE-jodd.txt | 24 pom.xml | 6 -- sql/hive/pom.xml | 4 5 files changed, 36 deletions(-) diff --git a/LICENSE-binary b/LICENSE-binary index 40271c9924bc..034215f0ab15 100644 --- a/LICENSE-binary +++ b/LICENSE-binary @@ -436,7 +436,6 @@ com.esotericsoftware:reflectasm org.codehaus.janino:commons-compiler org.codehaus.janino:janino jline:jline -org.jodd:jodd-core com.github.wendykierp:JTransforms pl.edu.icm:JLargeArrays diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 392bacd73277..29997815e5bc 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -143,7 +143,6 @@ jline/2.14.6//jline-2.14.6.jar jline/3.24.1//jline-3.24.1.jar jna/5.13.0//jna-5.13.0.jar joda-time/2.12.7//joda-time-2.12.7.jar -jodd-core/3.5.2//jodd-core-3.5.2.jar jpam/1.1//jpam-1.1.jar json/1.8//json-1.8.jar json4s-ast_2.13/4.0.7//json4s-ast_2.13-4.0.7.jar diff --git a/licenses-binary/LICENSE-jodd.txt b/licenses-binary/LICENSE-jodd.txt deleted file mode 100644 index cc6b458adb38.. --- a/licenses-binary/LICENSE-jodd.txt +++ /dev/null @@ -1,24 +0,0 @@ -Copyright (c) 2003-present, Jodd Team (https://jodd.org) -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are met: - -1. Redistributions of source code must retain the above copyright notice, -this list of conditions and the following disclaimer. - -2. Redistributions in binary form must reproduce the above copyright -notice, this list of conditions and the following disclaimer in the -documentation and/or other materials provided with the distribution. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE -LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file diff --git a/pom.xml b/pom.xml index 56a34cedde51..a98efe8aed1e 100644 --- a/pom.xml +++ b/pom.xml @@ -201,7 +201,6 @@ 3.1.9 3.0.12 2.12.7 -3.5.2 3.0.0 2.2.11 0.16.0 @@ -2783,11 +2782,6 @@ joda-time ${joda.version} - -org.jodd -jodd-core -${jodd.version} - org.datanucleus datanucleus-core diff --git a/sql/hive/pom.xml b/sql/hive/pom.xml index 3895d9dc5a63..5e9fc256e7e6 100644 --- a/sql/hive/pom.xml +++ b/sql/hive/pom.xml @@ -152,10 +152,6 @@ joda-time joda-time - - org.jodd - jodd-core - com.google.code.findbugs jsr305 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new c048653435f9 [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion` c048653435f9 is described below commit c048653435f9b7c832f79d38a504a145a17654c0 Author: Cheng Pan AuthorDate: Thu May 9 22:55:07 2024 -0700 [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion` ### What changes were proposed in this pull request? `spark.network.remoteReadNioBufferConversion` was introduced in https://github.com/apache/spark/commit/2c82745686f4456c4d5c84040a431dcb5b6cb60b, to allow disable [SPARK-24307](https://issues.apache.org/jira/browse/SPARK-24307) for safety, while during the whole Spark 3 period, there are no negative reports, it proves that [SPARK-24307](https://issues.apache.org/jira/browse/SPARK-24307) is solid enough, I propose to mark it deprecated in 3.5.2 and remove in 4.1.0 or later ### Why are the changes needed? Code clean up ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46047 from pan3793/SPARK-47847. Authored-by: Cheng Pan Signed-off-by: Dongjoon Hyun (cherry picked from commit 33cac4436e593c9c501c5ff0eedf923d3a21899c) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/SparkConf.scala | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 813a14acd19e..f49e9e357c84 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -638,7 +638,9 @@ private[spark] object SparkConf extends Logging { DeprecatedConfig("spark.blacklist.killBlacklistedExecutors", "3.1.0", "Please use spark.excludeOnFailure.killExcludedExecutors"), DeprecatedConfig("spark.yarn.blacklist.executor.launch.blacklisting.enabled", "3.1.0", -"Please use spark.yarn.executor.launch.excludeOnFailure.enabled") +"Please use spark.yarn.executor.launch.excludeOnFailure.enabled"), + DeprecatedConfig("spark.network.remoteReadNioBufferConversion", "3.5.2", +"Please open a JIRA ticket to report it if you need to use this configuration.") ) Map(configs.map { cfg => (cfg.key -> cfg) } : _*) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (8ccc8b92be50 -> 33cac4436e59)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8ccc8b92be50 [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods add 33cac4436e59 [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion` No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/SparkConf.scala | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8ccc8b92be50 [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods 8ccc8b92be50 is described below commit 8ccc8b92be50b1d5ef932873403e62e28c478781 Author: Chloe He AuthorDate: Thu May 9 22:07:04 2024 -0700 [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods ### What changes were proposed in this pull request? The docstrings of the pyspark DataStream Reader methods `csv()` and `text()` say that the `path` parameter can be a list, but actually when a list is passed an error is raised. ### Why are the changes needed? Documentation is wrong. ### Does this PR introduce _any_ user-facing change? Yes. Fixes documentation. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #46416 from chloeh13q/fix/streamread-docstring. Authored-by: Chloe He Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/streaming/readwriter.py | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/python/pyspark/sql/streaming/readwriter.py b/python/pyspark/sql/streaming/readwriter.py index c2b75dd8f167..b202a499e8b0 100644 --- a/python/pyspark/sql/streaming/readwriter.py +++ b/python/pyspark/sql/streaming/readwriter.py @@ -553,8 +553,8 @@ class DataStreamReader(OptionUtils): Parameters -- -path : str or list -string, or list of strings, for input path(s). +path : str +string for input path. Other Parameters @@ -641,8 +641,8 @@ class DataStreamReader(OptionUtils): Parameters -- -path : str or list -string, or list of strings, for input path(s). +path : str +string for input path. schema : :class:`pyspark.sql.types.StructType` or str, optional an optional :class:`pyspark.sql.types.StructType` for the input schema or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``). - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9bb15db85e53 [SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX 9bb15db85e53 is described below commit 9bb15db85e53b69b9c0ba112cd1dd93d8213eea4 Author: Ruifeng Zheng AuthorDate: Thu May 9 22:01:13 2024 -0700 [SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX ### What changes were proposed in this pull request? Implement the missing function validation in ApplyInXXX https://github.com/apache/spark/pull/46397 fixed this issue for `Cogrouped.ApplyInPandas`, this PR fix remaining methods. ### Why are the changes needed? for better error message: ``` In [12]: df1 = spark.range(11) In [13]: df2 = df1.groupby("id").applyInPandas(lambda: 1, StructType([StructField("d", DoubleType())])) In [14]: df2.show() ``` before this PR, an invalid function causes weird execution errors: ``` 24/05/10 11:37:36 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 36) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main process() File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process serializer.dump_stream(out_iter, outfile) File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 104, in dump_stream for batch in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 524, in init_stream_yield_batches for series in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1610, in mapper return f(keys, vals) ^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 488, in return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] ^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 483, in wrapped result, return_type, _assign_cols_by_name, truncate_return_schema=False ^^ UnboundLocalError: cannot access local variable 'result' where it is not associated with a value at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:523) at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:479) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601) at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896) ... ``` After this PR, the error happens before execution, which is consistent with Spark Classic, and much clear ``` PySparkValueError: [INVALID_PANDAS_UDF] Invalid function: pandas_udf with function type GROUPED_MAP or the function in groupby.applyInPandas must take either one argument (data) or two arguments (key, data). ``` ### Does this PR introduce _any_ user-facing change? yes, error message changes ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46519 from zhengruifeng/missing_check_in_group.
(spark) branch master updated: [SPARK-48224][SQL] Disallow map keys from being of variant type
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b371e7dd8800 [SPARK-48224][SQL] Disallow map keys from being of variant type b371e7dd8800 is described below commit b371e7dd88009195740f8f5b591447441ea43d0b Author: Harsh Motwani AuthorDate: Thu May 9 21:47:05 2024 -0700 [SPARK-48224][SQL] Disallow map keys from being of variant type ### What changes were proposed in this pull request? This PR disallows map keys from being of variant type. Therefore, SQL statements like `select map(parse_json('{"a": 1}'), 1)`, which would work earlier, will throw an exception now. ### Why are the changes needed? Allowing variant to be the key type of a map can result in undefined behavior as this has not been tested. ### Does this PR introduce _any_ user-facing change? Yes, users could use variants as keys in maps earlier. However, this PR disallows this possibility. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46516 from harshmotw-db/map_variant_key. Authored-by: Harsh Motwani Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/catalyst/util/TypeUtils.scala | 2 +- .../catalyst/expressions/ComplexTypeSuite.scala| 34 +- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala index d2c708b380cf..a0d578c66e73 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala @@ -58,7 +58,7 @@ object TypeUtils extends QueryErrorsBase { } def checkForMapKeyType(keyType: DataType): TypeCheckResult = { -if (keyType.existsRecursively(_.isInstanceOf[MapType])) { +if (keyType.existsRecursively(dt => dt.isInstanceOf[MapType] || dt.isInstanceOf[VariantType])) { DataTypeMismatch( errorSubClass = "INVALID_MAP_KEY_TYPE", messageParameters = Map( diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala index 5f135e46a377..497b335289b1 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.util._ import org.apache.spark.sql.catalyst.util.TypeUtils.ordinalNumber import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ -import org.apache.spark.unsafe.types.UTF8String +import org.apache.spark.unsafe.types.{UTF8String, VariantVal} class ComplexTypeSuite extends SparkFunSuite with ExpressionEvalHelper { @@ -359,6 +359,38 @@ class ComplexTypeSuite extends SparkFunSuite with ExpressionEvalHelper { ) } + // map key can't be variant + val map6 = CreateMap(Seq( +Literal.create(new VariantVal(Array[Byte](), Array[Byte]())), +Literal.create(1) + )) + map6.checkInputDataTypes() match { +case TypeCheckResult.TypeCheckSuccess => fail("should not allow variant as a part of map key") +case TypeCheckResult.DataTypeMismatch(errorSubClass, messageParameters) => + assert(errorSubClass == "INVALID_MAP_KEY_TYPE") + assert(messageParameters === Map("keyType" -> "\"VARIANT\"")) + } + + // map key can't contain variant + val map7 = CreateMap( +Seq( + CreateStruct( +Seq(Literal.create(1), Literal.create(new VariantVal(Array[Byte](), Array[Byte]( + ), + Literal.create(1) +) + ) + map7.checkInputDataTypes() match { +case TypeCheckResult.TypeCheckSuccess => fail("should not allow variant as a part of map key") +case TypeCheckResult.DataTypeMismatch(errorSubClass, messageParameters) => + assert(errorSubClass == "INVALID_MAP_KEY_TYPE") + assert( +messageParameters === Map( + "keyType" -> "\"STRUCT\"" +) + ) + } + test("MapFromArrays") { val intSeq = Seq(5, 10, 15, 20, 25) val longSeq = intSeq.map(_.toLong) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2d609bfd37ae [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 2d609bfd37ae is described below commit 2d609bfd37ae9a0877fb72d1ba0479bb04a2dad6 Author: Cheng Pan AuthorDate: Thu May 9 21:31:50 2024 -0700 [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 ### What changes were proposed in this pull request? This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes: - due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`. - remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in https://github.com/apache/hive/pull/4892 This is the first part of https://github.com/apache/spark/pull/45372 ### Why are the changes needed? Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45372 Closes #46468 from pan3793/SPARK-47018. Lead-authored-by: Cheng Pan Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- connector/kafka-0-10-assembly/pom.xml | 5 connector/kinesis-asl-assembly/pom.xml | 5 dev/deps/spark-deps-hadoop-3-hive-2.3 | 27 +-- docs/building-spark.md | 4 +-- docs/sql-data-sources-hive-tables.md | 8 +++--- docs/sql-migration-guide.md| 2 +- pom.xml| 31 +- .../hive/service/auth/KerberosSaslHelper.java | 5 ++-- .../apache/hive/service/auth/PlainSaslHelper.java | 3 ++- .../hive/service/auth/TSetIpAddressProcessor.java | 5 ++-- .../service/cli/thrift/ThriftBinaryCLIService.java | 6 - .../hive/service/cli/thrift/ThriftCLIService.java | 10 +++ .../org/apache/spark/sql/hive/HiveUtils.scala | 2 +- .../org/apache/spark/sql/hive/client/package.scala | 5 ++-- .../hive/HiveExternalCatalogVersionsSuite.scala| 1 - .../spark/sql/hive/HiveSparkSubmitSuite.scala | 10 +++ .../spark/sql/hive/execution/HiveQuerySuite.scala | 6 ++--- 17 files changed, 61 insertions(+), 74 deletions(-) diff --git a/connector/kafka-0-10-assembly/pom.xml b/connector/kafka-0-10-assembly/pom.xml index b2fcbdf8eca7..bd311b3a9804 100644 --- a/connector/kafka-0-10-assembly/pom.xml +++ b/connector/kafka-0-10-assembly/pom.xml @@ -54,11 +54,6 @@ commons-codec provided - - commons-lang - commons-lang - provided - com.google.protobuf protobuf-java diff --git a/connector/kinesis-asl-assembly/pom.xml b/connector/kinesis-asl-assembly/pom.xml index 577ec2153083..0e93526fce72 100644 --- a/connector/kinesis-asl-assembly/pom.xml +++ b/connector/kinesis-asl-assembly/pom.xml @@ -54,11 +54,6 @@ jackson-databind provided - - commons-lang - commons-lang - provided - org.glassfish.jersey.core jersey-client diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 73d41e9eeb33..392bacd73277 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -46,7 +46,6 @@ commons-compress/1.26.1//commons-compress-1.26.1.jar commons-crypto/1.1.0//commons-crypto-1.1.0.jar commons-dbcp/1.4//commons-dbcp-1.4.jar commons-io/2.16.1//commons-io-2.16.1.jar -commons-lang/2.6//commons-lang-2.6.jar commons-lang3/3.14.0//commons-lang3-3.14.0.jar commons-math3/3.6.1//commons-math3-3.6.1.jar commons-pool/1.5.4//commons-pool-1.5.4.jar @@ -81,19 +80,19 @@ hadoop-cloud-storage/3.4.0//hadoop-cloud-storage-3.4.0.jar hadoop-huaweicloud/3.4.0//hadoop-huaweicloud-3.4.0.jar hadoop-shaded-guava/1.2.0//hadoop-shaded-guava-1.2.0.jar hadoop-yarn-server-web-proxy/3.4.0//hadoop-yarn-server-web-proxy-3.4.0.jar -hive-beeline/2.3.9//hive-beeline-2.3.9.jar -hive-cli/2.3.9//hive-cli-2.3.9.jar -hive-common/2.3.9//hive-common-2.3.9.jar -hive-exec/2.3.9/core/hive-exec-2.3.9-core.jar -hive-jdbc/2.3.9//hive-jdbc-2.3.9.jar -hive-llap-common/2.3.9//hive-llap-common-2.3.9.jar -hive-metastore/2.3.9//hive-metastore-2.3.9.jar -hive-serde/2.3.9//hive-serde-2.3.9.jar +hive-beeline/2.3.10//hive-beeline-2.3.10.jar +hive-cli/2.3.10//hive-cli-2.3.10.jar +hive-common/2.3.10//hive
(spark) branch master updated: [MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1138b2a68b54 [MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin 1138b2a68b54 is described below commit 1138b2a68b5408e6d079bdbce8026323694628e5 Author: zml1206 AuthorDate: Thu May 9 20:51:32 2024 -0700 [MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin ### What changes were proposed in this pull request? `${java.version}` and `${java.version}` (https://github.com/apache/spark/pull/46024/files#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8R117) are equivalent duplicate configuration, so remove `${java.version}`. https://maven.apache.org/plugins/maven-compiler-plugin/examples/set-compiler-release.html ### Why are the changes needed? Simplify the code and facilitates subsequent configuration iterations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46024 from zml1206/remove_duplicate_configuration. Authored-by: zml1206 Signed-off-by: Dongjoon Hyun --- pom.xml | 1 - 1 file changed, 1 deletion(-) diff --git a/pom.xml b/pom.xml index c3ff5d101c22..678455e6e248 100644 --- a/pom.xml +++ b/pom.xml @@ -3127,7 +3127,6 @@ maven-compiler-plugin 3.13.0 -${java.version} true true - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 32b2827b964b [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits` 32b2827b964b is described below commit 32b2827b964bd4a4accb60b47ddd6929f41d4a89 Author: YangJie AuthorDate: Thu May 9 20:47:34 2024 -0700 [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits` ### What changes were proposed in this pull request? In the `sql` module, some functions in `SQLImplicits` have already been marked as `deprecated` in the function comments after SPARK-19089. This pr adds `deprecated` type annotation marks to them. Since SPARK-19089 occurred in Spark 2.2.0, the `since` field of `deprecated` is filled in as `2.2.0`. At the same time, these `deprecated` marks have also been synchronized to the corresponding functions in `SQLImplicits` in the `connect` module. ### Why are the changes needed? Mark deprecated functions with `deprecated` in `SQLImplicits` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #46029 from LuciferYang/deprecated-SQLImplicits. Lead-authored-by: YangJie Co-authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala | 9 + sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala | 9 + 2 files changed, 18 insertions(+) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala index 6c626fd716d5..7799d395d5c6 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala @@ -149,6 +149,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newIntSeqEncoder: Encoder[Seq[Int]] = newSeqEncoder(PrimitiveIntEncoder) /** @@ -156,6 +157,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newLongSeqEncoder: Encoder[Seq[Long]] = newSeqEncoder(PrimitiveLongEncoder) /** @@ -163,6 +165,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newDoubleSeqEncoder: Encoder[Seq[Double]] = newSeqEncoder(PrimitiveDoubleEncoder) /** @@ -170,6 +173,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newFloatSeqEncoder: Encoder[Seq[Float]] = newSeqEncoder(PrimitiveFloatEncoder) /** @@ -177,6 +181,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newByteSeqEncoder: Encoder[Seq[Byte]] = newSeqEncoder(PrimitiveByteEncoder) /** @@ -184,6 +189,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newShortSeqEncoder: Encoder[Seq[Short]] = newSeqEncoder(PrimitiveShortEncoder) /** @@ -191,6 +197,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newBooleanSeqEncoder: Encoder[Seq[Boolean]] = newSeqEncoder(PrimitiveBooleanEncoder) /** @@ -198,6 +205,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newStringSeqEncoder: Encoder[Seq[String]] = newSeqEncoder(StringEncoder) /** @@ -205,6 +213,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) e
(spark) branch master updated: [SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 012d19d8e9b2 [SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos 012d19d8e9b2 is described below commit 012d19d8e9b28f7ce266753bcfff4a76c9510245 Author: Ruifeng Zheng AuthorDate: Thu May 9 16:58:44 2024 -0700 [SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos ### What changes were proposed in this pull request? Document the requirement of seed in protos ### Why are the changes needed? the seed should be set at client side document it to avoid cases like https://github.com/apache/spark/pull/46456 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46518 from zhengruifeng/doc_random. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .../common/src/main/protobuf/spark/connect/relations.proto | 8 ++-- python/pyspark/sql/connect/plan.py | 10 -- python/pyspark/sql/connect/proto/relations_pb2.pyi | 10 -- 3 files changed, 18 insertions(+), 10 deletions(-) diff --git a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto index 3882b2e85396..0b3c9d4253e8 100644 --- a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto +++ b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto @@ -467,7 +467,9 @@ message Sample { // (Optional) Whether to sample with replacement. optional bool with_replacement = 4; - // (Optional) The random seed. + // (Required) The random seed. + // This filed is required to avoid generate mutable dataframes (see SPARK-48184 for details), + // however, still keep it 'optional' here for backward compatibility. optional int64 seed = 5; // (Required) Explicitly sort the underlying plan to make the ordering deterministic or cache it. @@ -687,7 +689,9 @@ message StatSampleBy { // If a stratum is not specified, we treat its fraction as zero. repeated Fraction fractions = 3; - // (Optional) The random seed. + // (Required) The random seed. + // This filed is required to avoid generate mutable dataframes (see SPARK-48184 for details), + // however, still keep it 'optional' here for backward compatibility. optional int64 seed = 5; message Fraction { diff --git a/python/pyspark/sql/connect/plan.py b/python/pyspark/sql/connect/plan.py index 4ac4946745f5..3d3303fb15c5 100644 --- a/python/pyspark/sql/connect/plan.py +++ b/python/pyspark/sql/connect/plan.py @@ -717,7 +717,7 @@ class Sample(LogicalPlan): lower_bound: float, upper_bound: float, with_replacement: bool, -seed: Optional[int], +seed: int, deterministic_order: bool = False, ) -> None: super().__init__(child) @@ -734,8 +734,7 @@ class Sample(LogicalPlan): plan.sample.lower_bound = self.lower_bound plan.sample.upper_bound = self.upper_bound plan.sample.with_replacement = self.with_replacement -if self.seed is not None: -plan.sample.seed = self.seed +plan.sample.seed = self.seed plan.sample.deterministic_order = self.deterministic_order return plan @@ -1526,7 +1525,7 @@ class StatSampleBy(LogicalPlan): child: Optional["LogicalPlan"], col: Column, fractions: Sequence[Tuple[Column, float]], -seed: Optional[int], +seed: int, ) -> None: super().__init__(child) @@ -1554,8 +1553,7 @@ class StatSampleBy(LogicalPlan): fraction.stratum.CopyFrom(k.to_plan(session).literal) fraction.fraction = float(v) plan.sample_by.fractions.append(fraction) -if self._seed is not None: -plan.sample_by.seed = self._seed +plan.sample_by.seed = self._seed return plan diff --git a/python/pyspark/sql/connect/proto/relations_pb2.pyi b/python/pyspark/sql/connect/proto/relations_pb2.pyi index 5dfb47da67a9..9b6f4b43544f 100644 --- a/python/pyspark/sql/connect/proto/relations_pb2.pyi +++ b/python/pyspark/sql/connect/proto/relations_pb2.pyi @@ -1865,7 +1865,10 @@ class Sample(google.protobuf.message.Message): with_replacement: builtins.bool """(Optional) Whether to sample with replacement.""" seed: builtins.int -"""(Optional) The random seed.""" +"""(Required) The random seed. +This filed is required to avoid generate mut
(spark) branch master updated (b47d7853d92f -> e704b9e56b0c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b47d7853d92f [SPARK-48148][CORE] JSON objects should not be modified when read as STRING add e704b9e56b0c [SPARK-48226][BUILD] Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle` No new revisions were added by this update. Summary of changes: dev/lint-java | 2 +- dev/sbt-checkstyle | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new da4c808be7d6 [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files da4c808be7d6 is described below commit da4c808be7d66dc61fdcb3b41254eef77298a72c Author: Dongjoon Hyun AuthorDate: Thu May 9 14:46:01 2024 -0700 [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files ### What changes were proposed in this pull request? This PR is a follow-up to regenerate golden files for branch-3.5 - #46475 ### Why are the changes needed? To recover branch-3.5 CI. - https://github.com/apache/spark/actions/runs/9011670853/job/24786397001 ``` [info] *** 4 TESTS FAILED *** [error] Failed: Total 3036, Failed 4, Errors 0, Passed 3032, Ignored 3 [error] Failed tests: [error] org.apache.spark.sql.SQLQueryTestSuite ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46514 from dongjoon-hyun/SPARK-48197. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../sql-tests/analyzer-results/ansi/higher-order-functions.sql.out | 1 - .../resources/sql-tests/analyzer-results/higher-order-functions.sql.out | 1 - .../test/resources/sql-tests/results/ansi/higher-order-functions.sql.out | 1 - .../src/test/resources/sql-tests/results/higher-order-functions.sql.out | 1 - 4 files changed, 4 deletions(-) diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out index 3fafb9858e5a..8fe6e7097e67 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out @@ -40,7 +40,6 @@ select ceil(x -> x) as v org.apache.spark.sql.AnalysisException { "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION", - "sqlState" : "42K0D", "messageParameters" : { "class" : "org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$" }, diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out index d9e88ac618aa..d85101986078 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out @@ -40,7 +40,6 @@ select ceil(x -> x) as v org.apache.spark.sql.AnalysisException { "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION", - "sqlState" : "42K0D", "messageParameters" : { "class" : "org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$" }, diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out index eb9c454109f0..dceb370c8388 100644 --- a/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out @@ -40,7 +40,6 @@ struct<> org.apache.spark.sql.AnalysisException { "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION", - "sqlState" : "42K0D", "messageParameters" : { "class" : "org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$" }, diff --git a/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out b/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out index eb9c454109f0..dceb370c8388 100644 --- a/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out @@ -40,7 +40,6 @@ struct<> org.apache.spark.sql.AnalysisException { "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION", - "sqlState" : "42K0D", "messageParameters" : { "class" : "org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$" }, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48216][TESTS] Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e1fb1d7e063a [SPARK-48216][TESTS] Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable e1fb1d7e063a is described below commit e1fb1d7e063af7e8eb6e992c800902aff6e19e15 Author: Kent Yao AuthorDate: Thu May 9 08:37:07 2024 -0700 [SPARK-48216][TESTS] Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable ### What changes were proposed in this pull request? This PR removes overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable. ### Why are the changes needed? The db dockers might require more time to bootstrap sometimes. It shall be configurable to avoid failure like: ```scala [info] org.apache.spark.sql.jdbc.DB2IntegrationSuite *** ABORTED *** (3 minutes, 11 seconds) [info] The code passed to eventually never returned normally. Attempted 96 times over 3.00399815763 minutes. Last failure message: [jcc][t4][2030][11211][4.33.31] A communication error occurred during operations on the connection's underlying socket, socket input stream, [info] or socket output stream. Error location: Reply.fill() - insufficient data (-1). Message: Insufficient data. ERRORCODE=-4499, SQLSTATE=08001. (DockerJDBCIntegrationSuite.scala:215) [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219) [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Passing GA ### Was this patch authored or co-authored using generative AI tooling? no Closes #46505 from yaooqinn/SPARK-48216. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala| 4 .../test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala | 3 --- .../test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala | 4 .../test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala | 3 --- .../org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala| 4 .../scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala| 4 .../scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala | 4 7 files changed, 26 deletions(-) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala index aca174cce194..4ece4d2088f4 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala @@ -21,8 +21,6 @@ import java.math.BigDecimal import java.sql.{Connection, Date, Timestamp} import java.util.Properties -import org.scalatest.time.SpanSugar._ - import org.apache.spark.sql.{Row, SaveMode} import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._ import org.apache.spark.sql.internal.SQLConf @@ -41,8 +39,6 @@ import org.apache.spark.tags.DockerTest class DB2IntegrationSuite extends DockerJDBCIntegrationSuite { override val db = new DB2DatabaseOnDocker - override val connectionTimeout = timeout(3.minutes) - override def dataPreparation(conn: Connection): Unit = { conn.prepareStatement("CREATE TABLE tbl (x INTEGER, y VARCHAR(8))").executeUpdate() conn.prepareStatement("INSERT INTO tbl VALUES (42,'fred')").executeUpdate() diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala index abb683c06495..4899de2b2a14 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala @@ -24,7 +24,6 @@ import javax.security.auth.login.Configuration import com.github.dockerjava.api.model.{AccessMode, Bind, ContainerConfig, HostConfig, Volume} import org.apache.hadoop.security.{SecurityUtil, UserGroup
(spark) branch master updated: [SPARK-47186][TESTS][FOLLOWUP] Correct the name of spark.test.docker.connectionTimeout
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5891b20ef492 [SPARK-47186][TESTS][FOLLOWUP] Correct the name of spark.test.docker.connectionTimeout 5891b20ef492 is described below commit 5891b20ef492e3dad31ff851770d9c4f9c7c4de4 Author: Kent Yao AuthorDate: Wed May 8 21:56:55 2024 -0700 [SPARK-47186][TESTS][FOLLOWUP] Correct the name of spark.test.docker.connectionTimeout ### What changes were proposed in this pull request? This PR adds a followup of SPARK-47186 to correct the name of spark.test.docker.connectionTimeout ### Why are the changes needed? test bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46495 from yaooqinn/SPARK-47186-FF. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala index ded7bb3a6bf6..8d17e0b4e36e 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala @@ -115,7 +115,7 @@ abstract class DockerJDBCIntegrationSuite protected val startContainerTimeout: Long = timeStringAsSeconds(sys.props.getOrElse("spark.test.docker.startContainerTimeout", "5min")) protected val connectionTimeout: PatienceConfiguration.Timeout = { -val timeoutStr = sys.props.getOrElse("spark.test.docker.conn", "5min") +val timeoutStr = sys.props.getOrElse("spark.test.docker.connectionTimeout", "5min") timeout(timeStringAsSeconds(timeoutStr).seconds) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r69044 - /dev/spark/v3.4.3-rc2-docs/
Author: dongjoon Date: Thu May 9 02:31:50 2024 New Revision: 69044 Log: Remove Apache Spark 3.4.3 RC2 docs after releasing 3.4.3 Removed: dev/spark/v3.4.3-rc2-docs/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (4fb6624bd2ce -> 337f980f0073)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 4fb6624bd2ce [SPARK-48205][PYTHON] Remove the private[sql] modifier for Python data sources add 337f980f0073 [SPARK-48204][INFRA] Fix release script for Spark 4.0+ No new revisions were added by this update. Summary of changes: dev/create-release/release-build.sh | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48207][INFRA][3.4] Run `build/scala-213/java-11-17` jobs of `branch-3.4` only if needed
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new d16a4f4c98d5 [SPARK-48207][INFRA][3.4] Run `build/scala-213/java-11-17` jobs of `branch-3.4` only if needed d16a4f4c98d5 is described below commit d16a4f4c98d5e6a44ff783e20a9f2f2f80c009f3 Author: Dongjoon Hyun AuthorDate: Wed May 8 16:19:40 2024 -0700 [SPARK-48207][INFRA][3.4] Run `build/scala-213/java-11-17` jobs of `branch-3.4` only if needed ### What changes were proposed in this pull request? This PR aims to run `build`, `scala-213`, and `java-11-17` job of `branch-3.4` only if needed to reduce the maximum concurrency of Apache Spark GitHub Action usage. ### Why are the changes needed? To meet ASF Infra GitHub Action policy, we need to reduce the maximum concurrency. - https://infra.apache.org/github-actions-policy.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46489 from dongjoon-hyun/SPARK-48207. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 64f18b5163b1..3e44d6cfd179 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -98,18 +98,17 @@ jobs: tpcds=false docker=false fi - # 'build', 'scala-213', and 'java-11-17' are always true for now. - # It does not save significant time and most of PRs trigger the build. + build=`./dev/is-changed.py -m "core,unsafe,kvstore,avro,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,connect,protobuf"` precondition=" { - \"build\": \"true\", + \"build\": \"$build\", \"pyspark\": \"$pyspark\", \"pyspark-pandas\": \"$pandas\", \"sparkr\": \"$sparkr\", \"tpcds-1g\": \"$tpcds\", \"docker-integration-tests\": \"$docker\", - \"scala-213\": \"true\", - \"java-11-17\": \"true\", + \"scala-213\": \"$build\", + \"java-11-17\": \"$build\", \"lint\" : \"true\", \"k8s-integration-tests\" : \"$kubernetes\", }" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new bd54e633121c [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository bd54e633121c is described below commit bd54e633121c77293bbb0cd343eeebb167ca5edf Author: Hyukjin Kwon AuthorDate: Wed May 8 17:13:11 2024 +0900 [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository This PR is a sort of a followup of https://github.com/apache/spark/pull/46361. It proposes to run TPC-DS and Docker integration tests in PRs (that does not consume ASF resources). TPC-DS and Docker integration stuff at least have to be tested in the PR if the PR touches the codes related to that. No, test-only. Manually No. Closes #46470 from HyukjinKwon/SPARK-48192. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit f693abc8de949b1fd5f77b9e74037b0cc2298aef) Signed-off-by: Dongjoon Hyun (cherry picked from commit 82779217b1fa1dea2b18772795969c04c1f34532) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 0166395ceb4a..64f18b5163b1 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -84,17 +84,19 @@ jobs: if [ -f "./dev/is-changed.py" ]; then pyspark_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"` pyspark=`./dev/is-changed.py -m $pyspark_modules` -tpcds=`./dev/is-changed.py -m sql` -docker=`./dev/is-changed.py -m docker-integration-tests` fi if [[ "${{ github.repository }}" != 'apache/spark' ]]; then pandas=$pyspark kubernetes=`./dev/is-changed.py -m kubernetes` sparkr=`./dev/is-changed.py -m sparkr` +tpcds=`./dev/is-changed.py -m sql` +docker=`./dev/is-changed.py -m docker-integration-tests` else pandas=false kubernetes=false sparkr=false +tpcds=false +docker=false fi # 'build', 'scala-213', and 'java-11-17' are always true for now. # It does not save significant time and most of PRs trigger the build. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new b6de16317abd [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs b6de16317abd is described below commit b6de16317abdead63fe12a686573c20172959437 Author: Dongjoon Hyun AuthorDate: Sun May 5 13:19:23 2024 -0700 [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs This PR aims to run `sparkr` only in PR builder and Daily Python CIs. In other words, only the commit builder will skip it by default. To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. No. Manual review. No. Closes #46389 from dongjoon-hyun/SPARK-48133. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 32ba5c1db62c2674e8acced56f89ed840bf9) Signed-off-by: Dongjoon Hyun (cherry picked from commit 6dbbf081a7d248ddce62b62e979ff06a3c793f22) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index cf1eb7b4c233..0166395ceb4a 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -84,16 +84,17 @@ jobs: if [ -f "./dev/is-changed.py" ]; then pyspark_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"` pyspark=`./dev/is-changed.py -m $pyspark_modules` -sparkr=`./dev/is-changed.py -m sparkr` tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` fi if [[ "${{ github.repository }}" != 'apache/spark' ]]; then pandas=$pyspark kubernetes=`./dev/is-changed.py -m kubernetes` +sparkr=`./dev/is-changed.py -m sparkr` else pandas=false kubernetes=false +sparkr=false fi # 'build', 'scala-213', and 'java-11-17' are always true for now. # It does not save significant time and most of PRs trigger the build. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 4b032a18924b [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs 4b032a18924b is described below commit 4b032a18924bd35322570551448c643786fd1a98 Author: Dongjoon Hyun AuthorDate: Sat May 4 22:55:04 2024 -0700 [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs This PR aims to run `k8s-integration-tests` only in PR builder and Daily Python CIs. In other words, only the commit builder will skip it by default. Please note that - K8s unit tests will be covered by the commit builder still. - All PR builders are not consuming ASF resources and they provide lots of test coverage everyday also. To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. No. Manual review. No. Closes #46388 from dongjoon-hyun/SPARK-48132. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 9454607944df5e8430642bbe399a35436506be2a) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 35c1328256c2..cf1eb7b4c233 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -90,8 +90,10 @@ jobs: fi if [[ "${{ github.repository }}" != 'apache/spark' ]]; then pandas=$pyspark +kubernetes=`./dev/is-changed.py -m kubernetes` else pandas=false +kubernetes=false fi # 'build', 'scala-213', and 'java-11-17' are always true for now. # It does not save significant time and most of PRs trigger the build. @@ -106,7 +108,7 @@ jobs: \"scala-213\": \"true\", \"java-11-17\": \"true\", \"lint\" : \"true\", - \"k8s-integration-tests\" : \"true\", + \"k8s-integration-tests\" : \"$kubernetes\", }" echo $precondition # For debugging # Remove `\n` to avoid "Invalid format" error - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 82779217b1fa [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository 82779217b1fa is described below commit 82779217b1fa1dea2b18772795969c04c1f34532 Author: Hyukjin Kwon AuthorDate: Wed May 8 17:13:11 2024 +0900 [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository This PR is a sort of a followup of https://github.com/apache/spark/pull/46361. It proposes to run TPC-DS and Docker integration tests in PRs (that does not consume ASF resources). TPC-DS and Docker integration stuff at least have to be tested in the PR if the PR touches the codes related to that. No, test-only. Manually No. Closes #46470 from HyukjinKwon/SPARK-48192. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit f693abc8de949b1fd5f77b9e74037b0cc2298aef) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 4ad4a243c76d..b016a29a86be 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -85,13 +85,15 @@ jobs: pandas=$pyspark kubernetes=`./dev/is-changed.py -m kubernetes` sparkr=`./dev/is-changed.py -m sparkr` +tpcds=`./dev/is-changed.py -m sql` +docker=`./dev/is-changed.py -m docker-integration-tests` else pandas=false kubernetes=false sparkr=false +tpcds=false +docker=false fi - tpcds=`./dev/is-changed.py -m sql` - docker=`./dev/is-changed.py -m docker-integration-tests` build=`./dev/is-changed.py -m "core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"` precondition=" { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 6dbbf081a7d2 [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs 6dbbf081a7d2 is described below commit 6dbbf081a7d248ddce62b62e979ff06a3c793f22 Author: Dongjoon Hyun AuthorDate: Sun May 5 13:19:23 2024 -0700 [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs This PR aims to run `sparkr` only in PR builder and Daily Python CIs. In other words, only the commit builder will skip it by default. To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. No. Manual review. No. Closes #46389 from dongjoon-hyun/SPARK-48133. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 32ba5c1db62c2674e8acced56f89ed840bf9) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 645054dc2087..4ad4a243c76d 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -79,17 +79,17 @@ jobs: id: set-outputs run: | if [ -z "${{ inputs.jobs }}" ]; then - pyspark=true; sparkr=true; tpcds=true; docker=true; pyspark_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"` pyspark=`./dev/is-changed.py -m $pyspark_modules` if [[ "${{ github.repository }}" != 'apache/spark' ]]; then pandas=$pyspark kubernetes=`./dev/is-changed.py -m kubernetes` +sparkr=`./dev/is-changed.py -m sparkr` else pandas=false kubernetes=false +sparkr=false fi - sparkr=`./dev/is-changed.py -m sparkr` tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` build=`./dev/is-changed.py -m "core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"` - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 9454607944df [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs 9454607944df is described below commit 9454607944df5e8430642bbe399a35436506be2a Author: Dongjoon Hyun AuthorDate: Sat May 4 22:55:04 2024 -0700 [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs This PR aims to run `k8s-integration-tests` only in PR builder and Daily Python CIs. In other words, only the commit builder will skip it by default. Please note that - K8s unit tests will be covered by the commit builder still. - All PR builders are not consuming ASF resources and they provide lots of test coverage everyday also. To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. No. Manual review. No. Closes #46388 from dongjoon-hyun/SPARK-48132. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index e73dced98238..645054dc2087 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -84,13 +84,14 @@ jobs: pyspark=`./dev/is-changed.py -m $pyspark_modules` if [[ "${{ github.repository }}" != 'apache/spark' ]]; then pandas=$pyspark +kubernetes=`./dev/is-changed.py -m kubernetes` else pandas=false +kubernetes=false fi sparkr=`./dev/is-changed.py -m sparkr` tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` - kubernetes=`./dev/is-changed.py -m kubernetes` build=`./dev/is-changed.py -m "core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"` precondition=" { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 26dccf09322f [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change 26dccf09322f is described below commit 26dccf09322fc9945557a6e005a15e14fc6926b0 Author: Dongjoon Hyun AuthorDate: Thu May 2 23:21:59 2024 -0700 [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change This PR aims to enable `k8s-integration-tests` only for `kubernetes` module change. Although there is a chance of missing `core` module change, the daily CI test coverage will reveal that. To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). No. Manual review. No. Closes #46356 from dongjoon-hyun/SPARK-48109. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 63837020ed29c9e6003f24117ad21f8b97f40f0f) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 051e8c98908c..e73dced98238 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -90,6 +90,7 @@ jobs: sparkr=`./dev/is-changed.py -m sparkr` tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` + kubernetes=`./dev/is-changed.py -m kubernetes` build=`./dev/is-changed.py -m "core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"` precondition=" { @@ -102,7 +103,7 @@ jobs: \"scala-213\": \"$build\", \"java-11-17\": \"$build\", \"lint\" : \"true\", - \"k8s-integration-tests\" : \"true\", + \"k8s-integration-tests\" : \"$kubernetes\", \"breaking-changes-buf\" : \"true\", }" echo $precondition # For debugging - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new d4a94c283c66 [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository d4a94c283c66 is described below commit d4a94c283c66c20be8a3ba67b75b960ba3c29d6b Author: Dongjoon Hyun AuthorDate: Fri May 3 21:25:41 2024 -0700 [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository (cherry picked from commit 81775a083f2339a76f3d1af472baf58e6fdf47d2) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 825ad064d078..35c1328256c2 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -88,7 +88,7 @@ jobs: tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` fi - if [ "${{ github.repository != 'apache/spark' }}" ]; then + if [[ "${{ github.repository }}" != 'apache/spark' ]]; then pandas=$pyspark else pandas=false - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 81775a083f23 [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository 81775a083f23 is described below commit 81775a083f2339a76f3d1af472baf58e6fdf47d2 Author: Dongjoon Hyun AuthorDate: Fri May 3 21:25:41 2024 -0700 [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository --- .github/workflows/build_and_test.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 679c51bb0941..051e8c98908c 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -82,7 +82,7 @@ jobs: pyspark=true; sparkr=true; tpcds=true; docker=true; pyspark_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"` pyspark=`./dev/is-changed.py -m $pyspark_modules` - if [ "${{ github.repository != 'apache/spark' }}" ]; then + if [[ "${{ github.repository }}" != 'apache/spark' ]]; then pandas=$pyspark else pandas=false - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 2d5a77bbea4a [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs 2d5a77bbea4a is described below commit 2d5a77bbea4a96916525299d277f368790ccc602 Author: Dongjoon Hyun AuthorDate: Wed May 8 13:48:12 2024 -0700 [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs ### What changes were proposed in this pull request? This PR aims to run `pyspark-pandas*` of `branch-3.4` only in PR builder and Daily Python CIs. In other words, only the commit builder will skip it by default. Please note that all PR builders is not consuming ASF resources and they provides lots of test coverage everyday. `branch-3.4` Python Daily CI runs all Python tests including `pyspark-pandas` like the following. https://github.com/apache/spark/blob/21548a8cc5c527d4416a276a852f967b4410bd4b/.github/workflows/build_branch34_python.yml#L43-L44 ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. Although `pandas` is an **optional** package in PySpark, this is essential for PySpark users and we have **6 test pipelines** which requires lots of resources. We need to optimize the job concurrently level to `less than or equal to 20` while keeping the test capability as much as possible. https://github.com/apache/spark/blob/da0c7cc81bb3d69d381dd0683e910eae4c80e9ae/dev/requirements.txt#L4-L7 - pyspark-pandas - pyspark-pandas-slow ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46483 from dongjoon-hyun/SPARK-48116-3.4. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 12 1 file changed, 12 insertions(+) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 2d2e8da80d46..825ad064d078 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -88,12 +88,18 @@ jobs: tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` fi + if [ "${{ github.repository != 'apache/spark' }}" ]; then +pandas=$pyspark + else +pandas=false + fi # 'build', 'scala-213', and 'java-11-17' are always true for now. # It does not save significant time and most of PRs trigger the build. precondition=" { \"build\": \"true\", \"pyspark\": \"$pyspark\", + \"pyspark-pandas\": \"$pandas\", \"sparkr\": \"$sparkr\", \"tpcds-1g\": \"$tpcds\", \"docker-integration-tests\": \"$docker\", @@ -349,6 +355,12 @@ jobs: pyspark-pandas-slow - >- pyspark-connect +exclude: + # Always run if pyspark-pandas == 'true', even infra-image is skip (such as non-master job) + # In practice, the build will run in individual PR, but not against the individual commit + # in Apache Spark repository. + - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 'pyspark-pandas' }} + - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 'pyspark-pandas-slow' }} env: MODULES_TO_TEST: ${{ matrix.modules }} HADOOP_PROFILE: ${{ inputs.hadoop }} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new ff691fa611f0 [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs ff691fa611f0 is described below commit ff691fa611f0c8a7f0ff626179bced2b48ef9b7d Author: Dongjoon Hyun AuthorDate: Wed May 8 13:45:55 2024 -0700 [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs ### What changes were proposed in this pull request? This PR aims to run `pyspark-pandas*` of `branch-3.5` only in PR builder and Daily Python CIs. In other words, only the commit builder will skip it by default. Please note that all PR builders is not consuming ASF resources and they provides lots of test coverage everyday. `branch-3.5` Python Daily CI runs all Python tests including `pyspark-pandas` like the following. https://github.com/apache/spark/blob/21548a8cc5c527d4416a276a852f967b4410bd4b/.github/workflows/build_branch35_python.yml#L43-L44 ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. Although `pandas` is an **optional** package in PySpark, this is essential for PySpark users and we have **6 test pipelines** which requires lots of resources. We need to optimize the job concurrently level to `less than or equal to 20` while keeping the test capability as much as possible. https://github.com/apache/spark/blob/a762f3175fcdb7b069faa0c2bfce93d295cb1f10/dev/requirements.txt#L4-L7 - pyspark-pandas - pyspark-pandas-slow - pyspark-pandas-connect - pyspark-pandas-slow-connect ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46482 from dongjoon-hyun/SPARK-48116-3.5. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 14 ++ 1 file changed, 14 insertions(+) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 9c3dc95d0f66..679c51bb0941 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -82,6 +82,11 @@ jobs: pyspark=true; sparkr=true; tpcds=true; docker=true; pyspark_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"` pyspark=`./dev/is-changed.py -m $pyspark_modules` + if [ "${{ github.repository != 'apache/spark' }}" ]; then +pandas=$pyspark + else +pandas=false + fi sparkr=`./dev/is-changed.py -m sparkr` tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` @@ -90,6 +95,7 @@ jobs: { \"build\": \"$build\", \"pyspark\": \"$pyspark\", + \"pyspark-pandas\": \"$pandas\", \"sparkr\": \"$sparkr\", \"tpcds-1g\": \"$tpcds\", \"docker-integration-tests\": \"$docker\", @@ -361,6 +367,14 @@ jobs: pyspark-pandas-connect - >- pyspark-pandas-slow-connect +exclude: + # Always run if pyspark-pandas == 'true', even infra-image is skip (such as non-master job) + # In practice, the build will run in individual PR, but not against the individual commit + # in Apache Spark repository. + - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 'pyspark-pandas' }} + - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 'pyspark-pandas-slow' }} + - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 'pyspark-pandas-connect' }} + - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 'pyspark-pandas-slow-connect' }} env: MODULES_TO_TEST: ${{ matrix.modules }} HADOOP_PROFILE: ${{ inputs.hadoop }} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fbfcd402851e [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI fbfcd402851e is described below commit fbfcd402851ee604789b8ba72a1ee0e67ef5ebe4 Author: Dongjoon Hyun AuthorDate: Wed May 8 12:30:12 2024 -0700 [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI ### What changes were proposed in this pull request? This PR aims to create `build_branch34_python.yml` in order to spin off `pyspark` tests from `build_branch34.yml` Daily CI. ### Why are the changes needed? Currently, `build_branch34.yml` creates more than 15 test pipelines concurrently which is beyond of ASF Infra policy. - https://github.com/apache/spark/actions/workflows/build_branch35.yml We had better offload this to `Python only Daily CI` like `master` branch's `Python Only` Daily CI. - https://github.com/apache/spark/actions/workflows/build_python_3.10.yml ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46480 from dongjoon-hyun/SPARK-48203. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_branch34.yml| 1 - .../{build_branch34.yml => build_branch34_python.yml} | 13 +++-- 2 files changed, 3 insertions(+), 11 deletions(-) diff --git a/.github/workflows/build_branch34.yml b/.github/workflows/build_branch34.yml index 68887970d4d8..deb6c4240797 100644 --- a/.github/workflows/build_branch34.yml +++ b/.github/workflows/build_branch34.yml @@ -43,7 +43,6 @@ jobs: jobs: >- { "build": "true", - "pyspark": "true", "sparkr": "true", "tpcds-1g": "true", "docker-integration-tests": "true", diff --git a/.github/workflows/build_branch34.yml b/.github/workflows/build_branch34_python.yml similarity index 74% copy from .github/workflows/build_branch34.yml copy to .github/workflows/build_branch34_python.yml index 68887970d4d8..c109ba2dc792 100644 --- a/.github/workflows/build_branch34.yml +++ b/.github/workflows/build_branch34_python.yml @@ -17,7 +17,7 @@ # under the License. # -name: "Build (branch-3.4, Scala 2.13, Hadoop 3, JDK 8)" +name: "Build / Python-only (branch-3.4)" on: schedule: @@ -36,17 +36,10 @@ jobs: hadoop: hadoop3 envs: >- { - "SCALA_PROFILE": "scala2.13", - "PYTHON_TO_TEST": "", - "ORACLE_DOCKER_IMAGE_NAME": "gvenzl/oracle-xe:21.3.0" + "PYTHON_TO_TEST": "" } jobs: >- { - "build": "true", "pyspark": "true", - "sparkr": "true", - "tpcds-1g": "true", - "docker-integration-tests": "true", - "k8s-integration-tests": "true", - "lint" : "true" + "pyspark-pandas": "true" } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 70e5d2aa7a99 [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI 70e5d2aa7a99 is described below commit 70e5d2aa7a992a6f4ff9c7d8e3752ce1d3d488f2 Author: Dongjoon Hyun AuthorDate: Wed May 8 10:47:52 2024 -0700 [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI ### What changes were proposed in this pull request? This PR aims to create `build_branch35_python.yml` in order to spin off `pyspark` tests from `build_branch35.yml` Daily CI. ### Why are the changes needed? Currently, `build_branch35.yml` creates more than 15 test pipelines concurrently which is beyond of ASF Infra policy. - https://github.com/apache/spark/actions/workflows/build_branch35.yml We had better offload this to `Python only Daily CI` like `master` branch's `Python Only` Daily CI. - https://github.com/apache/spark/actions/workflows/build_python_3.10.yml ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46479 from dongjoon-hyun/SPARK-48202. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_branch35.yml| 1 - .../{build_branch35.yml => build_branch35_python.yml} | 13 +++-- 2 files changed, 3 insertions(+), 11 deletions(-) diff --git a/.github/workflows/build_branch35.yml b/.github/workflows/build_branch35.yml index 55616c2f1f01..2ec080d5722c 100644 --- a/.github/workflows/build_branch35.yml +++ b/.github/workflows/build_branch35.yml @@ -43,7 +43,6 @@ jobs: jobs: >- { "build": "true", - "pyspark": "true", "sparkr": "true", "tpcds-1g": "true", "docker-integration-tests": "true", diff --git a/.github/workflows/build_branch35.yml b/.github/workflows/build_branch35_python.yml similarity index 74% copy from .github/workflows/build_branch35.yml copy to .github/workflows/build_branch35_python.yml index 55616c2f1f01..1585534d33ba 100644 --- a/.github/workflows/build_branch35.yml +++ b/.github/workflows/build_branch35_python.yml @@ -17,7 +17,7 @@ # under the License. # -name: "Build (branch-3.5, Scala 2.13, Hadoop 3, JDK 8)" +name: "Build / Python-only (branch-3.5)" on: schedule: @@ -36,17 +36,10 @@ jobs: hadoop: hadoop3 envs: >- { - "SCALA_PROFILE": "scala2.13", - "PYTHON_TO_TEST": "", - "ORACLE_DOCKER_IMAGE_NAME": "gvenzl/oracle-xe:21.3.0" + "PYTHON_TO_TEST": "" } jobs: >- { - "build": "true", "pyspark": "true", - "sparkr": "true", - "tpcds-1g": "true", - "docker-integration-tests": "true", - "k8s-integration-tests": "true", - "lint" : "true" + "pyspark-pandas": "true" } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9d79ab42b127 [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs 9d79ab42b127 is described below commit 9d79ab42b127d1a12164cec260bfbd69f6da8b74 Author: Dongjoon Hyun AuthorDate: Wed May 8 09:40:03 2024 -0700 [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs ### What changes were proposed in this pull request? This PR aims to split `build_python.yml` into per-version cron jobs. Technically, this includes a revert of SPARK-48149 and choose [the discussed alternative](https://github.com/apache/spark/pull/46407#discussion_r1591586209). - https://github.com/apache/spark/pull/46407 - https://github.com/apache/spark/pull/46454 ### Why are the changes needed? To recover Python CI successfully in ASF INFRA policy. - https://github.com/apache/spark/actions/workflows/build_python.yml ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46477 from dongjoon-hyun/SPARK-48200. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../{build_python.yml => build_python_3.10.yml} | 16 ++-- .../{build_python.yml => build_python_3.12.yml} | 16 ++-- .../{build_python.yml => build_python_pypy3.9.yml} | 16 ++-- 3 files changed, 6 insertions(+), 42 deletions(-) diff --git a/.github/workflows/build_python.yml b/.github/workflows/build_python_3.10.yml similarity index 63% copy from .github/workflows/build_python.yml copy to .github/workflows/build_python_3.10.yml index efa281d6a279..5ae37fbc9120 100644 --- a/.github/workflows/build_python.yml +++ b/.github/workflows/build_python_3.10.yml @@ -17,26 +17,14 @@ # under the License. # -# According to https://infra.apache.org/github-actions-policy.html, -# all workflows SHOULD have a job concurrency level less than or equal to 15. -# To do that, we run one python version per cron schedule -name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.12)" +name: "Build / Python-only (master, Python 3.10)" on: schedule: -- cron: '0 15 * * *' - cron: '0 17 * * *' -- cron: '0 19 * * *' jobs: run-build: -strategy: - fail-fast: false - matrix: -include: - - pyversion: ${{ github.event.schedule == '0 15 * * *' && 'pypy3' }} - - pyversion: ${{ github.event.schedule == '0 17 * * *' && 'python3.10' }} - - pyversion: ${{ github.event.schedule == '0 19 * * *' && 'python3.12' }} permissions: packages: write name: Run @@ -48,7 +36,7 @@ jobs: hadoop: hadoop3 envs: >- { - "PYTHON_TO_TEST": "${{ matrix.pyversion }}" + "PYTHON_TO_TEST": "python3.10" } jobs: >- { diff --git a/.github/workflows/build_python.yml b/.github/workflows/build_python_3.12.yml similarity index 63% copy from .github/workflows/build_python.yml copy to .github/workflows/build_python_3.12.yml index efa281d6a279..e1fd45a7d883 100644 --- a/.github/workflows/build_python.yml +++ b/.github/workflows/build_python_3.12.yml @@ -17,26 +17,14 @@ # under the License. # -# According to https://infra.apache.org/github-actions-policy.html, -# all workflows SHOULD have a job concurrency level less than or equal to 15. -# To do that, we run one python version per cron schedule -name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.12)" +name: "Build / Python-only (master, Python 3.12)" on: schedule: -- cron: '0 15 * * *' -- cron: '0 17 * * *' - cron: '0 19 * * *' jobs: run-build: -strategy: - fail-fast: false - matrix: -include: - - pyversion: ${{ github.event.schedule == '0 15 * * *' && 'pypy3' }} - - pyversion: ${{ github.event.schedule == '0 17 * * *' && 'python3.10' }} - - pyversion: ${{ github.event.schedule == '0 19 * * *' && 'python3.12' }} permissions: packages: write name: Run @@ -48,7 +36,7 @@ jobs: hadoop: hadoop3 envs: >- { - "PYTHON_TO_TEST": "${{ matrix.pyversion }}" + "PYTHON_TO_TEST": "python3.12" } jobs: >- { diff --git a/.github/workflows/build_python.yml b/.github/workflows/build_python_pypy3.9.yml similarity index 63% rename from .github/workflows/build_pyth
(spark) branch master updated: [SPARK-48198][BUILD] Upgrade jackson to 2.17.1
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e0c406eaef36 [SPARK-48198][BUILD] Upgrade jackson to 2.17.1 e0c406eaef36 is described below commit e0c406eaef36d95a106b6ce14086654ace6202af Author: panbingkun AuthorDate: Wed May 8 08:50:02 2024 -0700 [SPARK-48198][BUILD] Upgrade jackson to 2.17.1 ### What changes were proposed in this pull request? The pr aims to upgrade `jackson` from `2.17.0` to `2.17.1`. ### Why are the changes needed? The full release notes: https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.17.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46476 from panbingkun/SPARK-48198. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 14 +++--- pom.xml | 4 ++-- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 5d933e34e40b..73d41e9eeb33 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -104,15 +104,15 @@ icu4j/72.1//icu4j-72.1.jar ini4j/0.5.4//ini4j-0.5.4.jar istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar ivy/2.5.2//ivy-2.5.2.jar -jackson-annotations/2.17.0//jackson-annotations-2.17.0.jar +jackson-annotations/2.17.1//jackson-annotations-2.17.1.jar jackson-core-asl/1.9.13//jackson-core-asl-1.9.13.jar -jackson-core/2.17.0//jackson-core-2.17.0.jar -jackson-databind/2.17.0//jackson-databind-2.17.0.jar -jackson-dataformat-cbor/2.17.0//jackson-dataformat-cbor-2.17.0.jar -jackson-dataformat-yaml/2.17.0//jackson-dataformat-yaml-2.17.0.jar -jackson-datatype-jsr310/2.17.0//jackson-datatype-jsr310-2.17.0.jar +jackson-core/2.17.1//jackson-core-2.17.1.jar +jackson-databind/2.17.1//jackson-databind-2.17.1.jar +jackson-dataformat-cbor/2.17.1//jackson-dataformat-cbor-2.17.1.jar +jackson-dataformat-yaml/2.17.1//jackson-dataformat-yaml-2.17.1.jar +jackson-datatype-jsr310/2.17.1//jackson-datatype-jsr310-2.17.1.jar jackson-mapper-asl/1.9.13//jackson-mapper-asl-1.9.13.jar -jackson-module-scala_2.13/2.17.0//jackson-module-scala_2.13-2.17.0.jar +jackson-module-scala_2.13/2.17.1//jackson-module-scala_2.13-2.17.1.jar jakarta.annotation-api/2.0.0//jakarta.annotation-api-2.0.0.jar jakarta.inject-api/2.0.1//jakarta.inject-api-2.0.1.jar jakarta.servlet-api/5.0.0//jakarta.servlet-api-5.0.0.jar diff --git a/pom.xml b/pom.xml index c72482fd6a41..c3ff5d101c22 100644 --- a/pom.xml +++ b/pom.xml @@ -183,8 +183,8 @@ true true 1.9.13 -2.17.0 - 2.17.0 +2.17.1 + 2.17.1 2.3.1 3.0.2 1.1.10.5 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new a762f3175fcd [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side a762f3175fcd is described below commit a762f3175fcdb7b069faa0c2bfce93d295cb1f10 Author: Ruifeng Zheng AuthorDate: Wed May 8 07:44:22 2024 -0700 [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side ### What changes were proposed in this pull request? Always set the seed of `Dataframe.sample` in Client side ### Why are the changes needed? Bug fix If the seed is not set in Client, it will be set in server side with a random int https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386 which cause inconsistent results in multiple executions In Spark Classic: ``` In [1]: df = spark.range(1).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006] ``` In Spark Connect: before: ``` In [1]: df = spark.range(1).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979] ``` after: ``` In [1]: df = spark.range(1).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032] ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46456 from zhengruifeng/py_connect_sample_seed. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun (cherry picked from commit 47afe77242abf639a1d6966ce60cfd170a9d7d20) Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/connect/dataframe.py | 2 +- python/pyspark/sql/tests/connect/test_connect_plan.py | 2 +- python/pyspark/sql/tests/test_dataframe.py| 5 + 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index ff6191642025..6f23a15fb4ad 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -687,7 +687,7 @@ class DataFrame: if withReplacement is None: withReplacement = False -seed = int(seed) if seed is not None else None +seed = int(seed) if seed is not None else random.randint(0, sys.maxsize) return DataFrame.withPlan( plan.Sample( diff --git a/python/pyspark/sql/tests/connect/test_connect_plan.py b/python/pyspark/sql/tests/connect/test_connect_plan.py index c39fb6be24cd..88ef37511a66 100644 --- a/python/pyspark/sql/tests/connect/test_connect_plan.py +++ b/python/pyspark/sql/tests/connect/test_connect_plan.py @@ -430,7 +430,7 @@ class SparkConnectPlanTests(PlanOnlyTestFixture): self.assertEqual(plan.root.sample.lower_bound, 0.0) self.assertEqual(plan.root.sample.upper_bound, 0.3) self.assertEqual(plan.root.sample.with_replacement, False) -self.assertEqual(plan.root.sample.HasField("seed"), False) +self.assertEqual(plan.root.sample.HasField("seed"), True) self.assertEqual(plan.root.sample.deterministic_order, False) plan = ( diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index 5907c8c09fb4..887648018cf3 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -1045,6 +1045,11 @@ class DataFrameTestsMixin: IllegalArgumentException, lambda: self.spark.range(1).sample(-1.0).count() ) +def test_sample_with_random_seed(self): +df = self.spark.range(1).sample(0.1) +cnts = [df.count() for i in range(10)] +self.assertEqual(1, len(set(cnts))) + def test_toDF_with_string(self): df = self.spark.createDataFrame([("John", 30), ("Alice", 25), ("Bob", 28)]) data = [("John", 30), ("Alice", 25), ("Bob", 28)] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 47afe77242ab [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side 47afe77242ab is described below commit 47afe77242abf639a1d6966ce60cfd170a9d7d20 Author: Ruifeng Zheng AuthorDate: Wed May 8 07:44:22 2024 -0700 [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side ### What changes were proposed in this pull request? Always set the seed of `Dataframe.sample` in Client side ### Why are the changes needed? Bug fix If the seed is not set in Client, it will be set in server side with a random int https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386 which cause inconsistent results in multiple executions In Spark Classic: ``` In [1]: df = spark.range(1).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006] ``` In Spark Connect: before: ``` In [1]: df = spark.range(1).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979] ``` after: ``` In [1]: df = spark.range(1).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032] ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46456 from zhengruifeng/py_connect_sample_seed. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/connect/dataframe.py | 2 +- python/pyspark/sql/tests/connect/test_connect_plan.py | 2 +- python/pyspark/sql/tests/test_dataframe.py| 5 + 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index f9a209d2bcb3..843c92a9b27d 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -813,7 +813,7 @@ class DataFrame(ParentDataFrame): if withReplacement is None: withReplacement = False -seed = int(seed) if seed is not None else None +seed = int(seed) if seed is not None else random.randint(0, sys.maxsize) return DataFrame( plan.Sample( diff --git a/python/pyspark/sql/tests/connect/test_connect_plan.py b/python/pyspark/sql/tests/connect/test_connect_plan.py index 09c3171ee11f..e8d04aeada74 100644 --- a/python/pyspark/sql/tests/connect/test_connect_plan.py +++ b/python/pyspark/sql/tests/connect/test_connect_plan.py @@ -443,7 +443,7 @@ class SparkConnectPlanTests(PlanOnlyTestFixture): self.assertEqual(plan.root.sample.lower_bound, 0.0) self.assertEqual(plan.root.sample.upper_bound, 0.3) self.assertEqual(plan.root.sample.with_replacement, False) -self.assertEqual(plan.root.sample.HasField("seed"), False) +self.assertEqual(plan.root.sample.HasField("seed"), True) self.assertEqual(plan.root.sample.deterministic_order, False) plan = ( diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index 16dd0d2a3bf7..f491b496ddae 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -430,6 +430,11 @@ class DataFrameTestsMixin: IllegalArgumentException, lambda: self.spark.range(1).sample(-1.0).count() ) +def test_sample_with_random_seed(self): +df = self.spark.range(1).sample(0.1) +cnts = [df.count() for i in range(10)] +self.assertEqual(1, len(set(cnts))) + def test_toDF_with_string(self): df = self.spark.createDataFrame([("John", 30), ("Alice", 25), ("Bob", 28)]) data = [("John", 30), ("Alice", 25), ("Bob", 28)] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new da0c7cc81bb3 [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data da0c7cc81bb3 is described below commit da0c7cc81bb3d69d381dd0683e910eae4c80e9ae Author: sychen AuthorDate: Wed May 8 07:30:21 2024 -0700 [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data ### What changes were proposed in this pull request? This PR aims to fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data. ### Why are the changes needed? When the shuffle writer is SortShuffleWriter, it does not use SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain runtime statistics and the rowCount obtained is 0. Some optimization rules rely on rowCount statistics, such as `EliminateLimits`. Because rowCount is 0, it removes the limit operator. At this time, we get data results without limit. https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L168-L172 https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2067-L2070 ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Production environment verification. **master metrics** https://github.com/apache/spark/assets/3898450/dc9b6e8a-93ec-4f59-a903-71aa5b11962c;> **PR metrics** https://github.com/apache/spark/assets/3898450/2d73b773-2dcc-4d23-81de-25dcadac86c1;> ### Was this patch authored or co-authored using generative AI tooling? No Closes #46464 from cxzl25/SPARK-48037-3.4. Authored-by: sychen Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 1 + .../spark/shuffle/sort/SortShuffleManager.scala| 2 +- .../spark/shuffle/sort/SortShuffleWriter.scala | 6 ++-- .../spark/util/collection/ExternalSorter.scala | 9 +++--- .../shuffle/sort/SortShuffleWriterSuite.scala | 3 ++ .../sql/execution/UnsafeRowSerializerSuite.scala | 3 +- .../adaptive/AdaptiveQueryExecSuite.scala | 32 -- 7 files changed, 44 insertions(+), 12 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 8ae303178033..2d2e8da80d46 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -644,6 +644,7 @@ jobs: python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 'alabaster==0.7.13' python3.9 -m pip install ipython_genutils # See SPARK-38517 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas 'plotly>=4.8' +python3.9 -m pip install 'nbsphinx==0.9.3' python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421 apt-get update -y apt-get install -y ruby ruby-dev diff --git a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala index 46aca07ce43f..79dff6f87534 100644 --- a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala +++ b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala @@ -176,7 +176,7 @@ private[spark] class SortShuffleManager(conf: SparkConf) extends ShuffleManager metrics, shuffleExecutorComponents) case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] => -new SortShuffleWriter(other, mapId, context, shuffleExecutorComponents) +new SortShuffleWriter(other, mapId, context, metrics, shuffleExecutorComponents) } } diff --git a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala index 8613fe11a4c2..3be7d24f7e4e 100644 --- a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala +++ b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala @@ -21,6 +21,7 @@ import org.apache.spark._ impor
(spark) branch master updated: [SPARK-48187][INFRA] Run `docs` only in PR builders and `build_non_ansi` Daily CI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f3d9b819f3c0 [SPARK-48187][INFRA] Run `docs` only in PR builders and `build_non_ansi` Daily CI f3d9b819f3c0 is described below commit f3d9b819f3c013cd402ed98d01842173c45a5dd6 Author: Dongjoon Hyun AuthorDate: Wed May 8 00:02:44 2024 -0700 [SPARK-48187][INFRA] Run `docs` only in PR builders and `build_non_ansi` Daily CI ### What changes were proposed in this pull request? This PR aims to run `docs` (Documentation Generation) step only in PR builders and `build_non_ansi` Daily CI. To do that, this PR spins off `documentation generation` tasks from `lint` job. ### Why are the changes needed? Currently, Apache Spark CI is running `Documentation Generation` always inside `lint` job. We can take advantage PR Builder and one of Daily CIs. - https://infra.apache.org/github-actions-policy.html ### Does this PR introduce _any_ user-facing change? No because this is an infra update. ### How was this patch tested? Pass the CIs and manual review because PR builders will not be affected by this. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46463 from dongjoon-hyun/SPARK-48187. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 94 ++-- .github/workflows/build_non_ansi.yml | 1 + 2 files changed, 90 insertions(+), 5 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 00ba16265dce..bb9f2f9a9603 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -85,6 +85,7 @@ jobs: sparkr=`./dev/is-changed.py -m sparkr` buf=true ui=true +docs=true else pandas=false yarn=false @@ -92,6 +93,7 @@ jobs: sparkr=false buf=false ui=false +docs=false fi build=`./dev/is-changed.py -m "core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,variant,api,catalyst,hive-thriftserver,mllib-local,mllib,graphx,streaming,sql-kafka-0-10,streaming-kafka-0-10,streaming-kinesis-asl,kubernetes,hadoop-cloud,spark-ganglia-lgpl,protobuf,yarn,connect,sql,hive"` precondition=" @@ -103,6 +105,7 @@ jobs: \"tpcds-1g\": \"false\", \"docker-integration-tests\": \"false\", \"lint\" : \"true\", + \"docs\" : \"$docs\", \"yarn\" : \"$yarn\", \"k8s-integration-tests\" : \"$kubernetes\", \"buf\" : \"$buf\", @@ -621,12 +624,12 @@ jobs: - name: Python CodeGen check run: ./dev/connect-check-protos.py - # Static analysis, and documentation build + # Static analysis lint: needs: [precondition, infra-image] # always run if lint == 'true', even infra-image is skip (such as non-master job) if: (!cancelled()) && fromJson(needs.precondition.outputs.required).lint == 'true' -name: Linters, licenses, dependencies and documentation generation +name: Linters, licenses, and dependencies runs-on: ubuntu-latest timeout-minutes: 180 env: @@ -764,7 +767,90 @@ jobs: Rscript -e "devtools::install_version('lintr', version='2.0.1', repos='https://cloud.r-project.org')" - name: Install R linter dependencies and SparkR run: ./R/install-dev.sh -# Should delete this section after SPARK 3.5 EOL. +- name: R linter + run: ./dev/lint-r + + # Documentation build + docs: +needs: [precondition, infra-image] +# always run if lint == 'true', even infra-image is skip (such as non-master job) +if: (!cancelled()) && fromJson(needs.precondition.outputs.required).docs == 'true' +name: Documentation generation +runs-on: ubuntu-latest +timeout-minutes: 180 +env: + LC_ALL: C.UTF-8 + LANG: C.UTF-8 + NOLINT_ON_COMPILE: false + PYSPARK_DRIVER_PYTHON: python3.9 + PYSPARK_PYTHON: python3.9 + GITHUB_PREV_SHA: ${{ github.event.before }} +container: + image: ${{ needs.precondition.outputs.image_url }} +steps: +- name: Checkout Spark repository + uses: actions/checkout@v4 + with: +fetch-depth: 0 +repository: apache/spark +ref: ${{ inputs.branch }} +- name: Add GITHUB_WORKSPACE to git trust safe.directory + run: | +
(spark) branch branch-3.5 updated: [SPARK-48138][CONNECT][TESTS] Disable a flaky `SparkSessionE2ESuite.interrupt tag` test
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 36da89deccc9 [SPARK-48138][CONNECT][TESTS] Disable a flaky `SparkSessionE2ESuite.interrupt tag` test 36da89deccc9 is described below commit 36da89deccc916a6f32d9bf6d6f2fd8e288da917 Author: Dongjoon Hyun AuthorDate: Mon May 6 13:45:54 2024 +0800 [SPARK-48138][CONNECT][TESTS] Disable a flaky `SparkSessionE2ESuite.interrupt tag` test ### What changes were proposed in this pull request? This PR aims to disable a flaky test, `SparkSessionE2ESuite.interrupt tag`, temporarily. To re-enable this, SPARK-48139 is created as a blocker issue for 4.0.0. ### Why are the changes needed? This test case was added at `Apache Spark 3.5.0` but has been unstable unfortunately until now. - #42009 We tried to stabilize this test case before `Apache Spark 4.0.0-preview`. - #45173 - #46374 However, it's still flaky. - https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 (Master, 2024-05-05) - https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 (Master, 2024-05-04) This PR aims to stablize CI first and to focus this flaky issue as a blocker level before going on `Spark Connect GA` in SPARK-48139 before Apache Spark 4.0.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46396 from dongjoon-hyun/SPARK-48138. Authored-by: Dongjoon Hyun Signed-off-by: yangjie01 (cherry picked from commit 8294c5962febe53eebdff79f65f5f293d93a1997) Signed-off-by: Dongjoon Hyun --- .../jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala index c76dc724828e..e9c2f0c45750 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala @@ -108,7 +108,8 @@ class SparkSessionE2ESuite extends RemoteSparkSession { assert(interrupted.length == 2, s"Interrupted operations: $interrupted.") } - test("interrupt tag") { + // TODO(SPARK-48139): Re-enable `SparkSessionE2ESuite.interrupt tag` + ignore("interrupt tag") { val session = spark import session.implicits._ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 58b71307795b [SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data 58b71307795b is described below commit 58b71307795b6060be97431e0c5c8ab95205ea79 Author: sychen AuthorDate: Tue May 7 22:39:02 2024 -0700 [SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data ### What changes were proposed in this pull request? This PR aims to fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data. ### Why are the changes needed? When the shuffle writer is SortShuffleWriter, it does not use SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain runtime statistics and the rowCount obtained is 0. Some optimization rules rely on rowCount statistics, such as `EliminateLimits`. Because rowCount is 0, it removes the limit operator. At this time, we get data results without limit. https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L168-L172 https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2067-L2070 ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Production environment verification. **master metrics** https://github.com/apache/spark/assets/3898450/dc9b6e8a-93ec-4f59-a903-71aa5b11962c;> **PR metrics** https://github.com/apache/spark/assets/3898450/2d73b773-2dcc-4d23-81de-25dcadac86c1;> ### Was this patch authored or co-authored using generative AI tooling? No Closes #46459 from cxzl25/SPARK-48037-3.5. Authored-by: sychen Signed-off-by: Dongjoon Hyun --- .../spark/shuffle/sort/SortShuffleManager.scala| 2 +- .../spark/shuffle/sort/SortShuffleWriter.scala | 6 ++-- .../spark/util/collection/ExternalSorter.scala | 9 +++--- .../shuffle/sort/SortShuffleWriterSuite.scala | 3 ++ .../sql/execution/UnsafeRowSerializerSuite.scala | 3 +- .../adaptive/AdaptiveQueryExecSuite.scala | 32 -- 6 files changed, 43 insertions(+), 12 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala index 46aca07ce43f..79dff6f87534 100644 --- a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala +++ b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala @@ -176,7 +176,7 @@ private[spark] class SortShuffleManager(conf: SparkConf) extends ShuffleManager metrics, shuffleExecutorComponents) case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] => -new SortShuffleWriter(other, mapId, context, shuffleExecutorComponents) +new SortShuffleWriter(other, mapId, context, metrics, shuffleExecutorComponents) } } diff --git a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala index 8613fe11a4c2..3be7d24f7e4e 100644 --- a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala +++ b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala @@ -21,6 +21,7 @@ import org.apache.spark._ import org.apache.spark.internal.{config, Logging} import org.apache.spark.scheduler.MapStatus import org.apache.spark.shuffle.{BaseShuffleHandle, ShuffleWriter} +import org.apache.spark.shuffle.ShuffleWriteMetricsReporter import org.apache.spark.shuffle.api.ShuffleExecutorComponents import org.apache.spark.util.collection.ExternalSorter @@ -28,6 +29,7 @@ private[spark] class SortShuffleWriter[K, V, C]( handle: BaseShuffleHandle[K, V, C], mapId: Long, context: TaskContext, +writeMetrics: ShuffleWriteMetricsReporter, shuffleExecutorComponents: ShuffleExecutorComponents) extends ShuffleWriter[K, V] with Logging { @@ -46,8 +48,6 @@ private[spark] class SortShuffleWriter[K, V, C]( private var partitionLengths: Array[Long] = _ - private val writeMetrics = context.taskMetrics().shuffleWriteMetrics - /** Write a bunch of records to this task's output */ override def write(records: Iterator[Product2[K, V]]): Unit = { sorter = if (dep.mapSideCombine) { @@ -67,7 +67,7 @@ private[spark] class SortShuffleWriter[K, V, C]( // (see SPARK-3570).
(spark) branch master updated (5f883117203d -> 52a7f634e913)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 5f883117203d [SPARK-47914][SQL] Do not display the splits parameter in Range add 52a7f634e913 [SPARK-48183][PYTHON][DOCS] Update error contribution guide to respect new error class file No new revisions were added by this update. Summary of changes: python/docs/source/development/contributing.rst | 4 ++-- python/pyspark/errors/utils.py | 8 2 files changed, 6 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (6588554aa4cc -> 3b1ea0fde44e)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 6588554aa4cc [SPARK-48149][INFRA][FOLLOWUP] Use single quotation mark add 3b1ea0fde44e [MINOR][PYTHON][TESTS] Remove the doc in error message tests to allow other PyArrow versions in tests No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py | 2 +- python/pyspark/sql/tests/pandas/test_pandas_map.py | 4 ++-- python/pyspark/sql/tests/test_arrow_map.py | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48131][CORE][FOLLOWUP] Add a new configuration for the MDC key of Task Name
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 84c5b919d998 [SPARK-48131][CORE][FOLLOWUP] Add a new configuration for the MDC key of Task Name 84c5b919d998 is described below commit 84c5b919d99872858d2f98db21fd3482f27dcbfc Author: Gengliang Wang AuthorDate: Tue May 7 19:18:50 2024 -0700 [SPARK-48131][CORE][FOLLOWUP] Add a new configuration for the MDC key of Task Name ### What changes were proposed in this pull request? Introduce a new Spark config `spark.log.legacyTaskNameMdc.enabled`: When true, the MDC key `mdc.taskName` will be set in the logs, which is consistent with the behavior of Spark 3.1 to Spark 3.5 releases. When false, the logging framework will use `task_name` as the MDC key for consistency with other new MDC keys. ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/46386#issuecomment-2098985001, we should add a configuration and migration guide about the change in the MDC key of Task Name. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46446 from gengliangwang/addConfig. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/executor/Executor.scala | 11 +-- .../main/scala/org/apache/spark/internal/config/package.scala | 10 ++ docs/core-migration-guide.md | 2 ++ 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala b/core/src/main/scala/org/apache/spark/executor/Executor.scala index 3edba45ef89f..68c38fb6179f 100644 --- a/core/src/main/scala/org/apache/spark/executor/Executor.scala +++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala @@ -95,6 +95,13 @@ private[spark] class Executor( private[executor] val conf = env.conf + // SPARK-48131: Unify MDC key mdc.taskName and task_name in Spark 4.0 release. + private[executor] val taskNameMDCKey = if (conf.get(LEGACY_TASK_NAME_MDC_ENABLED)) { +"mdc.taskName" + } else { +LogKeys.TASK_NAME.name + } + // SPARK-40235: updateDependencies() uses a ReentrantLock instead of the `synchronized` keyword // so that tasks can exit quickly if they are interrupted while waiting on another task to // finish downloading dependencies. @@ -914,7 +921,7 @@ private[spark] class Executor( try { mdc.foreach { case (key, value) => MDC.put(key, value) } // avoid overriding the takName by the user - MDC.put(LogKeys.TASK_NAME.name, taskName) + MDC.put(taskNameMDCKey, taskName) } catch { case _: NoSuchFieldError => logInfo("MDC is not supported.") } @@ -923,7 +930,7 @@ private[spark] class Executor( private def cleanMDCForTask(taskName: String, mdc: Seq[(String, String)]): Unit = { try { mdc.foreach { case (key, _) => MDC.remove(key) } - MDC.remove(LogKeys.TASK_NAME.name) + MDC.remove(taskNameMDCKey) } catch { case _: NoSuchFieldError => logInfo("MDC is not supported.") } diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index a5be6084de36..87402d2cc17e 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -152,6 +152,16 @@ package object config { .booleanConf .createWithDefault(true) + private[spark] val LEGACY_TASK_NAME_MDC_ENABLED = +ConfigBuilder("spark.log.legacyTaskNameMdc.enabled") + .doc("When true, the MDC (Mapped Diagnostic Context) key `mdc.taskName` will be set in the " + +"log output, which is the behavior of Spark version 3.1 through Spark 3.5 releases. " + +"When false, the logging framework will use `task_name` as the MDC key, " + +"aligning it with the naming convention of newer MDC keys introduced in Spark 4.0 release.") + .version("4.0.0") + .booleanConf + .createWithDefault(false) + private[spark] val DRIVER_LOG_LOCAL_DIR = ConfigBuilder("spark.driver.log.localDir") .doc("Specifies a local directory to write driver logs and enable Driver Log UI Tab.") diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md index 95c7929a6241..28a9dd0f4371 100644 --- a/docs/core-migration-guide.md +++ b/docs/core-migration-guide.md @@ -46,6 +46,8 @@ license: | -
(spark) branch master updated (5e49665ac39b -> 553e1b85c42a)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 5e49665ac39b [SPARK-47960][SS] Allow chaining other stateful operators after transformWithState operator add 553e1b85c42a [SPARK-48152][BUILD] Make `spark-profiler` as a part of release and publish to maven central repo No new revisions were added by this update. Summary of changes: .github/workflows/maven_test.yml| 10 +- connector/profiler/README.md| 2 +- connector/profiler/pom.xml | 6 +- dev/create-release/release-build.sh | 2 +- dev/test-dependencies.sh| 2 +- docs/building-spark.md | 7 +++ pom.xml | 3 +++ 7 files changed, 23 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48178][INFRA][3.5] Run `build/scala-213/java-11-17` jobs of `branch-3.5` only if needed
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 15b5d2a55837 [SPARK-48178][INFRA][3.5] Run `build/scala-213/java-11-17` jobs of `branch-3.5` only if needed 15b5d2a55837 is described below commit 15b5d2a558371395547461d7b37f20610432dea0 Author: Dongjoon Hyun AuthorDate: Tue May 7 15:54:50 2024 -0700 [SPARK-48178][INFRA][3.5] Run `build/scala-213/java-11-17` jobs of `branch-3.5` only if needed ### What changes were proposed in this pull request? This PR aims to run `build`, `scala-213`, and `java-11-17` job of `branch-3.5` only if needed to reduce the maximum concurrency of Apache Spark GitHub Action usage. ### Why are the changes needed? To meet ASF Infra GitHub Action policy, we need to reduce the maximum concurrency. - https://infra.apache.org/github-actions-policy.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46449 from dongjoon-hyun/SPARK-48178. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index fa40b2d0a390..9c3dc95d0f66 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -85,17 +85,16 @@ jobs: sparkr=`./dev/is-changed.py -m sparkr` tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` - # 'build', 'scala-213', and 'java-11-17' are always true for now. - # It does not save significant time and most of PRs trigger the build. + build=`./dev/is-changed.py -m "core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"` precondition=" { - \"build\": \"true\", + \"build\": \"$build\", \"pyspark\": \"$pyspark\", \"sparkr\": \"$sparkr\", \"tpcds-1g\": \"$tpcds\", \"docker-integration-tests\": \"$docker\", - \"scala-213\": \"true\", - \"java-11-17\": \"true\", + \"scala-213\": \"$build\", + \"java-11-17\": \"$build\", \"lint\" : \"true\", \"k8s-integration-tests\" : \"true\", \"breaking-changes-buf\" : \"true\", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48173][SQL][3.5] CheckAnalysis should see the entire query plan
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 2f8e7cbe98df [SPARK-48173][SQL][3.5] CheckAnalysis should see the entire query plan 2f8e7cbe98df is described below commit 2f8e7cbe98df97ee0ae51a20796192c95e750721 Author: Wenchen Fan AuthorDate: Tue May 7 15:25:15 2024 -0700 [SPARK-48173][SQL][3.5] CheckAnalysis should see the entire query plan backport https://github.com/apache/spark/pull/46439 to 3.5 ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/38029 . Some custom check rules need to see the entire query plan tree to get some context, but https://github.com/apache/spark/pull/38029 breaks it as it checks the query plan of dangling CTE relations recursively. This PR fixes it by putting back the dangling CTE relation in the main query plan and then check the main query plan. ### Why are the changes needed? Revert the breaking change to custom check rules ### Does this PR introduce _any_ user-facing change? No for most users. This restores the behavior of Spark 3.3 and earlier for custom check rules. ### How was this patch tested? existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46442 from cloud-fan/check2. Lead-authored-by: Wenchen Fan Co-authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun --- .../sql/catalyst/analysis/CheckAnalysis.scala | 38 +++--- 1 file changed, 33 insertions(+), 5 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 7f10bdbc80ca..485015f2efab 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -141,17 +141,45 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB errorClass, missingCol, orderedCandidates, a.origin) } + private def checkUnreferencedCTERelations( + cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])], + visited: mutable.Map[Long, Boolean], + danglingCTERelations: mutable.ArrayBuffer[CTERelationDef], + cteId: Long): Unit = { +if (visited(cteId)) { + return +} +val (cteDef, _, refMap) = cteMap(cteId) +refMap.foreach { case (id, _) => + checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, id) +} +danglingCTERelations.append(cteDef) +visited(cteId) = true + } + def checkAnalysis(plan: LogicalPlan): Unit = { val inlineCTE = InlineCTE(alwaysInline = true) val cteMap = mutable.HashMap.empty[Long, (CTERelationDef, Int, mutable.Map[Long, Int])] inlineCTE.buildCTEMap(plan, cteMap) -cteMap.values.foreach { case (relation, refCount, _) => - // If a CTE relation is never used, it will disappear after inline. Here we explicitly check - // analysis for it, to make sure the entire query plan is valid. - if (refCount == 0) checkAnalysis0(relation.child) +val danglingCTERelations = mutable.ArrayBuffer.empty[CTERelationDef] +val visited: mutable.Map[Long, Boolean] = mutable.Map.empty.withDefaultValue(false) +// If a CTE relation is never used, it will disappear after inline. Here we explicitly collect +// these dangling CTE relations, and put them back in the main query, to make sure the entire +// query plan is valid. +cteMap.foreach { case (cteId, (_, refCount, _)) => + // If a CTE relation ref count is 0, the other CTE relations that reference it should also be + // collected. This code will also guarantee the leaf relations that do not reference + // any others are collected first. + if (refCount == 0) { +checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, cteId) + } } // Inline all CTEs in the plan to help check query plan structures in subqueries. -checkAnalysis0(inlineCTE(plan)) +var inlinedPlan: LogicalPlan = inlineCTE(plan) +if (danglingCTERelations.nonEmpty) { + inlinedPlan = WithCTE(inlinedPlan, danglingCTERelations.toSeq) +} +checkAnalysis0(inlinedPlan) plan.setAnalyzed() } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new a24ec1d8f76c [SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3` a24ec1d8f76c is described below commit a24ec1d8f76c7bf47e491086f14ea202b6806cd8 Author: Dongjoon Hyun AuthorDate: Tue May 7 15:23:24 2024 -0700 [SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3` ### What changes were proposed in this pull request? This PR aims to pin `nbsphinx` to `0.9.3` to recover `branch-3.5` CI. ### Why are the changes needed? From yesterday, `branch-3.5` commit build is broken. - https://github.com/apache/spark/actions/runs/8978558438/job/24659197282 ``` Exception occurred: File "/usr/local/lib/python3.9/dist-packages/nbsphinx/__init__.py", line 1316, in apply for section in self.document.findall(docutils.nodes.section): AttributeError: 'document' object has no attribute 'findall' The full traceback has been saved in /tmp/sphinx-err-qz4y0bav.log, if you want to report the issue to the developers. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs on this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46448 from dongjoon-hyun/nbsphinx. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 8488540b415d..fa40b2d0a390 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -682,7 +682,7 @@ jobs: # See also https://issues.apache.org/jira/browse/SPARK-35375. # Pin the MarkupSafe to 2.0.1 to resolve the CI error. # See also https://issues.apache.org/jira/browse/SPARK-38279. -python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 'alabaster==0.7.13' +python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 'sphinx-copybutton==0.5.2' 'nbsphinx==0.9.3' numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 'alabaster==0.7.13' python3.9 -m pip install ipython_genutils # See SPARK-38517 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas 'plotly>=4.8' python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix Python Linter error
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 03bc2b188d21 [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix Python Linter error 03bc2b188d21 is described below commit 03bc2b188d2111b5c4cc5bc13ebd0455602028a8 Author: Dongjoon Hyun AuthorDate: Tue May 7 13:38:08 2024 -0700 [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix Python Linter error ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/46430 to fix Python linter failure. ### Why are the changes needed? To recover `branch-3.5` CI, - https://github.com/apache/spark/actions/runs/8981228745/job/24666400664 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass Python Linter in this PR builder. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46445 from dongjoon-hyun/SPARK-48167. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/tests/test_readwriter.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/tests/test_readwriter.py b/python/pyspark/sql/tests/test_readwriter.py index e903d3383b74..7911a82c61fc 100644 --- a/python/pyspark/sql/tests/test_readwriter.py +++ b/python/pyspark/sql/tests/test_readwriter.py @@ -247,7 +247,8 @@ class ReadwriterV2TestsMixin: self.assertEqual(100, self.spark.sql("select * from test_table").count()) @unittest.skipIf( -"SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ, "Known behavior change in 4.0") +"SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ, "Known behavior change in 4.0" +) def test_create_without_provider(self): df = self.df with self.assertRaisesRegex( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (26c50369edb2 -> e24f8965e066)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 26c50369edb2 [SPARK-48174][INFRA] Merge `connect` back to the original test pipeline add e24f8965e066 [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data No new revisions were added by this update. Summary of changes: .../spark/shuffle/sort/SortShuffleManager.scala| 2 +- .../spark/shuffle/sort/SortShuffleWriter.scala | 6 +++--- .../spark/util/collection/ExternalSorter.scala | 9 + .../shuffle/sort/SortShuffleWriterSuite.scala | 3 +++ .../sql/execution/UnsafeRowSerializerSuite.scala | 3 ++- .../adaptive/AdaptiveQueryExecSuite.scala | 23 ++ 6 files changed, 37 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48174][INFRA] Merge `connect` back to the original test pipeline
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 26c50369edb2 [SPARK-48174][INFRA] Merge `connect` back to the original test pipeline 26c50369edb2 is described below commit 26c50369edb21d616361a4b22a555ed7b7412a4e Author: Dongjoon Hyun AuthorDate: Tue May 7 09:34:59 2024 -0700 [SPARK-48174][INFRA] Merge `connect` back to the original test pipeline ### What changes were proposed in this pull request? This PR aims to merge connect back to the original test pipeline to reduce the maximum concurrency of GitHub Action by one. - https://infra.apache.org/github-actions-policy.html > All workflows SHOULD have a job concurrency level less than or equal to 15. ### Why are the changes needed? This is a partial recover from the following. - #45107 We stabilized the root cause of #45107 via the following PRs. In addition we will disable a flaky test case if exists. - #46395 - #46396 - #46425 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46441 from dongjoon-hyun/SPARK-48174. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 286f8e1193d9..00ba16265dce 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -156,9 +156,8 @@ jobs: mllib-local, mllib, graphx - >- streaming, sql-kafka-0-10, streaming-kafka-0-10, streaming-kinesis-asl, -kubernetes, hadoop-cloud, spark-ganglia-lgpl, protobuf +kubernetes, hadoop-cloud, spark-ganglia-lgpl, protobuf, connect - yarn - - connect # Here, we split Hive and SQL tests into some of slow ones and the rest of them. included-tags: [""] excluded-tags: [""] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal to add/multiply
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 808186835077 [SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal to add/multiply 808186835077 is described below commit 808186835077cf50f10262c633f19de4ccc09d9d Author: Supun Nakandala AuthorDate: Tue May 7 09:17:01 2024 -0700 [SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal to add/multiply ### What changes were proposed in this pull request? - This is a follow-up to the previous PR: https://github.com/apache/spark/pull/46307. - With the new changes we do the evalMode check in the `collectOperands` function instead of introducing a new function. ### Why are the changes needed? - Better code quality and readability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? - No Closes #46414 from db-scnakandala/db-scnakandala/master. Authored-by: Supun Nakandala Signed-off-by: Dongjoon Hyun --- .../sql/catalyst/expressions/Expression.scala | 14 - .../sql/catalyst/expressions/arithmetic.scala | 23 -- 2 files changed, 8 insertions(+), 29 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala index 2759f5a29c79..de15ec43c4f3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala @@ -1378,20 +1378,6 @@ trait CommutativeExpression extends Expression { } reorderResult } - - /** - * Helper method to collect the evaluation mode of the commutative expressions. This is - * used by the canonicalized methods of [[Add]] and [[Multiply]] operators to ensure that - * all operands have the same evaluation mode before reordering the operands. - */ - protected def collectEvalModes( - e: Expression, - f: PartialFunction[CommutativeExpression, Seq[EvalMode.Value]] - ): Seq[EvalMode.Value] = e match { -case c: CommutativeExpression if f.isDefinedAt(c) => - f(c) ++ c.children.flatMap(collectEvalModes(_, f)) -case _ => Nil - } } /** diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala index 91c10a53af8a..a085a4e3a8a3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala @@ -452,14 +452,12 @@ case class Add( copy(left = newLeft, right = newRight) override lazy val canonicalized: Expression = { -val evalModes = collectEvalModes(this, {case Add(_, _, evalMode) => Seq(evalMode)}) -lazy val reorderResult = buildCanonicalizedPlan( - { case Add(l, r, _) => Seq(l, r) }, +val reorderResult = buildCanonicalizedPlan( + { case Add(l, r, em) if em == evalMode => Seq(l, r) }, { case (l: Expression, r: Expression) => Add(l, r, evalMode)}, Some(evalMode) ) -if (resolved && evalModes.forall(_ == evalMode) && reorderResult.resolved && - reorderResult.dataType == dataType) { +if (resolved && reorderResult.resolved && reorderResult.dataType == dataType) { reorderResult } else { // SPARK-40903: Avoid reordering decimal Add for canonicalization if the result data type is @@ -609,16 +607,11 @@ case class Multiply( newLeft: Expression, newRight: Expression): Multiply = copy(left = newLeft, right = newRight) override lazy val canonicalized: Expression = { -val evalModes = collectEvalModes(this, {case Multiply(_, _, evalMode) => Seq(evalMode)}) -if (evalModes.forall(_ == evalMode)) { - buildCanonicalizedPlan( -{ case Multiply(l, r, _) => Seq(l, r) }, -{ case (l: Expression, r: Expression) => Multiply(l, r, evalMode)}, -Some(evalMode) - ) -} else { - withCanonicalizedChildren -} +buildCanonicalizedPlan( + { case Multiply(l, r, em) if em == evalMode => Seq(l, r) }, + { case (l: Expression, r: Expression) => Multiply(l, r, evalMode) }, + Some(evalMode) +) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-41547][CONNECT][TESTS] Re-eneable Spark Connect function tests with ANSI mode
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8f719adcf556 [SPARK-41547][CONNECT][TESTS] Re-eneable Spark Connect function tests with ANSI mode 8f719adcf556 is described below commit 8f719adcf556f23ba66d3742266f4ca2e4875530 Author: Martin Grund AuthorDate: Tue May 7 09:14:06 2024 -0700 [SPARK-41547][CONNECT][TESTS] Re-eneable Spark Connect function tests with ANSI mode ### What changes were proposed in this pull request? This patch re-enables the previously failing tests after enablement of ANSI SQL. ### Why are the changes needed? Spark 4 / ANSI SQL ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Re-enabled tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46432 from grundprinzip/grundprinzip/SPARK-41547. Authored-by: Martin Grund Signed-off-by: Dongjoon Hyun --- .../sql/tests/connect/test_connect_function.py | 33 ++ 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py b/python/pyspark/sql/tests/connect/test_connect_function.py index 2f21dd5a7d3a..9d4db8cf7d15 100644 --- a/python/pyspark/sql/tests/connect/test_connect_function.py +++ b/python/pyspark/sql/tests/connect/test_connect_function.py @@ -2030,7 +2030,6 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, S (CF.sentences, SF.sentences), (CF.initcap, SF.initcap), (CF.soundex, SF.soundex), -(CF.bin, SF.bin), (CF.hex, SF.hex), (CF.unhex, SF.unhex), (CF.length, SF.length), @@ -2043,6 +2042,19 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, S sdf.select(sfunc("a"), sfunc(sdf.b)).toPandas(), ) +query = """ +SELECT * FROM VALUES +(' 1 ', '2 ', NULL), (' 3', NULL, '4') +AS tab(a, b, c) +""" +cdf = self.connect.sql(query) +sdf = self.spark.sql(query) + +self.assert_eq( +cdf.select(CF.bin(cdf.a), CF.bin(cdf.b)).toPandas(), +sdf.select(SF.bin(sdf.a), SF.bin(sdf.b)).toPandas(), +) + def test_string_functions_multi_args(self): query = """ SELECT * FROM VALUES @@ -2149,15 +2161,15 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, S def test_date_ts_functions(self): query = """ SELECT * FROM VALUES -('1997/02/28 10:30:00', '2023/03/01 06:00:00', 'JST', 1428476400, 2020, 12, 6), -('2000/01/01 04:30:05', '2020/05/01 12:15:00', 'PST', 1403892395, 2022, 12, 6) +('1997-02-28 10:30:00', '2023-03-01 06:00:00', 'JST', 1428476400, 2020, 12, 6), +('2000-01-01 04:30:05', '2020-05-01 12:15:00', 'PST', 1403892395, 2022, 12, 6) AS tab(ts1, ts2, tz, seconds, Y, M, D) """ # +---+---+---+--++---+---+ # |ts1|ts2| tz| seconds| Y| M| D| # +---+---+---+--++---+---+ -# |1997/02/28 10:30:00|2023/03/01 06:00:00|JST|1428476400|2020| 12| 6| -# |2000/01/01 04:30:05|2020/05/01 12:15:00|PST|1403892395|2022| 12| 6| +# |1997-02-28 10:30:00|2023-03-01 06:00:00|JST|1428476400|2020| 12| 6| +# |2000-01-01 04:30:05|2020-05-01 12:15:00|PST|1403892395|2022| 12| 6| # +---+---+---+--++---+---+ cdf = self.connect.sql(query) @@ -2213,14 +2225,14 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, S (CF.to_date, SF.to_date), ]: self.assert_eq( -cdf.select(cfunc(cdf.ts1, format="-MM-dd")).toPandas(), -sdf.select(sfunc(sdf.ts1, format="-MM-dd")).toPandas(), +cdf.select(cfunc(cdf.ts1, format="-MM-dd HH:mm:ss")).toPandas(), +sdf.select(sfunc(sdf.ts1, format="-MM-dd HH:mm:ss")).toPandas(), ) self.compare_by_show( # [left]: datetime64[ns, America/Los_Angeles] # [right]: datetime64[ns] -cdf.select(CF.to_timestamp(cdf.ts1, format="-MM-dd")), -sdf.select(SF.to_timestamp(sdf.ts1, format="-MM-dd")), +cdf.select(CF.to_timestamp(cdf.ts1,
(spark) branch master updated (925457cadd22 -> a3eebcf39687)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 925457cadd22 [SPARK-48169][SQL] Use lazy BadRecordException cause in all parsers and remove the old constructor, which was meant for the migration add a3eebcf39687 [SPARK-48170][PYTHON][CONNECT][TESTS] Enable `ArrowPythonUDFParityTests.test_err_return_type` No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py | 4 1 file changed, 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48169][SQL] Use lazy BadRecordException cause in all parsers and remove the old constructor, which was meant for the migration
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 925457cadd22 [SPARK-48169][SQL] Use lazy BadRecordException cause in all parsers and remove the old constructor, which was meant for the migration 925457cadd22 is described below commit 925457cadd229673323e91a82d0b504145f509e0 Author: Vladimir Golubev AuthorDate: Tue May 7 09:09:00 2024 -0700 [SPARK-48169][SQL] Use lazy BadRecordException cause in all parsers and remove the old constructor, which was meant for the migration ### What changes were proposed in this pull request? Use factory function for the exception cause in `BadRecordException` to avoid constructing heavy exceptions in the underlying parser. Now they are constructed on-demand in `FailureSafeParser`. A follow-up for https://github.com/apache/spark/pull/46400 ### Why are the changes needed? - Speed-up `JacksonParser` and `StaxXmlParser`, since they throw user-facing exceptions to `FailureSafeParser` - Refactoring - leave only one constructor in `BadRecordException` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - `testOnly org.apache.spark.sql.catalyst.json.JacksonParserSuite` - `testOnly org.apache.spark.sql.catalyst.csv.UnivocityParserSuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46438 from vladimirg-db/vladimirg-db/use-lazy-exception-cause-in-all-bad-record-exception-invocations. Authored-by: Vladimir Golubev Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/csv/UnivocityParser.scala | 2 +- .../spark/sql/catalyst/json/JacksonParser.scala| 12 ++-- .../sql/catalyst/util/BadRecordException.scala | 10 +- .../spark/sql/catalyst/xml/StaxXmlParser.scala | 22 -- 4 files changed, 20 insertions(+), 26 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala index 37d9143e5b5a..8d06789a7512 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala @@ -359,7 +359,7 @@ class UnivocityParser( } else { if (badRecordException.isDefined) { throw BadRecordException( - () => currentInput, () => Array[InternalRow](requiredRow.get), badRecordException.get) + () => currentInput, () => Array(requiredRow.get), badRecordException.get) } else { requiredRow } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala index d1093a3b1be1..3c42f72fa6b6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala @@ -613,7 +613,7 @@ class JacksonParser( // JSON parser currently doesn't support partial results for corrupted records. // For such records, all fields other than the field configured by // `columnNameOfCorruptRecord` are set to `null`. -throw BadRecordException(() => recordLiteral(record), cause = e) +throw BadRecordException(() => recordLiteral(record), cause = () => e) case e: CharConversionException if options.encoding.isEmpty => val msg = """JSON parser cannot handle a character in its input. @@ -621,17 +621,17 @@ class JacksonParser( |""".stripMargin + e.getMessage val wrappedCharException = new CharConversionException(msg) wrappedCharException.initCause(e) -throw BadRecordException(() => recordLiteral(record), cause = wrappedCharException) +throw BadRecordException(() => recordLiteral(record), cause = () => wrappedCharException) case PartialResultException(row, cause) => throw BadRecordException( record = () => recordLiteral(record), partialResults = () => Array(row), - convertCauseForPartialResult(cause)) + cause = () => convertCauseForPartialResult(cause)) case PartialResultArrayException(rows, cause) => throw BadRecordException( record = () => recordLiteral(record), partialResults = () => rows, - cause) + cause = () => cause) // These exceptions should never be thrown outside of JacksonParser. // They are used for the control flow in the par
(spark) branch master updated (493493d6c5bb -> 9e0a87eb4cf2)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 493493d6c5bb [SPARK-48173][SQL] CheckAnalysis should see the entire query plan add 9e0a87eb4cf2 [SPARK-48165][BUILD] Update `ap-loader` to 3.0-9 No new revisions were added by this update. Summary of changes: connector/profiler/pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48173][SQL] CheckAnalysis should see the entire query plan
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 493493d6c5bb [SPARK-48173][SQL] CheckAnalysis should see the entire query plan 493493d6c5bb is described below commit 493493d6c5bbbaa0b04f5548ac1ccd9502e8b8fa Author: Wenchen Fan AuthorDate: Tue May 7 08:02:25 2024 -0700 [SPARK-48173][SQL] CheckAnalysis should see the entire query plan ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/38029 . Some custom check rules need to see the entire query plan tree to get some context, but https://github.com/apache/spark/pull/38029 breaks it as it checks the query plan of dangling CTE relations recursively. This PR fixes it by putting back the dangling CTE relation in the main query plan and then check the main query plan. ### Why are the changes needed? Revert the breaking change to custom check rules ### Does this PR introduce _any_ user-facing change? No for most users. This restores the behavior of Spark 3.3 and earlier for custom check rules. ### How was this patch tested? existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46439 from cloud-fan/check. Authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun --- .../sql/catalyst/analysis/CheckAnalysis.scala | 39 +++--- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index d1b336b08955..e55f23b6aa86 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -145,15 +145,16 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB private def checkUnreferencedCTERelations( cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])], visited: mutable.Map[Long, Boolean], + danglingCTERelations: mutable.ArrayBuffer[CTERelationDef], cteId: Long): Unit = { if (visited(cteId)) { return } val (cteDef, _, refMap) = cteMap(cteId) refMap.foreach { case (id, _) => - checkUnreferencedCTERelations(cteMap, visited, id) + checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, id) } -checkAnalysis0(cteDef.child) +danglingCTERelations.append(cteDef) visited(cteId) = true } @@ -161,35 +162,35 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB val inlineCTE = InlineCTE(alwaysInline = true) val cteMap = mutable.HashMap.empty[Long, (CTERelationDef, Int, mutable.Map[Long, Int])] inlineCTE.buildCTEMap(plan, cteMap) +val danglingCTERelations = mutable.ArrayBuffer.empty[CTERelationDef] val visited: mutable.Map[Long, Boolean] = mutable.Map.empty.withDefaultValue(false) -cteMap.foreach { case (cteId, (relation, refCount, _)) => - // If a CTE relation is never used, it will disappear after inline. Here we explicitly check - // analysis for it, to make sure the entire query plan is valid. - try { -// If a CTE relation ref count is 0, the other CTE relations that reference it -// should also be checked by checkAnalysis0. This code will also guarantee the leaf -// relations that do not reference any others are checked first. -if (refCount == 0) { - checkUnreferencedCTERelations(cteMap, visited, cteId) -} - } catch { -case e: AnalysisException => - throw new ExtendedAnalysisException(e, relation.child) +// If a CTE relation is never used, it will disappear after inline. Here we explicitly collect +// these dangling CTE relations, and put them back in the main query, to make sure the entire +// query plan is valid. +cteMap.foreach { case (cteId, (_, refCount, _)) => + // If a CTE relation ref count is 0, the other CTE relations that reference it should also be + // collected. This code will also guarantee the leaf relations that do not reference + // any others are collected first. + if (refCount == 0) { +checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, cteId) } } // Inline all CTEs in the plan to help check query plan structures in subqueries. -var inlinedPlan: Option[LogicalPlan] = None +var inlinedPlan: LogicalPlan = plan try { - inlinedPlan = Some(inlineCTE(plan)) + inlinedPlan = inlineCTE(plan)
(spark) branch master updated: [SPARK-48171][CORE] Clean up the use of deprecated constructors of `o.rocksdb.Logger`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c326f3c143ff [SPARK-48171][CORE] Clean up the use of deprecated constructors of `o.rocksdb.Logger` c326f3c143ff is described below commit c326f3c143ffdd56954706aeb4e0b82ac819bf03 Author: yangjie01 AuthorDate: Tue May 7 07:33:38 2024 -0700 [SPARK-48171][CORE] Clean up the use of deprecated constructors of `o.rocksdb.Logger` ### What changes were proposed in this pull request? This pr aims to clean up the use of deprecated constructors of `o.rocksdb.Logger`, the change ref to https://github.com/facebook/rocksdb/blob/5c2be544f5509465957706c955b6d623e889ac4e/java/src/main/java/org/rocksdb/Logger.java#L39-L54 ``` /** * AbstractLogger constructor. * * Important: the log level set within * the {link org.rocksdb.Options} instance will be used as * maximum log level of RocksDB. * * param options {link org.rocksdb.Options} instance. * * deprecated Use {link Logger#Logger(InfoLogLevel)} instead, e.g. {code new * Logger(options.infoLogLevel())}. */ Deprecated public Logger(final Options options) { this(options.infoLogLevel()); } ``` ### Why are the changes needed? Clean up deprecated api usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #46436 from LuciferYang/rocksdb-deprecation. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../src/main/java/org/apache/spark/network/util/RocksDBProvider.java| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java b/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java index f3b7b48355a0..2b5ea01d94c9 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java +++ b/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java @@ -136,7 +136,7 @@ public class RocksDBProvider { private static final Logger LOG = LoggerFactory.getLogger(RocksDBLogger.class); RocksDBLogger(Options options) { - super(options); + super(options.infoLogLevel()); } @Override - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 56fe185c78a2 [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` 56fe185c78a2 is described below commit 56fe185c78a249cf88b1d7e5d1e67444e1b224db Author: Dongjoon Hyun AuthorDate: Mon May 6 21:39:52 2024 -0700 [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` ### What changes were proposed in this pull request? This PR aims to disable a flaky test, `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command`, temporarily. To re-enable this, SPARK-48164 is created as a blocker issue for 4.0.0. ### Why are the changes needed? This test case was added at `Apache Spark 3.5.0`, but it has been flaky and causes many re-tries in our GitHub Action CI environment. - https://github.com/apache/spark/pull/42454 - https://github.com/apache/spark/actions/runs/8979348499/job/24661200052 ``` [info] - SPARK-43923: commands send events ((get_resources_command { [info] } [info] ,None)) *** FAILED *** (35 milliseconds) [info] VerifyEvents.this.listener.executeHolder.isDefined was false (SparkConnectServiceSuite.scala:873) ``` This PR aims to stabilize CI first and to focus this flaky issue as a blocker level before going on `Spark Connect GA` in SPARK-48164 before Apache Spark 4.0.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46425 from dongjoon-hyun/SPARK-48163. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala| 3 +++ 1 file changed, 3 insertions(+) diff --git a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala index af18fca9dd21..59d9750c0fbf 100644 --- a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala +++ b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala @@ -418,11 +418,14 @@ class SparkConnectServiceSuite .setInput( proto.Relation.newBuilder().setSql(proto.SQL.newBuilder().setQuery("select 1", None), + // TODO(SPARK-48164) Reenable `commands send events - get_resources_command` + /* ( proto.Command .newBuilder() .setGetResourcesCommand(proto.GetResourcesCommand.newBuilder()), None), + */ ( proto.Command .newBuilder() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48141][TEST] Update the Oracle docker image version used for test and integration to use Oracle Database 23ai Free
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 05b22ebb3060 [SPARK-48141][TEST] Update the Oracle docker image version used for test and integration to use Oracle Database 23ai Free 05b22ebb3060 is described below commit 05b22ebb30606a76c50e649a6efa825f03ca97ff Author: Luca Canali AuthorDate: Mon May 6 20:44:51 2024 -0700 [SPARK-48141][TEST] Update the Oracle docker image version used for test and integration to use Oracle Database 23ai Free ### What changes were proposed in this pull request? This proposes to update the Docker image used for integration tests and builds to Oracle Database 23ai Free, version 23.4 (previously we used Oracle Database 23c Free, version 23.3) ### Why are the changes needed? This is to keep the testing infrastructure up-to-date with the latest Oracle Database Free version. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test infrastructure. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46399 from LucaCanali/updateOracleImage. Lead-authored-by: Luca Canali Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml| 1 - connector/docker-integration-tests/README.md| 2 +- .../test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala | 2 +- 3 files changed, 2 insertions(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index b34456fc3e42..286f8e1193d9 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -928,7 +928,6 @@ jobs: HIVE_PROFILE: hive2.3 GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost - ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-free:23.3 SKIP_UNIDOC: true SKIP_MIMA: true SKIP_PACKAGING: true diff --git a/connector/docker-integration-tests/README.md b/connector/docker-integration-tests/README.md index 0192947bdbf9..03d3fe706a60 100644 --- a/connector/docker-integration-tests/README.md +++ b/connector/docker-integration-tests/README.md @@ -45,7 +45,7 @@ the container bootstrapping. To run an individual Docker integration test, use t Besides the default Docker images, the integration tests can be run with custom Docker images. For example, -ORACLE_DOCKER_IMAGE_NAME=gvenzl/oracle-free:23.3-slim-faststart ./build/sbt -Pdocker-integration-tests "docker-integration-tests/testOnly *OracleIntegrationSuite" +ORACLE_DOCKER_IMAGE_NAME=gvenzl/oracle-free:23.4-slim-faststart ./build/sbt -Pdocker-integration-tests "docker-integration-tests/testOnly *OracleIntegrationSuite" The following environment variables can be used to specify the custom Docker images for different databases: diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala index bfbcf5b533d7..88bb23f9c653 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala @@ -26,7 +26,7 @@ import org.apache.spark.util.Utils class OracleDatabaseOnDocker extends DatabaseOnDocker with Logging { lazy override val imageName = -sys.env.getOrElse("ORACLE_DOCKER_IMAGE_NAME", "gvenzl/oracle-free:23.3-slim") +sys.env.getOrElse("ORACLE_DOCKER_IMAGE_NAME", "gvenzl/oracle-free:23.4-slim") val oracle_password = "Th1s1sThe0racle#Pass" override val env = Map( "ORACLE_PWD" -> oracle_password, // oracle images uses this - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48150][SQL] try_parse_json output should be declared as nullable
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8cf602a3f587 [SPARK-48150][SQL] try_parse_json output should be declared as nullable 8cf602a3f587 is described below commit 8cf602a3f587af4acc15637878437f166db4ed3f Author: Josh Rosen AuthorDate: Mon May 6 20:08:56 2024 -0700 [SPARK-48150][SQL] try_parse_json output should be declared as nullable ### What changes were proposed in this pull request? The `try_parse_json` expression added in https://github.com/apache/spark/pull/46141 declares improper output nullability: the `try_` version's output must be marked as nullable. This PR corrects the nullability and adds a test. ### Why are the changes needed? Incorrectly declaring an expression's output as non-nullable when it is actually nullable may lead to crashes. ### Does this PR introduce _any_ user-facing change? Yes, it affects output nullability and thus may affect query result schemas. ### How was this patch tested? New unit test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46409 from JoshRosen/fix-try-parse-json-nullability. Authored-by: Josh Rosen Signed-off-by: Dongjoon Hyun --- .../query-tests/explain-results/function_try_parse_json.explain | 2 +- .../sql/catalyst/expressions/variant/variantExpressions.scala| 2 +- .../catalyst/expressions/variant/VariantExpressionSuite.scala| 9 + 3 files changed, 11 insertions(+), 2 deletions(-) diff --git a/connector/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain b/connector/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain index 1772b5d37623..5c6b21a3ad46 100644 --- a/connector/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain +++ b/connector/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain @@ -1,2 +1,2 @@ -Project [staticinvoke(class org.apache.spark.sql.catalyst.expressions.variant.VariantExpressionEvalUtils$, VariantType, parseJson, g#0, false, StringType, BooleanType, true, false, true) AS try_parse_json(g)#0] +Project [staticinvoke(class org.apache.spark.sql.catalyst.expressions.variant.VariantExpressionEvalUtils$, VariantType, parseJson, g#0, false, StringType, BooleanType, true, true, true) AS try_parse_json(g)#0] +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0] diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala index 3dbc72415ff0..5026d8e49ef1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala @@ -59,7 +59,7 @@ case class ParseJson(child: Expression, failOnError: Boolean = true) "parseJson", Seq(child, Literal(failOnError, BooleanType)), inputTypes :+ BooleanType, -returnNullable = false) +returnNullable = !failOnError) override def inputTypes: Seq[AbstractDataType] = StringType :: Nil diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala index f4a6a144c221..73abf8074e8c 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala @@ -810,6 +810,15 @@ class VariantExpressionSuite extends SparkFunSuite with ExpressionEvalHelper { "Hello") } + test("SPARK-48150: ParseJson expression nullability") { +assert(!ParseJson(Literal("["), failOnError = true).replacement.nullable) +assert(ParseJson(Literal("["), failOnError = false).replacement.nullable) +checkEvaluation( + ParseJson(Literal("["), failOnError = false).replacement, + null +) + } + test("cast to variant") { def check[T : TypeTag](input: T, expectedJson: String): Unit = { val cast = Cast(Literal.create(input), VariantType, evalMode = EvalMode.ANSI) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (f918d1179642 -> 0907a15b2d15)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from f918d1179642 [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` add 0907a15b2d15 [SPARK-48153][INFRA] Run `build` job of `build_and_test.yml` only if needed No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new e699a1eee085 [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access e699a1eee085 is described below commit e699a1eee085eb6025f33284c6369553713794d1 Author: Hyukjin Kwon AuthorDate: Mon May 6 19:06:29 2024 -0700 [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/46334 that missed one more test case. ### Why are the changes needed? See https://github.com/apache/spark/pull/46334 ### Does this PR introduce _any_ user-facing change? See https://github.com/apache/spark/pull/46334 ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46411 from HyukjinKwon/SPARK-48088-followup. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- python/pyspark/ml/tests/connect/test_connect_pipeline.py | 1 + 1 file changed, 1 insertion(+) diff --git a/python/pyspark/ml/tests/connect/test_connect_pipeline.py b/python/pyspark/ml/tests/connect/test_connect_pipeline.py index dc7490bf14b1..eb2bedddbe28 100644 --- a/python/pyspark/ml/tests/connect/test_connect_pipeline.py +++ b/python/pyspark/ml/tests/connect/test_connect_pipeline.py @@ -22,6 +22,7 @@ from pyspark.sql import SparkSession from pyspark.ml.tests.connect.test_legacy_mode_pipeline import PipelineTestsMixin +@unittest.skipIf("SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ, "Requires JVM access") class PipelineTestsOnConnect(PipelineTestsMixin, unittest.TestCase): def setUp(self) -> None: self.spark = ( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (2ef7246b9c5b -> f918d1179642)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 2ef7246b9c5b [SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` add f918d1179642 [SPARK-48151][INFRA] `build_and_test.yml` should use `Volcano` 1.7.0 for `branch-3.4/3.5` No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 5 + 1 file changed, 5 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48149][INFRA] Serialize `build_python.yml` to run a single Python version per cron schedule
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4c6884291e8b [SPARK-48149][INFRA] Serialize `build_python.yml` to run a single Python version per cron schedule 4c6884291e8b is described below commit 4c6884291e8b97a7d64dd13530f7ecabe2839d16 Author: Dongjoon Hyun AuthorDate: Mon May 6 16:06:57 2024 -0700 [SPARK-48149][INFRA] Serialize `build_python.yml` to run a single Python version per cron schedule ### What changes were proposed in this pull request? This PR aims to serialize `build_python.yml` to run a single Python version per cron schedule to reduce the maximum concurrency per single GitHub Action job. ### Why are the changes needed? Currently, `build_python.yml` triggers 60 jobs. `30` of `60` jobs are running concurrently because 10 test pipelines are required per Python version. - https://github.com/apache/spark/actions/workflows/build_python.yml https://github.com/apache/spark/assets/9700541/e4f4e9d2-2b2e-43b9-a760-6b9943c7b5b7;> According to https://infra.apache.org/github-actions-policy.html, > All workflows SHOULD have a job concurrency level less than or equal to 15. After this PR, the maximum concurrently level will be 10. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review because this is a daily CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46407 from dongjoon-hyun/SPARK-48149. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_python.yml | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build_python.yml b/.github/workflows/build_python.yml index 3354fb726368..9195dc4af518 100644 --- a/.github/workflows/build_python.yml +++ b/.github/workflows/build_python.yml @@ -17,18 +17,26 @@ # under the License. # +# According to https://infra.apache.org/github-actions-policy.html, +# all workflows SHOULD have a job concurrency level less than or equal to 15. +# To do that, we run one python version per cron schedule name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.12)" on: schedule: - cron: '0 15 * * *' +- cron: '0 17 * * *' +- cron: '0 19 * * *' jobs: run-build: strategy: fail-fast: false matrix: -pyversion: ["pypy3", "python3.10", "python3.12"] +include: + - pyversion: ${{ github.event.schedule == '0 15 * * *' && "pypy3" }} + - pyversion: ${{ github.event.schedule == '0 17 * * *' && "python3.10" }} + - pyversion: ${{ github.event.schedule == '0 19 * * *' && "python3.12" }} permissions: packages: write name: Run - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48145][CORE] Remove logDebug and logTrace with MDC in JAVA structured logging framework
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new de8ba8589c21 [SPARK-48145][CORE] Remove logDebug and logTrace with MDC in JAVA structured logging framework de8ba8589c21 is described below commit de8ba8589c218ffbe57efc581bd921a6aef73fae Author: Gengliang Wang AuthorDate: Mon May 6 13:32:54 2024 -0700 [SPARK-48145][CORE] Remove logDebug and logTrace with MDC in JAVA structured logging framework ### What changes were proposed in this pull request? Since we are targeting on migration INFO/WARN/ERROR level logs to structure logging, this PR removes the logDebug and logTrace methods from the JAVA structured logging framework. ### Why are the changes needed? In the log migration PR https://github.com/apache/spark/pull/46390, there are unnecessary changes such as updating ``` logger.debug("Task {} need to spill {} for {}", taskAttemptId, Utils.bytesToString(required - got), requestingConsumer); ``` to ``` LOGGER.debug("Task {} need to spill {} for {}", String.valueOf(taskAttemptId), Utils.bytesToString(required - got), requestingConsumer.toString()); ``` With this PR, we can avoid such changes during log migrations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46405 from gengliangwang/updateJavaLog. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun --- .../java/org/apache/spark/internal/Logger.java | 49 ++ .../org/apache/spark/util/LoggerSuiteBase.java | 28 +++-- 2 files changed, 26 insertions(+), 51 deletions(-) diff --git a/common/utils/src/main/java/org/apache/spark/internal/Logger.java b/common/utils/src/main/java/org/apache/spark/internal/Logger.java index f252f44b3b76..2b4dd3bb45bc 100644 --- a/common/utils/src/main/java/org/apache/spark/internal/Logger.java +++ b/common/utils/src/main/java/org/apache/spark/internal/Logger.java @@ -110,50 +110,43 @@ public class Logger { slf4jLogger.debug(msg); } - public void debug(String msg, Throwable throwable) { -slf4jLogger.debug(msg, throwable); + public void debug(String format, Object arg) { +slf4jLogger.debug(format, arg); } - public void debug(String msg, MDC... mdcs) { -if (mdcs == null || mdcs.length == 0) { - slf4jLogger.debug(msg); -} else if (slf4jLogger.isDebugEnabled()) { - withLogContext(msg, mdcs, null, mt -> slf4jLogger.debug(mt.message)); -} + public void debug(String format, Object arg1, Object arg2) { +slf4jLogger.debug(format, arg1, arg2); } - public void debug(String msg, Throwable throwable, MDC... mdcs) { -if (mdcs == null || mdcs.length == 0) { - slf4jLogger.debug(msg); -} else if (slf4jLogger.isDebugEnabled()) { - withLogContext(msg, mdcs, throwable, mt -> slf4jLogger.debug(mt.message, mt.throwable)); -} + public void debug(String format, Object... arguments) { +slf4jLogger.debug(format, arguments); + } + + public void debug(String msg, Throwable throwable) { +slf4jLogger.debug(msg, throwable); } public void trace(String msg) { slf4jLogger.trace(msg); } - public void trace(String msg, Throwable throwable) { -slf4jLogger.trace(msg, throwable); + public void trace(String format, Object arg) { +slf4jLogger.trace(format, arg); } - public void trace(String msg, MDC... mdcs) { -if (mdcs == null || mdcs.length == 0) { - slf4jLogger.trace(msg); -} else if (slf4jLogger.isTraceEnabled()) { - withLogContext(msg, mdcs, null, mt -> slf4jLogger.trace(mt.message)); -} + public void trace(String format, Object arg1, Object arg2) { +slf4jLogger.trace(format, arg1, arg2); } - public void trace(String msg, Throwable throwable, MDC... mdcs) { -if (mdcs == null || mdcs.length == 0) { - slf4jLogger.trace(msg); -} else if (slf4jLogger.isTraceEnabled()) { - withLogContext(msg, mdcs, throwable, mt -> slf4jLogger.trace(mt.message, mt.throwable)); -} + public void trace(String format, Object... arguments) { +slf4jLogger.trace(format, arguments); } + public void trace(String msg, Throwable throwable) { +slf4jLogger.trace(msg, throwable); + } + + private void withLogContext( String pattern, MDC[] mdcs, diff --git a/common/utils/src/test/java/org/apache/spark/util/LoggerSuiteBase.java b/common/utils/src/test/java/org/apache/spark/util/LoggerSuiteBase.java index cdc06f6fc261..6c39304bece0 100644 --- a/common/utils/src/test/java/org/apache/spark/util/LoggerSuiteBase.
(spark-website) branch asf-site updated: Update `committers` page (#517)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new 79e0191af2 Update `committers` page (#517) 79e0191af2 is described below commit 79e0191af219aab6e9aea84458700e33d7013bef Author: Dongjoon Hyun AuthorDate: Tue May 7 04:25:32 2024 +0900 Update `committers` page (#517) This PR aims to update `committers` page because `Apache Spark 4.0.0-preview` is going to be ready this week soon. - https://spark.apache.org/committers.html --- committers.md| 30 +++--- site/committers.html | 30 +++--- 2 files changed, 30 insertions(+), 30 deletions(-) diff --git a/committers.md b/committers.md index 42e06bebd7..f443ff280b 100644 --- a/committers.md +++ b/committers.md @@ -10,20 +10,20 @@ navigation: |Name|Organization| ||| -|Sameer Agarwal|Facebook| +|Sameer Agarwal|Deductive AI| |Michael Armbrust|Databricks| |Dilip Biswal|Adobe| -|Ryan Blue|Netflix| +|Ryan Blue|Tabular| |Joseph Bradley|Databricks| |Matthew Cheah|Palantir| -|Felix Cheung|SafeGraph| +|Felix Cheung|NVIDIA| |Mosharaf Chowdhury|University of Michigan, Ann Arbor| |Bryan Cutler|IBM| |Jason Dai|Intel| |Tathagata Das|Databricks| -|Ankur Dave|UC Berkeley| +|Ankur Dave|Databricks| |Aaron Davidson|Databricks| -|Thomas Dudziak|Facebook| +|Thomas Dudziak|Meta| |Erik Erlandson|Red Hat| |Robert Evans|NVIDIA| |Wenchen Fan|Databricks| @@ -34,7 +34,7 @@ navigation: |Thomas Graves|NVIDIA| |Stephen Haberman|LinkedIn| |Mark Hamstra|ClearStory Data| -|Seth Hendrickson|Cloudera| +|Seth Hendrickson|Stripe| |Herman van Hovell|Databricks| |Liang-Chi Hsieh|Apple| |Yin Huai|Databricks| @@ -43,7 +43,7 @@ navigation: |Kazuaki Ishizaki|IBM| |Xingbo Jiang|Databricks| |Yikun Jiang|Huawei| -|Holden Karau|Apple| +|Holden Karau|Netflix| |Shane Knapp|UC Berkeley| |Cody Koeninger|Nexstar Digital| |Andy Konwinski|Databricks| @@ -61,23 +61,23 @@ navigation: |Xiangrui Meng|Databricks| |Xinrong Meng|Databricks| |Mridul Muralidharan|LinkedIn| -|Andrew Or|Princeton University| +|Andrew Or|Facebook| |Kay Ousterhout|LightStep| |Sean Owen|Databricks| -|Tejas Patil|Facebook| -|Nick Pentreath|IBM| +|Tejas Patil|Meta| +|Nick Pentreath|Automattic| |Attila Zsolt Piros|Cloudera| -|Anirudh Ramanathan|Rockset| +|Anirudh Ramanathan|Signadot| |Imran Rashid|Cloudera| |Charles Reiss|University of Virginia| -|Josh Rosen|Stripe| -|Sandy Ryza|Remix| +|Josh Rosen|Databricks| +|Sandy Ryza|Dagster| |Kousuke Saruta|NTT Data| |Saisai Shao|Datastrato| |Prashant Sharma|IBM| |Gabor Somogyi|Apple| -|Ram Sriharsha|Databricks| -|Chao Sun|Apple| +|Ram Sriharsha|Pinecone| +|Chao Sun|OpenAI| |Maciej Szymkiewicz|| |Jose Torres|Databricks| |Peter Toth|Cloudera| diff --git a/site/committers.html b/site/committers.html index 0153ecf595..15e8bdbe19 100644 --- a/site/committers.html +++ b/site/committers.html @@ -153,7 +153,7 @@ Sameer Agarwal - Facebook + Deductive AI Michael Armbrust @@ -165,7 +165,7 @@ Ryan Blue - Netflix + Tabular Joseph Bradley @@ -177,7 +177,7 @@ Felix Cheung - SafeGraph + NVIDIA Mosharaf Chowdhury @@ -197,7 +197,7 @@ Ankur Dave - UC Berkeley + Databricks Aaron Davidson @@ -205,7 +205,7 @@ Thomas Dudziak - Facebook + Meta Erik Erlandson @@ -249,7 +249,7 @@ Seth Hendrickson - Cloudera + Stripe Herman van Hovell @@ -285,7 +285,7 @@ Holden Karau - Apple + Netflix Shane Knapp @@ -357,7 +357,7 @@ Andrew Or - Princeton University + Facebook Kay Ousterhout @@ -369,11 +369,11 @@ Tejas Patil - Facebook + Meta Nick Pentreath - IBM + Automattic Attila Zsolt Piros @@ -381,7 +381,7 @@ Anirudh Ramanathan - Rockset + Signadot Imran Rashid @@ -393,11 +393,11 @@ Josh Rosen - Stripe + Databricks Sandy Ryza - Remix + Dagster Kousuke Saruta @@ -417,11 +417,11 @@ Ram Sriharsha - Databricks + Pinecone Chao Sun - Apple + OpenAI Maciej Szymkiewicz - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (7c728b2c2d6c -> 526e4141457d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 7c728b2c2d6c [SPARK-48137][INFRA] Run `yarn` test only in PR builders and Daily CIs add 526e4141457d [SPARK-45220][FOLLOWUP][DOCS][TESTS] Make a `dataframe.join` doctest deterministic No new revisions were added by this update. Summary of changes: python/pyspark/sql/dataframe.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (8294c5962feb -> 7c728b2c2d6c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8294c5962feb [SPARK-48138][CONNECT][TESTS] Disable a flaky `SparkSessionE2ESuite.interrupt tag` test add 7c728b2c2d6c [SPARK-48137][INFRA] Run `yarn` test only in PR builders and Daily CIs No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 12 ++-- .github/workflows/build_java21.yml | 1 + .github/workflows/build_non_ansi.yml | 3 ++- .github/workflows/build_rockdb_as_ui_backend.yml | 3 ++- 4 files changed, 15 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48136][INFRA][CONNECT] Always upload Spark Connect log files in scheduled build for Spark Connect
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d09f174be5e9 [SPARK-48136][INFRA][CONNECT] Always upload Spark Connect log files in scheduled build for Spark Connect d09f174be5e9 is described below commit d09f174be5e9bf7dee12840526ed8bf6aee07052 Author: Hyukjin Kwon AuthorDate: Sun May 5 17:49:15 2024 -0700 [SPARK-48136][INFRA][CONNECT] Always upload Spark Connect log files in scheduled build for Spark Connect ### What changes were proposed in this pull request? This PR proposes to upload Spark Connect log files in scheduled build for Spark Connect ### Why are the changes needed? Difficult to debug, e.g., https://github.com/apache/spark/actions/runs/8960485641/job/24607044822 ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46393 from HyukjinKwon/SPARK-48136. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- .github/workflows/build_python_connect.yml | 2 +- .github/workflows/build_python_connect35.yml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_python_connect.yml b/.github/workflows/build_python_connect.yml index 3a9ce5115741..639b0d084314 100644 --- a/.github/workflows/build_python_connect.yml +++ b/.github/workflows/build_python_connect.yml @@ -118,7 +118,7 @@ jobs: name: test-results-spark-connect-python-only path: "**/target/test-reports/*.xml" - name: Upload Spark Connect server log file -if: failure() +if: ${{ !success() }} uses: actions/upload-artifact@v4 with: name: unit-tests-log-spark-connect-python-only diff --git a/.github/workflows/build_python_connect35.yml b/.github/workflows/build_python_connect35.yml index 8c9a5fa86996..14edb8bf91ed 100644 --- a/.github/workflows/build_python_connect35.yml +++ b/.github/workflows/build_python_connect35.yml @@ -106,7 +106,7 @@ jobs: name: test-results-spark-connect-python-only path: "**/target/test-reports/*.xml" - name: Upload Spark Connect server log file -if: failure() +if: ${{ !success() }} uses: actions/upload-artifact@v4 with: name: unit-tests-log-spark-connect-python-only - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48135][INFRA] Run `buf` and `ui` only in PR builders and Java 21 Daily CI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8b2251734519 [SPARK-48135][INFRA] Run `buf` and `ui` only in PR builders and Java 21 Daily CI 8b2251734519 is described below commit 8b22517345190e007ca87c7491116ad590ad46f2 Author: Dongjoon Hyun AuthorDate: Sun May 5 16:40:11 2024 -0700 [SPARK-48135][INFRA] Run `buf` and `ui` only in PR builders and Java 21 Daily CI ### What changes were proposed in this pull request? This PR aims to run `buf` and `ui` tests only in PR builders and Java 21 Daily CI. ### Why are the changes needed? Currently, Apache Spark CI is running `buf` and `ui` tests always because they finish quickly. https://github.com/apache/spark/blob/32ba5c1db62c2674e8acced56f89ed840bf9/.github/workflows/build_and_test.yml#L102-L103 - `buf` job https://github.com/apache/spark/blob/32ba5c1db62c2674e8acced56f89ed840bf9/.github/workflows/build_and_test.yml#L571-L574 - `ui` job https://github.com/apache/spark/blob/32ba5c1db62c2674e8acced56f89ed840bf9/.github/workflows/build_and_test.yml#L1049-L1052 However, ASF Infra team's guideline recommends to maintain the job concurrency level under or equal to `15`. We had better offload `buf` and `ui` from per-commit CI. - https://infra.apache.org/github-actions-policy.html > All workflows SHOULD have a job concurrency level less than or equal to 15. ### Does this PR introduce _any_ user-facing change? No because this is an infra update. ### How was this patch tested? Pass the CIs and manual review because PR builders will not be affected by this. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46392 from dongjoon-hyun/SPARK-48135. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 8 ++-- .github/workflows/build_java21.yml | 4 +++- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index f626cd72be15..8a85d26c0eca 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -82,10 +82,14 @@ jobs: pandas=$pyspark kubernetes=`./dev/is-changed.py -m kubernetes` sparkr=`./dev/is-changed.py -m sparkr` +buf=true +ui=true else pandas=false kubernetes=false sparkr=false +buf=false +ui=false fi # 'build' is always true for now. # It does not save significant time and most of PRs trigger the build. @@ -99,8 +103,8 @@ jobs: \"docker-integration-tests\": \"false\", \"lint\" : \"true\", \"k8s-integration-tests\" : \"$kubernetes\", - \"buf\" : \"true\", - \"ui\" : \"true\", + \"buf\" : \"$buf\", + \"ui\" : \"$ui\", }" echo $precondition # For debugging # Remove `\n` to avoid "Invalid format" error diff --git a/.github/workflows/build_java21.yml b/.github/workflows/build_java21.yml index bfeedd4174cf..a2fb0e6e2c1d 100644 --- a/.github/workflows/build_java21.yml +++ b/.github/workflows/build_java21.yml @@ -47,5 +47,7 @@ jobs: "sparkr": "true", "tpcds-1g": "true", "docker-integration-tests": "true", - "k8s-integration-tests": "true" + "k8s-integration-tests": "true", + "buf": "true", + "ui": "true" } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 32ba5c1db62c [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs 32ba5c1db62c is described below commit 32ba5c1db62c2674e8acced56f89ed840bf9 Author: Dongjoon Hyun AuthorDate: Sun May 5 13:19:23 2024 -0700 [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs ### What changes were proposed in this pull request? This PR aims to run `sparkr` only in PR builder and Daily Python CIs. In other words, only the commit builder will skip it by default. ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46389 from dongjoon-hyun/SPARK-48133. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index c87e8921b48e..f626cd72be15 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -76,17 +76,17 @@ jobs: id: set-outputs run: | if [ -z "${{ inputs.jobs }}" ]; then - pyspark=true; sparkr=true; pyspark_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"` pyspark=`./dev/is-changed.py -m $pyspark_modules` if [[ "${{ github.repository }}" != 'apache/spark' ]]; then pandas=$pyspark kubernetes=`./dev/is-changed.py -m kubernetes` +sparkr=`./dev/is-changed.py -m sparkr` else pandas=false kubernetes=false +sparkr=false fi - sparkr=`./dev/is-changed.py -m sparkr` # 'build' is always true for now. # It does not save significant time and most of PRs trigger the build. precondition=" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a0f62393d69a [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs a0f62393d69a is described below commit a0f62393d69a40ddd49b034b3ce332e6fa6bfb13 Author: Dongjoon Hyun AuthorDate: Sat May 4 22:55:04 2024 -0700 [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs ### What changes were proposed in this pull request? This PR aims to run `k8s-integration-tests` only in PR builder and Daily Python CIs. In other words, only the commit builder will skip it by default. Please note that - K8s unit tests will be covered by the commit builder still. - All PR builders are not consuming ASF resources and they provide lots of test coverage everyday also. ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46388 from dongjoon-hyun/SPARK-48132. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 6ef971002c54..c87e8921b48e 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -81,11 +81,12 @@ jobs: pyspark=`./dev/is-changed.py -m $pyspark_modules` if [[ "${{ github.repository }}" != 'apache/spark' ]]; then pandas=$pyspark +kubernetes=`./dev/is-changed.py -m kubernetes` else pandas=false +kubernetes=false fi sparkr=`./dev/is-changed.py -m sparkr` - kubernetes=`./dev/is-changed.py -m kubernetes` # 'build' is always true for now. # It does not save significant time and most of PRs trigger the build. precondition=" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48131][CORE] Unify MDC key `mdc.taskName` and `task_name`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8443672b1ab1 [SPARK-48131][CORE] Unify MDC key `mdc.taskName` and `task_name` 8443672b1ab1 is described below commit 8443672b1ab1195278a73a9ec487af8e02e3a8de Author: Gengliang Wang AuthorDate: Sat May 4 17:33:02 2024 -0700 [SPARK-48131][CORE] Unify MDC key `mdc.taskName` and `task_name` ### What changes were proposed in this pull request? Currently there are two MDC keys for task name: * `mdc.taskName`, which is introduced in https://github.com/apache/spark/pull/28801. Before the change, it was `taskName`. * `task_name`: introduce from the structured logging framework project. To make the MDC keys unified, this PR renames the `mdc.taskName` as `task_name`. This MDC is showing frequently in logs when running Spark application. Before change: ``` "context":{"mdc.taskName":"task 19.0 in stage 0.0 (TID 19)”} ``` after change ``` "context":{“task_name":"task 19.0 in stage 0.0 (TID 19)”} ``` ### Why are the changes needed? 1. Make the MDC names consistent 2. Minor upside: this will allow users to query task names with `SELECT * FROM logs where context.task_name = ...`. Otherwise, querying with `context.mdc.task_name` will result in an analysis exception. Users will have to query with `context['mdc.task_name']` ### Does this PR introduce _any_ user-facing change? No really. The MDC key is used by developers for debugging purpose. ### How was this patch tested? Manual test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46386 from gengliangwang/unify. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/executor/Executor.scala | 6 +++--- docs/configuration.md| 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala b/core/src/main/scala/org/apache/spark/executor/Executor.scala index fd6c02c07789..3edba45ef89f 100644 --- a/core/src/main/scala/org/apache/spark/executor/Executor.scala +++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala @@ -40,7 +40,7 @@ import org.slf4j.MDC import org.apache.spark._ import org.apache.spark.deploy.SparkHadoopUtil -import org.apache.spark.internal.{Logging, MDC => LogMDC} +import org.apache.spark.internal.{Logging, LogKeys, MDC => LogMDC} import org.apache.spark.internal.LogKeys._ import org.apache.spark.internal.config._ import org.apache.spark.internal.plugin.PluginContainer @@ -914,7 +914,7 @@ private[spark] class Executor( try { mdc.foreach { case (key, value) => MDC.put(key, value) } // avoid overriding the takName by the user - MDC.put("mdc.taskName", taskName) + MDC.put(LogKeys.TASK_NAME.name, taskName) } catch { case _: NoSuchFieldError => logInfo("MDC is not supported.") } @@ -923,7 +923,7 @@ private[spark] class Executor( private def cleanMDCForTask(taskName: String, mdc: Seq[(String, String)]): Unit = { try { mdc.foreach { case (key, _) => MDC.remove(key) } - MDC.remove("mdc.taskName") + MDC.remove(LogKeys.TASK_NAME.name) } catch { case _: NoSuchFieldError => logInfo("MDC is not supported.") } diff --git a/docs/configuration.md b/docs/configuration.md index a55ce89c096b..fb14af6d55b8 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -3693,7 +3693,7 @@ val logDf = spark.read.schema(LOG_SCHEMA).json("path/to/logs") ``` ## Plain Text Logging -If you prefer plain text logging, you can use the `log4j2.properties.pattern-layout-template` file as a starting point. This is the default configuration used by Spark before the 4.0.0 release. This configuration uses the [PatternLayout](https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout) to log all the logs in plain text. MDC information is not included by default. In order to print it in the logs, you can update the patternLayout in the file. For example, you can ad [...] +If you prefer plain text logging, you can use the `log4j2.properties.pattern-layout-template` file as a starting point. This is the default configuration used by Spark before the 4.0.0 release. This configuration uses the [PatternLayout](https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout) to log all the logs in plain text. MDC information is not included by default. In order to print it in the logs, you can update the patternLayout in the file. For exam
(spark) branch master updated: [SPARK-48129][PYTHON] Provide a constant table schema in PySpark for querying structured logs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9a45da21dd1c [SPARK-48129][PYTHON] Provide a constant table schema in PySpark for querying structured logs 9a45da21dd1c is described below commit 9a45da21dd1c7dd93152f7126c8c611b8ba031e7 Author: Gengliang Wang AuthorDate: Sat May 4 11:54:49 2024 -0700 [SPARK-48129][PYTHON] Provide a constant table schema in PySpark for querying structured logs ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/46375/, this PR provides a constant table schema in PySpark for querying structured logs. The doc of logging configuration is also updated. ### Why are the changes needed? Provide a convenient way to query Spark logs using PySpark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46384 from gengliangwang/pythonLog. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun --- docs/configuration.md | 9 - python/pyspark/util.py | 16 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index d07decf02505..a55ce89c096b 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -3677,8 +3677,15 @@ Starting from version 4.0.0, `spark-submit` has adopted the [JSON Template Layou To configure the layout of structured logging, start with the `log4j2.properties.template` file. -To query Spark logs using Spark SQL, you can use the following Scala code snippet: +To query Spark logs using Spark SQL, you can use the following Python code snippet: +```python +from pyspark.util import LogUtils + +logDf = spark.read.schema(LogUtils.LOG_SCHEMA).json("path/to/logs") +``` + +Or using the following Scala code snippet: ```scala import org.apache.spark.util.LogUtils.LOG_SCHEMA diff --git a/python/pyspark/util.py b/python/pyspark/util.py index f0fa4a2413ce..4920ba957c19 100644 --- a/python/pyspark/util.py +++ b/python/pyspark/util.py @@ -107,6 +107,22 @@ class VersionUtils: ) +class LogUtils: +""" +Utils for querying structured Spark logs with Spark SQL. +""" + +LOG_SCHEMA = ( +"ts TIMESTAMP, " +"level STRING, " +"msg STRING, " +"context map, " +"exception STRUCT>>," +"logger STRING" +) + + def fail_on_stopiteration(f: Callable) -> Callable: """ Wraps the input function to fail on 'StopIteration' by raising a 'RuntimeError' - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46009][SQL][FOLLOWUP] Remove unused golden file
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 356aca5af5b8 [SPARK-46009][SQL][FOLLOWUP] Remove unused golden file 356aca5af5b8 is described below commit 356aca5af5b88570d43d1c0f2b417aa87b86d323 Author: beliefer AuthorDate: Sat May 4 11:51:40 2024 -0700 [SPARK-46009][SQL][FOLLOWUP] Remove unused golden file ### What changes were proposed in this pull request? This PR propose to remove unused golden file. ### Why are the changes needed? https://github.com/apache/spark/pull/46272 removed unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4. But I made a mistake and submitted my local test code. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46385 from beliefer/SPARK-46009_followup3. Authored-by: beliefer Signed-off-by: Dongjoon Hyun --- .../sql-tests/analyzer-results/window2.sql.out | 126 - 1 file changed, 126 deletions(-) diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out deleted file mode 100644 index 6fd41286959a.. --- a/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out +++ /dev/null @@ -1,126 +0,0 @@ --- Automatically generated by SQLQueryTestSuite --- !query -CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES -(null, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"), -(1, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"), -(1, 2L, 2.5D, date("2017-08-02"), timestamp_seconds(150200), "a"), -(2, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), "a"), -(1, null, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "b"), -(2, 3L, 3.3D, date("2017-08-03"), timestamp_seconds(150300), "b"), -(3, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), "b"), -(null, null, null, null, null, null), -(3, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), null) -AS testData(val, val_long, val_double, val_date, val_timestamp, cate) --- !query analysis -CreateViewCommand `testData`, SELECT * FROM VALUES -(null, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"), -(1, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"), -(1, 2L, 2.5D, date("2017-08-02"), timestamp_seconds(150200), "a"), -(2, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), "a"), -(1, null, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "b"), -(2, 3L, 3.3D, date("2017-08-03"), timestamp_seconds(150300), "b"), -(3, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), "b"), -(null, null, null, null, null, null), -(3, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), null) -AS testData(val, val_long, val_double, val_date, val_timestamp, cate), false, true, LocalTempView, true - +- Project [val#x, val_long#xL, val_double#x, val_date#x, val_timestamp#x, cate#x] - +- SubqueryAlias testData - +- LocalRelation [val#x, val_long#xL, val_double#x, val_date#x, val_timestamp#x, cate#x] - - --- !query -CREATE OR REPLACE TEMPORARY VIEW basic_pays AS SELECT * FROM VALUES -('Diane Murphy','Accounting',8435), -('Mary Patterson','Accounting',9998), -('Jeff Firrelli','Accounting',8992), -('William Patterson','Accounting',8870), -('Gerard Bondur','Accounting',11472), -('Anthony Bow','Accounting',6627), -('Leslie Jennings','IT',8113), -('Leslie Thompson','IT',5186), -('Julie Firrelli','Sales',9181), -('Steve Patterson','Sales',9441), -('Foon Yue Tseng','Sales',6660), -('George Vanauf','Sales',10563), -('Loui Bondur','SCM',10449), -('Gerard Hernandez','SCM',6949), -('Pamela Castillo','SCM',11303), -('Larry Bott','SCM',11798), -('Barry Jones','SCM',10586) -AS basic_pays(employee_name, department, salary) --- !query analysis -CreateViewCommand `basic_pays`, SELECT * FROM VALUES -('Diane Murphy','Accounting',8435), -('Mary Patterson','Accounting',9998), -('Jeff Firrelli','Accounting',8992), -('William Patterson','Accounting',8870), -('Gerard Bondur','Accounting',11472), -('Anthony Bow','Accounting',6627), -('Leslie Jennings','IT',8113), -('Leslie Thompson','IT',5186), -('Julie Firrelli','Sales',9181), -('Steve Patterson','Sales',9441), -('Foon
(spark) branch branch-3.5 updated: [SPARK-48128][SQL] For BitwiseCount / bit_count expression, fix codegen syntax error for boolean type inputs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 2f2347f3b74f [SPARK-48128][SQL] For BitwiseCount / bit_count expression, fix codegen syntax error for boolean type inputs 2f2347f3b74f is described below commit 2f2347f3b74f1478fb583de9378427b3e45bd980 Author: Josh Rosen AuthorDate: Sat May 4 11:49:20 2024 -0700 [SPARK-48128][SQL] For BitwiseCount / bit_count expression, fix codegen syntax error for boolean type inputs ### What changes were proposed in this pull request? This PR fixes an issue where `BitwiseCount` / `bit_count` of boolean inputs would cause codegen to generate syntactically invalid Java code that does not compile, triggering errors like ``` java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 41, Column 11: Failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 41, Column 11: Unexpected token "if" in primary ``` Even though this code has test cases in `bitwise.sql` via the query test framework, those existing test cases were insufficient to find this problem: I believe that is because the example queries were constant-folded using the interpreted path, leaving the codegen path without test coverage. This PR fixes the codegen issue and adds explicit expression tests to ensure that the same tests run on both the codegen and interpreted paths. ### Why are the changes needed? Fix a rare codegen to interpreted fallback issue, which may harm query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test cases to BitwiseExpressionsSuite.scala, copied from the existing `bitwise.sql` query test case file. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46382 from JoshRosen/SPARK-48128-bit_count_codegen. Authored-by: Josh Rosen Signed-off-by: Dongjoon Hyun (cherry picked from commit 96f65c950064d330245dc53fcd50cf6d9753afc8) Signed-off-by: Dongjoon Hyun --- .../catalyst/expressions/bitwiseExpressions.scala | 2 +- .../expressions/BitwiseExpressionsSuite.scala | 41 ++ 2 files changed, 42 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala index 6061f625ef07..183e5d6697e9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala @@ -229,7 +229,7 @@ case class BitwiseCount(child: Expression) override def prettyName: String = "bit_count" override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = child.dataType match { -case BooleanType => defineCodeGen(ctx, ev, c => s"if ($c) 1 else 0") +case BooleanType => defineCodeGen(ctx, ev, c => s"($c) ? 1 : 0") case _ => defineCodeGen(ctx, ev, c => s"java.lang.Long.bitCount($c)") } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/BitwiseExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/BitwiseExpressionsSuite.scala index 4cd5f3e861ac..5bd1bc346c02 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/BitwiseExpressionsSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/BitwiseExpressionsSuite.scala @@ -133,6 +133,47 @@ class BitwiseExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { } } + test("BitCount") { +// null +val nullLongLiteral = Literal.create(null, LongType) +val nullIntLiteral = Literal.create(null, IntegerType) +val nullBooleanLiteral = Literal.create(null, BooleanType) +checkEvaluation(BitwiseCount(nullLongLiteral), null) +checkEvaluation(BitwiseCount(nullIntLiteral), null) +checkEvaluation(BitwiseCount(nullBooleanLiteral), null) + +// boolean +checkEvaluation(BitwiseCount(Literal(true)), 1) +checkEvaluation(BitwiseCount(Literal(false)), 0) + +// byte/tinyint +checkEvaluation(BitwiseCount(Literal(1.toByte)), 1) +checkEvaluation(BitwiseCount(Literal(2.toByte)), 1) +checkEvaluation(BitwiseCount(Literal(3.toByte)), 2) + +// short/smallint +checkEvaluation(BitwiseCount(Literal(1.toShort)), 1) +checkEvaluation(BitwiseCount(Literal(2.toShort)), 1) +checkEval