from:"dongjoon"

(spark) branch master updated (65db87697949 -> 7cba1ab4d6ac)

2024-06-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 65db87697949 [SPARK-48513][SS] Add error class for state schema 
compatibility and minor refactoring
 add 7cba1ab4d6ac [SPARK-48554][INFRA] Use R 4.4.0 in `windows` R GitHub 
Action Window job

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_sparkr_window.yml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark-kubernetes-operator) branch main updated: [SPARK-48528] Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` version

2024-06-04 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new d7734bb  [SPARK-48528] Refine K8s Operator `merge_spark_pr.py` to use 
`kubernetes-operator-x.y.z` version
d7734bb is described below

commit d7734bbc4413163cf60fe67e23c541929a9a37a8
Author: Dongjoon Hyun 
AuthorDate: Tue Jun 4 12:04:21 2024 -0700

[SPARK-48528] Refine K8s Operator `merge_spark_pr.py` to use 
`kubernetes-operator-x.y.z` version

### What changes were proposed in this pull request?

This PR aims to refine `merge_spark_pr.py` to use 
`kubernetes-operator-x.y.z` versions.

### Why are the changes needed?

Previously, it uses Apache Spark's versions like 4.0.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I manually tested like the following by printing the versions.
```
Enter number of user, or userid, to assign to (blank to leave unassigned):0
[]
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #14 from dongjoon-hyun/SPARK-48528.

Lead-authored-by: Dongjoon Hyun 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/merge_spark_pr.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py
index 24e956d..9a8d39f 100755
--- a/dev/merge_spark_pr.py
+++ b/dev/merge_spark_pr.py
@@ -305,7 +305,9 @@ def resolve_jira_issue(merge_branches, comment, 
default_jira_id=""):
 versions = [
 x
 for x in versions
-if not x.raw["released"] and not x.raw["archived"] and 
re.match(r"\d+\.\d+\.\d+", x.name)
+if not x.raw["released"]
+and not x.raw["archived"]
+and re.match(r"kubernetes-operator-\d+\.\d+\.\d+", x.name)
 ]
 versions = sorted(versions, key=lambda x: x.name, reverse=True)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48531][INFRA] Fix `Black` target version to Python 3.9

2024-06-04 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 651f68782ab7 [SPARK-48531][INFRA] Fix `Black` target version to Python 
3.9
651f68782ab7 is described below

commit 651f68782ab705f277b2548382900cdf986e017e
Author: Dongjoon Hyun 
AuthorDate: Tue Jun 4 10:28:50 2024 -0700

[SPARK-48531][INFRA] Fix `Black` target version to Python 3.9

### What changes were proposed in this pull request?

This PR aims to fix `Black` target version to `Python 3.9`.

### Why are the changes needed?

Since SPARK-47993 dropped Python 3.8 support officially at Apache Spark 
4.0.0, we had better update target version to `Python 3.9`.

- #46228

`py39` is the version for `Python 3.9`.
```
$ black --help  | grep target
  -t, --target-version 
[py33|py34|py35|py36|py37|py38|py39|py310|py311|py312]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with Python linter.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46867 from dongjoon-hyun/SPARK-48531.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/pyproject.toml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/pyproject.toml b/dev/pyproject.toml
index 4f462d14c783..f19107b3782a 100644
--- a/dev/pyproject.toml
+++ b/dev/pyproject.toml
@@ -29,6 +29,6 @@ testpaths = [
 # GitHub workflow version and dev/reformat-python
 required-version = "23.9.1"
 line-length = 100
-target-version = ['py38']
+target-version = ['py39']
 include = '\.pyi?$'
 extend-exclude = 'cloudpickle|error_classes.py'


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark-kubernetes-operator) branch main updated: [SPARK-48326] Use the official Apache Spark 4.0.0-preview1

2024-06-04 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new cd23de3  [SPARK-48326] Use the official Apache Spark 4.0.0-preview1
cd23de3 is described below

commit cd23de3ff5ee4dbc13d55c8552d86acc94cd8411
Author: Dongjoon Hyun 
AuthorDate: Tue Jun 4 09:24:09 2024 -0700

[SPARK-48326] Use the official Apache Spark 4.0.0-preview1

### What changes were proposed in this pull request?

This PR aims to use the official Apache Spark `4.0.0-preview1` artifacts.

### Why are the changes needed?

The current used artifact is not the latest RC3 and will be removed.

- https://repository.apache.org/content/repositories/orgapachespark-1454/

For the record, the latest RC was the following. And, it becomes the 
official artifact.
- https://repository.apache.org/content/repositories/orgapachespark-1456/ 
(RC3)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #13 from dongjoon-hyun/SPARK-48326.

Lead-authored-by: Dongjoon Hyun 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 build.gradle  | 4 
 gradle.properties | 1 -
 2 files changed, 5 deletions(-)

diff --git a/build.gradle b/build.gradle
index c0c75d0..a6c1701 100644
--- a/build.gradle
+++ b/build.gradle
@@ -25,10 +25,6 @@ subprojects {
 
   repositories {
 mavenCentral()
-// TODO(SPARK-48326) Upgrade submission worker base Spark version to 
4.0.0-preview2
-maven {
-  url 
"https://repository.apache.org/content/repositories/orgapachespark-1454/;
-}
   }
 
   apply plugin: 'checkstyle'
diff --git a/gradle.properties b/gradle.properties
index ffa8302..31b75dc 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -26,7 +26,6 @@ lombokVersion=1.18.32
 
 # Spark
 scalaVersion=2.13
-# TODO(SPARK-48326) Upgrade submission worker base Spark version to 
4.0.0-preview2
 sparkVersion=4.0.0-preview1
 
 # Logging


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (1a536f01ead3 -> 6cd1ccc56321)

2024-05-24 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1a536f01ead3 [SPARK-48407][SQL][DOCS] Teradata: Document Type 
Conversion rules between Spark SQL and teradata
 add 6cd1ccc56321 [SPARK-48394][CORE] Cleanup mapIdToMapIndex on mapoutput 
unregister

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/MapOutputTracker.scala  | 26 ++
 .../org/apache/spark/MapOutputTrackerSuite.scala   | 55 ++
 2 files changed, 72 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48407][SQL][DOCS] Teradata: Document Type Conversion rules between Spark SQL and teradata

2024-05-24 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1a536f01ead3 [SPARK-48407][SQL][DOCS] Teradata: Document Type 
Conversion rules between Spark SQL and teradata
1a536f01ead3 is described below

commit 1a536f01ead35b770467381c476e093338d81e7c
Author: Kent Yao 
AuthorDate: Fri May 24 15:56:19 2024 -0700

[SPARK-48407][SQL][DOCS] Teradata: Document Type Conversion rules between 
Spark SQL and teradata

### What changes were proposed in this pull request?

This PR adds documentation for the builtin teradata jdbc dialect's data 
type conversion rules

### Why are the changes needed?

doc improvement
### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?


![image](https://github.com/apache/spark/assets/8326978/e1ec0de5-cd83-4339-896a-50c58ad01c4d)

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46728 from yaooqinn/SPARK-48407.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 docs/sql-data-sources-jdbc.md | 214 ++
 1 file changed, 214 insertions(+)

diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md
index 371dc0595071..9ffd96cd40ee 100644
--- a/docs/sql-data-sources-jdbc.md
+++ b/docs/sql-data-sources-jdbc.md
@@ -1991,3 +1991,217 @@ The Spark Catalyst data types below are not supported 
with suitable DB2 types.
 - NullType
 - ObjectType
 - VariantType
+
+### Mapping Spark SQL Data Types from Teradata
+
+The below table describes the data type conversions from Teradata data types 
to Spark SQL Data Types,
+when reading data from a Teradata table using the built-in jdbc data source 
with the [Teradata JDBC 
Driver](https://mvnrepository.com/artifact/com.teradata.jdbc/terajdbc)
+as the activated JDBC Driver.
+
+
+  
+
+  Teradata Data Type
+  Spark SQL Data Type
+  Remarks
+
+  
+  
+
+  BYTEINT
+  ByteType
+  
+
+
+  SMALLINT
+  ShortType
+  
+
+
+  INTEGER, INT
+  IntegerType
+  
+
+
+  BIGINT
+  LongType
+  
+
+
+  REAL, DOUBLE PRECISION, FLOAT
+  DoubleType
+  
+
+
+  DECIMAL, NUMERIC, NUMBER
+  DecimalType
+  
+
+
+  DATE
+  DateType
+  
+
+
+  TIMESTAMP, TIMESTAMP WITH TIME ZONE
+  TimestampType
+  (Default)preferTimestampNTZ=false or 
spark.sql.timestampType=TIMESTAMP_LTZ
+
+
+  TIMESTAMP, TIMESTAMP WITH TIME ZONE
+  TimestampNTZType
+  preferTimestampNTZ=true or spark.sql.timestampType=TIMESTAMP_NTZ
+
+
+  TIME, TIME WITH TIME ZONE
+  TimestampType
+  (Default)preferTimestampNTZ=false or 
spark.sql.timestampType=TIMESTAMP_LTZ
+
+
+  TIME, TIME WITH TIME ZONE
+  TimestampNTZType
+  preferTimestampNTZ=true or spark.sql.timestampType=TIMESTAMP_NTZ
+
+
+  CHARACTER(n), CHAR(n), GRAPHIC(n)
+  CharType(n)
+  
+
+
+  VARCHAR(n), VARGRAPHIC(n)
+  VarcharType(n)
+  
+
+
+  BYTE(n), VARBYTE(n)
+  BinaryType
+  
+
+
+  CLOB
+  StringType
+  
+
+
+  BLOB
+  BinaryType
+  
+
+
+  INTERVAL Data Types
+  -
+  The INTERVAL data types are unknown yet
+
+
+  Period Data Types, ARRAY, UDT
+  -
+  Not Supported
+
+  
+
+
+### Mapping Spark SQL Data Types to Teradata
+
+The below table describes the data type conversions from Spark SQL Data Types 
to Teradata data types,
+when creating, altering, or writing data to a Teradata table using the 
built-in jdbc data source with
+the [Teradata JDBC 
Driver](https://mvnrepository.com/artifact/com.teradata.jdbc/terajdbc) as the 
activated JDBC Driver.
+
+
+  
+
+  Spark SQL Data Type
+  Teradata Data Type
+  Remarks
+
+  
+  
+
+  BooleanType
+  CHAR(1)
+  
+
+
+  ByteType
+  BYTEINT
+  
+
+
+  ShortType
+  SMALLINT
+  
+
+
+  IntegerType
+  INTEGER
+  
+
+
+  LongType
+  BIGINT
+  
+
+
+  FloatType
+  REAL
+  
+
+
+  DoubleType
+  DOUBLE PRECISION
+  
+
+
+  DecimalType(p, s)
+  DECIMAL(p,s)
+  
+
+
+  DateType
+  DATE
+  
+
+
+  TimestampType
+  TIMESTAMP
+  
+
+
+  TimestampNTZType
+  TIMESTAMP
+  
+
+
+  StringType
+  VARCHAR(255)
+  
+
+
+  BinaryType
+  BLOB
+  
+
+
+  CharType(n)
+  CHAR(n)
+  
+
+
+  VarcharType(n)
+  VARCHAR(n

(spark) branch master updated: [SPARK-48325][CORE] Always specify messages in ExecutorRunner.killProcess

2024-05-24 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7d96334902f2 [SPARK-48325][CORE] Always specify messages in 
ExecutorRunner.killProcess
7d96334902f2 is described below

commit 7d96334902f22a80af63ce1253d5abda63178c4e
Author: Bo Zhang 
AuthorDate: Fri May 24 15:54:21 2024 -0700

[SPARK-48325][CORE] Always specify messages in ExecutorRunner.killProcess

### What changes were proposed in this pull request?
This change is to always specify the message in 
`ExecutorRunner.killProcess`.

### Why are the changes needed?
This is to get the occurrence rate for different cases when killing the 
executor process, in order to analyze executor running stability.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46641 from bozhang2820/spark-48325.

Authored-by: Bo Zhang 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/deploy/worker/ExecutorRunner.scala  | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala
index 7bb8b74eb021..bd98f19cdb60 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala
@@ -88,7 +88,7 @@ private[deploy] class ExecutorRunner(
   if (state == ExecutorState.LAUNCHING || state == ExecutorState.RUNNING) {
 state = ExecutorState.FAILED
   }
-  killProcess(Some("Worker shutting down")) }
+  killProcess("Worker shutting down") }
   }
 
   /**
@@ -96,7 +96,7 @@ private[deploy] class ExecutorRunner(
*
* @param message the exception message which caused the executor's death
*/
-  private def killProcess(message: Option[String]): Unit = {
+  private def killProcess(message: String): Unit = {
 var exitCode: Option[Int] = None
 if (process != null) {
   logInfo("Killing process!")
@@ -113,7 +113,7 @@ private[deploy] class ExecutorRunner(
   }
 }
 try {
-  worker.send(ExecutorStateChanged(appId, execId, state, message, 
exitCode))
+  worker.send(ExecutorStateChanged(appId, execId, state, Some(message), 
exitCode))
 } catch {
   case e: IllegalStateException => logWarning(log"${MDC(ERROR, 
e.getMessage())}", e)
 }
@@ -206,11 +206,11 @@ private[deploy] class ExecutorRunner(
   case interrupted: InterruptedException =>
 logInfo("Runner thread for executor " + fullId + " interrupted")
 state = ExecutorState.KILLED
-killProcess(None)
+killProcess(s"Runner thread for executor $fullId interrupted")
   case e: Exception =>
 logError("Error running executor", e)
 state = ExecutorState.FAILED
-killProcess(Some(e.toString))
+killProcess(s"Error running executor: $e")
 }
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (febdbf56fb22 -> 80c0f1165417)

2024-05-21 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from febdbf56fb22 [SPARK-48031] Grandfather legacy views to SCHEMA BINDING
 add 80c0f1165417 [SPARK-48381][K8S][DOCS] Update `YuniKorn` docs with 
v1.5.1

No new revisions were added by this update.

Summary of changes:
 docs/running-on-kubernetes.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48329][SQL] Enable `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default

2024-05-21 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6b3a88195e30 [SPARK-48329][SQL] Enable 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default
6b3a88195e30 is described below

commit 6b3a88195e30027b74166d7729c232cd7ddba83b
Author: Szehon Ho 
AuthorDate: Tue May 21 10:00:14 2024 -0700

[SPARK-48329][SQL] Enable 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default

### What changes were proposed in this pull request?

This PR aims to enable 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default for Apache 
Spark 4.0.0 while keeping `spark.sql.sources.v2.bucketing.enabled` is `false`.

### Why are the changes needed?

`spark.sql.sources.v2.bucketing.pushPartValues.enabled` was added at Apache 
Spark 3.4.0 and has been used as one of the datasource v2 bucketing feature. 
This PR will help the datasource v2 bucketing users use this feature more 
easily.

Note that this change is technically no-op for the default users because 
`spark.sql.sources.v2.bucketing.enabled` is `false` still.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46673 from szehon-ho/default_pushpart.

Lead-authored-by: Szehon Ho 
Co-authored-by: chesterxu 
Signed-off-by: Dongjoon Hyun 
---
 docs/sql-migration-guide.md | 1 +
 sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 98075d019585..6e400ab93711 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -57,6 +57,7 @@ license: |
 - Since Spark 4.0, A bug falsely allowing `!` instead of `NOT` when `!` is not 
a prefix operator has been fixed. Clauses such as `expr ! IN (...)`, `expr ! 
BETWEEN ...`, or `col ! NULL` now raise syntax errors. To restore the previous 
behavior, set `spark.sql.legacy.bangEqualsNot` to `true`. 
 - Since Spark 4.0, By default views tolerate column type changes in the query 
and compensate with casts. To restore the previous behavior, allowing up-casts 
only, set `spark.sql.legacy.viewSchemaCompensation` to `false`.
 - Since Spark 4.0, Views allow control over how they react to underlying query 
changes. By default views tolerate column type changes in the query and 
compensate with casts. To disable thsi feature set 
`spark.sql.legacy.viewSchemaBindingMode` to `false`. This also removes the 
clause from `DESCRIBE EXTENDED` and `SHOW CREATE TABLE`.
+- Since Spark 4.0, The Storage-Partitioned Join feature flag 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` is set to `true`. To 
restore the previous behavior, set 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` to `false`.
 
 ## Upgrading from Spark SQL 3.5.1 to 3.5.2
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 87b32ca0b9b5..9c4236679f3a 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -1569,7 +1569,7 @@ object SQLConf {
 "side. This could help to eliminate unnecessary shuffles")
   .version("3.4.0")
   .booleanConf
-  .createWithDefault(false)
+  .createWithDefault(true)
 
   val V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED =
 
buildConf("spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (4fc2910f92d1 -> f5ffb74f170e)

2024-05-20 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 4fc2910f92d1 [SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a 
forked implementation
 add f5ffb74f170e [SPARK-48328][BUILD] Upgrade `Arrow` to 16.1.0

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 10 +-
 pom.xml   |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark-kubernetes-operator) branch main updated: [SPARK-48017] Add Spark application submission worker for operator

2024-05-20 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new e747bcf  [SPARK-48017] Add Spark application submission worker for 
operator
e747bcf is described below

commit e747bcfab106b828bbd9f2d44968698e5dce3c33
Author: zhou-jiang 
AuthorDate: Mon May 20 10:41:48 2024 -0700

[SPARK-48017] Add Spark application submission worker for operator

### What changes were proposed in this pull request?

This is a breakdown PR of #2  - adding a submission worker implementation 
for SparkApplication.

### Why are the changes needed?

Spark Operator needs a submission worker to convert its abstraction (the 
SparkApplication API) into k8s resource spec.
This is a light-weight implementation based on native k8s integration.

As of now, it's based off Spark 4.0.0-preview1 - but it's assumed to serve 
all Spark LTS versions. This is feasible because as it aims to cover only the 
spec generation, Spark core jars are still brought-in by application images. 
E2Es would set up with operator later to ensure that.

Per SPIP doc, in future operator version(s) we may add more implementations 
for submission worker based on different Spark versions to achieve 100% version 
agnostic, at the cost of having multiple workers stand-by.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test coverage.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #10 from jiangzho/worker.

Authored-by: zhou-jiang 
Signed-off-by: Dongjoon Hyun 
---
 build.gradle   |   4 +
 gradle.properties  |   7 +
 settings.gradle|   1 +
 spark-operator-api/build.gradle|   1 +
 .../spark/k8s/operator/utils/ModelUtils.java   |   9 +
 spark-submission-worker/build.gradle   |  18 ++
 .../spark/k8s/operator/SparkAppDriverConf.java |  73 +++
 .../spark/k8s/operator/SparkAppResourceSpec.java   | 129 
 .../k8s/operator/SparkAppSubmissionWorker.java | 175 +
 .../spark/k8s/operator/SparkAppDriverConfTest.java |  75 +++
 .../k8s/operator/SparkAppResourceSpecTest.java | 137 +
 .../k8s/operator/SparkAppSubmissionWorkerTest.java | 218 +
 12 files changed, 847 insertions(+)

diff --git a/build.gradle b/build.gradle
index a6c1701..c0c75d0 100644
--- a/build.gradle
+++ b/build.gradle
@@ -25,6 +25,10 @@ subprojects {
 
   repositories {
 mavenCentral()
+// TODO(SPARK-48326) Upgrade submission worker base Spark version to 
4.0.0-preview2
+maven {
+  url 
"https://repository.apache.org/content/repositories/orgapachespark-1454/;
+}
   }
 
   apply plugin: 'checkstyle'
diff --git a/gradle.properties b/gradle.properties
index 2606179..ffa8302 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -18,17 +18,24 @@
 group=org.apache.spark.k8s.operator
 version=0.1.0
 
+# Caution: fabric8 version should be aligned with Spark dependency
 fabric8Version=6.12.1
 commonsLang3Version=3.14.0
 commonsIOVersion=2.16.1
 lombokVersion=1.18.32
 
+# Spark
+scalaVersion=2.13
+# TODO(SPARK-48326) Upgrade submission worker base Spark version to 
4.0.0-preview2
+sparkVersion=4.0.0-preview1
+
 # Logging
 log4jVersion=2.22.1
 
 # Test
 junitVersion=5.10.2
 jacocoVersion=0.8.12
+mockitoVersion=5.11.0
 
 # Build Analysis
 checkstyleVersion=10.15.0
diff --git a/settings.gradle b/settings.gradle
index 69e7827..6808ec7 100644
--- a/settings.gradle
+++ b/settings.gradle
@@ -1,2 +1,3 @@
 rootProject.name = 'apache-spark-kubernetes-operator'
 include 'spark-operator-api'
+include 'spark-submission-worker'
diff --git a/spark-operator-api/build.gradle b/spark-operator-api/build.gradle
index b57beca..696415f 100644
--- a/spark-operator-api/build.gradle
+++ b/spark-operator-api/build.gradle
@@ -18,6 +18,7 @@ dependencies {
 
   testImplementation platform("org.junit:junit-bom:$junitVersion")
   testImplementation 'org.junit.jupiter:junit-jupiter'
+  testRuntimeOnly "org.junit.platform:junit-platform-launcher"
 }
 
 test {
diff --git 
a/spark-operator-api/src/main/java/org/apache/spark/k8s/operator/utils/ModelUtils.java
 
b/spark-operator-api/src/main/java/org/apache/spark/k8s/operator/utils/ModelUtils.java
index 454e706..03d84be 100644
--- 
a/spark-operator-api/src/main/java/org/apache/spark/k8s/operator/utils/ModelUtils.java
+++ 
b/spark-operator-api/src/main/java/org/apache/spark/k8s/operator/utils/ModelUtils.java
@@ -36,6 +36,7 @@ import io.fabric8.kubernetes.api.model.PodBuilder;
 import io.fabric8.kubernetes.api.model

(spark) branch master updated (6767053dacd9 -> a2d93d104a6c)

2024-05-15 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6767053dacd9 [SPARK-48218][CORE] TransportClientFactory.createClient 
may NPE cause FetchFailedException
 add a2d93d104a6c [SPARK-48256][BUILD] Add a rule to check file headers for 
the java side, and fix inconsistent files

No new revisions were added by this update.

Summary of changes:
 .../protocol/EncryptedMessageWithHeader.java   |  2 +-
 .../spark/unsafe/types/CalendarIntervalSuite.java  | 30 +++---
 .../apache/spark/unsafe/types/UTF8StringSuite.java | 30 +++---
 .../spark/io/NioBufferedFileInputStream.java   | 11 +---
 .../org/apache/spark/io/ReadAheadInputStream.java  | 11 +---
 dev/checkstyle-suppressions.xml|  4 +++
 dev/checkstyle.xml |  6 +
 .../hive/package-info.java => dev/java-file-header |  4 +--
 .../spark/sql/connector/catalog/Identifier.java|  2 +-
 .../sql/connector/catalog/IdentifierImpl.java  |  2 +-
 .../spark/sql/connector/catalog/CatalogPlugin.java |  2 +-
 .../sql/connector/catalog/MetadataColumn.java  | 26 +--
 .../connector/catalog/SupportsMetadataColumns.java | 26 +--
 .../sql/connector/catalog/index/SupportsIndex.java |  2 +-
 .../sql/connector/catalog/index/TableIndex.java|  2 +-
 .../sql/connector/catalog/CatalogLoadingSuite.java |  2 +-
 .../parquet/filter2/predicate/SparkFilterApi.java  | 26 +--
 .../spark/sql/JavaDataFrameReaderWriterSuite.java  | 30 +++---
 .../execution/datasources/orc/FakeKeyProvider.java | 15 +--
 19 files changed, 120 insertions(+), 113 deletions(-)
 copy sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java => 
dev/java-file-header (95%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException

2024-05-15 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6767053dacd9 [SPARK-48218][CORE] TransportClientFactory.createClient 
may NPE cause FetchFailedException
6767053dacd9 is described below

commit 6767053dacd9df623336e1f5faabf1eb16b7a7dd
Author: sychen 
AuthorDate: Wed May 15 09:33:39 2024 -0700

[SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause 
FetchFailedException

### What changes were proposed in this pull request?
This PR aims to add a check for `TransportChannelHandler` to be non-null in 
the `TransportClientFactory.createClient` method.

### Why are the changes needed?

Line 178 synchronized (handler) , handler == null


org.apache.spark.network.client.TransportClientFactory#createClient(java.lang.String,
 int, boolean)
```java
  TransportChannelHandler handler = cachedClient.getChannel().pipeline()
.get(TransportChannelHandler.class);
  synchronized (handler) {
handler.getResponseHandler().updateTimeOfLastRequest();
  }
```

```java
org.apache.spark.shuffle.FetchFailedException
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1180)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:913)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)

Caused by: java.lang.NullPointerException
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:178)
at 
org.apache.spark.network.shuffle.ExternalBlockStoreClient.lambda$fetchBlocks$0(ExternalBlockStoreClient.java:128)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.start(RetryingBlockTransferor.java:133)
at 
org.apache.spark.network.shuffle.ExternalBlockStoreClient.fetchBlocks(ExternalBlockStoreClient.java:139)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46506 from cxzl25/SPARK-48218.

Authored-by: sychen 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/network/client/TransportClientFactory.java | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
index ddf1b3cce349..f2dbfd92b854 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
@@ -171,8 +171,10 @@ public class TransportClientFactory implements Closeable {
   // this code was able to update things.
   TransportChannelHandler handler = cachedClient.getChannel().pipeline()
 .get(TransportChannelHandler.class);
-  synchronized (handler) {
-handler.getResponseHandler().updateTimeOfLastRequest();
+  if (handler != null) {
+synchronized (handler) {
+  handler.getResponseHandler().updateTimeOfLastRequest();
+}
   }
 
   if (cachedClient.isActive()) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (973328cd376b -> 12820e11b094)

2024-05-15 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 973328cd376b [SPARK-48285][SQL][DOCS] Update docs for size function 
and sizeOfNull configuration
 add 12820e11b094 [SPARK-48049][BUILD] Upgrade Scala to 2.13.14

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 10 +-
 docs/_config.yml  |  2 +-
 pom.xml   | 12 ++--
 3 files changed, 16 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (9e386b472981 -> 973328cd376b)

2024-05-15 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 9e386b472981 [SPARK-48172][SQL] Fix escaping issues in JDBCDialects
 add 973328cd376b [SPARK-48285][SQL][DOCS] Update docs for size function 
and sizeOfNull configuration

No new revisions were added by this update.

Summary of changes:
 .../jvm/src/main/scala/org/apache/spark/sql/functions.scala  | 12 ++--
 .../sql/catalyst/expressions/collectionOperations.scala  |  6 +++---
 sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 12 ++--
 3 files changed, 15 insertions(+), 15 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48279][BUILD] Upgrade ORC to 2.0.1

2024-05-15 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 76329f9bb60e [SPARK-48279][BUILD] Upgrade ORC to 2.0.1
76329f9bb60e is described below

commit 76329f9bb60e3c61e16e8a285fe00cf4f185efd5
Author: William Hyun 
AuthorDate: Wed May 15 01:17:54 2024 -0700

[SPARK-48279][BUILD] Upgrade ORC to 2.0.1

### What changes were proposed in this pull request?
This PR aims to upgrade ORC to 2.0.1

### Why are the changes needed?
Apache ORC 2.0.1 is the first maintenance release of 2.0.x line.

- https://orc.apache.org/news/2024/05/14/ORC-2.0.1/

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46587 from williamhyun/SPARK-48279.

Authored-by: William Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 6 +++---
 pom.xml   | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 2b444dddcbe9..598be34e5e0f 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -231,10 +231,10 @@ opencsv/2.3//opencsv-2.3.jar
 opentracing-api/0.33.0//opentracing-api-0.33.0.jar
 opentracing-noop/0.33.0//opentracing-noop-0.33.0.jar
 opentracing-util/0.33.0//opentracing-util-0.33.0.jar
-orc-core/2.0.0/shaded-protobuf/orc-core-2.0.0-shaded-protobuf.jar
+orc-core/2.0.1/shaded-protobuf/orc-core-2.0.1-shaded-protobuf.jar
 orc-format/1.0.0/shaded-protobuf/orc-format-1.0.0-shaded-protobuf.jar
-orc-mapreduce/2.0.0/shaded-protobuf/orc-mapreduce-2.0.0-shaded-protobuf.jar
-orc-shims/2.0.0//orc-shims-2.0.0.jar
+orc-mapreduce/2.0.1/shaded-protobuf/orc-mapreduce-2.0.1-shaded-protobuf.jar
+orc-shims/2.0.1//orc-shims-2.0.1.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
diff --git a/pom.xml b/pom.xml
index 12d20f4f0736..ce7d2546e7c2 100644
--- a/pom.xml
+++ b/pom.xml
@@ -138,7 +138,7 @@
 
 10.16.1.1
 1.13.1
-2.0.0
+2.0.1
 shaded-protobuf
 11.0.20
 5.0.0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f699f556d8a0 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script
f699f556d8a0 is described below

commit f699f556d8a09bb755e9c8558661a36fbdb42e73
Author: panbingkun 
AuthorDate: Fri May 10 19:54:29 2024 -0700

[SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script

### What changes were proposed in this pull request?
The pr aims to delete the dir `dev/pr-deps` after executing 
`test-dependencies.sh`.

### Why are the changes needed?
We'd better clean the `temporary files` generated at the end.
Before:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;>

After:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46531 from panbingkun/minor_test-dependencies.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 dev/test-dependencies.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh
index 048c59f4cec9..e645a66165a2 100755
--- a/dev/test-dependencies.sh
+++ b/dev/test-dependencies.sh
@@ -140,4 +140,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do
   fi
 done
 
+if [[ -d "$FWDIR/dev/pr-deps" ]]; then
+  rm -rf "$FWDIR/dev/pr-deps"
+fi
+
 exit 0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 1e0fc1ef96aa [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script
1e0fc1ef96aa is described below

commit 1e0fc1ef96aa6f541134224f1ba626f234442e74
Author: panbingkun 
AuthorDate: Fri May 10 19:54:29 2024 -0700

[SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script

### What changes were proposed in this pull request?
The pr aims to delete the dir `dev/pr-deps` after executing 
`test-dependencies.sh`.

### Why are the changes needed?
We'd better clean the `temporary files` generated at the end.
Before:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;>

After:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46531 from panbingkun/minor_test-dependencies.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit f699f556d8a09bb755e9c8558661a36fbdb42e73)
Signed-off-by: Dongjoon Hyun 
---
 dev/test-dependencies.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh
index 2268a262d5f8..2907ef27189c 100755
--- a/dev/test-dependencies.sh
+++ b/dev/test-dependencies.sh
@@ -144,4 +144,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do
   fi
 done
 
+if [[ -d "$FWDIR/dev/pr-deps" ]]; then
+  rm -rf "$FWDIR/dev/pr-deps"
+fi
+
 exit 0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e9a1b4254419 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script
e9a1b4254419 is described below

commit e9a1b4254419c751e612cd5e5c56f111b41399e7
Author: panbingkun 
AuthorDate: Fri May 10 19:54:29 2024 -0700

[SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script

### What changes were proposed in this pull request?
The pr aims to delete the dir `dev/pr-deps` after executing 
`test-dependencies.sh`.

### Why are the changes needed?
We'd better clean the `temporary files` generated at the end.
Before:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;>

After:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46531 from panbingkun/minor_test-dependencies.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit f699f556d8a09bb755e9c8558661a36fbdb42e73)
Signed-off-by: Dongjoon Hyun 
---
 dev/test-dependencies.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh
index d7967ac3afa9..36cc7a4f994d 100755
--- a/dev/test-dependencies.sh
+++ b/dev/test-dependencies.sh
@@ -140,4 +140,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do
   fi
 done
 
+if [[ -d "$FWDIR/dev/pr-deps" ]]; then
+  rm -rf "$FWDIR/dev/pr-deps"
+fi
+
 exit 0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d82458f15539 [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the 
dataSource API
d82458f15539 is described below

commit d82458f15539eef8df320345a7c2382ca4d5be8a
Author: allisonwang-db 
AuthorDate: Fri May 10 16:31:47 2024 -0700

[SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API

### What changes were proposed in this pull request?

This is a follow-up PR for https://github.com/apache/spark/pull/46487 to 
add missing tags for the `dataSource` API.

### Why are the changes needed?

To address comments from a previous PR.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46530 from allisonwang-db/spark-48205-followup.

Authored-by: allisonwang-db 
Signed-off-by: Dongjoon Hyun 
---
 sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala | 4 
 1 file changed, 4 insertions(+)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
index d5de74455dce..466e4cf81318 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
@@ -233,7 +233,11 @@ class SparkSession private(
 
   /**
* A collection of methods for registering user-defined data sources.
+   *
+   * @since 4.0.0
*/
+  @Experimental
+  @Unstable
   def dataSource: DataSourceRegistration = sessionState.dataSourceRegistration
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5b3b8a90638c [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` 
back to support legacy Hive UDF jars
5b3b8a90638c is described below

commit 5b3b8a90638c49fc7ddcace69a85989c1053f1ab
Author: Dongjoon Hyun 
AuthorDate: Fri May 10 15:48:08 2024 -0700

[SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support 
legacy Hive UDF jars

### What changes were proposed in this pull request?

This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy 
Hive UDF jars . This is a partial revert of SPARK-47018 .

### Why are the changes needed?

Recently, we dropped `commons-lang:commons-lang` during Hive upgrade.
- #46468

However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 
2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing  
UDF jars built against those versions requires `commons-lang:commons-lang` 
still.

- https://github.com/apache/hive/pull/4892

For example, Apache Hive 3.1.3 code:
- 
https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21
```
import org.apache.commons.lang.StringUtils;
```

- 
https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42
```
return StringUtils.strip(val, " ");
```

As a result, Maven CIs are broken.
- https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 
(Maven / Java 17)
- https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 
(Maven / Java 21)

The root cause is that the existing test UDF jar `hive-test-udfs.jar` was 
built from old Hive (before 2.3.10) libraries which requires 
`commons-lang:commons-lang:2.6`.
```
HiveUDFDynamicLoadSuite:
- Spark should be able to run Hive UDF using jar regardless of current 
thread context classloader (UDF
20:21:25.129 WARN org.apache.spark.SparkContext: The JAR 
file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar
 at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. 
Overwriting of added jar is not supported in the current version.

*** RUN ABORTED ***
A needed class was not found. This could be due to an error in your 
runpath. Missing class: org/apache/commons/lang/StringUtils
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at 
org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at 
org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
  ...
  Cause: java.lang.ClassNotFoundException: 
org.apache.commons.lang.StringUtils
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
  at 
org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at 
org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  ...
```

### Does this PR introduce _any_ user-facing change?

To support the existin

(spark) branch master updated: Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`"

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 726ef8aa66ea Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`"
726ef8aa66ea is described below

commit 726ef8aa66ea6e56b739f3b16f99e457a0febb81
Author: Dongjoon Hyun 
AuthorDate: Fri May 10 15:34:12 2024 -0700

Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`"

This reverts commit d8151186d79459fbde27a01bd97328e73548c55a.
---
 LICENSE-binary|  1 +
 dev/deps/spark-deps-hadoop-3-hive-2.3 |  1 +
 licenses-binary/LICENSE-jodd.txt  | 24 
 pom.xml   |  6 ++
 sql/hive/pom.xml  |  4 
 5 files changed, 36 insertions(+)

diff --git a/LICENSE-binary b/LICENSE-binary
index 034215f0ab15..40271c9924bc 100644
--- a/LICENSE-binary
+++ b/LICENSE-binary
@@ -436,6 +436,7 @@ com.esotericsoftware:reflectasm
 org.codehaus.janino:commons-compiler
 org.codehaus.janino:janino
 jline:jline
+org.jodd:jodd-core
 com.github.wendykierp:JTransforms
 pl.edu.icm:JLargeArrays
 
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 29997815e5bc..392bacd73277 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -143,6 +143,7 @@ jline/2.14.6//jline-2.14.6.jar
 jline/3.24.1//jline-3.24.1.jar
 jna/5.13.0//jna-5.13.0.jar
 joda-time/2.12.7//joda-time-2.12.7.jar
+jodd-core/3.5.2//jodd-core-3.5.2.jar
 jpam/1.1//jpam-1.1.jar
 json/1.8//json-1.8.jar
 json4s-ast_2.13/4.0.7//json4s-ast_2.13-4.0.7.jar
diff --git a/licenses-binary/LICENSE-jodd.txt b/licenses-binary/LICENSE-jodd.txt
new file mode 100644
index ..cc6b458adb38
--- /dev/null
+++ b/licenses-binary/LICENSE-jodd.txt
@@ -0,0 +1,24 @@
+Copyright (c) 2003-present, Jodd Team (https://jodd.org)
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice,
+this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file
diff --git a/pom.xml b/pom.xml
index a98efe8aed1e..56a34cedde51 100644
--- a/pom.xml
+++ b/pom.xml
@@ -201,6 +201,7 @@
 3.1.9
 3.0.12
 2.12.7
+3.5.2
 3.0.0
 2.2.11
 0.16.0
@@ -2782,6 +2783,11 @@
 joda-time
 ${joda.version}
   
+  
+org.jodd
+jodd-core
+${jodd.version}
+  
   
 org.datanucleus
 datanucleus-core
diff --git a/sql/hive/pom.xml b/sql/hive/pom.xml
index 5e9fc256e7e6..3895d9dc5a63 100644
--- a/sql/hive/pom.xml
+++ b/sql/hive/pom.xml
@@ -152,6 +152,10 @@
   joda-time
   joda-time
 
+
+  org.jodd
+  jodd-core
+
 
   com.google.code.findbugs
   jsr305


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (a6632ffa16f6 -> 2225aa1dab0f)

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a6632ffa16f6 [SPARK-48143][SQL] Use lightweight exceptions for 
control-flow between UnivocityParser and FailureSafeParser
 add 2225aa1dab0f [SPARK-48144][SQL] Fix `canPlanAsBroadcastHashJoin` to 
respect shuffle join hints

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/optimizer/joins.scala   | 38 ++
 .../spark/sql/execution/SparkStrategies.scala  | 17 --
 .../scala/org/apache/spark/sql/JoinSuite.scala | 26 +--
 3 files changed, 55 insertions(+), 26 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5beaf85cd5ef [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python 
data source exactly once
5beaf85cd5ef is described below

commit 5beaf85cd5ef2b84a67ebce712e8d73d1e7d41ff
Author: Chaoqin Li 
AuthorDate: Fri May 10 08:24:42 2024 -0700

[SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly 
once

### What changes were proposed in this pull request?
Fix the flakiness in python streaming source exactly once test. The last 
executed batch may not be recorded in query progress, which cause the expected 
rows doesn't match. This fix takes the uncompleted batch into account and relax 
the condition

### Why are the changes needed?
Fix flaky test.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test change.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46481 from chaoqin-li1123/fix_python_ds_test.

Authored-by: Chaoqin Li 
Signed-off-by: Dongjoon Hyun 
---
 .../execution/python/PythonStreamingDataSourceSuite.scala| 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala
index 97e6467c3eaf..d1f7c597b308 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala
@@ -299,7 +299,7 @@ class PythonStreamingDataSourceSuite extends 
PythonDataSourceSuiteBase {
   val checkpointDir = new File(path, "checkpoint")
   val outputDir = new File(path, "output")
   val df = spark.readStream.format(dataSourceName).load()
-  var lastBatch = 0
+  var lastBatchId = 0
   // Restart streaming query multiple times to verify exactly once 
guarantee.
   for (i <- 1 to 5) {
 
@@ -323,11 +323,15 @@ class PythonStreamingDataSourceSuite extends 
PythonDataSourceSuiteBase {
 }
 q.stop()
 q.awaitTermination()
-lastBatch = q.lastProgress.batchId.toInt
+lastBatchId = q.lastProgress.batchId.toInt
   }
-  assert(lastBatch > 20)
+  assert(lastBatchId > 20)
+  val rowCount = 
spark.read.format("json").load(outputDir.getAbsolutePath).count()
+  // There may be one uncommitted batch that is not recorded in query 
progress.
+  // The number of batch can be lastBatchId + 1 or lastBatchId + 2.
+  assert(rowCount == 2 * (lastBatchId + 1) || rowCount == 2 * (lastBatchId 
+ 2))
   checkAnswer(spark.read.format("json").load(outputDir.getAbsolutePath),
-(0 to  2 * lastBatch + 1).map(Row(_)))
+(0 until rowCount.toInt).map(Row(_)))
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c5b6ec734bd0 [SPARK-47441][YARN] Do not add log link for unmanaged AM 
in Spark UI
c5b6ec734bd0 is described below

commit c5b6ec734bd0c47551b59f9de13c6323b80974b2
Author: Yuming Wang 
AuthorDate: Fri May 10 08:22:03 2024 -0700

[SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI

### What changes were proposed in this pull request?

This PR makes it do not add log link for unmanaged AM in Spark UI.

### Why are the changes needed?

Avoid start driver error messages:
```
24/03/18 04:58:25,022 ERROR [spark-listener-group-appStatus] 
scheduler.AsyncEventQueue:97 : Listener AppStatusListener threw an exception
java.lang.NumberFormatException: For input string: "null"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) 
~[?:?]
at java.lang.Integer.parseInt(Integer.java:668) ~[?:?]
at java.lang.Integer.parseInt(Integer.java:786) ~[?:?]
at 
scala.collection.immutable.StringLike.toInt(StringLike.scala:310) 
~[scala-library-2.12.18.jar:?]
at 
scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) 
~[scala-library-2.12.18.jar:?]
at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) 
~[scala-library-2.12.18.jar:?]
at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1105) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.ProcessSummaryWrapper.(storeTypes.scala:609) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:1045)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1233) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1445)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) 
~[scala-library-2.12.18.jar:?]
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) 
~[scala-library-2.12.18.jar:?]
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) 
[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
 [spark-core_2.12-3.5.1.jar:3.5.1]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual testing:
```shell
bin/spark-sql --master yarn  --conf spark.yarn.unmanagedAM.enabled=true
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45565 from wangyum/SPARK-47441.

Authored-by: Yuming Wang 
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/d

(spark) branch master updated: [SPARK-48235][SQL] Directly pass join instead of all arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 73bb619d45b2 [SPARK-48235][SQL] Directly pass join instead of all 
arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide
73bb619d45b2 is described below

commit 73bb619d45b2d0699ca4a9d251eea57c359f275b
Author: fred-db 
AuthorDate: Fri May 10 07:45:28 2024 -0700

[SPARK-48235][SQL] Directly pass join instead of all arguments to 
getBroadcastBuildSide and getShuffleHashJoinBuildSide

### What changes were proposed in this pull request?

* Refactor getBroadcastBuildSide and getShuffleHashJoinBuildSide to pass 
the join as argument instead of all member variables of the join separately.

### Why are the changes needed?

* Makes to code easier to read.

### Does this PR introduce _any_ user-facing change?

* no

### How was this patch tested?

* Existing UTs

### Was this patch authored or co-authored using generative AI tooling?

* No

Closes #46525 from fred-db/parameter-change.

Authored-by: fred-db 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/optimizer/joins.scala   | 56 +---
 .../optimizer/JoinSelectionHelperSuite.scala   | 59 +-
 .../spark/sql/execution/SparkStrategies.scala  |  6 +--
 3 files changed, 40 insertions(+), 81 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
index 2b4ee033b088..5571178832db 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
@@ -289,58 +289,52 @@ case object BuildLeft extends BuildSide
 trait JoinSelectionHelper {
 
   def getBroadcastBuildSide(
-  left: LogicalPlan,
-  right: LogicalPlan,
-  joinType: JoinType,
-  hint: JoinHint,
+  join: Join,
   hintOnly: Boolean,
   conf: SQLConf): Option[BuildSide] = {
 val buildLeft = if (hintOnly) {
-  hintToBroadcastLeft(hint)
+  hintToBroadcastLeft(join.hint)
 } else {
-  canBroadcastBySize(left, conf) && !hintToNotBroadcastLeft(hint)
+  canBroadcastBySize(join.left, conf) && !hintToNotBroadcastLeft(join.hint)
 }
 val buildRight = if (hintOnly) {
-  hintToBroadcastRight(hint)
+  hintToBroadcastRight(join.hint)
 } else {
-  canBroadcastBySize(right, conf) && !hintToNotBroadcastRight(hint)
+  canBroadcastBySize(join.right, conf) && 
!hintToNotBroadcastRight(join.hint)
 }
 getBuildSide(
-  canBuildBroadcastLeft(joinType) && buildLeft,
-  canBuildBroadcastRight(joinType) && buildRight,
-  left,
-  right
+  canBuildBroadcastLeft(join.joinType) && buildLeft,
+  canBuildBroadcastRight(join.joinType) && buildRight,
+  join.left,
+  join.right
 )
   }
 
   def getShuffleHashJoinBuildSide(
-  left: LogicalPlan,
-  right: LogicalPlan,
-  joinType: JoinType,
-  hint: JoinHint,
+  join: Join,
   hintOnly: Boolean,
   conf: SQLConf): Option[BuildSide] = {
 val buildLeft = if (hintOnly) {
-  hintToShuffleHashJoinLeft(hint)
+  hintToShuffleHashJoinLeft(join.hint)
 } else {
-  hintToPreferShuffleHashJoinLeft(hint) ||
-(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(left, conf) &&
-  muchSmaller(left, right, conf)) ||
+  hintToPreferShuffleHashJoinLeft(join.hint) ||
+(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(join.left, 
conf) &&
+  muchSmaller(join.left, join.right, conf)) ||
 forceApplyShuffledHashJoin(conf)
 }
 val buildRight = if (hintOnly) {
-  hintToShuffleHashJoinRight(hint)
+  hintToShuffleHashJoinRight(join.hint)
 } else {
-  hintToPreferShuffleHashJoinRight(hint) ||
-(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(right, conf) 
&&
-  muchSmaller(right, left, conf)) ||
+  hintToPreferShuffleHashJoinRight(join.hint) ||
+(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(join.right, 
conf) &&
+  muchSmaller(join.right, join.left, conf)) ||
 forceApplyShuffledHashJoin(conf)
 }
 getBuildSide(
-  canBuildShuffledHashJoinLeft(joinType) && buildLeft,
-  canBuildShuffledHashJoinRight(joinType) && buildRight,
-  left,
-  right
+  canBuildShuffledHashJoinLeft(join.joinType) && buildLeft,
+  canBuildShuffledHashJoinRight(join.joinType) && buildRight,
+

(spark) branch master updated: [SPARK-48230][BUILD] Remove unused `jodd-core`

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d8151186d794 [SPARK-48230][BUILD] Remove unused `jodd-core`
d8151186d794 is described below

commit d8151186d79459fbde27a01bd97328e73548c55a
Author: Cheng Pan 
AuthorDate: Fri May 10 01:09:01 2024 -0700

[SPARK-48230][BUILD] Remove unused `jodd-core`

### What changes were proposed in this pull request?

Remove a jar that has CVE https://github.com/advisories/GHSA-jrg3-qq99-35g7

### Why are the changes needed?

Previously, `jodd-core` came from Hive transitive deps, while 
https://github.com/apache/hive/pull/5151 (Hive 2.3.10) cut it out, so we can 
remove it from Spark now.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46520 from pan3793/SPARK-48230.

Authored-by: Cheng Pan 
Signed-off-by: Dongjoon Hyun 
---
 LICENSE-binary|  1 -
 dev/deps/spark-deps-hadoop-3-hive-2.3 |  1 -
 licenses-binary/LICENSE-jodd.txt  | 24 
 pom.xml   |  6 --
 sql/hive/pom.xml  |  4 
 5 files changed, 36 deletions(-)

diff --git a/LICENSE-binary b/LICENSE-binary
index 40271c9924bc..034215f0ab15 100644
--- a/LICENSE-binary
+++ b/LICENSE-binary
@@ -436,7 +436,6 @@ com.esotericsoftware:reflectasm
 org.codehaus.janino:commons-compiler
 org.codehaus.janino:janino
 jline:jline
-org.jodd:jodd-core
 com.github.wendykierp:JTransforms
 pl.edu.icm:JLargeArrays
 
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 392bacd73277..29997815e5bc 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -143,7 +143,6 @@ jline/2.14.6//jline-2.14.6.jar
 jline/3.24.1//jline-3.24.1.jar
 jna/5.13.0//jna-5.13.0.jar
 joda-time/2.12.7//joda-time-2.12.7.jar
-jodd-core/3.5.2//jodd-core-3.5.2.jar
 jpam/1.1//jpam-1.1.jar
 json/1.8//json-1.8.jar
 json4s-ast_2.13/4.0.7//json4s-ast_2.13-4.0.7.jar
diff --git a/licenses-binary/LICENSE-jodd.txt b/licenses-binary/LICENSE-jodd.txt
deleted file mode 100644
index cc6b458adb38..
--- a/licenses-binary/LICENSE-jodd.txt
+++ /dev/null
@@ -1,24 +0,0 @@
-Copyright (c) 2003-present, Jodd Team (https://jodd.org)
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-
-1. Redistributions of source code must retain the above copyright notice,
-this list of conditions and the following disclaimer.
-
-2. Redistributions in binary form must reproduce the above copyright
-notice, this list of conditions and the following disclaimer in the
-documentation and/or other materials provided with the distribution.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
-ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
-LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
-CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
-SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
-INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
-CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
-ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file
diff --git a/pom.xml b/pom.xml
index 56a34cedde51..a98efe8aed1e 100644
--- a/pom.xml
+++ b/pom.xml
@@ -201,7 +201,6 @@
 3.1.9
 3.0.12
 2.12.7
-3.5.2
 3.0.0
 2.2.11
 0.16.0
@@ -2783,11 +2782,6 @@
 joda-time
 ${joda.version}
   
-  
-org.jodd
-jodd-core
-${jodd.version}
-  
   
 org.datanucleus
 datanucleus-core
diff --git a/sql/hive/pom.xml b/sql/hive/pom.xml
index 3895d9dc5a63..5e9fc256e7e6 100644
--- a/sql/hive/pom.xml
+++ b/sql/hive/pom.xml
@@ -152,10 +152,6 @@
   joda-time
   joda-time
 
-
-  org.jodd
-  jodd-core
-
 
   com.google.code.findbugs
   jsr305


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion`

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new c048653435f9 [SPARK-47847][CORE] Deprecate 
`spark.network.remoteReadNioBufferConversion`
c048653435f9 is described below

commit c048653435f9b7c832f79d38a504a145a17654c0
Author: Cheng Pan 
AuthorDate: Thu May 9 22:55:07 2024 -0700

[SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion`

### What changes were proposed in this pull request?

`spark.network.remoteReadNioBufferConversion` was introduced in 
https://github.com/apache/spark/commit/2c82745686f4456c4d5c84040a431dcb5b6cb60b,
 to allow disable 
[SPARK-24307](https://issues.apache.org/jira/browse/SPARK-24307) for safety, 
while during the whole Spark 3 period, there are no negative reports, it proves 
that [SPARK-24307](https://issues.apache.org/jira/browse/SPARK-24307) is solid 
enough, I propose to mark it deprecated in 3.5.2 and remove in 4.1.0 or later

### Why are the changes needed?

Code clean up

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46047 from pan3793/SPARK-47847.

Authored-by: Cheng Pan 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 33cac4436e593c9c501c5ff0eedf923d3a21899c)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/SparkConf.scala | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala 
b/core/src/main/scala/org/apache/spark/SparkConf.scala
index 813a14acd19e..f49e9e357c84 100644
--- a/core/src/main/scala/org/apache/spark/SparkConf.scala
+++ b/core/src/main/scala/org/apache/spark/SparkConf.scala
@@ -638,7 +638,9 @@ private[spark] object SparkConf extends Logging {
   DeprecatedConfig("spark.blacklist.killBlacklistedExecutors", "3.1.0",
 "Please use spark.excludeOnFailure.killExcludedExecutors"),
   
DeprecatedConfig("spark.yarn.blacklist.executor.launch.blacklisting.enabled", 
"3.1.0",
-"Please use spark.yarn.executor.launch.excludeOnFailure.enabled")
+"Please use spark.yarn.executor.launch.excludeOnFailure.enabled"),
+  DeprecatedConfig("spark.network.remoteReadNioBufferConversion", "3.5.2",
+"Please open a JIRA ticket to report it if you need to use this 
configuration.")
 )
 
 Map(configs.map { cfg => (cfg.key -> cfg) } : _*)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (8ccc8b92be50 -> 33cac4436e59)

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8ccc8b92be50 [SPARK-48201][DOCS][PYTHON] Make some corrections in the 
docstring of pyspark DataStreamReader methods
 add 33cac4436e59 [SPARK-47847][CORE] Deprecate 
`spark.network.remoteReadNioBufferConversion`

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/SparkConf.scala | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8ccc8b92be50 [SPARK-48201][DOCS][PYTHON] Make some corrections in the 
docstring of pyspark DataStreamReader methods
8ccc8b92be50 is described below

commit 8ccc8b92be50b1d5ef932873403e62e28c478781
Author: Chloe He 
AuthorDate: Thu May 9 22:07:04 2024 -0700

[SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of 
pyspark DataStreamReader methods

### What changes were proposed in this pull request?

The docstrings of the pyspark DataStream Reader methods `csv()` and 
`text()` say that the `path` parameter can be a list, but actually when a list 
is passed an error is raised.

### Why are the changes needed?

Documentation is wrong.

### Does this PR introduce _any_ user-facing change?

Yes. Fixes documentation.

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46416 from chloeh13q/fix/streamread-docstring.

Authored-by: Chloe He 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/streaming/readwriter.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/streaming/readwriter.py 
b/python/pyspark/sql/streaming/readwriter.py
index c2b75dd8f167..b202a499e8b0 100644
--- a/python/pyspark/sql/streaming/readwriter.py
+++ b/python/pyspark/sql/streaming/readwriter.py
@@ -553,8 +553,8 @@ class DataStreamReader(OptionUtils):
 
 Parameters
 --
-path : str or list
-string, or list of strings, for input path(s).
+path : str
+string for input path.
 
 Other Parameters
 
@@ -641,8 +641,8 @@ class DataStreamReader(OptionUtils):
 
 Parameters
 --
-path : str or list
-string, or list of strings, for input path(s).
+path : str
+string for input path.
 schema : :class:`pyspark.sql.types.StructType` or str, optional
 an optional :class:`pyspark.sql.types.StructType` for the input 
schema
 or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9bb15db85e53 [SPARK-48228][PYTHON][CONNECT] Implement the missing 
function validation in ApplyInXXX
9bb15db85e53 is described below

commit 9bb15db85e53b69b9c0ba112cd1dd93d8213eea4
Author: Ruifeng Zheng 
AuthorDate: Thu May 9 22:01:13 2024 -0700

[SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in 
ApplyInXXX

### What changes were proposed in this pull request?
Implement the missing function validation in ApplyInXXX

https://github.com/apache/spark/pull/46397 fixed this issue for 
`Cogrouped.ApplyInPandas`, this PR fix remaining methods.

### Why are the changes needed?
for better error message:

```
In [12]: df1 = spark.range(11)

In [13]: df2 = df1.groupby("id").applyInPandas(lambda: 1, 
StructType([StructField("d", DoubleType())]))

In [14]: df2.show()
```

before this PR, an invalid function causes weird execution errors:
```
24/05/10 11:37:36 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 
36)
org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1834, in main
process()
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1826, in process
serializer.dump_stream(out_iter, outfile)
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 531, in dump_stream
return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
   

  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 104, in dump_stream
for batch in iterator:
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 524, in init_stream_yield_batches
for series in iterator:
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1610, in mapper
return f(keys, vals)
   ^
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
488, in 
return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
  ^
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
483, in wrapped
result, return_type, _assign_cols_by_name, truncate_return_schema=False
^^
UnboundLocalError: cannot access local variable 'result' where it is not 
associated with a value

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:523)
at 
org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:479)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)

...
```

After this PR, the error happens before execution, which is consistent with 
Spark Classic, and
 much clear
```
PySparkValueError: [INVALID_PANDAS_UDF] Invalid function: pandas_udf with 
function type GROUPED_MAP or the function in groupby.applyInPandas must take 
either one argument (data) or two arguments (key, data).

```

### Does this PR introduce _any_ user-facing change?
yes, error message changes

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46519 from zhengruifeng/missing_check_in_group.

(spark) branch master updated: [SPARK-48224][SQL] Disallow map keys from being of variant type

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b371e7dd8800 [SPARK-48224][SQL] Disallow map keys from being of 
variant type
b371e7dd8800 is described below

commit b371e7dd88009195740f8f5b591447441ea43d0b
Author: Harsh Motwani 
AuthorDate: Thu May 9 21:47:05 2024 -0700

[SPARK-48224][SQL] Disallow map keys from being of variant type

### What changes were proposed in this pull request?

This PR disallows map keys from being of variant type. Therefore, SQL 
statements like `select map(parse_json('{"a": 1}'), 1)`, which would work 
earlier, will throw an exception now.

### Why are the changes needed?

Allowing variant to be the key type of a map can result in undefined 
behavior as this has not been tested.

### Does this PR introduce _any_ user-facing change?

Yes, users could use variants as keys in maps earlier. However, this PR 
disallows this possibility.

### How was this patch tested?

Unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46516 from harshmotw-db/map_variant_key.

Authored-by: Harsh Motwani 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/catalyst/util/TypeUtils.scala |  2 +-
 .../catalyst/expressions/ComplexTypeSuite.scala| 34 +-
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala
index d2c708b380cf..a0d578c66e73 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala
@@ -58,7 +58,7 @@ object TypeUtils extends QueryErrorsBase {
   }
 
   def checkForMapKeyType(keyType: DataType): TypeCheckResult = {
-if (keyType.existsRecursively(_.isInstanceOf[MapType])) {
+if (keyType.existsRecursively(dt => dt.isInstanceOf[MapType] || 
dt.isInstanceOf[VariantType])) {
   DataTypeMismatch(
 errorSubClass = "INVALID_MAP_KEY_TYPE",
 messageParameters = Map(
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
index 5f135e46a377..497b335289b1 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
@@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.catalyst.util.TypeUtils.ordinalNumber
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
-import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.unsafe.types.{UTF8String, VariantVal}
 
 class ComplexTypeSuite extends SparkFunSuite with ExpressionEvalHelper {
 
@@ -359,6 +359,38 @@ class ComplexTypeSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 )
   }
 
+  // map key can't be variant
+  val map6 = CreateMap(Seq(
+Literal.create(new VariantVal(Array[Byte](), Array[Byte]())),
+Literal.create(1)
+  ))
+  map6.checkInputDataTypes() match {
+case TypeCheckResult.TypeCheckSuccess => fail("should not allow variant as 
a part of map key")
+case TypeCheckResult.DataTypeMismatch(errorSubClass, messageParameters) =>
+  assert(errorSubClass == "INVALID_MAP_KEY_TYPE")
+  assert(messageParameters === Map("keyType" -> "\"VARIANT\""))
+  }
+
+  // map key can't contain variant
+  val map7 = CreateMap(
+Seq(
+  CreateStruct(
+Seq(Literal.create(1), Literal.create(new VariantVal(Array[Byte](), 
Array[Byte](
+  ),
+  Literal.create(1)
+)
+  )
+  map7.checkInputDataTypes() match {
+case TypeCheckResult.TypeCheckSuccess => fail("should not allow variant as 
a part of map key")
+case TypeCheckResult.DataTypeMismatch(errorSubClass, messageParameters) =>
+  assert(errorSubClass == "INVALID_MAP_KEY_TYPE")
+  assert(
+messageParameters === Map(
+  "keyType" -> "\"STRUCT\""
+)
+  )
+  }
+
   test("MapFromArrays") {
 val intSeq = Seq(5, 10, 15, 20, 25)
 val longSeq = intSeq.map(_.toLong)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2d609bfd37ae [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10
2d609bfd37ae is described below

commit 2d609bfd37ae9a0877fb72d1ba0479bb04a2dad6
Author: Cheng Pan 
AuthorDate: Thu May 9 21:31:50 2024 -0700

[SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10

### What changes were proposed in this pull request?

This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with 
two additional changes:

- due to API breaking changes of Thrift, `libthrift` is upgraded from 
`0.12` to `0.16`.
- remove version management of `commons-lang:2.6`, it comes from Hive 
transitive deps, Hive 2.3.10 drops it in 
https://github.com/apache/hive/pull/4892

This is the first part of https://github.com/apache/spark/pull/45372

### Why are the changes needed?

Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and 
dropping vulnerable dependencies like Jackson 1.x / Jodd

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars 
visible on Maven Central)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45372

Closes #46468 from pan3793/SPARK-47018.

Lead-authored-by: Cheng Pan 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 connector/kafka-0-10-assembly/pom.xml  |  5 
 connector/kinesis-asl-assembly/pom.xml |  5 
 dev/deps/spark-deps-hadoop-3-hive-2.3  | 27 +--
 docs/building-spark.md |  4 +--
 docs/sql-data-sources-hive-tables.md   |  8 +++---
 docs/sql-migration-guide.md|  2 +-
 pom.xml| 31 +-
 .../hive/service/auth/KerberosSaslHelper.java  |  5 ++--
 .../apache/hive/service/auth/PlainSaslHelper.java  |  3 ++-
 .../hive/service/auth/TSetIpAddressProcessor.java  |  5 ++--
 .../service/cli/thrift/ThriftBinaryCLIService.java |  6 -
 .../hive/service/cli/thrift/ThriftCLIService.java  | 10 +++
 .../org/apache/spark/sql/hive/HiveUtils.scala  |  2 +-
 .../org/apache/spark/sql/hive/client/package.scala |  5 ++--
 .../hive/HiveExternalCatalogVersionsSuite.scala|  1 -
 .../spark/sql/hive/HiveSparkSubmitSuite.scala  | 10 +++
 .../spark/sql/hive/execution/HiveQuerySuite.scala  |  6 ++---
 17 files changed, 61 insertions(+), 74 deletions(-)

diff --git a/connector/kafka-0-10-assembly/pom.xml 
b/connector/kafka-0-10-assembly/pom.xml
index b2fcbdf8eca7..bd311b3a9804 100644
--- a/connector/kafka-0-10-assembly/pom.xml
+++ b/connector/kafka-0-10-assembly/pom.xml
@@ -54,11 +54,6 @@
   commons-codec
   provided
 
-
-  commons-lang
-  commons-lang
-  provided
-
 
   com.google.protobuf
   protobuf-java
diff --git a/connector/kinesis-asl-assembly/pom.xml 
b/connector/kinesis-asl-assembly/pom.xml
index 577ec2153083..0e93526fce72 100644
--- a/connector/kinesis-asl-assembly/pom.xml
+++ b/connector/kinesis-asl-assembly/pom.xml
@@ -54,11 +54,6 @@
   jackson-databind
   provided
 
-
-  commons-lang
-  commons-lang
-  provided
-
 
   org.glassfish.jersey.core
   jersey-client
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 73d41e9eeb33..392bacd73277 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -46,7 +46,6 @@ commons-compress/1.26.1//commons-compress-1.26.1.jar
 commons-crypto/1.1.0//commons-crypto-1.1.0.jar
 commons-dbcp/1.4//commons-dbcp-1.4.jar
 commons-io/2.16.1//commons-io-2.16.1.jar
-commons-lang/2.6//commons-lang-2.6.jar
 commons-lang3/3.14.0//commons-lang3-3.14.0.jar
 commons-math3/3.6.1//commons-math3-3.6.1.jar
 commons-pool/1.5.4//commons-pool-1.5.4.jar
@@ -81,19 +80,19 @@ hadoop-cloud-storage/3.4.0//hadoop-cloud-storage-3.4.0.jar
 hadoop-huaweicloud/3.4.0//hadoop-huaweicloud-3.4.0.jar
 hadoop-shaded-guava/1.2.0//hadoop-shaded-guava-1.2.0.jar
 hadoop-yarn-server-web-proxy/3.4.0//hadoop-yarn-server-web-proxy-3.4.0.jar
-hive-beeline/2.3.9//hive-beeline-2.3.9.jar
-hive-cli/2.3.9//hive-cli-2.3.9.jar
-hive-common/2.3.9//hive-common-2.3.9.jar
-hive-exec/2.3.9/core/hive-exec-2.3.9-core.jar
-hive-jdbc/2.3.9//hive-jdbc-2.3.9.jar
-hive-llap-common/2.3.9//hive-llap-common-2.3.9.jar
-hive-metastore/2.3.9//hive-metastore-2.3.9.jar
-hive-serde/2.3.9//hive-serde-2.3.9.jar
+hive-beeline/2.3.10//hive-beeline-2.3.10.jar
+hive-cli/2.3.10//hive-cli-2.3.10.jar
+hive-common/2.3.10//hive

(spark) branch master updated: [MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1138b2a68b54 [MINOR][BUILD] Remove duplicate configuration of 
maven-compiler-plugin
1138b2a68b54 is described below

commit 1138b2a68b5408e6d079bdbce8026323694628e5
Author: zml1206 
AuthorDate: Thu May 9 20:51:32 2024 -0700

[MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin

### What changes were proposed in this pull request?
`${java.version}` and 
`${java.version}` 
(https://github.com/apache/spark/pull/46024/files#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8R117)
are equivalent duplicate configuration, so remove 
`${java.version}`.

https://maven.apache.org/plugins/maven-compiler-plugin/examples/set-compiler-release.html

### Why are the changes needed?
Simplify the code and facilitates subsequent configuration iterations.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46024 from zml1206/remove_duplicate_configuration.

Authored-by: zml1206 
Signed-off-by: Dongjoon Hyun 
---
 pom.xml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/pom.xml b/pom.xml
index c3ff5d101c22..678455e6e248 100644
--- a/pom.xml
+++ b/pom.xml
@@ -3127,7 +3127,6 @@
   maven-compiler-plugin
   3.13.0
   
-${java.version}
 true 
 true 
   


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits`

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 32b2827b964b [SPARK-47834][SQL][CONNECT] Mark deprecated functions 
with `@deprecated` in `SQLImplicits`
32b2827b964b is described below

commit 32b2827b964bd4a4accb60b47ddd6929f41d4a89
Author: YangJie 
AuthorDate: Thu May 9 20:47:34 2024 -0700

[SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in 
`SQLImplicits`

### What changes were proposed in this pull request?
In the `sql` module, some functions in `SQLImplicits` have already been 
marked as `deprecated` in the function comments after SPARK-19089.

This pr adds `deprecated` type annotation marks to them. Since SPARK-19089 
occurred in Spark 2.2.0, the `since` field of `deprecated` is filled in as 
`2.2.0`.

At the same time, these `deprecated` marks have also been synchronized to 
the corresponding functions in `SQLImplicits` in the `connect` module.

### Why are the changes needed?
Mark deprecated functions with `deprecated` in `SQLImplicits`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46029 from LuciferYang/deprecated-SQLImplicits.

Lead-authored-by: YangJie 
Co-authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 .../jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala   | 9 +
 sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala  | 9 +
 2 files changed, 18 insertions(+)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
index 6c626fd716d5..7799d395d5c6 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
@@ -149,6 +149,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newIntSeqEncoder: Encoder[Seq[Int]] = newSeqEncoder(PrimitiveIntEncoder)
 
   /**
@@ -156,6 +157,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newLongSeqEncoder: Encoder[Seq[Long]] = 
newSeqEncoder(PrimitiveLongEncoder)
 
   /**
@@ -163,6 +165,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newDoubleSeqEncoder: Encoder[Seq[Double]] = 
newSeqEncoder(PrimitiveDoubleEncoder)
 
   /**
@@ -170,6 +173,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newFloatSeqEncoder: Encoder[Seq[Float]] = 
newSeqEncoder(PrimitiveFloatEncoder)
 
   /**
@@ -177,6 +181,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newByteSeqEncoder: Encoder[Seq[Byte]] = 
newSeqEncoder(PrimitiveByteEncoder)
 
   /**
@@ -184,6 +189,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newShortSeqEncoder: Encoder[Seq[Short]] = 
newSeqEncoder(PrimitiveShortEncoder)
 
   /**
@@ -191,6 +197,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newBooleanSeqEncoder: Encoder[Seq[Boolean]] = 
newSeqEncoder(PrimitiveBooleanEncoder)
 
   /**
@@ -198,6 +205,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newStringSeqEncoder: Encoder[Seq[String]] = newSeqEncoder(StringEncoder)
 
   /**
@@ -205,6 +213,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) e

(spark) branch master updated: [SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 012d19d8e9b2 [SPARK-48227][PYTHON][DOC] Document the requirement of 
seed in protos
012d19d8e9b2 is described below

commit 012d19d8e9b28f7ce266753bcfff4a76c9510245
Author: Ruifeng Zheng 
AuthorDate: Thu May 9 16:58:44 2024 -0700

[SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos

### What changes were proposed in this pull request?
Document the requirement of seed in protos

### Why are the changes needed?
the seed should be set at client side

document it to avoid cases like https://github.com/apache/spark/pull/46456

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46518 from zhengruifeng/doc_random.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .../common/src/main/protobuf/spark/connect/relations.proto |  8 ++--
 python/pyspark/sql/connect/plan.py | 10 --
 python/pyspark/sql/connect/proto/relations_pb2.pyi | 10 --
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto
index 3882b2e85396..0b3c9d4253e8 100644
--- a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto
+++ b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto
@@ -467,7 +467,9 @@ message Sample {
   // (Optional) Whether to sample with replacement.
   optional bool with_replacement = 4;
 
-  // (Optional) The random seed.
+  // (Required) The random seed.
+  // This filed is required to avoid generate mutable dataframes (see 
SPARK-48184 for details),
+  // however, still keep it 'optional' here for backward compatibility.
   optional int64 seed = 5;
 
   // (Required) Explicitly sort the underlying plan to make the ordering 
deterministic or cache it.
@@ -687,7 +689,9 @@ message StatSampleBy {
   // If a stratum is not specified, we treat its fraction as zero.
   repeated Fraction fractions = 3;
 
-  // (Optional) The random seed.
+  // (Required) The random seed.
+  // This filed is required to avoid generate mutable dataframes (see 
SPARK-48184 for details),
+  // however, still keep it 'optional' here for backward compatibility.
   optional int64 seed = 5;
 
   message Fraction {
diff --git a/python/pyspark/sql/connect/plan.py 
b/python/pyspark/sql/connect/plan.py
index 4ac4946745f5..3d3303fb15c5 100644
--- a/python/pyspark/sql/connect/plan.py
+++ b/python/pyspark/sql/connect/plan.py
@@ -717,7 +717,7 @@ class Sample(LogicalPlan):
 lower_bound: float,
 upper_bound: float,
 with_replacement: bool,
-seed: Optional[int],
+seed: int,
 deterministic_order: bool = False,
 ) -> None:
 super().__init__(child)
@@ -734,8 +734,7 @@ class Sample(LogicalPlan):
 plan.sample.lower_bound = self.lower_bound
 plan.sample.upper_bound = self.upper_bound
 plan.sample.with_replacement = self.with_replacement
-if self.seed is not None:
-plan.sample.seed = self.seed
+plan.sample.seed = self.seed
 plan.sample.deterministic_order = self.deterministic_order
 return plan
 
@@ -1526,7 +1525,7 @@ class StatSampleBy(LogicalPlan):
 child: Optional["LogicalPlan"],
 col: Column,
 fractions: Sequence[Tuple[Column, float]],
-seed: Optional[int],
+seed: int,
 ) -> None:
 super().__init__(child)
 
@@ -1554,8 +1553,7 @@ class StatSampleBy(LogicalPlan):
 fraction.stratum.CopyFrom(k.to_plan(session).literal)
 fraction.fraction = float(v)
 plan.sample_by.fractions.append(fraction)
-if self._seed is not None:
-plan.sample_by.seed = self._seed
+plan.sample_by.seed = self._seed
 return plan
 
 
diff --git a/python/pyspark/sql/connect/proto/relations_pb2.pyi 
b/python/pyspark/sql/connect/proto/relations_pb2.pyi
index 5dfb47da67a9..9b6f4b43544f 100644
--- a/python/pyspark/sql/connect/proto/relations_pb2.pyi
+++ b/python/pyspark/sql/connect/proto/relations_pb2.pyi
@@ -1865,7 +1865,10 @@ class Sample(google.protobuf.message.Message):
 with_replacement: builtins.bool
 """(Optional) Whether to sample with replacement."""
 seed: builtins.int
-"""(Optional) The random seed."""
+"""(Required) The random seed.
+This filed is required to avoid generate mut

(spark) branch master updated (b47d7853d92f -> e704b9e56b0c)

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b47d7853d92f [SPARK-48148][CORE] JSON objects should not be modified 
when read as STRING
 add e704b9e56b0c [SPARK-48226][BUILD] Add `spark-ganglia-lgpl` to 
`lint-java` & `spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle`

No new revisions were added by this update.

Summary of changes:
 dev/lint-java  | 2 +-
 dev/sbt-checkstyle | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new da4c808be7d6 [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate 
golden files
da4c808be7d6 is described below

commit da4c808be7d66dc61fdcb3b41254eef77298a72c
Author: Dongjoon Hyun 
AuthorDate: Thu May 9 14:46:01 2024 -0700

[SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files

### What changes were proposed in this pull request?

This PR is a follow-up to regenerate golden files for branch-3.5
- #46475

### Why are the changes needed?

To recover branch-3.5 CI.
- https://github.com/apache/spark/actions/runs/9011670853/job/24786397001
```
[info] *** 4 TESTS FAILED ***
[error] Failed: Total 3036, Failed 4, Errors 0, Passed 3032, Ignored 3
[error] Failed tests:
[error] org.apache.spark.sql.SQLQueryTestSuite
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46514 from dongjoon-hyun/SPARK-48197.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../sql-tests/analyzer-results/ansi/higher-order-functions.sql.out   | 1 -
 .../resources/sql-tests/analyzer-results/higher-order-functions.sql.out  | 1 -
 .../test/resources/sql-tests/results/ansi/higher-order-functions.sql.out | 1 -
 .../src/test/resources/sql-tests/results/higher-order-functions.sql.out  | 1 -
 4 files changed, 4 deletions(-)

diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out
 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out
index 3fafb9858e5a..8fe6e7097e67 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out
@@ -40,7 +40,6 @@ select ceil(x -> x) as v
 org.apache.spark.sql.AnalysisException
 {
   "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION",
-  "sqlState" : "42K0D",
   "messageParameters" : {
 "class" : 
"org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$"
   },
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out
 
b/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out
index d9e88ac618aa..d85101986078 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out
@@ -40,7 +40,6 @@ select ceil(x -> x) as v
 org.apache.spark.sql.AnalysisException
 {
   "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION",
-  "sqlState" : "42K0D",
   "messageParameters" : {
 "class" : 
"org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$"
   },
diff --git 
a/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out
 
b/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out
index eb9c454109f0..dceb370c8388 100644
--- 
a/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out
@@ -40,7 +40,6 @@ struct<>
 org.apache.spark.sql.AnalysisException
 {
   "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION",
-  "sqlState" : "42K0D",
   "messageParameters" : {
 "class" : 
"org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$"
   },
diff --git 
a/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out 
b/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out
index eb9c454109f0..dceb370c8388 100644
--- 
a/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out
@@ -40,7 +40,6 @@ struct<>
 org.apache.spark.sql.AnalysisException
 {
   "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION",
-  "sqlState" : "42K0D",
   "messageParameters" : {
 "class" : 
"org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$"
   },


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48216][TESTS] Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e1fb1d7e063a [SPARK-48216][TESTS] Remove overrides 
DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable
e1fb1d7e063a is described below

commit e1fb1d7e063af7e8eb6e992c800902aff6e19e15
Author: Kent Yao 
AuthorDate: Thu May 9 08:37:07 2024 -0700

[SPARK-48216][TESTS] Remove overrides 
DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable

### What changes were proposed in this pull request?

This PR removes overrides DockerJDBCIntegrationSuite.connectionTimeout to 
make related tests configurable.

### Why are the changes needed?

The db dockers might require more time to bootstrap sometimes. It shall be 
configurable to avoid failure like:

```scala
[info] org.apache.spark.sql.jdbc.DB2IntegrationSuite *** ABORTED *** (3 
minutes, 11 seconds)
[info]   The code passed to eventually never returned normally. Attempted 
96 times over 3.00399815763 minutes. Last failure message: 
[jcc][t4][2030][11211][4.33.31] A communication error occurred during 
operations on the connection's underlying socket, socket input stream,
[info]   or socket output stream.  Error location: Reply.fill() - 
insufficient data (-1).  Message: Insufficient data. ERRORCODE=-4499, 
SQLSTATE=08001. (DockerJDBCIntegrationSuite.scala:215)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
[info]   at 
org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
[info]   at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

Passing GA

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46505 from yaooqinn/SPARK-48216.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala| 4 
 .../test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala | 3 ---
 .../test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala | 4 
 .../test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala | 3 ---
 .../org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala| 4 
 .../scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala| 4 
 .../scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala   | 4 
 7 files changed, 26 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
index aca174cce194..4ece4d2088f4 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
@@ -21,8 +21,6 @@ import java.math.BigDecimal
 import java.sql.{Connection, Date, Timestamp}
 import java.util.Properties
 
-import org.scalatest.time.SpanSugar._
-
 import org.apache.spark.sql.{Row, SaveMode}
 import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._
 import org.apache.spark.sql.internal.SQLConf
@@ -41,8 +39,6 @@ import org.apache.spark.tags.DockerTest
 class DB2IntegrationSuite extends DockerJDBCIntegrationSuite {
   override val db = new DB2DatabaseOnDocker
 
-  override val connectionTimeout = timeout(3.minutes)
-
   override def dataPreparation(conn: Connection): Unit = {
 conn.prepareStatement("CREATE TABLE tbl (x INTEGER, y 
VARCHAR(8))").executeUpdate()
 conn.prepareStatement("INSERT INTO tbl VALUES (42,'fred')").executeUpdate()
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala
index abb683c06495..4899de2b2a14 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala
@@ -24,7 +24,6 @@ import javax.security.auth.login.Configuration
 import com.github.dockerjava.api.model.{AccessMode, Bind, ContainerConfig, 
HostConfig, Volume}
 import org.apache.hadoop.security.{SecurityUtil, UserGroup

(spark) branch master updated: [SPARK-47186][TESTS][FOLLOWUP] Correct the name of spark.test.docker.connectionTimeout

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5891b20ef492 [SPARK-47186][TESTS][FOLLOWUP] Correct the name of 
spark.test.docker.connectionTimeout
5891b20ef492 is described below

commit 5891b20ef492e3dad31ff851770d9c4f9c7c4de4
Author: Kent Yao 
AuthorDate: Wed May 8 21:56:55 2024 -0700

[SPARK-47186][TESTS][FOLLOWUP] Correct the name of 
spark.test.docker.connectionTimeout

### What changes were proposed in this pull request?

This PR adds a followup of SPARK-47186 to correct the name of 
spark.test.docker.connectionTimeout

### Why are the changes needed?

test bugfix

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46495 from yaooqinn/SPARK-47186-FF.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala
index ded7bb3a6bf6..8d17e0b4e36e 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala
@@ -115,7 +115,7 @@ abstract class DockerJDBCIntegrationSuite
   protected val startContainerTimeout: Long =
 
timeStringAsSeconds(sys.props.getOrElse("spark.test.docker.startContainerTimeout",
 "5min"))
   protected val connectionTimeout: PatienceConfiguration.Timeout = {
-val timeoutStr = sys.props.getOrElse("spark.test.docker.conn", "5min")
+val timeoutStr = 
sys.props.getOrElse("spark.test.docker.connectionTimeout", "5min")
 timeout(timeStringAsSeconds(timeoutStr).seconds)
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69044 - /dev/spark/v3.4.3-rc2-docs/

2024-05-08 Thread dongjoon

Author: dongjoon
Date: Thu May  9 02:31:50 2024
New Revision: 69044

Log:
Remove Apache Spark 3.4.3 RC2 docs after releasing 3.4.3

Removed:
dev/spark/v3.4.3-rc2-docs/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (4fb6624bd2ce -> 337f980f0073)

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 4fb6624bd2ce [SPARK-48205][PYTHON] Remove the private[sql] modifier 
for Python data sources
 add 337f980f0073 [SPARK-48204][INFRA] Fix release script for Spark 4.0+

No new revisions were added by this update.

Summary of changes:
 dev/create-release/release-build.sh | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48207][INFRA][3.4] Run `build/scala-213/java-11-17` jobs of `branch-3.4` only if needed

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new d16a4f4c98d5 [SPARK-48207][INFRA][3.4] Run 
`build/scala-213/java-11-17` jobs of `branch-3.4` only if needed
d16a4f4c98d5 is described below

commit d16a4f4c98d5e6a44ff783e20a9f2f2f80c009f3
Author: Dongjoon Hyun 
AuthorDate: Wed May 8 16:19:40 2024 -0700

[SPARK-48207][INFRA][3.4] Run `build/scala-213/java-11-17` jobs of 
`branch-3.4` only if needed

### What changes were proposed in this pull request?

This PR aims to run `build`, `scala-213`, and `java-11-17` job of 
`branch-3.4` only if needed to reduce the maximum concurrency of Apache Spark 
GitHub Action usage.

### Why are the changes needed?

To meet ASF Infra GitHub Action policy, we need to reduce the maximum 
concurrency.
- https://infra.apache.org/github-actions-policy.html

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46489 from dongjoon-hyun/SPARK-48207.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 64f18b5163b1..3e44d6cfd179 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -98,18 +98,17 @@ jobs:
 tpcds=false
 docker=false
   fi
-  # 'build', 'scala-213', and 'java-11-17' are always true for now.
-  # It does not save significant time and most of PRs trigger the 
build.
+  build=`./dev/is-changed.py -m 
"core,unsafe,kvstore,avro,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,connect,protobuf"`
   precondition="
 {
-  \"build\": \"true\",
+  \"build\": \"$build\",
   \"pyspark\": \"$pyspark\",
   \"pyspark-pandas\": \"$pandas\",
   \"sparkr\": \"$sparkr\",
   \"tpcds-1g\": \"$tpcds\",
   \"docker-integration-tests\": \"$docker\",
-  \"scala-213\": \"true\",
-  \"java-11-17\": \"true\",
+  \"scala-213\": \"$build\",
+  \"java-11-17\": \"$build\",
   \"lint\" : \"true\",
   \"k8s-integration-tests\" : \"$kubernetes\",
 }"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new bd54e633121c [SPARK-48192][INFRA] Enable TPC-DS tests in forked 
repository
bd54e633121c is described below

commit bd54e633121c77293bbb0cd343eeebb167ca5edf
Author: Hyukjin Kwon 
AuthorDate: Wed May 8 17:13:11 2024 +0900

[SPARK-48192][INFRA] Enable TPC-DS tests in forked repository

This PR is a sort of a followup of 
https://github.com/apache/spark/pull/46361. It proposes to run TPC-DS and 
Docker integration tests in PRs (that does not consume ASF resources).

TPC-DS and Docker integration stuff at least have to be tested in the PR if 
the PR touches the codes related to that.

No, test-only.

Manually

No.

Closes #46470 from HyukjinKwon/SPARK-48192.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit f693abc8de949b1fd5f77b9e74037b0cc2298aef)
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 82779217b1fa1dea2b18772795969c04c1f34532)
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 0166395ceb4a..64f18b5163b1 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -84,17 +84,19 @@ jobs:
   if [ -f "./dev/is-changed.py" ]; then
 pyspark_modules=`cd dev && python -c "import 
sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if 
m.name.startswith('pyspark')))"`
 pyspark=`./dev/is-changed.py -m $pyspark_modules`
-tpcds=`./dev/is-changed.py -m sql`
-docker=`./dev/is-changed.py -m docker-integration-tests`
   fi
   if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
 pandas=$pyspark
 kubernetes=`./dev/is-changed.py -m kubernetes`
 sparkr=`./dev/is-changed.py -m sparkr`
+tpcds=`./dev/is-changed.py -m sql`
+docker=`./dev/is-changed.py -m docker-integration-tests`
   else
 pandas=false
 kubernetes=false
 sparkr=false
+tpcds=false
+docker=false
   fi
   # 'build', 'scala-213', and 'java-11-17' are always true for now.
   # It does not save significant time and most of PRs trigger the 
build.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new b6de16317abd [SPARK-48133][INFRA] Run `sparkr` only in PR builders and 
Daily CIs
b6de16317abd is described below

commit b6de16317abdead63fe12a686573c20172959437
Author: Dongjoon Hyun 
AuthorDate: Sun May 5 13:19:23 2024 -0700

[SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs

This PR aims to run `sparkr` only in PR builder and Daily Python CIs. In 
other words, only the commit builder will skip it by default.

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> All workflows MUST have a job concurrency level less than or equal to 
20. This means a workflow cannot have more than 20 jobs running at the same 
time across all matrices.

No.

Manual review.

No.

Closes #46389 from dongjoon-hyun/SPARK-48133.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 32ba5c1db62c2674e8acced56f89ed840bf9)
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 6dbbf081a7d248ddce62b62e979ff06a3c793f22)
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index cf1eb7b4c233..0166395ceb4a 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -84,16 +84,17 @@ jobs:
   if [ -f "./dev/is-changed.py" ]; then
 pyspark_modules=`cd dev && python -c "import 
sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if 
m.name.startswith('pyspark')))"`
 pyspark=`./dev/is-changed.py -m $pyspark_modules`
-sparkr=`./dev/is-changed.py -m sparkr`
 tpcds=`./dev/is-changed.py -m sql`
 docker=`./dev/is-changed.py -m docker-integration-tests`
   fi
   if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
 pandas=$pyspark
 kubernetes=`./dev/is-changed.py -m kubernetes`
+sparkr=`./dev/is-changed.py -m sparkr`
   else
 pandas=false
 kubernetes=false
+sparkr=false
   fi
   # 'build', 'scala-213', and 'java-11-17' are always true for now.
   # It does not save significant time and most of PRs trigger the 
build.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 4b032a18924b [SPARK-48132][INFRA] Run `k8s-integration-tests` only in 
PR builder and Daily CIs
4b032a18924b is described below

commit 4b032a18924bd35322570551448c643786fd1a98
Author: Dongjoon Hyun 
AuthorDate: Sat May 4 22:55:04 2024 -0700

[SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and 
Daily CIs

This PR aims to run `k8s-integration-tests` only in PR builder and Daily 
Python CIs. In other words, only the commit builder will skip it by default.

Please note that
- K8s unit tests will be covered by the commit builder still.
- All PR builders are not consuming ASF resources and they provide lots of 
test coverage everyday also.

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> All workflows MUST have a job concurrency level less than or equal to 
20. This means a workflow cannot have more than 20 jobs running at the same 
time across all matrices.

No.

Manual review.

No.

Closes #46388 from dongjoon-hyun/SPARK-48132.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 9454607944df5e8430642bbe399a35436506be2a)
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 35c1328256c2..cf1eb7b4c233 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -90,8 +90,10 @@ jobs:
   fi
   if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
 pandas=$pyspark
+kubernetes=`./dev/is-changed.py -m kubernetes`
   else
 pandas=false
+kubernetes=false
   fi
   # 'build', 'scala-213', and 'java-11-17' are always true for now.
   # It does not save significant time and most of PRs trigger the 
build.
@@ -106,7 +108,7 @@ jobs:
   \"scala-213\": \"true\",
   \"java-11-17\": \"true\",
   \"lint\" : \"true\",
-  \"k8s-integration-tests\" : \"true\",
+  \"k8s-integration-tests\" : \"$kubernetes\",
 }"
   echo $precondition # For debugging
   # Remove `\n` to avoid "Invalid format" error


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 82779217b1fa [SPARK-48192][INFRA] Enable TPC-DS tests in forked 
repository
82779217b1fa is described below

commit 82779217b1fa1dea2b18772795969c04c1f34532
Author: Hyukjin Kwon 
AuthorDate: Wed May 8 17:13:11 2024 +0900

[SPARK-48192][INFRA] Enable TPC-DS tests in forked repository

This PR is a sort of a followup of 
https://github.com/apache/spark/pull/46361. It proposes to run TPC-DS and 
Docker integration tests in PRs (that does not consume ASF resources).

TPC-DS and Docker integration stuff at least have to be tested in the PR if 
the PR touches the codes related to that.

No, test-only.

Manually

No.

Closes #46470 from HyukjinKwon/SPARK-48192.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit f693abc8de949b1fd5f77b9e74037b0cc2298aef)
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 4ad4a243c76d..b016a29a86be 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -85,13 +85,15 @@ jobs:
 pandas=$pyspark
 kubernetes=`./dev/is-changed.py -m kubernetes`
 sparkr=`./dev/is-changed.py -m sparkr`
+tpcds=`./dev/is-changed.py -m sql`
+docker=`./dev/is-changed.py -m docker-integration-tests`
   else
 pandas=false
 kubernetes=false
 sparkr=false
+tpcds=false
+docker=false
   fi
-  tpcds=`./dev/is-changed.py -m sql`
-  docker=`./dev/is-changed.py -m docker-integration-tests`
   build=`./dev/is-changed.py -m 
"core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"`
   precondition="
 {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 6dbbf081a7d2 [SPARK-48133][INFRA] Run `sparkr` only in PR builders and 
Daily CIs
6dbbf081a7d2 is described below

commit 6dbbf081a7d248ddce62b62e979ff06a3c793f22
Author: Dongjoon Hyun 
AuthorDate: Sun May 5 13:19:23 2024 -0700

[SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs

This PR aims to run `sparkr` only in PR builder and Daily Python CIs. In 
other words, only the commit builder will skip it by default.

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> All workflows MUST have a job concurrency level less than or equal to 
20. This means a workflow cannot have more than 20 jobs running at the same 
time across all matrices.

No.

Manual review.

No.

Closes #46389 from dongjoon-hyun/SPARK-48133.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 32ba5c1db62c2674e8acced56f89ed840bf9)
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 645054dc2087..4ad4a243c76d 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -79,17 +79,17 @@ jobs:
   id: set-outputs
   run: |
 if [ -z "${{ inputs.jobs }}" ]; then
-  pyspark=true; sparkr=true; tpcds=true; docker=true;
   pyspark_modules=`cd dev && python -c "import 
sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if 
m.name.startswith('pyspark')))"`
   pyspark=`./dev/is-changed.py -m $pyspark_modules`
   if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
 pandas=$pyspark
 kubernetes=`./dev/is-changed.py -m kubernetes`
+sparkr=`./dev/is-changed.py -m sparkr`
   else
 pandas=false
 kubernetes=false
+sparkr=false
   fi
-  sparkr=`./dev/is-changed.py -m sparkr`
   tpcds=`./dev/is-changed.py -m sql`
   docker=`./dev/is-changed.py -m docker-integration-tests`
   build=`./dev/is-changed.py -m 
"core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"`


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 9454607944df [SPARK-48132][INFRA] Run `k8s-integration-tests` only in 
PR builder and Daily CIs
9454607944df is described below

commit 9454607944df5e8430642bbe399a35436506be2a
Author: Dongjoon Hyun 
AuthorDate: Sat May 4 22:55:04 2024 -0700

[SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and 
Daily CIs

This PR aims to run `k8s-integration-tests` only in PR builder and Daily 
Python CIs. In other words, only the commit builder will skip it by default.

Please note that
- K8s unit tests will be covered by the commit builder still.
- All PR builders are not consuming ASF resources and they provide lots of 
test coverage everyday also.

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> All workflows MUST have a job concurrency level less than or equal to 
20. This means a workflow cannot have more than 20 jobs running at the same 
time across all matrices.

No.

Manual review.

No.

Closes #46388 from dongjoon-hyun/SPARK-48132.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index e73dced98238..645054dc2087 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -84,13 +84,14 @@ jobs:
   pyspark=`./dev/is-changed.py -m $pyspark_modules`
   if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
 pandas=$pyspark
+kubernetes=`./dev/is-changed.py -m kubernetes`
   else
 pandas=false
+kubernetes=false
   fi
   sparkr=`./dev/is-changed.py -m sparkr`
   tpcds=`./dev/is-changed.py -m sql`
   docker=`./dev/is-changed.py -m docker-integration-tests`
-  kubernetes=`./dev/is-changed.py -m kubernetes`
   build=`./dev/is-changed.py -m 
"core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"`
   precondition="
 {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 26dccf09322f [SPARK-48109][INFRA] Enable `k8s-integration-tests` only 
for `kubernetes` module change
26dccf09322f is described below

commit 26dccf09322fc9945557a6e005a15e14fc6926b0
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 23:21:59 2024 -0700

[SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` 
module change

This PR aims to enable `k8s-integration-tests` only for `kubernetes` module 
change.

Although there is a chance of missing `core` module change, the daily CI 
test coverage will reveal that.

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> The average number of minutes a project uses in any consecutive 
five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).

No.

Manual review.

No.

Closes #46356 from dongjoon-hyun/SPARK-48109.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 63837020ed29c9e6003f24117ad21f8b97f40f0f)
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 051e8c98908c..e73dced98238 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -90,6 +90,7 @@ jobs:
   sparkr=`./dev/is-changed.py -m sparkr`
   tpcds=`./dev/is-changed.py -m sql`
   docker=`./dev/is-changed.py -m docker-integration-tests`
+  kubernetes=`./dev/is-changed.py -m kubernetes`
   build=`./dev/is-changed.py -m 
"core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"`
   precondition="
 {
@@ -102,7 +103,7 @@ jobs:
   \"scala-213\": \"$build\",
   \"java-11-17\": \"$build\",
   \"lint\" : \"true\",
-  \"k8s-integration-tests\" : \"true\",
+  \"k8s-integration-tests\" : \"$kubernetes\",
   \"breaking-changes-buf\" : \"true\",
 }"
   echo $precondition # For debugging


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new d4a94c283c66 [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to 
check repository
d4a94c283c66 is described below

commit d4a94c283c66c20be8a3ba67b75b960ba3c29d6b
Author: Dongjoon Hyun 
AuthorDate: Fri May 3 21:25:41 2024 -0700

[SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository

(cherry picked from commit 81775a083f2339a76f3d1af472baf58e6fdf47d2)
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 825ad064d078..35c1328256c2 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -88,7 +88,7 @@ jobs:
 tpcds=`./dev/is-changed.py -m sql`
 docker=`./dev/is-changed.py -m docker-integration-tests`
   fi
-  if [ "${{ github.repository != 'apache/spark' }}" ]; then
+  if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
 pandas=$pyspark
   else
 pandas=false


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 81775a083f23 [SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to 
check repository
81775a083f23 is described below

commit 81775a083f2339a76f3d1af472baf58e6fdf47d2
Author: Dongjoon Hyun 
AuthorDate: Fri May 3 21:25:41 2024 -0700

[SPARK-48116][INFRA][FOLLOWUP] Fix `if` statement to check repository
---
 .github/workflows/build_and_test.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 679c51bb0941..051e8c98908c 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -82,7 +82,7 @@ jobs:
   pyspark=true; sparkr=true; tpcds=true; docker=true;
   pyspark_modules=`cd dev && python -c "import 
sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if 
m.name.startswith('pyspark')))"`
   pyspark=`./dev/is-changed.py -m $pyspark_modules`
-  if [ "${{ github.repository != 'apache/spark' }}" ]; then
+  if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
 pandas=$pyspark
   else
 pandas=false


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 2d5a77bbea4a [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in 
PR builder and Daily Python CIs
2d5a77bbea4a is described below

commit 2d5a77bbea4a96916525299d277f368790ccc602
Author: Dongjoon Hyun 
AuthorDate: Wed May 8 13:48:12 2024 -0700

[SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and 
Daily Python CIs

### What changes were proposed in this pull request?

This PR aims to run `pyspark-pandas*` of `branch-3.4` only in PR builder 
and Daily Python CIs. In other words, only the commit builder will skip it by 
default. Please note that all PR builders is not consuming ASF resources and 
they provides lots of test coverage everyday.

`branch-3.4` Python Daily CI runs all Python tests including 
`pyspark-pandas` like the following.


https://github.com/apache/spark/blob/21548a8cc5c527d4416a276a852f967b4410bd4b/.github/workflows/build_branch34_python.yml#L43-L44

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> All workflows MUST have a job concurrency level less than or equal to 
20. This means a workflow cannot have more than 20 jobs running at the same 
time across all matrices.

Although `pandas` is an **optional** package in PySpark, this is essential 
for PySpark users and we have **6 test pipelines** which requires lots of 
resources. We need to optimize the job concurrently level to `less than or 
equal to 20` while keeping the test capability as much as possible.


https://github.com/apache/spark/blob/da0c7cc81bb3d69d381dd0683e910eae4c80e9ae/dev/requirements.txt#L4-L7

- pyspark-pandas
- pyspark-pandas-slow

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46483 from dongjoon-hyun/SPARK-48116-3.4.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 12 
 1 file changed, 12 insertions(+)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 2d2e8da80d46..825ad064d078 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -88,12 +88,18 @@ jobs:
 tpcds=`./dev/is-changed.py -m sql`
 docker=`./dev/is-changed.py -m docker-integration-tests`
   fi
+  if [ "${{ github.repository != 'apache/spark' }}" ]; then
+pandas=$pyspark
+  else
+pandas=false
+  fi
   # 'build', 'scala-213', and 'java-11-17' are always true for now.
   # It does not save significant time and most of PRs trigger the 
build.
   precondition="
 {
   \"build\": \"true\",
   \"pyspark\": \"$pyspark\",
+  \"pyspark-pandas\": \"$pandas\",
   \"sparkr\": \"$sparkr\",
   \"tpcds-1g\": \"$tpcds\",
   \"docker-integration-tests\": \"$docker\",
@@ -349,6 +355,12 @@ jobs:
 pyspark-pandas-slow
   - >-
 pyspark-connect
+exclude:
+  # Always run if pyspark-pandas == 'true', even infra-image is skip 
(such as non-master job)
+  # In practice, the build will run in individual PR, but not against 
the individual commit
+  # in Apache Spark repository.
+  - modules: ${{ 
fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 
'pyspark-pandas' }}
+  - modules: ${{ 
fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 
'pyspark-pandas-slow' }}
 env:
   MODULES_TO_TEST: ${{ matrix.modules }}
   HADOOP_PROFILE: ${{ inputs.hadoop }}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and Daily Python CIs

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new ff691fa611f0 [SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in 
PR builder and Daily Python CIs
ff691fa611f0 is described below

commit ff691fa611f0c8a7f0ff626179bced2b48ef9b7d
Author: Dongjoon Hyun 
AuthorDate: Wed May 8 13:45:55 2024 -0700

[SPARK-48116][INFRA][3.5] Run `pyspark-pandas*` only in PR builder and 
Daily Python CIs

### What changes were proposed in this pull request?

This PR aims to run `pyspark-pandas*` of `branch-3.5` only in PR builder 
and Daily Python CIs. In other words, only the commit builder will skip it by 
default. Please note that all PR builders is not consuming ASF resources and 
they provides lots of test coverage everyday.

`branch-3.5` Python Daily CI runs all Python tests including 
`pyspark-pandas` like the following.


https://github.com/apache/spark/blob/21548a8cc5c527d4416a276a852f967b4410bd4b/.github/workflows/build_branch35_python.yml#L43-L44

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> All workflows MUST have a job concurrency level less than or equal to 
20. This means a workflow cannot have more than 20 jobs running at the same 
time across all matrices.

Although `pandas` is an **optional** package in PySpark, this is essential 
for PySpark users and we have **6 test pipelines** which requires lots of 
resources. We need to optimize the job concurrently level to `less than or 
equal to 20` while keeping the test capability as much as possible.


https://github.com/apache/spark/blob/a762f3175fcdb7b069faa0c2bfce93d295cb1f10/dev/requirements.txt#L4-L7

- pyspark-pandas
- pyspark-pandas-slow
- pyspark-pandas-connect
- pyspark-pandas-slow-connect

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46482 from dongjoon-hyun/SPARK-48116-3.5.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 9c3dc95d0f66..679c51bb0941 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -82,6 +82,11 @@ jobs:
   pyspark=true; sparkr=true; tpcds=true; docker=true;
   pyspark_modules=`cd dev && python -c "import 
sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if 
m.name.startswith('pyspark')))"`
   pyspark=`./dev/is-changed.py -m $pyspark_modules`
+  if [ "${{ github.repository != 'apache/spark' }}" ]; then
+pandas=$pyspark
+  else
+pandas=false
+  fi
   sparkr=`./dev/is-changed.py -m sparkr`
   tpcds=`./dev/is-changed.py -m sql`
   docker=`./dev/is-changed.py -m docker-integration-tests`
@@ -90,6 +95,7 @@ jobs:
 {
   \"build\": \"$build\",
   \"pyspark\": \"$pyspark\",
+  \"pyspark-pandas\": \"$pandas\",
   \"sparkr\": \"$sparkr\",
   \"tpcds-1g\": \"$tpcds\",
   \"docker-integration-tests\": \"$docker\",
@@ -361,6 +367,14 @@ jobs:
 pyspark-pandas-connect
   - >-
 pyspark-pandas-slow-connect
+exclude:
+  # Always run if pyspark-pandas == 'true', even infra-image is skip 
(such as non-master job)
+  # In practice, the build will run in individual PR, but not against 
the individual commit
+  # in Apache Spark repository.
+  - modules: ${{ 
fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 
'pyspark-pandas' }}
+  - modules: ${{ 
fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 
'pyspark-pandas-slow' }}
+  - modules: ${{ 
fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 
'pyspark-pandas-connect' }}
+  - modules: ${{ 
fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 
'pyspark-pandas-slow-connect' }}
 env:
   MODULES_TO_TEST: ${{ matrix.modules }}
   HADOOP_PROFILE: ${{ inputs.hadoop }}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` Daily CI

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fbfcd402851e [SPARK-48203][INFRA] Spin off `pyspark` tests from 
`build_branch34.yml` Daily CI
fbfcd402851e is described below

commit fbfcd402851ee604789b8ba72a1ee0e67ef5ebe4
Author: Dongjoon Hyun 
AuthorDate: Wed May 8 12:30:12 2024 -0700

[SPARK-48203][INFRA] Spin off `pyspark` tests from `build_branch34.yml` 
Daily CI

### What changes were proposed in this pull request?

This PR aims to create `build_branch34_python.yml` in order to spin off 
`pyspark` tests from `build_branch34.yml` Daily CI.

### Why are the changes needed?

Currently, `build_branch34.yml` creates more than 15 test pipelines 
concurrently which is beyond of ASF Infra policy.
- https://github.com/apache/spark/actions/workflows/build_branch35.yml

We had better offload this to `Python only Daily CI` like `master` branch's 
`Python Only` Daily CI.
- https://github.com/apache/spark/actions/workflows/build_python_3.10.yml

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46480 from dongjoon-hyun/SPARK-48203.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_branch34.yml|  1 -
 .../{build_branch34.yml => build_branch34_python.yml}   | 13 +++--
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/.github/workflows/build_branch34.yml 
b/.github/workflows/build_branch34.yml
index 68887970d4d8..deb6c4240797 100644
--- a/.github/workflows/build_branch34.yml
+++ b/.github/workflows/build_branch34.yml
@@ -43,7 +43,6 @@ jobs:
   jobs: >-
 {
   "build": "true",
-  "pyspark": "true",
   "sparkr": "true",
   "tpcds-1g": "true",
   "docker-integration-tests": "true",
diff --git a/.github/workflows/build_branch34.yml 
b/.github/workflows/build_branch34_python.yml
similarity index 74%
copy from .github/workflows/build_branch34.yml
copy to .github/workflows/build_branch34_python.yml
index 68887970d4d8..c109ba2dc792 100644
--- a/.github/workflows/build_branch34.yml
+++ b/.github/workflows/build_branch34_python.yml
@@ -17,7 +17,7 @@
 # under the License.
 #
 
-name: "Build (branch-3.4, Scala 2.13, Hadoop 3, JDK 8)"
+name: "Build / Python-only (branch-3.4)"
 
 on:
   schedule:
@@ -36,17 +36,10 @@ jobs:
   hadoop: hadoop3
   envs: >-
 {
-  "SCALA_PROFILE": "scala2.13",
-  "PYTHON_TO_TEST": "",
-  "ORACLE_DOCKER_IMAGE_NAME": "gvenzl/oracle-xe:21.3.0"
+  "PYTHON_TO_TEST": ""
 }
   jobs: >-
 {
-  "build": "true",
   "pyspark": "true",
-  "sparkr": "true",
-  "tpcds-1g": "true",
-  "docker-integration-tests": "true",
-  "k8s-integration-tests": "true",
-  "lint" : "true"
+  "pyspark-pandas": "true"
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` Daily CI

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 70e5d2aa7a99 [SPARK-48202][INFRA] Spin off `pyspark` tests from 
`build_branch35.yml` Daily CI
70e5d2aa7a99 is described below

commit 70e5d2aa7a992a6f4ff9c7d8e3752ce1d3d488f2
Author: Dongjoon Hyun 
AuthorDate: Wed May 8 10:47:52 2024 -0700

[SPARK-48202][INFRA] Spin off `pyspark` tests from `build_branch35.yml` 
Daily CI

### What changes were proposed in this pull request?

This PR aims to create `build_branch35_python.yml` in order to spin off 
`pyspark` tests from `build_branch35.yml` Daily CI.

### Why are the changes needed?

Currently, `build_branch35.yml` creates more than 15 test pipelines 
concurrently which is beyond of ASF Infra policy.
- https://github.com/apache/spark/actions/workflows/build_branch35.yml

We had better offload this to `Python only Daily CI` like `master` branch's 
`Python Only` Daily CI.
- https://github.com/apache/spark/actions/workflows/build_python_3.10.yml

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46479 from dongjoon-hyun/SPARK-48202.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_branch35.yml|  1 -
 .../{build_branch35.yml => build_branch35_python.yml}   | 13 +++--
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/.github/workflows/build_branch35.yml 
b/.github/workflows/build_branch35.yml
index 55616c2f1f01..2ec080d5722c 100644
--- a/.github/workflows/build_branch35.yml
+++ b/.github/workflows/build_branch35.yml
@@ -43,7 +43,6 @@ jobs:
   jobs: >-
 {
   "build": "true",
-  "pyspark": "true",
   "sparkr": "true",
   "tpcds-1g": "true",
   "docker-integration-tests": "true",
diff --git a/.github/workflows/build_branch35.yml 
b/.github/workflows/build_branch35_python.yml
similarity index 74%
copy from .github/workflows/build_branch35.yml
copy to .github/workflows/build_branch35_python.yml
index 55616c2f1f01..1585534d33ba 100644
--- a/.github/workflows/build_branch35.yml
+++ b/.github/workflows/build_branch35_python.yml
@@ -17,7 +17,7 @@
 # under the License.
 #
 
-name: "Build (branch-3.5, Scala 2.13, Hadoop 3, JDK 8)"
+name: "Build / Python-only (branch-3.5)"
 
 on:
   schedule:
@@ -36,17 +36,10 @@ jobs:
   hadoop: hadoop3
   envs: >-
 {
-  "SCALA_PROFILE": "scala2.13",
-  "PYTHON_TO_TEST": "",
-  "ORACLE_DOCKER_IMAGE_NAME": "gvenzl/oracle-xe:21.3.0"
+  "PYTHON_TO_TEST": ""
 }
   jobs: >-
 {
-  "build": "true",
   "pyspark": "true",
-  "sparkr": "true",
-  "tpcds-1g": "true",
-  "docker-integration-tests": "true",
-  "k8s-integration-tests": "true",
-  "lint" : "true"
+  "pyspark-pandas": "true"
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9d79ab42b127 [SPARK-48200][INFRA] Split `build_python.yml` into 
per-version cron jobs
9d79ab42b127 is described below

commit 9d79ab42b127d1a12164cec260bfbd69f6da8b74
Author: Dongjoon Hyun 
AuthorDate: Wed May 8 09:40:03 2024 -0700

[SPARK-48200][INFRA] Split `build_python.yml` into per-version cron jobs

### What changes were proposed in this pull request?

This PR aims to split `build_python.yml` into per-version cron jobs.

Technically, this includes a revert of SPARK-48149 and choose [the 
discussed 
alternative](https://github.com/apache/spark/pull/46407#discussion_r1591586209).

- https://github.com/apache/spark/pull/46407
- https://github.com/apache/spark/pull/46454

### Why are the changes needed?

To recover Python CI successfully in ASF INFRA policy.
- https://github.com/apache/spark/actions/workflows/build_python.yml

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46477 from dongjoon-hyun/SPARK-48200.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../{build_python.yml => build_python_3.10.yml}  | 16 ++--
 .../{build_python.yml => build_python_3.12.yml}  | 16 ++--
 .../{build_python.yml => build_python_pypy3.9.yml}   | 16 ++--
 3 files changed, 6 insertions(+), 42 deletions(-)

diff --git a/.github/workflows/build_python.yml 
b/.github/workflows/build_python_3.10.yml
similarity index 63%
copy from .github/workflows/build_python.yml
copy to .github/workflows/build_python_3.10.yml
index efa281d6a279..5ae37fbc9120 100644
--- a/.github/workflows/build_python.yml
+++ b/.github/workflows/build_python_3.10.yml
@@ -17,26 +17,14 @@
 # under the License.
 #
 
-# According to https://infra.apache.org/github-actions-policy.html,
-# all workflows SHOULD have a job concurrency level less than or equal to 15.
-# To do that, we run one python version per cron schedule
-name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.12)"
+name: "Build / Python-only (master, Python 3.10)"
 
 on:
   schedule:
-- cron: '0 15 * * *'
 - cron: '0 17 * * *'
-- cron: '0 19 * * *'
 
 jobs:
   run-build:
-strategy:
-  fail-fast: false
-  matrix:
-include:
-  - pyversion: ${{ github.event.schedule == '0 15 * * *' && 'pypy3' }}
-  - pyversion: ${{ github.event.schedule == '0 17 * * *' && 
'python3.10' }}
-  - pyversion: ${{ github.event.schedule == '0 19 * * *' && 
'python3.12' }}
 permissions:
   packages: write
 name: Run
@@ -48,7 +36,7 @@ jobs:
   hadoop: hadoop3
   envs: >-
 {
-  "PYTHON_TO_TEST": "${{ matrix.pyversion }}"
+  "PYTHON_TO_TEST": "python3.10"
 }
   jobs: >-
 {
diff --git a/.github/workflows/build_python.yml 
b/.github/workflows/build_python_3.12.yml
similarity index 63%
copy from .github/workflows/build_python.yml
copy to .github/workflows/build_python_3.12.yml
index efa281d6a279..e1fd45a7d883 100644
--- a/.github/workflows/build_python.yml
+++ b/.github/workflows/build_python_3.12.yml
@@ -17,26 +17,14 @@
 # under the License.
 #
 
-# According to https://infra.apache.org/github-actions-policy.html,
-# all workflows SHOULD have a job concurrency level less than or equal to 15.
-# To do that, we run one python version per cron schedule
-name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.12)"
+name: "Build / Python-only (master, Python 3.12)"
 
 on:
   schedule:
-- cron: '0 15 * * *'
-- cron: '0 17 * * *'
 - cron: '0 19 * * *'
 
 jobs:
   run-build:
-strategy:
-  fail-fast: false
-  matrix:
-include:
-  - pyversion: ${{ github.event.schedule == '0 15 * * *' && 'pypy3' }}
-  - pyversion: ${{ github.event.schedule == '0 17 * * *' && 
'python3.10' }}
-  - pyversion: ${{ github.event.schedule == '0 19 * * *' && 
'python3.12' }}
 permissions:
   packages: write
 name: Run
@@ -48,7 +36,7 @@ jobs:
   hadoop: hadoop3
   envs: >-
 {
-  "PYTHON_TO_TEST": "${{ matrix.pyversion }}"
+  "PYTHON_TO_TEST": "python3.12"
 }
   jobs: >-
 {
diff --git a/.github/workflows/build_python.yml 
b/.github/workflows/build_python_pypy3.9.yml
similarity index 63%
rename from .github/workflows/build_pyth

(spark) branch master updated: [SPARK-48198][BUILD] Upgrade jackson to 2.17.1

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e0c406eaef36 [SPARK-48198][BUILD] Upgrade jackson to 2.17.1
e0c406eaef36 is described below

commit e0c406eaef36d95a106b6ce14086654ace6202af
Author: panbingkun 
AuthorDate: Wed May 8 08:50:02 2024 -0700

[SPARK-48198][BUILD] Upgrade jackson to 2.17.1

### What changes were proposed in this pull request?
The pr aims to upgrade `jackson` from `2.17.0` to `2.17.1`.

### Why are the changes needed?
The full release notes:
https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.17.1

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46476 from panbingkun/SPARK-48198.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 14 +++---
 pom.xml   |  4 ++--
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 5d933e34e40b..73d41e9eeb33 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -104,15 +104,15 @@ icu4j/72.1//icu4j-72.1.jar
 ini4j/0.5.4//ini4j-0.5.4.jar
 istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar
 ivy/2.5.2//ivy-2.5.2.jar
-jackson-annotations/2.17.0//jackson-annotations-2.17.0.jar
+jackson-annotations/2.17.1//jackson-annotations-2.17.1.jar
 jackson-core-asl/1.9.13//jackson-core-asl-1.9.13.jar
-jackson-core/2.17.0//jackson-core-2.17.0.jar
-jackson-databind/2.17.0//jackson-databind-2.17.0.jar
-jackson-dataformat-cbor/2.17.0//jackson-dataformat-cbor-2.17.0.jar
-jackson-dataformat-yaml/2.17.0//jackson-dataformat-yaml-2.17.0.jar
-jackson-datatype-jsr310/2.17.0//jackson-datatype-jsr310-2.17.0.jar
+jackson-core/2.17.1//jackson-core-2.17.1.jar
+jackson-databind/2.17.1//jackson-databind-2.17.1.jar
+jackson-dataformat-cbor/2.17.1//jackson-dataformat-cbor-2.17.1.jar
+jackson-dataformat-yaml/2.17.1//jackson-dataformat-yaml-2.17.1.jar
+jackson-datatype-jsr310/2.17.1//jackson-datatype-jsr310-2.17.1.jar
 jackson-mapper-asl/1.9.13//jackson-mapper-asl-1.9.13.jar
-jackson-module-scala_2.13/2.17.0//jackson-module-scala_2.13-2.17.0.jar
+jackson-module-scala_2.13/2.17.1//jackson-module-scala_2.13-2.17.1.jar
 jakarta.annotation-api/2.0.0//jakarta.annotation-api-2.0.0.jar
 jakarta.inject-api/2.0.1//jakarta.inject-api-2.0.1.jar
 jakarta.servlet-api/5.0.0//jakarta.servlet-api-5.0.0.jar
diff --git a/pom.xml b/pom.xml
index c72482fd6a41..c3ff5d101c22 100644
--- a/pom.xml
+++ b/pom.xml
@@ -183,8 +183,8 @@
 true
 true
 1.9.13
-2.17.0
-
2.17.0
+2.17.1
+
2.17.1
 2.3.1
 3.0.2
 1.1.10.5


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a762f3175fcd [SPARK-48184][PYTHON][CONNECT] Always set the seed of 
`Dataframe.sample` in Client side
a762f3175fcd is described below

commit a762f3175fcdb7b069faa0c2bfce93d295cb1f10
Author: Ruifeng Zheng 
AuthorDate: Wed May 8 07:44:22 2024 -0700

[SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in 
Client side

### What changes were proposed in this pull request?
Always set the seed of `Dataframe.sample` in Client side

### Why are the changes needed?
Bug fix

If the seed is not set in Client, it will be set in server side with a 
random int


https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386

which cause inconsistent results in multiple executions

In Spark Classic:
```
In [1]: df = spark.range(1).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006]
```

In Spark Connect:

before:
```
In [1]: df = spark.range(1).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979]
```

after:
```
In [1]: df = spark.range(1).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032]
```

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46456 from zhengruifeng/py_connect_sample_seed.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 47afe77242abf639a1d6966ce60cfd170a9d7d20)
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/connect/dataframe.py   | 2 +-
 python/pyspark/sql/tests/connect/test_connect_plan.py | 2 +-
 python/pyspark/sql/tests/test_dataframe.py| 5 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index ff6191642025..6f23a15fb4ad 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -687,7 +687,7 @@ class DataFrame:
 if withReplacement is None:
 withReplacement = False
 
-seed = int(seed) if seed is not None else None
+seed = int(seed) if seed is not None else random.randint(0, 
sys.maxsize)
 
 return DataFrame.withPlan(
 plan.Sample(
diff --git a/python/pyspark/sql/tests/connect/test_connect_plan.py 
b/python/pyspark/sql/tests/connect/test_connect_plan.py
index c39fb6be24cd..88ef37511a66 100644
--- a/python/pyspark/sql/tests/connect/test_connect_plan.py
+++ b/python/pyspark/sql/tests/connect/test_connect_plan.py
@@ -430,7 +430,7 @@ class SparkConnectPlanTests(PlanOnlyTestFixture):
 self.assertEqual(plan.root.sample.lower_bound, 0.0)
 self.assertEqual(plan.root.sample.upper_bound, 0.3)
 self.assertEqual(plan.root.sample.with_replacement, False)
-self.assertEqual(plan.root.sample.HasField("seed"), False)
+self.assertEqual(plan.root.sample.HasField("seed"), True)
 self.assertEqual(plan.root.sample.deterministic_order, False)
 
 plan = (
diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index 5907c8c09fb4..887648018cf3 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -1045,6 +1045,11 @@ class DataFrameTestsMixin:
 IllegalArgumentException, lambda: 
self.spark.range(1).sample(-1.0).count()
 )
 
+def test_sample_with_random_seed(self):
+df = self.spark.range(1).sample(0.1)
+cnts = [df.count() for i in range(10)]
+self.assertEqual(1, len(set(cnts)))
+
 def test_toDF_with_string(self):
 df = self.spark.createDataFrame([("John", 30), ("Alice", 25), ("Bob", 
28)])
 data = [("John", 30), ("Alice", 25), ("Bob", 28)]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 47afe77242ab [SPARK-48184][PYTHON][CONNECT] Always set the seed of 
`Dataframe.sample` in Client side
47afe77242ab is described below

commit 47afe77242abf639a1d6966ce60cfd170a9d7d20
Author: Ruifeng Zheng 
AuthorDate: Wed May 8 07:44:22 2024 -0700

[SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in 
Client side

### What changes were proposed in this pull request?
Always set the seed of `Dataframe.sample` in Client side

### Why are the changes needed?
Bug fix

If the seed is not set in Client, it will be set in server side with a 
random int


https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386

which cause inconsistent results in multiple executions

In Spark Classic:
```
In [1]: df = spark.range(1).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006]
```

In Spark Connect:

before:
```
In [1]: df = spark.range(1).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979]
```

after:
```
In [1]: df = spark.range(1).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032]
```

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46456 from zhengruifeng/py_connect_sample_seed.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/connect/dataframe.py   | 2 +-
 python/pyspark/sql/tests/connect/test_connect_plan.py | 2 +-
 python/pyspark/sql/tests/test_dataframe.py| 5 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index f9a209d2bcb3..843c92a9b27d 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -813,7 +813,7 @@ class DataFrame(ParentDataFrame):
 if withReplacement is None:
 withReplacement = False
 
-seed = int(seed) if seed is not None else None
+seed = int(seed) if seed is not None else random.randint(0, 
sys.maxsize)
 
 return DataFrame(
 plan.Sample(
diff --git a/python/pyspark/sql/tests/connect/test_connect_plan.py 
b/python/pyspark/sql/tests/connect/test_connect_plan.py
index 09c3171ee11f..e8d04aeada74 100644
--- a/python/pyspark/sql/tests/connect/test_connect_plan.py
+++ b/python/pyspark/sql/tests/connect/test_connect_plan.py
@@ -443,7 +443,7 @@ class SparkConnectPlanTests(PlanOnlyTestFixture):
 self.assertEqual(plan.root.sample.lower_bound, 0.0)
 self.assertEqual(plan.root.sample.upper_bound, 0.3)
 self.assertEqual(plan.root.sample.with_replacement, False)
-self.assertEqual(plan.root.sample.HasField("seed"), False)
+self.assertEqual(plan.root.sample.HasField("seed"), True)
 self.assertEqual(plan.root.sample.deterministic_order, False)
 
 plan = (
diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index 16dd0d2a3bf7..f491b496ddae 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -430,6 +430,11 @@ class DataFrameTestsMixin:
 IllegalArgumentException, lambda: 
self.spark.range(1).sample(-1.0).count()
 )
 
+def test_sample_with_random_seed(self):
+df = self.spark.range(1).sample(0.1)
+cnts = [df.count() for i in range(10)]
+self.assertEqual(1, len(set(cnts)))
+
 def test_toDF_with_string(self):
 df = self.spark.createDataFrame([("John", 30), ("Alice", 25), ("Bob", 
28)])
 data = [("John", 30), ("Alice", 25), ("Bob", 28)]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new da0c7cc81bb3 [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks 
shuffle write related metrics resulting in potentially inaccurate data
da0c7cc81bb3 is described below

commit da0c7cc81bb3d69d381dd0683e910eae4c80e9ae
Author: sychen 
AuthorDate: Wed May 8 07:30:21 2024 -0700

[SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related 
metrics resulting in potentially inaccurate data

### What changes were proposed in this pull request?
This PR aims to fix SortShuffleWriter lacks shuffle write related metrics 
resulting in potentially inaccurate data.

### Why are the changes needed?
When the shuffle writer is SortShuffleWriter, it does not use 
SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain 
runtime statistics and the rowCount obtained is 0.

Some optimization rules rely on rowCount statistics, such as 
`EliminateLimits`. Because rowCount is 0, it removes the limit operator. At 
this time, we get data results without limit.


https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L168-L172


https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2067-L2070

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Production environment verification.

**master metrics**
https://github.com/apache/spark/assets/3898450/dc9b6e8a-93ec-4f59-a903-71aa5b11962c;>

**PR metrics**

https://github.com/apache/spark/assets/3898450/2d73b773-2dcc-4d23-81de-25dcadac86c1;>

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46464 from cxzl25/SPARK-48037-3.4.

Authored-by: sychen 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml   |  1 +
 .../spark/shuffle/sort/SortShuffleManager.scala|  2 +-
 .../spark/shuffle/sort/SortShuffleWriter.scala |  6 ++--
 .../spark/util/collection/ExternalSorter.scala |  9 +++---
 .../shuffle/sort/SortShuffleWriterSuite.scala  |  3 ++
 .../sql/execution/UnsafeRowSerializerSuite.scala   |  3 +-
 .../adaptive/AdaptiveQueryExecSuite.scala  | 32 --
 7 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 8ae303178033..2d2e8da80d46 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -644,6 +644,7 @@ jobs:
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 
'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 
'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 
'alabaster==0.7.13'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
+python3.9 -m pip install 'nbsphinx==0.9.3'
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
 apt-get update -y
 apt-get install -y ruby ruby-dev
diff --git 
a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala 
b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala
index 46aca07ce43f..79dff6f87534 100644
--- a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala
+++ b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala
@@ -176,7 +176,7 @@ private[spark] class SortShuffleManager(conf: SparkConf) 
extends ShuffleManager
   metrics,
   shuffleExecutorComponents)
   case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
-new SortShuffleWriter(other, mapId, context, shuffleExecutorComponents)
+new SortShuffleWriter(other, mapId, context, metrics, 
shuffleExecutorComponents)
 }
   }
 
diff --git 
a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala 
b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
index 8613fe11a4c2..3be7d24f7e4e 100644
--- a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
+++ b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
@@ -21,6 +21,7 @@ import org.apache.spark._
 impor

(spark) branch master updated: [SPARK-48187][INFRA] Run `docs` only in PR builders and `build_non_ansi` Daily CI

2024-05-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f3d9b819f3c0 [SPARK-48187][INFRA] Run `docs` only in PR builders and 
`build_non_ansi` Daily CI
f3d9b819f3c0 is described below

commit f3d9b819f3c013cd402ed98d01842173c45a5dd6
Author: Dongjoon Hyun 
AuthorDate: Wed May 8 00:02:44 2024 -0700

[SPARK-48187][INFRA] Run `docs` only in PR builders and `build_non_ansi` 
Daily CI

### What changes were proposed in this pull request?

This PR aims to run `docs` (Documentation Generation) step only in PR 
builders and `build_non_ansi` Daily CI.

To do that, this PR spins off `documentation generation` tasks from `lint` 
job.

### Why are the changes needed?

Currently, Apache Spark CI is running `Documentation Generation` always 
inside `lint` job. We can take advantage PR Builder and one of Daily CIs.

- https://infra.apache.org/github-actions-policy.html

### Does this PR introduce _any_ user-facing change?

No because this is an infra update.

### How was this patch tested?

Pass the CIs and manual review because PR builders will not be affected by 
this.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46463 from dongjoon-hyun/SPARK-48187.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 94 ++--
 .github/workflows/build_non_ansi.yml |  1 +
 2 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 00ba16265dce..bb9f2f9a9603 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -85,6 +85,7 @@ jobs:
 sparkr=`./dev/is-changed.py -m sparkr`
 buf=true
 ui=true
+docs=true
   else
 pandas=false
 yarn=false
@@ -92,6 +93,7 @@ jobs:
 sparkr=false
 buf=false
 ui=false
+docs=false
   fi
   build=`./dev/is-changed.py -m 
"core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,variant,api,catalyst,hive-thriftserver,mllib-local,mllib,graphx,streaming,sql-kafka-0-10,streaming-kafka-0-10,streaming-kinesis-asl,kubernetes,hadoop-cloud,spark-ganglia-lgpl,protobuf,yarn,connect,sql,hive"`
   precondition="
@@ -103,6 +105,7 @@ jobs:
   \"tpcds-1g\": \"false\",
   \"docker-integration-tests\": \"false\",
   \"lint\" : \"true\",
+  \"docs\" : \"$docs\",
   \"yarn\" : \"$yarn\",
   \"k8s-integration-tests\" : \"$kubernetes\",
   \"buf\" : \"$buf\",
@@ -621,12 +624,12 @@ jobs:
 - name: Python CodeGen check
   run: ./dev/connect-check-protos.py
 
-  # Static analysis, and documentation build
+  # Static analysis
   lint:
 needs: [precondition, infra-image]
 # always run if lint == 'true', even infra-image is skip (such as 
non-master job)
 if: (!cancelled()) && fromJson(needs.precondition.outputs.required).lint 
== 'true'
-name: Linters, licenses, dependencies and documentation generation
+name: Linters, licenses, and dependencies
 runs-on: ubuntu-latest
 timeout-minutes: 180
 env:
@@ -764,7 +767,90 @@ jobs:
 Rscript -e "devtools::install_version('lintr', version='2.0.1', 
repos='https://cloud.r-project.org')"
 - name: Install R linter dependencies and SparkR
   run: ./R/install-dev.sh
-# Should delete this section after SPARK 3.5 EOL.
+- name: R linter
+  run: ./dev/lint-r
+
+  # Documentation build
+  docs:
+needs: [precondition, infra-image]
+# always run if lint == 'true', even infra-image is skip (such as 
non-master job)
+if: (!cancelled()) && fromJson(needs.precondition.outputs.required).docs 
== 'true'
+name: Documentation generation
+runs-on: ubuntu-latest
+timeout-minutes: 180
+env:
+  LC_ALL: C.UTF-8
+  LANG: C.UTF-8
+  NOLINT_ON_COMPILE: false
+  PYSPARK_DRIVER_PYTHON: python3.9
+  PYSPARK_PYTHON: python3.9
+  GITHUB_PREV_SHA: ${{ github.event.before }}
+container:
+  image: ${{ needs.precondition.outputs.image_url }}
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v4
+  with:
+fetch-depth: 0
+repository: apache/spark
+ref: ${{ inputs.branch }}
+- name: Add GITHUB_WORKSPACE to git trust safe.directory
+  run: |
+

(spark) branch branch-3.5 updated: [SPARK-48138][CONNECT][TESTS] Disable a flaky `SparkSessionE2ESuite.interrupt tag` test

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 36da89deccc9 [SPARK-48138][CONNECT][TESTS] Disable a flaky 
`SparkSessionE2ESuite.interrupt tag` test
36da89deccc9 is described below

commit 36da89deccc916a6f32d9bf6d6f2fd8e288da917
Author: Dongjoon Hyun 
AuthorDate: Mon May 6 13:45:54 2024 +0800

[SPARK-48138][CONNECT][TESTS] Disable a flaky 
`SparkSessionE2ESuite.interrupt tag` test

### What changes were proposed in this pull request?

This PR aims to disable  a flaky test, `SparkSessionE2ESuite.interrupt 
tag`, temporarily.

To re-enable this, SPARK-48139 is created as a blocker issue for 4.0.0.

### Why are the changes needed?

This test case was added at `Apache Spark 3.5.0` but has been unstable 
unfortunately until now.
- #42009

We tried to stabilize this test case before `Apache Spark 4.0.0-preview`.
- #45173
- #46374

However, it's still flaky.

- https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 
(Master, 2024-05-05)
- https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 
(Master, 2024-05-04)

This PR aims to stablize CI first and to focus this flaky issue as a 
blocker level before going on `Spark Connect GA` in SPARK-48139 before Apache 
Spark 4.0.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46396 from dongjoon-hyun/SPARK-48138.

Authored-by: Dongjoon Hyun 
Signed-off-by: yangjie01 
(cherry picked from commit 8294c5962febe53eebdff79f65f5f293d93a1997)
Signed-off-by: Dongjoon Hyun 
---
 .../jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala
index c76dc724828e..e9c2f0c45750 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala
@@ -108,7 +108,8 @@ class SparkSessionE2ESuite extends RemoteSparkSession {
 assert(interrupted.length == 2, s"Interrupted operations: $interrupted.")
   }
 
-  test("interrupt tag") {
+  // TODO(SPARK-48139): Re-enable `SparkSessionE2ESuite.interrupt tag`
+  ignore("interrupt tag") {
 val session = spark
 import session.implicits._
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 58b71307795b [SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks 
shuffle write related metrics resulting in potentially inaccurate data
58b71307795b is described below

commit 58b71307795b6060be97431e0c5c8ab95205ea79
Author: sychen 
AuthorDate: Tue May 7 22:39:02 2024 -0700

[SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks shuffle write related 
metrics resulting in potentially inaccurate data

### What changes were proposed in this pull request?
This PR aims to fix SortShuffleWriter lacks shuffle write related metrics 
resulting in potentially inaccurate data.

### Why are the changes needed?
When the shuffle writer is SortShuffleWriter, it does not use 
SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain 
runtime statistics and the rowCount obtained is 0.

Some optimization rules rely on rowCount statistics, such as 
`EliminateLimits`. Because rowCount is 0, it removes the limit operator. At 
this time, we get data results without limit.


https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L168-L172


https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2067-L2070

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Production environment verification.

**master metrics**
https://github.com/apache/spark/assets/3898450/dc9b6e8a-93ec-4f59-a903-71aa5b11962c;>

**PR metrics**

https://github.com/apache/spark/assets/3898450/2d73b773-2dcc-4d23-81de-25dcadac86c1;>

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46459 from cxzl25/SPARK-48037-3.5.

Authored-by: sychen 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/shuffle/sort/SortShuffleManager.scala|  2 +-
 .../spark/shuffle/sort/SortShuffleWriter.scala |  6 ++--
 .../spark/util/collection/ExternalSorter.scala |  9 +++---
 .../shuffle/sort/SortShuffleWriterSuite.scala  |  3 ++
 .../sql/execution/UnsafeRowSerializerSuite.scala   |  3 +-
 .../adaptive/AdaptiveQueryExecSuite.scala  | 32 --
 6 files changed, 43 insertions(+), 12 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala 
b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala
index 46aca07ce43f..79dff6f87534 100644
--- a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala
+++ b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala
@@ -176,7 +176,7 @@ private[spark] class SortShuffleManager(conf: SparkConf) 
extends ShuffleManager
   metrics,
   shuffleExecutorComponents)
   case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
-new SortShuffleWriter(other, mapId, context, shuffleExecutorComponents)
+new SortShuffleWriter(other, mapId, context, metrics, 
shuffleExecutorComponents)
 }
   }
 
diff --git 
a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala 
b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
index 8613fe11a4c2..3be7d24f7e4e 100644
--- a/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
+++ b/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
@@ -21,6 +21,7 @@ import org.apache.spark._
 import org.apache.spark.internal.{config, Logging}
 import org.apache.spark.scheduler.MapStatus
 import org.apache.spark.shuffle.{BaseShuffleHandle, ShuffleWriter}
+import org.apache.spark.shuffle.ShuffleWriteMetricsReporter
 import org.apache.spark.shuffle.api.ShuffleExecutorComponents
 import org.apache.spark.util.collection.ExternalSorter
 
@@ -28,6 +29,7 @@ private[spark] class SortShuffleWriter[K, V, C](
 handle: BaseShuffleHandle[K, V, C],
 mapId: Long,
 context: TaskContext,
+writeMetrics: ShuffleWriteMetricsReporter,
 shuffleExecutorComponents: ShuffleExecutorComponents)
   extends ShuffleWriter[K, V] with Logging {
 
@@ -46,8 +48,6 @@ private[spark] class SortShuffleWriter[K, V, C](
 
   private var partitionLengths: Array[Long] = _
 
-  private val writeMetrics = context.taskMetrics().shuffleWriteMetrics
-
   /** Write a bunch of records to this task's output */
   override def write(records: Iterator[Product2[K, V]]): Unit = {
 sorter = if (dep.mapSideCombine) {
@@ -67,7 +67,7 @@ private[spark] class SortShuffleWriter[K, V, C](
 // (see SPARK-3570).

(spark) branch master updated (5f883117203d -> 52a7f634e913)

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5f883117203d [SPARK-47914][SQL] Do not display the splits parameter in 
Range
 add 52a7f634e913 [SPARK-48183][PYTHON][DOCS] Update error contribution 
guide to respect new error class file

No new revisions were added by this update.

Summary of changes:
 python/docs/source/development/contributing.rst | 4 ++--
 python/pyspark/errors/utils.py  | 8 
 2 files changed, 6 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (6588554aa4cc -> 3b1ea0fde44e)

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6588554aa4cc [SPARK-48149][INFRA][FOLLOWUP] Use single quotation mark
 add 3b1ea0fde44e [MINOR][PYTHON][TESTS] Remove the doc in error message 
tests to allow other PyArrow versions in tests

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py | 2 +-
 python/pyspark/sql/tests/pandas/test_pandas_map.py   | 4 ++--
 python/pyspark/sql/tests/test_arrow_map.py   | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48131][CORE][FOLLOWUP] Add a new configuration for the MDC key of Task Name

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 84c5b919d998 [SPARK-48131][CORE][FOLLOWUP] Add a new configuration for 
the MDC key of Task Name
84c5b919d998 is described below

commit 84c5b919d99872858d2f98db21fd3482f27dcbfc
Author: Gengliang Wang 
AuthorDate: Tue May 7 19:18:50 2024 -0700

[SPARK-48131][CORE][FOLLOWUP] Add a new configuration for the MDC key of 
Task Name

### What changes were proposed in this pull request?

Introduce a new Spark config `spark.log.legacyTaskNameMdc.enabled`:
When true, the MDC key `mdc.taskName` will be set in the logs, which is 
consistent with the behavior of Spark 3.1 to Spark 3.5 releases. When false, 
the logging framework will use `task_name` as the MDC key for consistency with 
other new MDC keys.

### Why are the changes needed?

As discussed in 
https://github.com/apache/spark/pull/46386#issuecomment-2098985001, we should 
add a configuration and migration guide about the change in the MDC key of Task 
Name.
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46446 from gengliangwang/addConfig.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/executor/Executor.scala  | 11 +--
 .../main/scala/org/apache/spark/internal/config/package.scala | 10 ++
 docs/core-migration-guide.md  |  2 ++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala 
b/core/src/main/scala/org/apache/spark/executor/Executor.scala
index 3edba45ef89f..68c38fb6179f 100644
--- a/core/src/main/scala/org/apache/spark/executor/Executor.scala
+++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala
@@ -95,6 +95,13 @@ private[spark] class Executor(
 
   private[executor] val conf = env.conf
 
+  // SPARK-48131: Unify MDC key mdc.taskName and task_name in Spark 4.0 
release.
+  private[executor] val taskNameMDCKey = if 
(conf.get(LEGACY_TASK_NAME_MDC_ENABLED)) {
+"mdc.taskName"
+  } else {
+LogKeys.TASK_NAME.name
+  }
+
   // SPARK-40235: updateDependencies() uses a ReentrantLock instead of the 
`synchronized` keyword
   // so that tasks can exit quickly if they are interrupted while waiting on 
another task to
   // finish downloading dependencies.
@@ -914,7 +921,7 @@ private[spark] class Executor(
 try {
   mdc.foreach { case (key, value) => MDC.put(key, value) }
   // avoid overriding the takName by the user
-  MDC.put(LogKeys.TASK_NAME.name, taskName)
+  MDC.put(taskNameMDCKey, taskName)
 } catch {
   case _: NoSuchFieldError => logInfo("MDC is not supported.")
 }
@@ -923,7 +930,7 @@ private[spark] class Executor(
   private def cleanMDCForTask(taskName: String, mdc: Seq[(String, String)]): 
Unit = {
 try {
   mdc.foreach { case (key, _) => MDC.remove(key) }
-  MDC.remove(LogKeys.TASK_NAME.name)
+  MDC.remove(taskNameMDCKey)
 } catch {
   case _: NoSuchFieldError => logInfo("MDC is not supported.")
 }
diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala 
b/core/src/main/scala/org/apache/spark/internal/config/package.scala
index a5be6084de36..87402d2cc17e 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/package.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -152,6 +152,16 @@ package object config {
   .booleanConf
   .createWithDefault(true)
 
+  private[spark] val LEGACY_TASK_NAME_MDC_ENABLED =
+ConfigBuilder("spark.log.legacyTaskNameMdc.enabled")
+  .doc("When true, the MDC (Mapped Diagnostic Context) key `mdc.taskName` 
will be set in the " +
+"log output, which is the behavior of Spark version 3.1 through Spark 
3.5 releases. " +
+"When false, the logging framework will use `task_name` as the MDC 
key, " +
+"aligning it with the naming convention of newer MDC keys introduced 
in Spark 4.0 release.")
+  .version("4.0.0")
+  .booleanConf
+  .createWithDefault(false)
+
   private[spark] val DRIVER_LOG_LOCAL_DIR =
 ConfigBuilder("spark.driver.log.localDir")
   .doc("Specifies a local directory to write driver logs and enable Driver 
Log UI Tab.")
diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md
index 95c7929a6241..28a9dd0f4371 100644
--- a/docs/core-migration-guide.md
+++ b/docs/core-migration-guide.md
@@ -46,6 +46,8 @@ license: |
   -

(spark) branch master updated (5e49665ac39b -> 553e1b85c42a)

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5e49665ac39b [SPARK-47960][SS] Allow chaining other stateful operators 
after transformWithState operator
 add 553e1b85c42a [SPARK-48152][BUILD] Make `spark-profiler` as a part of 
release and publish to maven central repo

No new revisions were added by this update.

Summary of changes:
 .github/workflows/maven_test.yml| 10 +-
 connector/profiler/README.md|  2 +-
 connector/profiler/pom.xml  |  6 +-
 dev/create-release/release-build.sh |  2 +-
 dev/test-dependencies.sh|  2 +-
 docs/building-spark.md  |  7 +++
 pom.xml |  3 +++
 7 files changed, 23 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48178][INFRA][3.5] Run `build/scala-213/java-11-17` jobs of `branch-3.5` only if needed

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 15b5d2a55837 [SPARK-48178][INFRA][3.5] Run 
`build/scala-213/java-11-17` jobs of `branch-3.5` only if needed
15b5d2a55837 is described below

commit 15b5d2a558371395547461d7b37f20610432dea0
Author: Dongjoon Hyun 
AuthorDate: Tue May 7 15:54:50 2024 -0700

[SPARK-48178][INFRA][3.5] Run `build/scala-213/java-11-17` jobs of 
`branch-3.5` only if needed

### What changes were proposed in this pull request?

This PR aims to run `build`, `scala-213`, and `java-11-17` job of 
`branch-3.5` only if needed to reduce the maximum concurrency of Apache Spark 
GitHub Action usage.

### Why are the changes needed?

To meet ASF Infra GitHub Action policy, we need to reduce the maximum 
concurrency.
- https://infra.apache.org/github-actions-policy.html

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46449 from dongjoon-hyun/SPARK-48178.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index fa40b2d0a390..9c3dc95d0f66 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -85,17 +85,16 @@ jobs:
   sparkr=`./dev/is-changed.py -m sparkr`
   tpcds=`./dev/is-changed.py -m sql`
   docker=`./dev/is-changed.py -m docker-integration-tests`
-  # 'build', 'scala-213', and 'java-11-17' are always true for now.
-  # It does not save significant time and most of PRs trigger the 
build.
+  build=`./dev/is-changed.py -m 
"core,unsafe,kvstore,avro,utils,network-common,network-shuffle,repl,launcher,examples,sketch,graphx,catalyst,hive-thriftserver,streaming,sql-kafka-0-10,streaming-kafka-0-10,mllib-local,mllib,yarn,mesos,kubernetes,hadoop-cloud,spark-ganglia-lgpl,sql,hive"`
   precondition="
 {
-  \"build\": \"true\",
+  \"build\": \"$build\",
   \"pyspark\": \"$pyspark\",
   \"sparkr\": \"$sparkr\",
   \"tpcds-1g\": \"$tpcds\",
   \"docker-integration-tests\": \"$docker\",
-  \"scala-213\": \"true\",
-  \"java-11-17\": \"true\",
+  \"scala-213\": \"$build\",
+  \"java-11-17\": \"$build\",
   \"lint\" : \"true\",
   \"k8s-integration-tests\" : \"true\",
   \"breaking-changes-buf\" : \"true\",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48173][SQL][3.5] CheckAnalysis should see the entire query plan

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 2f8e7cbe98df [SPARK-48173][SQL][3.5] CheckAnalysis should see the 
entire query plan
2f8e7cbe98df is described below

commit 2f8e7cbe98df97ee0ae51a20796192c95e750721
Author: Wenchen Fan 
AuthorDate: Tue May 7 15:25:15 2024 -0700

[SPARK-48173][SQL][3.5] CheckAnalysis should see the entire query plan

backport https://github.com/apache/spark/pull/46439 to 3.5

### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/38029 . Some 
custom check rules need to see the entire query plan tree to get some context, 
but https://github.com/apache/spark/pull/38029 breaks it as it checks the query 
plan of dangling CTE relations recursively.

This PR fixes it by putting back the dangling CTE relation in the main 
query plan and then check the main query plan.

### Why are the changes needed?

Revert the breaking change to custom check rules

### Does this PR introduce _any_ user-facing change?

No for most users. This restores the behavior of Spark 3.3 and earlier for 
custom check rules.

### How was this patch tested?

existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46442 from cloud-fan/check2.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  | 38 +++---
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 7f10bdbc80ca..485015f2efab 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -141,17 +141,45 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   errorClass, missingCol, orderedCandidates, a.origin)
   }
 
+  private def checkUnreferencedCTERelations(
+  cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])],
+  visited: mutable.Map[Long, Boolean],
+  danglingCTERelations: mutable.ArrayBuffer[CTERelationDef],
+  cteId: Long): Unit = {
+if (visited(cteId)) {
+  return
+}
+val (cteDef, _, refMap) = cteMap(cteId)
+refMap.foreach { case (id, _) =>
+  checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, id)
+}
+danglingCTERelations.append(cteDef)
+visited(cteId) = true
+  }
+
   def checkAnalysis(plan: LogicalPlan): Unit = {
 val inlineCTE = InlineCTE(alwaysInline = true)
 val cteMap = mutable.HashMap.empty[Long, (CTERelationDef, Int, 
mutable.Map[Long, Int])]
 inlineCTE.buildCTEMap(plan, cteMap)
-cteMap.values.foreach { case (relation, refCount, _) =>
-  // If a CTE relation is never used, it will disappear after inline. Here 
we explicitly check
-  // analysis for it, to make sure the entire query plan is valid.
-  if (refCount == 0) checkAnalysis0(relation.child)
+val danglingCTERelations = mutable.ArrayBuffer.empty[CTERelationDef]
+val visited: mutable.Map[Long, Boolean] = 
mutable.Map.empty.withDefaultValue(false)
+// If a CTE relation is never used, it will disappear after inline. Here 
we explicitly collect
+// these dangling CTE relations, and put them back in the main query, to 
make sure the entire
+// query plan is valid.
+cteMap.foreach { case (cteId, (_, refCount, _)) =>
+  // If a CTE relation ref count is 0, the other CTE relations that 
reference it should also be
+  // collected. This code will also guarantee the leaf relations that do 
not reference
+  // any others are collected first.
+  if (refCount == 0) {
+checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, 
cteId)
+  }
 }
 // Inline all CTEs in the plan to help check query plan structures in 
subqueries.
-checkAnalysis0(inlineCTE(plan))
+var inlinedPlan: LogicalPlan = inlineCTE(plan)
+if (danglingCTERelations.nonEmpty) {
+  inlinedPlan = WithCTE(inlinedPlan, danglingCTERelations.toSeq)
+}
+checkAnalysis0(inlinedPlan)
 plan.setAnalyzed()
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3`

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a24ec1d8f76c [SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3`
a24ec1d8f76c is described below

commit a24ec1d8f76c7bf47e491086f14ea202b6806cd8
Author: Dongjoon Hyun 
AuthorDate: Tue May 7 15:23:24 2024 -0700

[SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3`

### What changes were proposed in this pull request?

This PR aims to pin `nbsphinx` to `0.9.3` to recover `branch-3.5` CI.

### Why are the changes needed?

From yesterday, `branch-3.5` commit build is broken.
- https://github.com/apache/spark/actions/runs/8978558438/job/24659197282
```
Exception occurred:
  File "/usr/local/lib/python3.9/dist-packages/nbsphinx/__init__.py", line 
1316, in apply
for section in self.document.findall(docutils.nodes.section):
AttributeError: 'document' object has no attribute 'findall'
The full traceback has been saved in /tmp/sphinx-err-qz4y0bav.log, if you 
want to report the issue to the developers.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs on this PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46448 from dongjoon-hyun/nbsphinx.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 8488540b415d..fa40b2d0a390 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -682,7 +682,7 @@ jobs:
 #   See also https://issues.apache.org/jira/browse/SPARK-35375.
 # Pin the MarkupSafe to 2.0.1 to resolve the CI error.
 #   See also https://issues.apache.org/jira/browse/SPARK-38279.
-python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 
'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 
'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 
'alabaster==0.7.13'
+python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' 'nbsphinx==0.9.3' numpydoc 'jinja2<3.0.0' 
'markupsafe==2.0.1' 'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 
'sphinxcontrib-devhelp==1.0.2' 'sphinxcontrib-htmlhelp==2.0.1' 
'sphinxcontrib-qthelp==1.0.3' 'sphinxcontrib-serializinghtml==1.1.5' 
'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 'alabaster==0.7.13'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix Python Linter error

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 03bc2b188d21 [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat 
test_readwriter.py to fix Python Linter error
03bc2b188d21 is described below

commit 03bc2b188d2111b5c4cc5bc13ebd0455602028a8
Author: Dongjoon Hyun 
AuthorDate: Tue May 7 13:38:08 2024 -0700

[SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to 
fix Python Linter error

### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/46430 to fix 
Python linter failure.

### Why are the changes needed?

To recover `branch-3.5` CI,
- https://github.com/apache/spark/actions/runs/8981228745/job/24666400664

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass Python Linter in this PR builder.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46445 from dongjoon-hyun/SPARK-48167.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/tests/test_readwriter.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_readwriter.py 
b/python/pyspark/sql/tests/test_readwriter.py
index e903d3383b74..7911a82c61fc 100644
--- a/python/pyspark/sql/tests/test_readwriter.py
+++ b/python/pyspark/sql/tests/test_readwriter.py
@@ -247,7 +247,8 @@ class ReadwriterV2TestsMixin:
 self.assertEqual(100, self.spark.sql("select * from 
test_table").count())
 
 @unittest.skipIf(
-"SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ, "Known behavior 
change in 4.0")
+"SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ, "Known behavior 
change in 4.0"
+)
 def test_create_without_provider(self):
 df = self.df
 with self.assertRaisesRegex(


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (26c50369edb2 -> e24f8965e066)

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 26c50369edb2 [SPARK-48174][INFRA] Merge `connect` back to the original 
test pipeline
 add e24f8965e066 [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle 
write related metrics resulting in potentially inaccurate data

No new revisions were added by this update.

Summary of changes:
 .../spark/shuffle/sort/SortShuffleManager.scala|  2 +-
 .../spark/shuffle/sort/SortShuffleWriter.scala |  6 +++---
 .../spark/util/collection/ExternalSorter.scala |  9 +
 .../shuffle/sort/SortShuffleWriterSuite.scala  |  3 +++
 .../sql/execution/UnsafeRowSerializerSuite.scala   |  3 ++-
 .../adaptive/AdaptiveQueryExecSuite.scala  | 23 ++
 6 files changed, 37 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48174][INFRA] Merge `connect` back to the original test pipeline

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 26c50369edb2 [SPARK-48174][INFRA] Merge `connect` back to the original 
test pipeline
26c50369edb2 is described below

commit 26c50369edb21d616361a4b22a555ed7b7412a4e
Author: Dongjoon Hyun 
AuthorDate: Tue May 7 09:34:59 2024 -0700

[SPARK-48174][INFRA] Merge `connect` back to the original test pipeline

### What changes were proposed in this pull request?

This PR aims to merge connect back to the original test pipeline to reduce 
the maximum concurrency of GitHub Action by one.
- https://infra.apache.org/github-actions-policy.html
  > All workflows SHOULD have a job concurrency level less than or equal to 
15.

### Why are the changes needed?

This is a partial recover from the following.
- #45107

We stabilized the root cause of #45107 via the following PRs. In addition 
we will disable a flaky test case if exists.

- #46395
- #46396
- #46425

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46441 from dongjoon-hyun/SPARK-48174.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 286f8e1193d9..00ba16265dce 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -156,9 +156,8 @@ jobs:
 mllib-local, mllib, graphx
   - >-
 streaming, sql-kafka-0-10, streaming-kafka-0-10, 
streaming-kinesis-asl,
-kubernetes, hadoop-cloud, spark-ganglia-lgpl, protobuf
+kubernetes, hadoop-cloud, spark-ganglia-lgpl, protobuf, connect
   - yarn
-  - connect
 # Here, we split Hive and SQL tests into some of slow ones and the 
rest of them.
 included-tags: [""]
 excluded-tags: [""]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal to add/multiply

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 808186835077 [SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply 
being semantic equal to add/multiply
808186835077 is described below

commit 808186835077cf50f10262c633f19de4ccc09d9d
Author: Supun Nakandala 
AuthorDate: Tue May 7 09:17:01 2024 -0700

[SPARK-48035][SQL][FOLLOWUP] Fix try_add/try_multiply being semantic equal 
to add/multiply

### What changes were proposed in this pull request?
- This is a follow-up to the previous PR: 
https://github.com/apache/spark/pull/46307.
- With the new changes we do the evalMode check in the `collectOperands` 
function instead of introducing a new function.

### Why are the changes needed?
- Better code quality and readability.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?
- No

Closes #46414 from db-scnakandala/db-scnakandala/master.

Authored-by: Supun Nakandala 
Signed-off-by: Dongjoon Hyun 
---
 .../sql/catalyst/expressions/Expression.scala  | 14 -
 .../sql/catalyst/expressions/arithmetic.scala  | 23 --
 2 files changed, 8 insertions(+), 29 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
index 2759f5a29c79..de15ec43c4f3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
@@ -1378,20 +1378,6 @@ trait CommutativeExpression extends Expression {
   }
 reorderResult
   }
-
-  /**
-   * Helper method to collect the evaluation mode of the commutative 
expressions. This is
-   * used by the canonicalized methods of [[Add]] and [[Multiply]] operators 
to ensure that
-   * all operands have the same evaluation mode before reordering the operands.
-   */
-  protected def collectEvalModes(
-  e: Expression,
-  f: PartialFunction[CommutativeExpression, Seq[EvalMode.Value]]
-  ): Seq[EvalMode.Value] = e match {
-case c: CommutativeExpression if f.isDefinedAt(c) =>
-  f(c) ++ c.children.flatMap(collectEvalModes(_, f))
-case _ => Nil
-  }
 }
 
 /**
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
index 91c10a53af8a..a085a4e3a8a3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
@@ -452,14 +452,12 @@ case class Add(
 copy(left = newLeft, right = newRight)
 
   override lazy val canonicalized: Expression = {
-val evalModes = collectEvalModes(this, {case Add(_, _, evalMode) => 
Seq(evalMode)})
-lazy val reorderResult = buildCanonicalizedPlan(
-  { case Add(l, r, _) => Seq(l, r) },
+val reorderResult = buildCanonicalizedPlan(
+  { case Add(l, r, em) if em == evalMode => Seq(l, r) },
   { case (l: Expression, r: Expression) => Add(l, r, evalMode)},
   Some(evalMode)
 )
-if (resolved && evalModes.forall(_ == evalMode) && reorderResult.resolved 
&&
-  reorderResult.dataType == dataType) {
+if (resolved && reorderResult.resolved && reorderResult.dataType == 
dataType) {
   reorderResult
 } else {
   // SPARK-40903: Avoid reordering decimal Add for canonicalization if the 
result data type is
@@ -609,16 +607,11 @@ case class Multiply(
 newLeft: Expression, newRight: Expression): Multiply = copy(left = 
newLeft, right = newRight)
 
   override lazy val canonicalized: Expression = {
-val evalModes = collectEvalModes(this, {case Multiply(_, _, evalMode) => 
Seq(evalMode)})
-if (evalModes.forall(_ == evalMode)) {
-  buildCanonicalizedPlan(
-{ case Multiply(l, r, _) => Seq(l, r) },
-{ case (l: Expression, r: Expression) => Multiply(l, r, evalMode)},
-Some(evalMode)
-  )
-} else {
-  withCanonicalizedChildren
-}
+buildCanonicalizedPlan(
+  { case Multiply(l, r, em) if em == evalMode => Seq(l, r) },
+  { case (l: Expression, r: Expression) => Multiply(l, r, evalMode) },
+  Some(evalMode)
+)
   }
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-41547][CONNECT][TESTS] Re-eneable Spark Connect function tests with ANSI mode

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8f719adcf556 [SPARK-41547][CONNECT][TESTS] Re-eneable Spark Connect 
function tests with ANSI mode
8f719adcf556 is described below

commit 8f719adcf556f23ba66d3742266f4ca2e4875530
Author: Martin Grund 
AuthorDate: Tue May 7 09:14:06 2024 -0700

[SPARK-41547][CONNECT][TESTS] Re-eneable Spark Connect function tests with 
ANSI mode

### What changes were proposed in this pull request?
This patch re-enables the previously failing tests after enablement of ANSI 
SQL.

### Why are the changes needed?
Spark 4 / ANSI SQL

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Re-enabled tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46432 from grundprinzip/grundprinzip/SPARK-41547.

Authored-by: Martin Grund 
Signed-off-by: Dongjoon Hyun 
---
 .../sql/tests/connect/test_connect_function.py | 33 ++
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py 
b/python/pyspark/sql/tests/connect/test_connect_function.py
index 2f21dd5a7d3a..9d4db8cf7d15 100644
--- a/python/pyspark/sql/tests/connect/test_connect_function.py
+++ b/python/pyspark/sql/tests/connect/test_connect_function.py
@@ -2030,7 +2030,6 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, 
PandasOnSparkTestUtils, S
 (CF.sentences, SF.sentences),
 (CF.initcap, SF.initcap),
 (CF.soundex, SF.soundex),
-(CF.bin, SF.bin),
 (CF.hex, SF.hex),
 (CF.unhex, SF.unhex),
 (CF.length, SF.length),
@@ -2043,6 +2042,19 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, 
PandasOnSparkTestUtils, S
 sdf.select(sfunc("a"), sfunc(sdf.b)).toPandas(),
 )
 
+query = """
+SELECT * FROM VALUES
+('   1   ', '2   ', NULL), ('   3', NULL, '4')
+AS tab(a, b, c)
+"""
+cdf = self.connect.sql(query)
+sdf = self.spark.sql(query)
+
+self.assert_eq(
+cdf.select(CF.bin(cdf.a), CF.bin(cdf.b)).toPandas(),
+sdf.select(SF.bin(sdf.a), SF.bin(sdf.b)).toPandas(),
+)
+
 def test_string_functions_multi_args(self):
 query = """
 SELECT * FROM VALUES
@@ -2149,15 +2161,15 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, 
PandasOnSparkTestUtils, S
 def test_date_ts_functions(self):
 query = """
 SELECT * FROM VALUES
-('1997/02/28 10:30:00', '2023/03/01 06:00:00', 'JST', 1428476400, 
2020, 12, 6),
-('2000/01/01 04:30:05', '2020/05/01 12:15:00', 'PST', 1403892395, 
2022, 12, 6)
+('1997-02-28 10:30:00', '2023-03-01 06:00:00', 'JST', 1428476400, 
2020, 12, 6),
+('2000-01-01 04:30:05', '2020-05-01 12:15:00', 'PST', 1403892395, 
2022, 12, 6)
 AS tab(ts1, ts2, tz, seconds, Y, M, D)
 """
 # +---+---+---+--++---+---+
 # |ts1|ts2| tz|   seconds|   Y|  M|  D|
 # +---+---+---+--++---+---+
-# |1997/02/28 10:30:00|2023/03/01 06:00:00|JST|1428476400|2020| 12|  6|
-# |2000/01/01 04:30:05|2020/05/01 12:15:00|PST|1403892395|2022| 12|  6|
+# |1997-02-28 10:30:00|2023-03-01 06:00:00|JST|1428476400|2020| 12|  6|
+# |2000-01-01 04:30:05|2020-05-01 12:15:00|PST|1403892395|2022| 12|  6|
 # +---+---+---+--++---+---+
 
 cdf = self.connect.sql(query)
@@ -2213,14 +2225,14 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, 
PandasOnSparkTestUtils, S
 (CF.to_date, SF.to_date),
 ]:
 self.assert_eq(
-cdf.select(cfunc(cdf.ts1, format="-MM-dd")).toPandas(),
-sdf.select(sfunc(sdf.ts1, format="-MM-dd")).toPandas(),
+cdf.select(cfunc(cdf.ts1, format="-MM-dd 
HH:mm:ss")).toPandas(),
+sdf.select(sfunc(sdf.ts1, format="-MM-dd 
HH:mm:ss")).toPandas(),
 )
 self.compare_by_show(
 # [left]:  datetime64[ns, America/Los_Angeles]
 # [right]: datetime64[ns]
-cdf.select(CF.to_timestamp(cdf.ts1, format="-MM-dd")),
-sdf.select(SF.to_timestamp(sdf.ts1, format="-MM-dd")),
+cdf.select(CF.to_timestamp(cdf.ts1,

(spark) branch master updated (925457cadd22 -> a3eebcf39687)

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 925457cadd22 [SPARK-48169][SQL] Use lazy BadRecordException cause in 
all parsers and remove the old constructor, which was meant for the migration
 add a3eebcf39687 [SPARK-48170][PYTHON][CONNECT][TESTS] Enable 
`ArrowPythonUDFParityTests.test_err_return_type`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py | 4 
 1 file changed, 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48169][SQL] Use lazy BadRecordException cause in all parsers and remove the old constructor, which was meant for the migration

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 925457cadd22 [SPARK-48169][SQL] Use lazy BadRecordException cause in 
all parsers and remove the old constructor, which was meant for the migration
925457cadd22 is described below

commit 925457cadd229673323e91a82d0b504145f509e0
Author: Vladimir Golubev 
AuthorDate: Tue May 7 09:09:00 2024 -0700

[SPARK-48169][SQL] Use lazy BadRecordException cause in all parsers and 
remove the old constructor, which was meant for the migration

### What changes were proposed in this pull request?
Use factory function for the exception cause in `BadRecordException` to 
avoid constructing heavy exceptions in the underlying parser. Now they are 
constructed on-demand in `FailureSafeParser`. A follow-up for 
https://github.com/apache/spark/pull/46400

### Why are the changes needed?
- Speed-up `JacksonParser` and `StaxXmlParser`, since they throw 
user-facing exceptions to `FailureSafeParser`
- Refactoring - leave only one constructor in `BadRecordException`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- `testOnly org.apache.spark.sql.catalyst.json.JacksonParserSuite`
- `testOnly org.apache.spark.sql.catalyst.csv.UnivocityParserSuite`

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46438 from 
vladimirg-db/vladimirg-db/use-lazy-exception-cause-in-all-bad-record-exception-invocations.

Authored-by: Vladimir Golubev 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/csv/UnivocityParser.scala   |  2 +-
 .../spark/sql/catalyst/json/JacksonParser.scala| 12 ++--
 .../sql/catalyst/util/BadRecordException.scala | 10 +-
 .../spark/sql/catalyst/xml/StaxXmlParser.scala | 22 --
 4 files changed, 20 insertions(+), 26 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
index 37d9143e5b5a..8d06789a7512 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
@@ -359,7 +359,7 @@ class UnivocityParser(
 } else {
   if (badRecordException.isDefined) {
 throw BadRecordException(
-  () => currentInput, () => Array[InternalRow](requiredRow.get), 
badRecordException.get)
+  () => currentInput, () => Array(requiredRow.get), 
badRecordException.get)
   } else {
 requiredRow
   }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index d1093a3b1be1..3c42f72fa6b6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -613,7 +613,7 @@ class JacksonParser(
 // JSON parser currently doesn't support partial results for corrupted 
records.
 // For such records, all fields other than the field configured by
 // `columnNameOfCorruptRecord` are set to `null`.
-throw BadRecordException(() => recordLiteral(record), cause = e)
+throw BadRecordException(() => recordLiteral(record), cause = () => e)
   case e: CharConversionException if options.encoding.isEmpty =>
 val msg =
   """JSON parser cannot handle a character in its input.
@@ -621,17 +621,17 @@ class JacksonParser(
 |""".stripMargin + e.getMessage
 val wrappedCharException = new CharConversionException(msg)
 wrappedCharException.initCause(e)
-throw BadRecordException(() => recordLiteral(record), cause = 
wrappedCharException)
+throw BadRecordException(() => recordLiteral(record), cause = () => 
wrappedCharException)
   case PartialResultException(row, cause) =>
 throw BadRecordException(
   record = () => recordLiteral(record),
   partialResults = () => Array(row),
-  convertCauseForPartialResult(cause))
+  cause = () => convertCauseForPartialResult(cause))
   case PartialResultArrayException(rows, cause) =>
 throw BadRecordException(
   record = () => recordLiteral(record),
   partialResults = () => rows,
-  cause)
+  cause = () => cause)
   // These exceptions should never be thrown outside of JacksonParser.
   // They are used for the control flow in the par

(spark) branch master updated (493493d6c5bb -> 9e0a87eb4cf2)

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 493493d6c5bb [SPARK-48173][SQL] CheckAnalysis should see the entire 
query plan
 add 9e0a87eb4cf2 [SPARK-48165][BUILD] Update `ap-loader` to 3.0-9

No new revisions were added by this update.

Summary of changes:
 connector/profiler/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48173][SQL] CheckAnalysis should see the entire query plan

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 493493d6c5bb [SPARK-48173][SQL] CheckAnalysis should see the entire 
query plan
493493d6c5bb is described below

commit 493493d6c5bbbaa0b04f5548ac1ccd9502e8b8fa
Author: Wenchen Fan 
AuthorDate: Tue May 7 08:02:25 2024 -0700

[SPARK-48173][SQL] CheckAnalysis should see the entire query plan

### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/38029 . Some 
custom check rules need to see the entire query plan tree to get some context, 
but https://github.com/apache/spark/pull/38029 breaks it as it checks the query 
plan of dangling CTE relations recursively.

This PR fixes it by putting back the dangling CTE relation in the main 
query plan and then check the main query plan.

### Why are the changes needed?

Revert the breaking change to custom check rules

### Does this PR introduce _any_ user-facing change?

No for most users. This restores the behavior of Spark 3.3 and earlier for 
custom check rules.

### How was this patch tested?

existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46439 from cloud-fan/check.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  | 39 +++---
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index d1b336b08955..e55f23b6aa86 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -145,15 +145,16 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   private def checkUnreferencedCTERelations(
   cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])],
   visited: mutable.Map[Long, Boolean],
+  danglingCTERelations: mutable.ArrayBuffer[CTERelationDef],
   cteId: Long): Unit = {
 if (visited(cteId)) {
   return
 }
 val (cteDef, _, refMap) = cteMap(cteId)
 refMap.foreach { case (id, _) =>
-  checkUnreferencedCTERelations(cteMap, visited, id)
+  checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, id)
 }
-checkAnalysis0(cteDef.child)
+danglingCTERelations.append(cteDef)
 visited(cteId) = true
   }
 
@@ -161,35 +162,35 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 val inlineCTE = InlineCTE(alwaysInline = true)
 val cteMap = mutable.HashMap.empty[Long, (CTERelationDef, Int, 
mutable.Map[Long, Int])]
 inlineCTE.buildCTEMap(plan, cteMap)
+val danglingCTERelations = mutable.ArrayBuffer.empty[CTERelationDef]
 val visited: mutable.Map[Long, Boolean] = 
mutable.Map.empty.withDefaultValue(false)
-cteMap.foreach { case (cteId, (relation, refCount, _)) =>
-  // If a CTE relation is never used, it will disappear after inline. Here 
we explicitly check
-  // analysis for it, to make sure the entire query plan is valid.
-  try {
-// If a CTE relation ref count is 0, the other CTE relations that 
reference it
-// should also be checked by checkAnalysis0. This code will also 
guarantee the leaf
-// relations that do not reference any others are checked first.
-if (refCount == 0) {
-  checkUnreferencedCTERelations(cteMap, visited, cteId)
-}
-  } catch {
-case e: AnalysisException =>
-  throw new ExtendedAnalysisException(e, relation.child)
+// If a CTE relation is never used, it will disappear after inline. Here 
we explicitly collect
+// these dangling CTE relations, and put them back in the main query, to 
make sure the entire
+// query plan is valid.
+cteMap.foreach { case (cteId, (_, refCount, _)) =>
+  // If a CTE relation ref count is 0, the other CTE relations that 
reference it should also be
+  // collected. This code will also guarantee the leaf relations that do 
not reference
+  // any others are collected first.
+  if (refCount == 0) {
+checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, 
cteId)
   }
 }
 // Inline all CTEs in the plan to help check query plan structures in 
subqueries.
-var inlinedPlan: Option[LogicalPlan] = None
+var inlinedPlan: LogicalPlan = plan
 try {
-  inlinedPlan = Some(inlineCTE(plan))
+  inlinedPlan = inlineCTE(plan)

(spark) branch master updated: [SPARK-48171][CORE] Clean up the use of deprecated constructors of `o.rocksdb.Logger`

2024-05-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c326f3c143ff [SPARK-48171][CORE] Clean up the use of deprecated 
constructors of `o.rocksdb.Logger`
c326f3c143ff is described below

commit c326f3c143ffdd56954706aeb4e0b82ac819bf03
Author: yangjie01 
AuthorDate: Tue May 7 07:33:38 2024 -0700

[SPARK-48171][CORE] Clean up the use of deprecated constructors of 
`o.rocksdb.Logger`

### What changes were proposed in this pull request?
This pr aims to clean up the use of deprecated constructors of 
`o.rocksdb.Logger`, the change ref to


https://github.com/facebook/rocksdb/blob/5c2be544f5509465957706c955b6d623e889ac4e/java/src/main/java/org/rocksdb/Logger.java#L39-L54

```
/**
   * AbstractLogger constructor.
   *
   * Important: the log level set within
   * the {link org.rocksdb.Options} instance will be used as
   * maximum log level of RocksDB.
   *
   * param options {link org.rocksdb.Options} instance.
   *
   * deprecated Use {link Logger#Logger(InfoLogLevel)} instead, e.g. {code 
new
   * Logger(options.infoLogLevel())}.
   */
  Deprecated
  public Logger(final Options options) {
this(options.infoLogLevel());
  }
```

### Why are the changes needed?
Clean up deprecated api usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46436 from LuciferYang/rocksdb-deprecation.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 .../src/main/java/org/apache/spark/network/util/RocksDBProvider.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java
 
b/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java
index f3b7b48355a0..2b5ea01d94c9 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java
@@ -136,7 +136,7 @@ public class RocksDBProvider {
 private static final Logger LOG = 
LoggerFactory.getLogger(RocksDBLogger.class);
 
 RocksDBLogger(Options options) {
-  super(options);
+  super(options.infoLogLevel());
 }
 
 @Override


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command`

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 56fe185c78a2 [SPARK-48163][CONNECT][TESTS] Disable 
`SparkConnectServiceSuite.SPARK-43923: commands send events - 
get_resources_command`
56fe185c78a2 is described below

commit 56fe185c78a249cf88b1d7e5d1e67444e1b224db
Author: Dongjoon Hyun 
AuthorDate: Mon May 6 21:39:52 2024 -0700

[SPARK-48163][CONNECT][TESTS] Disable 
`SparkConnectServiceSuite.SPARK-43923: commands send events - 
get_resources_command`

### What changes were proposed in this pull request?

This PR aims to disable a flaky test, 
`SparkConnectServiceSuite.SPARK-43923: commands send events - 
get_resources_command`, temporarily.

To re-enable this, SPARK-48164 is created as a blocker issue for 4.0.0.

### Why are the changes needed?

This test case was added at `Apache Spark 3.5.0`, but it has been flaky and 
causes many re-tries in our GitHub Action CI environment.

- https://github.com/apache/spark/pull/42454

- https://github.com/apache/spark/actions/runs/8979348499/job/24661200052
```
[info] - SPARK-43923: commands send events ((get_resources_command {
[info] }
[info] ,None)) *** FAILED *** (35 milliseconds)
[info]   VerifyEvents.this.listener.executeHolder.isDefined was false 
(SparkConnectServiceSuite.scala:873)
```

This PR aims to stabilize CI first and to focus this flaky issue as a 
blocker level before going on `Spark Connect GA` in SPARK-48164 before Apache 
Spark 4.0.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46425 from dongjoon-hyun/SPARK-48163.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala| 3 +++
 1 file changed, 3 insertions(+)

diff --git 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala
 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala
index af18fca9dd21..59d9750c0fbf 100644
--- 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala
+++ 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala
@@ -418,11 +418,14 @@ class SparkConnectServiceSuite
   .setInput(
 
proto.Relation.newBuilder().setSql(proto.SQL.newBuilder().setQuery("select 
1",
 None),
+  // TODO(SPARK-48164) Reenable `commands send events - 
get_resources_command`
+  /*
   (
 proto.Command
   .newBuilder()
   .setGetResourcesCommand(proto.GetResourcesCommand.newBuilder()),
 None),
+  */
   (
 proto.Command
   .newBuilder()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48141][TEST] Update the Oracle docker image version used for test and integration to use Oracle Database 23ai Free

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 05b22ebb3060 [SPARK-48141][TEST] Update the Oracle docker image 
version used for test and integration to use Oracle Database 23ai Free
05b22ebb3060 is described below

commit 05b22ebb30606a76c50e649a6efa825f03ca97ff
Author: Luca Canali 
AuthorDate: Mon May 6 20:44:51 2024 -0700

[SPARK-48141][TEST] Update the Oracle docker image version used for test 
and integration to use Oracle Database 23ai Free

### What changes were proposed in this pull request?
This proposes to update the Docker image used for integration tests and 
builds to Oracle Database 23ai Free, version 23.4 (previously we used Oracle 
Database 23c Free, version 23.3)

### Why are the changes needed?
This is to keep the testing infrastructure up-to-date with the latest 
Oracle Database Free version.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test infrastructure.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46399 from LucaCanali/updateOracleImage.

Lead-authored-by: Luca Canali 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml| 1 -
 connector/docker-integration-tests/README.md| 2 +-
 .../test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala   | 2 +-
 3 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index b34456fc3e42..286f8e1193d9 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -928,7 +928,6 @@ jobs:
   HIVE_PROFILE: hive2.3
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
-  ORACLE_DOCKER_IMAGE_NAME: gvenzl/oracle-free:23.3
   SKIP_UNIDOC: true
   SKIP_MIMA: true
   SKIP_PACKAGING: true
diff --git a/connector/docker-integration-tests/README.md 
b/connector/docker-integration-tests/README.md
index 0192947bdbf9..03d3fe706a60 100644
--- a/connector/docker-integration-tests/README.md
+++ b/connector/docker-integration-tests/README.md
@@ -45,7 +45,7 @@ the container bootstrapping. To run an individual Docker 
integration test, use t
 
 Besides the default Docker images, the integration tests can be run with 
custom Docker images. For example,
 
-ORACLE_DOCKER_IMAGE_NAME=gvenzl/oracle-free:23.3-slim-faststart 
./build/sbt -Pdocker-integration-tests "docker-integration-tests/testOnly 
*OracleIntegrationSuite"
+ORACLE_DOCKER_IMAGE_NAME=gvenzl/oracle-free:23.4-slim-faststart 
./build/sbt -Pdocker-integration-tests "docker-integration-tests/testOnly 
*OracleIntegrationSuite"
 
 The following environment variables can be used to specify the custom Docker 
images for different databases:
 
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala
index bfbcf5b533d7..88bb23f9c653 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleDatabaseOnDocker.scala
@@ -26,7 +26,7 @@ import org.apache.spark.util.Utils
 
 class OracleDatabaseOnDocker extends DatabaseOnDocker with Logging {
   lazy override val imageName =
-sys.env.getOrElse("ORACLE_DOCKER_IMAGE_NAME", 
"gvenzl/oracle-free:23.3-slim")
+sys.env.getOrElse("ORACLE_DOCKER_IMAGE_NAME", 
"gvenzl/oracle-free:23.4-slim")
   val oracle_password = "Th1s1sThe0racle#Pass"
   override val env = Map(
 "ORACLE_PWD" -> oracle_password, // oracle images uses this


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48150][SQL] try_parse_json output should be declared as nullable

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8cf602a3f587 [SPARK-48150][SQL] try_parse_json output should be 
declared as nullable
8cf602a3f587 is described below

commit 8cf602a3f587af4acc15637878437f166db4ed3f
Author: Josh Rosen 
AuthorDate: Mon May 6 20:08:56 2024 -0700

[SPARK-48150][SQL] try_parse_json output should be declared as nullable

### What changes were proposed in this pull request?

The `try_parse_json` expression added in 
https://github.com/apache/spark/pull/46141 declares improper output 
nullability: the `try_` version's output must be marked as nullable. This PR 
corrects the nullability and adds a test.

### Why are the changes needed?

Incorrectly declaring an expression's output as non-nullable when it is 
actually nullable may lead to crashes.

### Does this PR introduce _any_ user-facing change?

Yes, it affects output nullability and thus may affect query result schemas.

### How was this patch tested?

New unit test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46409 from JoshRosen/fix-try-parse-json-nullability.

Authored-by: Josh Rosen 
Signed-off-by: Dongjoon Hyun 
---
 .../query-tests/explain-results/function_try_parse_json.explain  | 2 +-
 .../sql/catalyst/expressions/variant/variantExpressions.scala| 2 +-
 .../catalyst/expressions/variant/VariantExpressionSuite.scala| 9 +
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
index 1772b5d37623..5c6b21a3ad46 100644
--- 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
+++ 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
@@ -1,2 +1,2 @@
-Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.variant.VariantExpressionEvalUtils$, 
VariantType, parseJson, g#0, false, StringType, BooleanType, true, false, true) 
AS try_parse_json(g)#0]
+Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.variant.VariantExpressionEvalUtils$, 
VariantType, parseJson, g#0, false, StringType, BooleanType, true, true, true) 
AS try_parse_json(g)#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
index 3dbc72415ff0..5026d8e49ef1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
@@ -59,7 +59,7 @@ case class ParseJson(child: Expression, failOnError: Boolean 
= true)
 "parseJson",
 Seq(child, Literal(failOnError, BooleanType)),
 inputTypes :+ BooleanType,
-returnNullable = false)
+returnNullable = !failOnError)
 
   override def inputTypes: Seq[AbstractDataType] = StringType :: Nil
 
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala
index f4a6a144c221..73abf8074e8c 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala
@@ -810,6 +810,15 @@ class VariantExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   "Hello")
   }
 
+  test("SPARK-48150: ParseJson expression nullability") {
+assert(!ParseJson(Literal("["), failOnError = true).replacement.nullable)
+assert(ParseJson(Literal("["), failOnError = false).replacement.nullable)
+checkEvaluation(
+  ParseJson(Literal("["), failOnError = false).replacement,
+  null
+)
+  }
+
   test("cast to variant") {
 def check[T : TypeTag](input: T, expectedJson: String): Unit = {
   val cast = Cast(Literal.create(input), VariantType, evalMode = 
EvalMode.ANSI)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (f918d1179642 -> 0907a15b2d15)

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f918d1179642 [SPARK-48151][INFRA] `build_and_test.yml` should use 
`Volcano` 1.7.0 for `branch-3.4/3.5`
 add 0907a15b2d15 [SPARK-48153][INFRA] Run `build` job of 
`build_and_test.yml` only if needed

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_and_test.yml | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that that requires JVM access

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e699a1eee085 [SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] 
Skips another that that requires JVM access
e699a1eee085 is described below

commit e699a1eee085eb6025f33284c6369553713794d1
Author: Hyukjin Kwon 
AuthorDate: Mon May 6 19:06:29 2024 -0700

[SPARK-48088][PYTHON][CONNECT][TESTS][FOLLOW-UP][3.5] Skips another that 
that requires JVM access

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/46334 that 
missed one more test case.

### Why are the changes needed?

See https://github.com/apache/spark/pull/46334

### Does this PR introduce _any_ user-facing change?

See https://github.com/apache/spark/pull/46334

### How was this patch tested?

Manually

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46411 from HyukjinKwon/SPARK-48088-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/ml/tests/connect/test_connect_pipeline.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/python/pyspark/ml/tests/connect/test_connect_pipeline.py 
b/python/pyspark/ml/tests/connect/test_connect_pipeline.py
index dc7490bf14b1..eb2bedddbe28 100644
--- a/python/pyspark/ml/tests/connect/test_connect_pipeline.py
+++ b/python/pyspark/ml/tests/connect/test_connect_pipeline.py
@@ -22,6 +22,7 @@ from pyspark.sql import SparkSession
 from pyspark.ml.tests.connect.test_legacy_mode_pipeline import 
PipelineTestsMixin
 
 
+@unittest.skipIf("SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ, "Requires 
JVM access")
 class PipelineTestsOnConnect(PipelineTestsMixin, unittest.TestCase):
 def setUp(self) -> None:
 self.spark = (


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (2ef7246b9c5b -> f918d1179642)

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2ef7246b9c5b [SPARK-48142][PYTHON][CONNECT][TESTS] Enable 
`CogroupedApplyInPandasTests.test_wrong_args`
 add f918d1179642 [SPARK-48151][INFRA] `build_and_test.yml` should use 
`Volcano` 1.7.0 for `branch-3.4/3.5`

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_and_test.yml | 5 +
 1 file changed, 5 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48149][INFRA] Serialize `build_python.yml` to run a single Python version per cron schedule

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4c6884291e8b [SPARK-48149][INFRA] Serialize `build_python.yml` to run 
a single Python version per cron schedule
4c6884291e8b is described below

commit 4c6884291e8b97a7d64dd13530f7ecabe2839d16
Author: Dongjoon Hyun 
AuthorDate: Mon May 6 16:06:57 2024 -0700

[SPARK-48149][INFRA] Serialize `build_python.yml` to run a single Python 
version per cron schedule

### What changes were proposed in this pull request?

This PR aims to serialize `build_python.yml` to run a single Python version 
per cron schedule to reduce the maximum concurrency per single GitHub Action 
job.

### Why are the changes needed?

Currently, `build_python.yml` triggers 60 jobs. `30` of `60` jobs are 
running concurrently because 10 test pipelines are required per Python version.

- https://github.com/apache/spark/actions/workflows/build_python.yml

https://github.com/apache/spark/assets/9700541/e4f4e9d2-2b2e-43b9-a760-6b9943c7b5b7;>

According to https://infra.apache.org/github-actions-policy.html,
> All workflows SHOULD have a job concurrency level less than or equal to 
15.

After this PR, the maximum concurrently level will be 10.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review because this is a daily CI.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46407 from dongjoon-hyun/SPARK-48149.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_python.yml | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/build_python.yml 
b/.github/workflows/build_python.yml
index 3354fb726368..9195dc4af518 100644
--- a/.github/workflows/build_python.yml
+++ b/.github/workflows/build_python.yml
@@ -17,18 +17,26 @@
 # under the License.
 #
 
+# According to https://infra.apache.org/github-actions-policy.html,
+# all workflows SHOULD have a job concurrency level less than or equal to 15.
+# To do that, we run one python version per cron schedule
 name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.12)"
 
 on:
   schedule:
 - cron: '0 15 * * *'
+- cron: '0 17 * * *'
+- cron: '0 19 * * *'
 
 jobs:
   run-build:
 strategy:
   fail-fast: false
   matrix:
-pyversion: ["pypy3", "python3.10", "python3.12"]
+include:
+  - pyversion: ${{ github.event.schedule == '0 15 * * *' && "pypy3" }}
+  - pyversion: ${{ github.event.schedule == '0 17 * * *' && 
"python3.10" }}
+  - pyversion: ${{ github.event.schedule == '0 19 * * *' && 
"python3.12" }}
 permissions:
   packages: write
 name: Run


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48145][CORE] Remove logDebug and logTrace with MDC in JAVA structured logging framework

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new de8ba8589c21 [SPARK-48145][CORE] Remove logDebug and logTrace with MDC 
in JAVA structured logging framework
de8ba8589c21 is described below

commit de8ba8589c218ffbe57efc581bd921a6aef73fae
Author: Gengliang Wang 
AuthorDate: Mon May 6 13:32:54 2024 -0700

[SPARK-48145][CORE] Remove logDebug and logTrace with MDC in JAVA 
structured logging framework

### What changes were proposed in this pull request?

Since we are targeting on migration INFO/WARN/ERROR level logs to structure 
logging, this PR removes the logDebug and logTrace methods from the JAVA 
structured logging framework.

### Why are the changes needed?

In the log migration PR https://github.com/apache/spark/pull/46390, there 
are unnecessary changes such as updating
```
logger.debug("Task {} need to spill {} for {}", taskAttemptId,
Utils.bytesToString(required - got), requestingConsumer);
```
to
```
LOGGER.debug("Task {} need to spill {} for {}", 
String.valueOf(taskAttemptId),
Utils.bytesToString(required - got), 
requestingConsumer.toString());
```

With this PR, we can avoid such changes during log migrations.
### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Existing UT.
### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46405 from gengliangwang/updateJavaLog.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
---
 .../java/org/apache/spark/internal/Logger.java | 49 ++
 .../org/apache/spark/util/LoggerSuiteBase.java | 28 +++--
 2 files changed, 26 insertions(+), 51 deletions(-)

diff --git a/common/utils/src/main/java/org/apache/spark/internal/Logger.java 
b/common/utils/src/main/java/org/apache/spark/internal/Logger.java
index f252f44b3b76..2b4dd3bb45bc 100644
--- a/common/utils/src/main/java/org/apache/spark/internal/Logger.java
+++ b/common/utils/src/main/java/org/apache/spark/internal/Logger.java
@@ -110,50 +110,43 @@ public class Logger {
 slf4jLogger.debug(msg);
   }
 
-  public void debug(String msg, Throwable throwable) {
-slf4jLogger.debug(msg, throwable);
+  public void debug(String format, Object arg) {
+slf4jLogger.debug(format, arg);
   }
 
-  public void debug(String msg, MDC... mdcs) {
-if (mdcs == null || mdcs.length == 0) {
-  slf4jLogger.debug(msg);
-} else if (slf4jLogger.isDebugEnabled()) {
-  withLogContext(msg, mdcs, null, mt -> slf4jLogger.debug(mt.message));
-}
+  public void debug(String format, Object arg1, Object arg2) {
+slf4jLogger.debug(format, arg1, arg2);
   }
 
-  public void debug(String msg, Throwable throwable, MDC... mdcs) {
-if (mdcs == null || mdcs.length == 0) {
-  slf4jLogger.debug(msg);
-} else if (slf4jLogger.isDebugEnabled()) {
-  withLogContext(msg, mdcs, throwable, mt -> slf4jLogger.debug(mt.message, 
mt.throwable));
-}
+  public void debug(String format, Object... arguments) {
+slf4jLogger.debug(format, arguments);
+  }
+
+  public void debug(String msg, Throwable throwable) {
+slf4jLogger.debug(msg, throwable);
   }
 
   public void trace(String msg) {
 slf4jLogger.trace(msg);
   }
 
-  public void trace(String msg, Throwable throwable) {
-slf4jLogger.trace(msg, throwable);
+  public void trace(String format, Object arg) {
+slf4jLogger.trace(format, arg);
   }
 
-  public void trace(String msg, MDC... mdcs) {
-if (mdcs == null || mdcs.length == 0) {
-  slf4jLogger.trace(msg);
-} else if (slf4jLogger.isTraceEnabled()) {
-  withLogContext(msg, mdcs, null, mt -> slf4jLogger.trace(mt.message));
-}
+  public void trace(String format, Object arg1, Object arg2) {
+slf4jLogger.trace(format, arg1, arg2);
   }
 
-  public void trace(String msg, Throwable throwable, MDC... mdcs) {
-if (mdcs == null || mdcs.length == 0) {
-  slf4jLogger.trace(msg);
-} else if (slf4jLogger.isTraceEnabled()) {
-  withLogContext(msg, mdcs, throwable, mt -> slf4jLogger.trace(mt.message, 
mt.throwable));
-}
+  public void trace(String format, Object... arguments) {
+slf4jLogger.trace(format, arguments);
   }
 
+  public void trace(String msg, Throwable throwable) {
+slf4jLogger.trace(msg, throwable);
+  }
+
+
   private void withLogContext(
   String pattern,
   MDC[] mdcs,
diff --git 
a/common/utils/src/test/java/org/apache/spark/util/LoggerSuiteBase.java 
b/common/utils/src/test/java/org/apache/spark/util/LoggerSuiteBase.java
index cdc06f6fc261..6c39304bece0 100644
--- a/common/utils/src/test/java/org/apache/spark/util/LoggerSuiteBase.

(spark-website) branch asf-site updated: Update `committers` page (#517)

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 79e0191af2 Update `committers` page (#517)
79e0191af2 is described below

commit 79e0191af219aab6e9aea84458700e33d7013bef
Author: Dongjoon Hyun 
AuthorDate: Tue May 7 04:25:32 2024 +0900

Update `committers` page (#517)

This PR aims to update `committers` page because `Apache Spark 
4.0.0-preview` is going to be ready this week soon.
- https://spark.apache.org/committers.html
---
 committers.md| 30 +++---
 site/committers.html | 30 +++---
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/committers.md b/committers.md
index 42e06bebd7..f443ff280b 100644
--- a/committers.md
+++ b/committers.md
@@ -10,20 +10,20 @@ navigation:
 
 |Name|Organization|
 |||
-|Sameer Agarwal|Facebook|
+|Sameer Agarwal|Deductive AI|
 |Michael Armbrust|Databricks|
 |Dilip Biswal|Adobe|
-|Ryan Blue|Netflix|
+|Ryan Blue|Tabular|
 |Joseph Bradley|Databricks|
 |Matthew Cheah|Palantir|
-|Felix Cheung|SafeGraph|
+|Felix Cheung|NVIDIA|
 |Mosharaf Chowdhury|University of Michigan, Ann Arbor|
 |Bryan Cutler|IBM|
 |Jason Dai|Intel|
 |Tathagata Das|Databricks|
-|Ankur Dave|UC Berkeley|
+|Ankur Dave|Databricks|
 |Aaron Davidson|Databricks|
-|Thomas Dudziak|Facebook|
+|Thomas Dudziak|Meta|
 |Erik Erlandson|Red Hat|
 |Robert Evans|NVIDIA|
 |Wenchen Fan|Databricks|
@@ -34,7 +34,7 @@ navigation:
 |Thomas Graves|NVIDIA|
 |Stephen Haberman|LinkedIn|
 |Mark Hamstra|ClearStory Data|
-|Seth Hendrickson|Cloudera|
+|Seth Hendrickson|Stripe|
 |Herman van Hovell|Databricks|
 |Liang-Chi Hsieh|Apple|
 |Yin Huai|Databricks|
@@ -43,7 +43,7 @@ navigation:
 |Kazuaki Ishizaki|IBM|
 |Xingbo Jiang|Databricks|
 |Yikun Jiang|Huawei|
-|Holden Karau|Apple|
+|Holden Karau|Netflix|
 |Shane Knapp|UC Berkeley|
 |Cody Koeninger|Nexstar Digital|
 |Andy Konwinski|Databricks|
@@ -61,23 +61,23 @@ navigation:
 |Xiangrui Meng|Databricks|
 |Xinrong Meng|Databricks|
 |Mridul Muralidharan|LinkedIn|
-|Andrew Or|Princeton University|
+|Andrew Or|Facebook|
 |Kay Ousterhout|LightStep|
 |Sean Owen|Databricks|
-|Tejas Patil|Facebook|
-|Nick Pentreath|IBM|
+|Tejas Patil|Meta|
+|Nick Pentreath|Automattic|
 |Attila Zsolt Piros|Cloudera|
-|Anirudh Ramanathan|Rockset|
+|Anirudh Ramanathan|Signadot|
 |Imran Rashid|Cloudera|
 |Charles Reiss|University of Virginia|
-|Josh Rosen|Stripe|
-|Sandy Ryza|Remix|
+|Josh Rosen|Databricks|
+|Sandy Ryza|Dagster|
 |Kousuke Saruta|NTT Data|
 |Saisai Shao|Datastrato|
 |Prashant Sharma|IBM|
 |Gabor Somogyi|Apple|
-|Ram Sriharsha|Databricks|
-|Chao Sun|Apple|
+|Ram Sriharsha|Pinecone|
+|Chao Sun|OpenAI|
 |Maciej Szymkiewicz||
 |Jose Torres|Databricks|
 |Peter Toth|Cloudera|
diff --git a/site/committers.html b/site/committers.html
index 0153ecf595..15e8bdbe19 100644
--- a/site/committers.html
+++ b/site/committers.html
@@ -153,7 +153,7 @@
   
 
   Sameer Agarwal
-  Facebook
+  Deductive AI
 
 
   Michael Armbrust
@@ -165,7 +165,7 @@
 
 
   Ryan Blue
-  Netflix
+  Tabular
 
 
   Joseph Bradley
@@ -177,7 +177,7 @@
 
 
   Felix Cheung
-  SafeGraph
+  NVIDIA
 
 
   Mosharaf Chowdhury
@@ -197,7 +197,7 @@
 
 
   Ankur Dave
-  UC Berkeley
+  Databricks
 
 
   Aaron Davidson
@@ -205,7 +205,7 @@
 
 
   Thomas Dudziak
-  Facebook
+  Meta
 
 
   Erik Erlandson
@@ -249,7 +249,7 @@
 
 
   Seth Hendrickson
-  Cloudera
+  Stripe
 
 
   Herman van Hovell
@@ -285,7 +285,7 @@
 
 
   Holden Karau
-  Apple
+  Netflix
 
 
   Shane Knapp
@@ -357,7 +357,7 @@
 
 
   Andrew Or
-  Princeton University
+  Facebook
 
 
   Kay Ousterhout
@@ -369,11 +369,11 @@
 
 
   Tejas Patil
-  Facebook
+  Meta
 
 
   Nick Pentreath
-  IBM
+  Automattic
 
 
   Attila Zsolt Piros
@@ -381,7 +381,7 @@
 
 
   Anirudh Ramanathan
-  Rockset
+  Signadot
 
 
   Imran Rashid
@@ -393,11 +393,11 @@
 
 
   Josh Rosen
-  Stripe
+  Databricks
 
 
   Sandy Ryza
-  Remix
+  Dagster
 
 
   Kousuke Saruta
@@ -417,11 +417,11 @@
 
 
   Ram Sriharsha
-  Databricks
+  Pinecone
 
 
   Chao Sun
-  Apple
+  OpenAI
 
 
   Maciej Szymkiewicz


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (7c728b2c2d6c -> 526e4141457d)

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 7c728b2c2d6c [SPARK-48137][INFRA] Run `yarn` test only in PR builders 
and Daily CIs
 add 526e4141457d [SPARK-45220][FOLLOWUP][DOCS][TESTS] Make a 
`dataframe.join` doctest deterministic

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (8294c5962feb -> 7c728b2c2d6c)

2024-05-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8294c5962feb [SPARK-48138][CONNECT][TESTS] Disable a flaky 
`SparkSessionE2ESuite.interrupt tag` test
 add 7c728b2c2d6c [SPARK-48137][INFRA] Run `yarn` test only in PR builders 
and Daily CIs

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_and_test.yml | 12 ++--
 .github/workflows/build_java21.yml   |  1 +
 .github/workflows/build_non_ansi.yml |  3 ++-
 .github/workflows/build_rockdb_as_ui_backend.yml |  3 ++-
 4 files changed, 15 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48136][INFRA][CONNECT] Always upload Spark Connect log files in scheduled build for Spark Connect

2024-05-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d09f174be5e9 [SPARK-48136][INFRA][CONNECT] Always upload Spark Connect 
log files in scheduled build for Spark Connect
d09f174be5e9 is described below

commit d09f174be5e9bf7dee12840526ed8bf6aee07052
Author: Hyukjin Kwon 
AuthorDate: Sun May 5 17:49:15 2024 -0700

[SPARK-48136][INFRA][CONNECT] Always upload Spark Connect log files in 
scheduled build for Spark Connect

### What changes were proposed in this pull request?

This PR proposes to upload Spark Connect log files in scheduled build for 
Spark Connect

### Why are the changes needed?

Difficult to debug, e.g., 
https://github.com/apache/spark/actions/runs/8960485641/job/24607044822

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46393 from HyukjinKwon/SPARK-48136.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_python_connect.yml   | 2 +-
 .github/workflows/build_python_connect35.yml | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_python_connect.yml 
b/.github/workflows/build_python_connect.yml
index 3a9ce5115741..639b0d084314 100644
--- a/.github/workflows/build_python_connect.yml
+++ b/.github/workflows/build_python_connect.yml
@@ -118,7 +118,7 @@ jobs:
   name: test-results-spark-connect-python-only
   path: "**/target/test-reports/*.xml"
   - name: Upload Spark Connect server log file
-if: failure()
+if: ${{ !success() }}
 uses: actions/upload-artifact@v4
 with:
   name: unit-tests-log-spark-connect-python-only
diff --git a/.github/workflows/build_python_connect35.yml 
b/.github/workflows/build_python_connect35.yml
index 8c9a5fa86996..14edb8bf91ed 100644
--- a/.github/workflows/build_python_connect35.yml
+++ b/.github/workflows/build_python_connect35.yml
@@ -106,7 +106,7 @@ jobs:
   name: test-results-spark-connect-python-only
   path: "**/target/test-reports/*.xml"
   - name: Upload Spark Connect server log file
-if: failure()
+if: ${{ !success() }}
 uses: actions/upload-artifact@v4
 with:
   name: unit-tests-log-spark-connect-python-only


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48135][INFRA] Run `buf` and `ui` only in PR builders and Java 21 Daily CI

2024-05-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8b2251734519 [SPARK-48135][INFRA] Run `buf` and `ui` only in PR 
builders and Java 21 Daily CI
8b2251734519 is described below

commit 8b22517345190e007ca87c7491116ad590ad46f2
Author: Dongjoon Hyun 
AuthorDate: Sun May 5 16:40:11 2024 -0700

[SPARK-48135][INFRA] Run `buf` and `ui` only in PR builders and Java 21 
Daily CI

### What changes were proposed in this pull request?

This PR aims to run `buf` and `ui` tests only in PR builders and Java 21 
Daily CI.

### Why are the changes needed?

Currently, Apache Spark CI is running `buf` and `ui` tests always because 
they finish quickly.


https://github.com/apache/spark/blob/32ba5c1db62c2674e8acced56f89ed840bf9/.github/workflows/build_and_test.yml#L102-L103

- `buf` job

https://github.com/apache/spark/blob/32ba5c1db62c2674e8acced56f89ed840bf9/.github/workflows/build_and_test.yml#L571-L574

- `ui` job

https://github.com/apache/spark/blob/32ba5c1db62c2674e8acced56f89ed840bf9/.github/workflows/build_and_test.yml#L1049-L1052

However, ASF Infra team's guideline recommends to maintain the job 
concurrency level under or equal to `15`. We had better offload `buf` and `ui` 
from per-commit CI.

- https://infra.apache.org/github-actions-policy.html

> All workflows SHOULD have a job concurrency level less than or equal to 
15.

### Does this PR introduce _any_ user-facing change?

No because this is an infra update.

### How was this patch tested?

Pass the CIs and manual review because PR builders will not be affected by 
this.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46392 from dongjoon-hyun/SPARK-48135.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 8 ++--
 .github/workflows/build_java21.yml   | 4 +++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index f626cd72be15..8a85d26c0eca 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -82,10 +82,14 @@ jobs:
 pandas=$pyspark
 kubernetes=`./dev/is-changed.py -m kubernetes`
 sparkr=`./dev/is-changed.py -m sparkr`
+buf=true
+ui=true
   else
 pandas=false
 kubernetes=false
 sparkr=false
+buf=false
+ui=false
   fi
   # 'build' is always true for now.
   # It does not save significant time and most of PRs trigger the 
build.
@@ -99,8 +103,8 @@ jobs:
   \"docker-integration-tests\": \"false\",
   \"lint\" : \"true\",
   \"k8s-integration-tests\" : \"$kubernetes\",
-  \"buf\" : \"true\",
-  \"ui\" : \"true\",
+  \"buf\" : \"$buf\",
+  \"ui\" : \"$ui\",
 }"
   echo $precondition # For debugging
   # Remove `\n` to avoid "Invalid format" error
diff --git a/.github/workflows/build_java21.yml 
b/.github/workflows/build_java21.yml
index bfeedd4174cf..a2fb0e6e2c1d 100644
--- a/.github/workflows/build_java21.yml
+++ b/.github/workflows/build_java21.yml
@@ -47,5 +47,7 @@ jobs:
   "sparkr": "true",
   "tpcds-1g": "true",
   "docker-integration-tests": "true",
-  "k8s-integration-tests": "true"
+  "k8s-integration-tests": "true",
+  "buf": "true",
+  "ui": "true"
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs

2024-05-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 32ba5c1db62c [SPARK-48133][INFRA] Run `sparkr` only in PR builders and 
Daily CIs
32ba5c1db62c is described below

commit 32ba5c1db62c2674e8acced56f89ed840bf9
Author: Dongjoon Hyun 
AuthorDate: Sun May 5 13:19:23 2024 -0700

[SPARK-48133][INFRA] Run `sparkr` only in PR builders and Daily CIs

### What changes were proposed in this pull request?

This PR aims to run `sparkr` only in PR builder and Daily Python CIs. In 
other words, only the commit builder will skip it by default.

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> All workflows MUST have a job concurrency level less than or equal to 
20. This means a workflow cannot have more than 20 jobs running at the same 
time across all matrices.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46389 from dongjoon-hyun/SPARK-48133.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index c87e8921b48e..f626cd72be15 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -76,17 +76,17 @@ jobs:
   id: set-outputs
   run: |
 if [ -z "${{ inputs.jobs }}" ]; then
-  pyspark=true; sparkr=true;
   pyspark_modules=`cd dev && python -c "import 
sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if 
m.name.startswith('pyspark')))"`
   pyspark=`./dev/is-changed.py -m $pyspark_modules`
   if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
 pandas=$pyspark
 kubernetes=`./dev/is-changed.py -m kubernetes`
+sparkr=`./dev/is-changed.py -m sparkr`
   else
 pandas=false
 kubernetes=false
+sparkr=false
   fi
-  sparkr=`./dev/is-changed.py -m sparkr`
   # 'build' is always true for now.
   # It does not save significant time and most of PRs trigger the 
build.
   precondition="


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and Daily CIs

2024-05-04 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a0f62393d69a [SPARK-48132][INFRA] Run `k8s-integration-tests` only in 
PR builder and Daily CIs
a0f62393d69a is described below

commit a0f62393d69a40ddd49b034b3ce332e6fa6bfb13
Author: Dongjoon Hyun 
AuthorDate: Sat May 4 22:55:04 2024 -0700

[SPARK-48132][INFRA] Run `k8s-integration-tests` only in PR builder and 
Daily CIs

### What changes were proposed in this pull request?

This PR aims to run `k8s-integration-tests` only in PR builder and Daily 
Python CIs. In other words, only the commit builder will skip it by default.

Please note that
- K8s unit tests will be covered by the commit builder still.
- All PR builders are not consuming ASF resources and they provide lots of 
test coverage everyday also.

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> All workflows MUST have a job concurrency level less than or equal to 
20. This means a workflow cannot have more than 20 jobs running at the same 
time across all matrices.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46388 from dongjoon-hyun/SPARK-48132.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 6ef971002c54..c87e8921b48e 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -81,11 +81,12 @@ jobs:
   pyspark=`./dev/is-changed.py -m $pyspark_modules`
   if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
 pandas=$pyspark
+kubernetes=`./dev/is-changed.py -m kubernetes`
   else
 pandas=false
+kubernetes=false
   fi
   sparkr=`./dev/is-changed.py -m sparkr`
-  kubernetes=`./dev/is-changed.py -m kubernetes`
   # 'build' is always true for now.
   # It does not save significant time and most of PRs trigger the 
build.
   precondition="


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48131][CORE] Unify MDC key `mdc.taskName` and `task_name`

2024-05-04 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8443672b1ab1 [SPARK-48131][CORE] Unify MDC key `mdc.taskName` and 
`task_name`
8443672b1ab1 is described below

commit 8443672b1ab1195278a73a9ec487af8e02e3a8de
Author: Gengliang Wang 
AuthorDate: Sat May 4 17:33:02 2024 -0700

[SPARK-48131][CORE] Unify MDC key `mdc.taskName` and `task_name`

### What changes were proposed in this pull request?

Currently there are two MDC keys for task name:
* `mdc.taskName`, which is introduced in 
https://github.com/apache/spark/pull/28801. Before the change, it was 
`taskName`.
* `task_name`: introduce from the structured logging framework project.

To make the MDC keys unified, this PR renames the `mdc.taskName` as 
`task_name`. This MDC is showing frequently in logs when running Spark 
application.
Before change:
```
"context":{"mdc.taskName":"task 19.0 in stage 0.0 (TID 19)”}
```
after change
```
"context":{“task_name":"task 19.0 in stage 0.0 (TID 19)”}
```

### Why are the changes needed?

1. Make the MDC names consistent
2. Minor upside: this will allow users to query task names with `SELECT * 
FROM logs where context.task_name = ...`.  Otherwise, querying with 
`context.mdc.task_name` will result in an analysis exception. Users will have 
to query with `context['mdc.task_name']`

### Does this PR introduce _any_ user-facing change?

No really. The MDC key is used by developers for debugging purpose.

### How was this patch tested?

Manual test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46386 from gengliangwang/unify.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/executor/Executor.scala | 6 +++---
 docs/configuration.md| 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala 
b/core/src/main/scala/org/apache/spark/executor/Executor.scala
index fd6c02c07789..3edba45ef89f 100644
--- a/core/src/main/scala/org/apache/spark/executor/Executor.scala
+++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala
@@ -40,7 +40,7 @@ import org.slf4j.MDC
 
 import org.apache.spark._
 import org.apache.spark.deploy.SparkHadoopUtil
-import org.apache.spark.internal.{Logging, MDC => LogMDC}
+import org.apache.spark.internal.{Logging, LogKeys, MDC => LogMDC}
 import org.apache.spark.internal.LogKeys._
 import org.apache.spark.internal.config._
 import org.apache.spark.internal.plugin.PluginContainer
@@ -914,7 +914,7 @@ private[spark] class Executor(
 try {
   mdc.foreach { case (key, value) => MDC.put(key, value) }
   // avoid overriding the takName by the user
-  MDC.put("mdc.taskName", taskName)
+  MDC.put(LogKeys.TASK_NAME.name, taskName)
 } catch {
   case _: NoSuchFieldError => logInfo("MDC is not supported.")
 }
@@ -923,7 +923,7 @@ private[spark] class Executor(
   private def cleanMDCForTask(taskName: String, mdc: Seq[(String, String)]): 
Unit = {
 try {
   mdc.foreach { case (key, _) => MDC.remove(key) }
-  MDC.remove("mdc.taskName")
+  MDC.remove(LogKeys.TASK_NAME.name)
 } catch {
   case _: NoSuchFieldError => logInfo("MDC is not supported.")
 }
diff --git a/docs/configuration.md b/docs/configuration.md
index a55ce89c096b..fb14af6d55b8 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -3693,7 +3693,7 @@ val logDf = 
spark.read.schema(LOG_SCHEMA).json("path/to/logs")
 ```
 
 ## Plain Text Logging
-If you prefer plain text logging, you can use the 
`log4j2.properties.pattern-layout-template` file as a starting point. This is 
the default configuration used by Spark before the 4.0.0 release. This 
configuration uses the 
[PatternLayout](https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout)
 to log all the logs in plain text. MDC information is not included by default. 
In order to print it in the logs, you can update the patternLayout in the file. 
For example, you can ad [...]
+If you prefer plain text logging, you can use the 
`log4j2.properties.pattern-layout-template` file as a starting point. This is 
the default configuration used by Spark before the 4.0.0 release. This 
configuration uses the 
[PatternLayout](https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout)
 to log all the logs in plain text. MDC information is not included by default. 
In order to print it in the logs, you can update the patternLayout in the file. 
For exam

(spark) branch master updated: [SPARK-48129][PYTHON] Provide a constant table schema in PySpark for querying structured logs

2024-05-04 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9a45da21dd1c [SPARK-48129][PYTHON] Provide a constant table schema in 
PySpark for querying structured logs
9a45da21dd1c is described below

commit 9a45da21dd1c7dd93152f7126c8c611b8ba031e7
Author: Gengliang Wang 
AuthorDate: Sat May 4 11:54:49 2024 -0700

[SPARK-48129][PYTHON] Provide a constant table schema in PySpark for 
querying structured logs

### What changes were proposed in this pull request?

Similar to https://github.com/apache/spark/pull/46375/, this PR provides a 
constant table schema in PySpark for querying structured logs.
The doc of logging configuration is also updated.

### Why are the changes needed?

Provide a convenient way to query Spark logs using PySpark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46384 from gengliangwang/pythonLog.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
---
 docs/configuration.md  |  9 -
 python/pyspark/util.py | 16 
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index d07decf02505..a55ce89c096b 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -3677,8 +3677,15 @@ Starting from version 4.0.0, `spark-submit` has adopted 
the [JSON Template Layou
 
 To configure the layout of structured logging, start with the 
`log4j2.properties.template` file.
 
-To query Spark logs using Spark SQL, you can use the following Scala code 
snippet:
+To query Spark logs using Spark SQL, you can use the following Python code 
snippet:
 
+```python
+from pyspark.util import LogUtils
+
+logDf = spark.read.schema(LogUtils.LOG_SCHEMA).json("path/to/logs")
+```
+
+Or using the following Scala code snippet:
 ```scala
 import org.apache.spark.util.LogUtils.LOG_SCHEMA
 
diff --git a/python/pyspark/util.py b/python/pyspark/util.py
index f0fa4a2413ce..4920ba957c19 100644
--- a/python/pyspark/util.py
+++ b/python/pyspark/util.py
@@ -107,6 +107,22 @@ class VersionUtils:
 )
 
 
+class LogUtils:
+"""
+Utils for querying structured Spark logs with Spark SQL.
+"""
+
+LOG_SCHEMA = (
+"ts TIMESTAMP, "
+"level STRING, "
+"msg STRING, "
+"context map, "
+"exception STRUCT>>,"
+"logger STRING"
+)
+
+
 def fail_on_stopiteration(f: Callable) -> Callable:
 """
 Wraps the input function to fail on 'StopIteration' by raising a 
'RuntimeError'


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46009][SQL][FOLLOWUP] Remove unused golden file

2024-05-04 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 356aca5af5b8 [SPARK-46009][SQL][FOLLOWUP] Remove unused golden file
356aca5af5b8 is described below

commit 356aca5af5b88570d43d1c0f2b417aa87b86d323
Author: beliefer 
AuthorDate: Sat May 4 11:51:40 2024 -0700

[SPARK-46009][SQL][FOLLOWUP] Remove unused golden file

### What changes were proposed in this pull request?
This PR propose to remove unused golden file.

### Why are the changes needed?
https://github.com/apache/spark/pull/46272 removed unused `PERCENTILE_CONT` 
and `PERCENTILE_DISC` in g4.
But I made a mistake and submitted my local test code.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #46385 from beliefer/SPARK-46009_followup3.

Authored-by: beliefer 
Signed-off-by: Dongjoon Hyun 
---
 .../sql-tests/analyzer-results/window2.sql.out | 126 -
 1 file changed, 126 deletions(-)

diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out
deleted file mode 100644
index 6fd41286959a..
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out
+++ /dev/null
@@ -1,126 +0,0 @@
--- Automatically generated by SQLQueryTestSuite
--- !query
-CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
-(null, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"),
-(1, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"),
-(1, 2L, 2.5D, date("2017-08-02"), timestamp_seconds(150200), "a"),
-(2, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), 
"a"),
-(1, null, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "b"),
-(2, 3L, 3.3D, date("2017-08-03"), timestamp_seconds(150300), "b"),
-(3, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), 
"b"),
-(null, null, null, null, null, null),
-(3, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), null)
-AS testData(val, val_long, val_double, val_date, val_timestamp, cate)
--- !query analysis
-CreateViewCommand `testData`, SELECT * FROM VALUES
-(null, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"),
-(1, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"),
-(1, 2L, 2.5D, date("2017-08-02"), timestamp_seconds(150200), "a"),
-(2, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), 
"a"),
-(1, null, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "b"),
-(2, 3L, 3.3D, date("2017-08-03"), timestamp_seconds(150300), "b"),
-(3, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), 
"b"),
-(null, null, null, null, null, null),
-(3, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), null)
-AS testData(val, val_long, val_double, val_date, val_timestamp, cate), false, 
true, LocalTempView, true
-   +- Project [val#x, val_long#xL, val_double#x, val_date#x, val_timestamp#x, 
cate#x]
-  +- SubqueryAlias testData
- +- LocalRelation [val#x, val_long#xL, val_double#x, val_date#x, 
val_timestamp#x, cate#x]
-
-
--- !query
-CREATE OR REPLACE TEMPORARY VIEW basic_pays AS SELECT * FROM VALUES
-('Diane Murphy','Accounting',8435),
-('Mary Patterson','Accounting',9998),
-('Jeff Firrelli','Accounting',8992),
-('William Patterson','Accounting',8870),
-('Gerard Bondur','Accounting',11472),
-('Anthony Bow','Accounting',6627),
-('Leslie Jennings','IT',8113),
-('Leslie Thompson','IT',5186),
-('Julie Firrelli','Sales',9181),
-('Steve Patterson','Sales',9441),
-('Foon Yue Tseng','Sales',6660),
-('George Vanauf','Sales',10563),
-('Loui Bondur','SCM',10449),
-('Gerard Hernandez','SCM',6949),
-('Pamela Castillo','SCM',11303),
-('Larry Bott','SCM',11798),
-('Barry Jones','SCM',10586)
-AS basic_pays(employee_name, department, salary)
--- !query analysis
-CreateViewCommand `basic_pays`, SELECT * FROM VALUES
-('Diane Murphy','Accounting',8435),
-('Mary Patterson','Accounting',9998),
-('Jeff Firrelli','Accounting',8992),
-('William Patterson','Accounting',8870),
-('Gerard Bondur','Accounting',11472),
-('Anthony Bow','Accounting',6627),
-('Leslie Jennings','IT',8113),
-('Leslie Thompson','IT',5186),
-('Julie Firrelli','Sales',9181),
-('Steve Patterson','Sales',9441),
-('Foon

(spark) branch branch-3.5 updated: [SPARK-48128][SQL] For BitwiseCount / bit_count expression, fix codegen syntax error for boolean type inputs

2024-05-04 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 2f2347f3b74f [SPARK-48128][SQL] For BitwiseCount / bit_count 
expression, fix codegen syntax error for boolean type inputs
2f2347f3b74f is described below

commit 2f2347f3b74f1478fb583de9378427b3e45bd980
Author: Josh Rosen 
AuthorDate: Sat May 4 11:49:20 2024 -0700

[SPARK-48128][SQL] For BitwiseCount / bit_count expression, fix codegen 
syntax error for boolean type inputs

### What changes were proposed in this pull request?

This PR fixes an issue where `BitwiseCount` / `bit_count` of boolean inputs 
would cause codegen to generate syntactically invalid Java code that does not 
compile, triggering errors like

```
 java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 41, 
Column 11: Failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 41, Column 11: Unexpected token "if" in primary
```

Even though this code has test cases in `bitwise.sql` via the query test 
framework, those existing test cases were insufficient to find this problem: I 
believe that is because the example queries were constant-folded using the 
interpreted path, leaving the codegen path without test coverage.

This PR fixes the codegen issue and adds explicit expression tests to 
ensure that the same tests run on both the codegen and interpreted paths.

### Why are the changes needed?

Fix a rare codegen to interpreted fallback issue, which may harm query 
performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new test cases to BitwiseExpressionsSuite.scala, copied from the 
existing `bitwise.sql` query test case file.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46382 from JoshRosen/SPARK-48128-bit_count_codegen.

Authored-by: Josh Rosen 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 96f65c950064d330245dc53fcd50cf6d9753afc8)
Signed-off-by: Dongjoon Hyun 
---
 .../catalyst/expressions/bitwiseExpressions.scala  |  2 +-
 .../expressions/BitwiseExpressionsSuite.scala  | 41 ++
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala
index 6061f625ef07..183e5d6697e9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala
@@ -229,7 +229,7 @@ case class BitwiseCount(child: Expression)
   override def prettyName: String = "bit_count"
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = 
child.dataType match {
-case BooleanType => defineCodeGen(ctx, ev, c => s"if ($c) 1 else 0")
+case BooleanType => defineCodeGen(ctx, ev, c => s"($c) ? 1 : 0")
 case _ => defineCodeGen(ctx, ev, c => s"java.lang.Long.bitCount($c)")
   }
 
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/BitwiseExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/BitwiseExpressionsSuite.scala
index 4cd5f3e861ac..5bd1bc346c02 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/BitwiseExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/BitwiseExpressionsSuite.scala
@@ -133,6 +133,47 @@ class BitwiseExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 }
   }
 
+  test("BitCount") {
+// null
+val nullLongLiteral = Literal.create(null, LongType)
+val nullIntLiteral = Literal.create(null, IntegerType)
+val nullBooleanLiteral = Literal.create(null, BooleanType)
+checkEvaluation(BitwiseCount(nullLongLiteral), null)
+checkEvaluation(BitwiseCount(nullIntLiteral), null)
+checkEvaluation(BitwiseCount(nullBooleanLiteral), null)
+
+// boolean
+checkEvaluation(BitwiseCount(Literal(true)), 1)
+checkEvaluation(BitwiseCount(Literal(false)), 0)
+
+// byte/tinyint
+checkEvaluation(BitwiseCount(Literal(1.toByte)), 1)
+checkEvaluation(BitwiseCount(Literal(2.toByte)), 1)
+checkEvaluation(BitwiseCount(Literal(3.toByte)), 2)
+
+// short/smallint
+checkEvaluation(BitwiseCount(Literal(1.toShort)), 1)
+checkEvaluation(BitwiseCount(Literal(2.toShort)), 1)
+checkEval

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 8203 matches

Mail list logo