(spark) branch master updated: [SPARK-47969][PYTHON][TESTS][FOLLOWUP] Make Test `test_creation_index` deterministic

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5d1f976f85fe [SPARK-47969][PYTHON][TESTS][FOLLOWUP] Make Test 
`test_creation_index` deterministic
5d1f976f85fe is described below

commit 5d1f976f85fe1ee39ca3cc4f0f2e6afa8b43e5ea
Author: Ruifeng Zheng 
AuthorDate: Fri May 3 20:42:30 2024 -0700

[SPARK-47969][PYTHON][TESTS][FOLLOWUP] Make Test `test_creation_index` 
deterministic

### What changes were proposed in this pull request?
followup https://github.com/apache/spark/pull/46200

### Why are the changes needed?
there is still non-deterministic codes in this test:
```
Traceback (most recent call last):
  File "/home/jenkins/python/pyspark/testing/pandasutils.py", line 91, in 
_assert_pandas_equal
assert_frame_equal(
  File 
"/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py",
 line 1257, in assert_frame_equal
assert_index_equal(
  File 
"/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py",
 line 407, in assert_index_equal
raise_assert_detail(obj, msg, left, right)
  File 
"/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py",
 line 665, in raise_assert_detail
raise AssertionError(msg)
AssertionError: DataFrame.index are different
DataFrame.index values are different (75.0 %)
[left]:  DatetimeIndex(['2022-09-02', '2022-09-03', '2022-08-31', 
'2022-09-05'], dtype='datetime64[ns]', freq=None)
[right]: DatetimeIndex(['2022-08-31', '2022-09-02', '2022-09-03', 
'2022-09-05'], dtype='datetime64[ns]', freq=None)

```

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46378 from zhengruifeng/ps_test_create_index.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/pandas/tests/frame/test_constructor.py | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/python/pyspark/pandas/tests/frame/test_constructor.py 
b/python/pyspark/pandas/tests/frame/test_constructor.py
index d7581895c6c9..e093adfa7ba3 100644
--- a/python/pyspark/pandas/tests/frame/test_constructor.py
+++ b/python/pyspark/pandas/tests/frame/test_constructor.py
@@ -269,11 +269,11 @@ class FrameConstructorMixin:
 ps.DataFrame(
 data=pdf,
 index=pd.DatetimeIndex(["2022-08-31", "2022-09-02", 
"2022-09-03", "2022-09-05"]),
-),
+).sort_index(),
 pd.DataFrame(
 data=pdf,
 index=pd.DatetimeIndex(["2022-08-31", "2022-09-02", 
"2022-09-03", "2022-09-05"]),
-),
+).sort_index(),
 )
 
 # test with pd.DataFrame and ps.DatetimeIndex
@@ -281,11 +281,11 @@ class FrameConstructorMixin:
 ps.DataFrame(
 data=pdf,
 index=ps.DatetimeIndex(["2022-08-31", "2022-09-02", 
"2022-09-03", "2022-09-05"]),
-),
+).sort_index(),
 pd.DataFrame(
 data=pdf,
 index=pd.DatetimeIndex(["2022-08-31", "2022-09-02", 
"2022-09-03", "2022-09-05"]),
-),
+).sort_index(),
 )
 
 with ps.option_context("compute.ops_on_diff_frames", True):
@@ -296,13 +296,13 @@ class FrameConstructorMixin:
 index=pd.DatetimeIndex(
 ["2022-08-31", "2022-09-02", "2022-09-03", 
"2022-09-05"]
 ),
-),
+).sort_index(),
 pd.DataFrame(
 data=pdf,
 index=pd.DatetimeIndex(
 ["2022-08-31", "2022-09-02", "2022-09-03", 
"2022-09-05"]
 ),
-),
+).sort_index(),
 )
 
 # test with ps.DataFrame and ps.DatetimeIndex
@@ -312,13 +312,13 @@ class FrameConstructorMixin:
 index=ps.DatetimeIndex(
 ["2022-08-31", "2022-09-02", "2022-09-03", 
"2022-09-05"]
 ),
-),
+).sort_index(),
 pd.DataFrame(
 data

(spark) branch master updated: [SPARK-47097][CONNECT][TESTS][FOLLOWUP] Increase timeout to `1 minute` for `interrupt tag` test

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7f08df4af95d [SPARK-47097][CONNECT][TESTS][FOLLOWUP] Increase timeout 
to `1 minute` for `interrupt tag` test
7f08df4af95d is described below

commit 7f08df4af95d20f3fd056588b5a3cfa5f5c57654
Author: Dongjoon Hyun 
AuthorDate: Fri May 3 16:54:24 2024 -0700

[SPARK-47097][CONNECT][TESTS][FOLLOWUP] Increase timeout to `1 minute` for 
`interrupt tag` test

### What changes were proposed in this pull request?

This is a follow-up to increase `timeout` from `30s` to `1 minute` like the 
other timeouts of the same test case.
- #45173

### Why are the changes needed?

To reduce the flakiness more. The following is the recent failure on 
`master` branch.
- https://github.com/apache/spark/actions/runs/8944948827/job/24572965877
- https://github.com/apache/spark/actions/runs/8945375279/job/24574263993

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46374 from dongjoon-hyun/SPARK-47097.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala| 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala
index b967245d90c2..d1015d55b1df 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala
@@ -196,7 +196,7 @@ class SparkSessionE2ESuite extends RemoteSparkSession {
 
 // q2 and q3 should be cancelled
 interrupted.clear()
-eventually(timeout(30.seconds), interval(1.seconds)) {
+eventually(timeout(1.minute), interval(1.seconds)) {
   val ids = spark.interruptTag("two")
   interrupted ++= ids
   assert(interrupted.length == 2, s"Interrupted operations: $interrupted.")
@@ -213,7 +213,7 @@ class SparkSessionE2ESuite extends RemoteSparkSession {
 
 // q1 and q4 should be cancelled
 interrupted.clear()
-eventually(timeout(30.seconds), interval(1.seconds)) {
+eventually(timeout(1.minute), interval(1.seconds)) {
   val ids = spark.interruptTag("one")
   interrupted ++= ids
   assert(interrupted.length == 2, s"Interrupted operations: $interrupted.")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48121][K8S] Promote `KubernetesDriverConf` to `DeveloperApi`

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c3a462ce2966 [SPARK-48121][K8S] Promote `KubernetesDriverConf` to 
`DeveloperApi`
c3a462ce2966 is described below

commit c3a462ce2966d42a3cebf238b809e2c2e2631c08
Author: zhou-jiang 
AuthorDate: Fri May 3 16:25:38 2024 -0700

[SPARK-48121][K8S] Promote `KubernetesDriverConf` to `DeveloperApi`

### What changes were proposed in this pull request?

This PR aims to promote `KubernetesDriverConf` to `DeveloperApi`

### Why are the changes needed?

Since Apache Spark Kubernetes Operator requires this, we had better 
maintain it as a developer API officially from Apache Spark 4.0.0.

https://github.com/apache/spark-kubernetes-operator/pull/10

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46373 from jiangzho/driver_conf.

Authored-by: zhou-jiang 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/deploy/k8s/KubernetesConf.scala| 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
index fda772b737fe..f62204a8a9c0 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
@@ -22,6 +22,7 @@ import io.fabric8.kubernetes.api.model.{LocalObjectReference, 
LocalObjectReferen
 import org.apache.commons.lang3.StringUtils
 
 import org.apache.spark.{SPARK_VERSION, SparkConf}
+import org.apache.spark.annotation.{DeveloperApi, Since, Unstable}
 import org.apache.spark.deploy.k8s.Config._
 import org.apache.spark.deploy.k8s.Constants._
 import org.apache.spark.deploy.k8s.features.DriverServiceFeatureStep._
@@ -78,7 +79,15 @@ private[spark] abstract class KubernetesConf(val sparkConf: 
SparkConf) {
   def getOption(key: String): Option[String] = sparkConf.getOption(key)
 }
 
-private[spark] class KubernetesDriverConf(
+/**
+ * :: DeveloperApi ::
+ *
+ * Used for K8s operations internally and Spark K8s operator.
+ */
+@Unstable
+@DeveloperApi
+@Since("4.0.0")
+class KubernetesDriverConf(
 sparkConf: SparkConf,
 val appId: String,
 val mainAppResource: MainAppResource,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark-kubernetes-operator) branch main updated: [SPARK-48120] Enable autolink to SPARK jira issue

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new 91ecc93  [SPARK-48120] Enable autolink to SPARK jira issue
91ecc93 is described below

commit 91ecc932096f0f41f395d2b6e935daa075c7d47a
Author: Dongjoon Hyun 
AuthorDate: Fri May 3 15:52:27 2024 -0700

[SPARK-48120] Enable autolink to SPARK jira issue

### What changes were proposed in this pull request?

This PR aims to enable `autolink` feature to `SPARK` jira issue like 
`Apache Spark` repository.

### Why are the changes needed?

Since we share the same JIRA project name, we need to link it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #11 from dongjoon-hyun/SPARK-48120.

Lead-authored-by: Dongjoon Hyun 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .asf.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.asf.yaml b/.asf.yaml
index c7e6ae7..c1409a7 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -26,6 +26,7 @@ github:
 merge: false
 squash: true
 rebase: true
+  autolink_jira: SPARK
 
 notifications:
   pullrequests: revi...@spark.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (85902880d709 -> b42d235c2930)

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 85902880d709 [SPARK-48119][K8S] Promote `KubernetesDriverSpec` to 
`DeveloperApi`
 add b42d235c2930 [SPARK-48114][CORE] Precompile template regex to avoid 
unnecessary work

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala  | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48119][K8S] Promote `KubernetesDriverSpec` to `DeveloperApi`

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 85902880d709 [SPARK-48119][K8S] Promote `KubernetesDriverSpec` to 
`DeveloperApi`
85902880d709 is described below

commit 85902880d709a66ef89bd6a5e0e7f1233f4d4fec
Author: zhou-jiang 
AuthorDate: Fri May 3 15:02:56 2024 -0700

[SPARK-48119][K8S] Promote `KubernetesDriverSpec` to `DeveloperApi`

### What changes were proposed in this pull request?

This PR aims to promote ` KubernetesDriverSpec` to `DeveloperApi`

### Why are the changes needed?

Since Apache Spark Kubernetes Operator requires this, we had better 
maintain it as a developer API officially from Apache Spark 4.0.0.

https://github.com/apache/spark-kubernetes-operator/pull/10

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46371 from jiangzho/k8s_dev_apis.

Authored-by: zhou-jiang 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala  | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala
index a603cb08ba9a..0fd2cf16e74e 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesDriverSpec.scala
@@ -18,7 +18,18 @@ package org.apache.spark.deploy.k8s
 
 import io.fabric8.kubernetes.api.model.HasMetadata
 
-private[spark] case class KubernetesDriverSpec(
+import org.apache.spark.annotation.{DeveloperApi, Since, Unstable}
+
+/**
+ * :: DeveloperApi ::
+ *
+ * Spec for driver pod and resources, used for K8s operations internally
+ * and Spark K8s operator.
+ */
+@Unstable
+@DeveloperApi
+@Since("3.3.0")
+case class KubernetesDriverSpec(
 pod: SparkPod,
 driverPreKubernetesResources: Seq[HasMetadata],
 driverKubernetesResources: Seq[HasMetadata],


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (aa00b00c18e6 -> d6ca2c5c3c4b)

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from aa00b00c18e6 [SPARK-48115][INFRA] Remove `Python 3.11` from 
`build_python.yml`
 add d6ca2c5c3c4b [SPARK-48118][SQL] Support 
`SPARK_SQL_LEGACY_CREATE_HIVE_TABLE` env variable

No new revisions were added by this update.

Summary of changes:
 docs/sql-migration-guide.md | 2 +-
 sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48115][INFRA] Remove `Python 3.11` from `build_python.yml`

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new aa00b00c18e6 [SPARK-48115][INFRA] Remove `Python 3.11` from 
`build_python.yml`
aa00b00c18e6 is described below

commit aa00b00c18e6a714dc02e9444576e063c8e49db7
Author: Dongjoon Hyun 
AuthorDate: Fri May 3 14:10:39 2024 -0700

[SPARK-48115][INFRA] Remove `Python 3.11` from `build_python.yml`

### What changes were proposed in this pull request?

This PR aims to remove `Python 3.11` from `build_python.yml` Daily CI 
because `Python 3.11` is the main python version in the PR and commit build.
- https://github.com/apache/spark/actions/workflows/build_python.yml

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> The average number of minutes a project uses in any consecutive 
five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46366 from dongjoon-hyun/SPARK-48115.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_python.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_python.yml 
b/.github/workflows/build_python.yml
index 2249dd230265..761fd20f0c79 100644
--- a/.github/workflows/build_python.yml
+++ b/.github/workflows/build_python.yml
@@ -17,7 +17,7 @@
 # under the License.
 #
 
-name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.11/Python 
3.12)"
+name: "Build / Python-only (master, PyPy 3.9/Python 3.10/Python 3.12)"
 
 on:
   schedule:
@@ -28,7 +28,7 @@ jobs:
 strategy:
   fail-fast: false
   matrix:
-pyversion: ["pypy3", "python3.10", "python3.11", "python3.12"]
+pyversion: ["pypy3", "python3.10", "python3.12"]
 permissions:
   packages: write
 name: Run


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48088][PYTHON][CONNECT][TESTS] Prepare backward compatibility test 4.0 <> above

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cd789acb5e51 [SPARK-48088][PYTHON][CONNECT][TESTS] Prepare backward 
compatibility test 4.0 <> above
cd789acb5e51 is described below

commit cd789acb5e51172e43052b59c4b610e64f380a16
Author: Hyukjin Kwon 
AuthorDate: Fri May 3 01:08:05 2024 -0700

[SPARK-48088][PYTHON][CONNECT][TESTS] Prepare backward compatibility test 
4.0 <> above

### What changes were proposed in this pull request?

This PR forward ports https://github.com/apache/spark/pull/46334 to reduce 
conflicts.

### Why are the changes needed?

To reduce the conflict against branch-3.5, and prepare 4.0 <> above test.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

CI in this PR should verify them.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46358 from HyukjinKwon/SPARK-48088-40.

Authored-by: Hyukjin Kwon 
    Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/util.py |  3 +++
 python/run-tests.py| 18 +++---
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/util.py b/python/pyspark/util.py
index bf1cf5b59553..f0fa4a2413ce 100644
--- a/python/pyspark/util.py
+++ b/python/pyspark/util.py
@@ -747,6 +747,9 @@ def is_remote_only() -> bool:
 """
 global _is_remote_only
 
+if "SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ:
+return True
+
 if _is_remote_only is not None:
 return _is_remote_only
 try:
diff --git a/python/run-tests.py b/python/run-tests.py
index ebdd4a9a2179..64ac48e210db 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -62,13 +62,15 @@ LOGGER = logging.getLogger()
 
 # Find out where the assembly jars are located.
 # TODO: revisit for Scala 2.13
-for scala in ["2.13"]:
-build_dir = os.path.join(SPARK_HOME, "assembly", "target", "scala-" + 
scala)
-if os.path.isdir(build_dir):
-SPARK_DIST_CLASSPATH = os.path.join(build_dir, "jars", "*")
-break
-else:
-raise RuntimeError("Cannot find assembly build directory, please build 
Spark first.")
+SPARK_DIST_CLASSPATH = ""
+if "SPARK_SKIP_CONNECT_COMPAT_TESTS" not in os.environ:
+for scala in ["2.13"]:
+build_dir = os.path.join(SPARK_HOME, "assembly", "target", "scala-" + 
scala)
+if os.path.isdir(build_dir):
+SPARK_DIST_CLASSPATH = os.path.join(build_dir, "jars", "*")
+break
+else:
+raise RuntimeError("Cannot find assembly build directory, please build 
Spark first.")
 
 
 def run_individual_python_test(target_dir, test_name, pyspark_python, 
keep_test_output):
@@ -100,6 +102,8 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
 
 if "SPARK_CONNECT_TESTING_REMOTE" in os.environ:
 env.update({"SPARK_CONNECT_TESTING_REMOTE": 
os.environ["SPARK_CONNECT_TESTING_REMOTE"]})
+if "SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ:
+env.update({"SPARK_SKIP_JVM_REQUIRED_TESTS": 
os.environ["SPARK_SKIP_CONNECT_COMPAT_TESTS"]})
 
 # Create a unique temp directory under 'target/' for each run. The TMPDIR 
variable is
 # recognized by the tempfile module to override the default system temp 
directory.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48111][INFRA] Disable Docker integration test and TPC-DS in commit builder

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2d346fbb9c5c [SPARK-48111][INFRA] Disable Docker integration test and 
TPC-DS in commit builder
2d346fbb9c5c is described below

commit 2d346fbb9c5c5e58f8fba076fc7f2348565bea91
Author: Hyukjin Kwon 
AuthorDate: Fri May 3 00:16:09 2024 -0700

[SPARK-48111][INFRA] Disable Docker integration test and TPC-DS in commit 
builder

### What changes were proposed in this pull request?

This PR proposes to disable Docker integration test and TPC-DS in commit 
builder

### Why are the changes needed?

This is being tested in daily scheduled build: 
https://github.com/apache/spark/blob/master/.github/workflows/build_java21.yml#L48-L49

Both are pretty unlikely broken in my experience.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

CI in this PR should verify them

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46361 from HyukjinKwon/SPARK-48111.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index f7e83854c1f7..0dc217570ba0 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -76,12 +76,10 @@ jobs:
   id: set-outputs
   run: |
 if [ -z "${{ inputs.jobs }}" ]; then
-  pyspark=true; sparkr=true; tpcds=true; docker=true;
+  pyspark=true; sparkr=true;
   pyspark_modules=`cd dev && python -c "import 
sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if 
m.name.startswith('pyspark')))"`
   pyspark=`./dev/is-changed.py -m $pyspark_modules`
   sparkr=`./dev/is-changed.py -m sparkr`
-  tpcds=`./dev/is-changed.py -m sql`
-  docker=`./dev/is-changed.py -m docker-integration-tests`
   kubernetes=`./dev/is-changed.py -m kubernetes`
   # 'build' is always true for now.
   # It does not save significant time and most of PRs trigger the 
build.
@@ -90,8 +88,8 @@ jobs:
   \"build\": \"true\",
   \"pyspark\": \"$pyspark\",
   \"sparkr\": \"$sparkr\",
-  \"tpcds-1g\": \"$tpcds\",
-  \"docker-integration-tests\": \"$docker\",
+  \"tpcds-1g\": \"false\",
+  \"docker-integration-tests\": \"false\",
   \"lint\" : \"true\",
   \"k8s-integration-tests\" : \"$kubernetes\",
   \"buf\" : \"true\",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48110][INFRA] Remove all Maven compilation build

2024-05-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new defda8663c05 [SPARK-48110][INFRA] Remove all Maven compilation build
defda8663c05 is described below

commit defda8663c05fdba122325b36c45ef8f2da6624e
Author: Hyukjin Kwon 
AuthorDate: Fri May 3 00:13:28 2024 -0700

[SPARK-48110][INFRA] Remove all Maven compilation build

### What changes were proposed in this pull request?

This PR proposes to reduce the concurrency of GitHub Action Job, by 
removing all Maven-only builds because this is tested in daily build 
(https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml)

### Why are the changes needed?

Same as https://github.com/apache/spark/pull/46347

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI in this PR

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46360 from HyukjinKwon/SPARK-48110.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 59 +---
 1 file changed, 1 insertion(+), 58 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 3bb37e74805f..f7e83854c1f7 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -83,7 +83,7 @@ jobs:
   tpcds=`./dev/is-changed.py -m sql`
   docker=`./dev/is-changed.py -m docker-integration-tests`
   kubernetes=`./dev/is-changed.py -m kubernetes`
-  # 'build' and 'maven-build' are always true for now.
+  # 'build' is always true for now.
   # It does not save significant time and most of PRs trigger the 
build.
   precondition="
 {
@@ -92,7 +92,6 @@ jobs:
   \"sparkr\": \"$sparkr\",
   \"tpcds-1g\": \"$tpcds\",
   \"docker-integration-tests\": \"$docker\",
-  \"maven-build\": \"true\",
   \"lint\" : \"true\",
   \"k8s-integration-tests\" : \"$kubernetes\",
   \"buf\" : \"true\",
@@ -789,62 +788,6 @@ jobs:
 path: site.tar.bz2
 retention-days: 1
 
-  maven-build:
-needs: precondition
-if: fromJson(needs.precondition.outputs.required).maven-build == 'true'
-name: Java ${{ matrix.java }} build with Maven (${{ matrix.os }})
-strategy:
-  fail-fast: false
-  matrix:
-include:
-  - java: 21
-os: macos-14 
-runs-on: ${{ matrix.os }}
-timeout-minutes: 180
-steps:
-- name: Checkout Spark repository
-  uses: actions/checkout@v4
-  with:
-fetch-depth: 0
-repository: apache/spark
-ref: ${{ inputs.branch }}
-- name: Sync the current branch with the latest in Apache Spark
-  if: github.repository != 'apache/spark'
-  run: |
-git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
-git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
-git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit" --allow-empty
-- name: Cache SBT and Maven
-  uses: actions/cache@v4
-  with:
-path: |
-  build/apache-maven-*
-  build/*.jar
-  ~/.sbt
-key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 
'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 
'build/spark-build-info') }}
-restore-keys: |
-  build-
-- name: Cache Maven local repository
-  uses: actions/cache@v4
-  with:
-path: ~/.m2/repository
-key: java${{ matrix.java }}-maven-${{ hashFiles('**/pom.xml') }}
-restore-keys: |
-  java${{ matrix.java }}-maven-
-- name: Install Java ${{ matrix.java }}
-  uses: actions/setup-java@v4
-  with:
-distribution: zulu
-java-version: ${{ matrix.java }}
-- name: Build with Maven
-  run: |
-export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g 
-Dorg.slf4j.simpleLogger.defaultLogLevel=WARN"
-export MAVEN_CLI_OPTS="--no-transfer-progress"
-export JAVA_VERSION=${{ matrix.java }}
-# It uses Maven's 'install' intentionally, see

(spark) branch master updated: [SPARK-48107][PYTHON] Exclude tests from Python distribution

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8d70f4ba5396 [SPARK-48107][PYTHON] Exclude tests from Python 
distribution
8d70f4ba5396 is described below

commit 8d70f4ba53962de540fb3dc5bdedd32754be974d
Author: Nicholas Chammas 
AuthorDate: Thu May 2 23:52:43 2024 -0700

[SPARK-48107][PYTHON] Exclude tests from Python distribution

### What changes were proposed in this pull request?

Change the Python manifest so that tests are excluded from the packages 
that are built for distribution.

### Why are the changes needed?

Tests were unintentionally included in the distributions as part of #44920. 
See [this 
comment](https://github.com/apache/spark/pull/44920/files#r1586979834).

### Does this PR introduce _any_ user-facing change?

No, since #44920 hasn't been released to any users yet.

### How was this patch tested?

I built Python packages and inspected `SOURCES.txt` to confirm that tests 
were excluded:

```sh
cd python
rm -rf pyspark.egg-info || echo "No existing egg info file, skipping 
deletion"
python3 packaging/classic/setup.py sdist
python3 packaging/connect/setup.py sdist
find dist -name '*.tar.gz' | xargs -I _ tar xf _ --directory=dist
cd ..
open python/dist
find python/dist -name SOURCES.txt | xargs code
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46354 from nchammas/SPARK-48107-package-json.

Authored-by: Nicholas Chammas 
    Signed-off-by: Dongjoon Hyun 
---
 python/MANIFEST.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/MANIFEST.in b/python/MANIFEST.in
index 0374b3096d47..45c9dca8b474 100644
--- a/python/MANIFEST.in
+++ b/python/MANIFEST.in
@@ -16,7 +16,7 @@
 
 # Reference: https://setuptools.pypa.io/en/latest/userguide/miscellaneous.html
 
-graft pyspark
+recursive-include pyspark *.pyi py.typed *.json
 recursive-include deps/jars *.jar
 graft deps/bin
 recursive-include deps/sbin spark-config.sh spark-daemon.sh 
start-history-server.sh stop-history-server.sh


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48106][INFRA] Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ddc1f6b2a466 [SPARK-48106][INFRA] Use `Python 3.11` in `pyspark` tests 
of `build_and_test.yml`
ddc1f6b2a466 is described below

commit ddc1f6b2a466892110ea0010c36f83847b9dc36e
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 23:34:47 2024 -0700

[SPARK-48106][INFRA] Use `Python 3.11` in `pyspark` tests of 
`build_and_test.yml`

### What changes were proposed in this pull request?

This PR aims to use Python `3.11` instead of `3.9` in `pyspark` tests of 
`build_and_test.yml`.

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> The average number of minutes a project uses in any consecutive 
five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).

`Python 3.11` is faster in general.
- https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights

> Python 3.11 is between 10-60% faster than Python 3.10. On average, we 
measured a 1.25x speedup on the standard benchmark suite.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46353 from dongjoon-hyun/SPARK-48106.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 56516c95dcb8..3bb37e74805f 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -366,7 +366,7 @@ jobs:
 pyspark-pandas-connect-part3
 env:
   MODULES_TO_TEST: ${{ matrix.modules }}
-  PYTHON_TO_TEST: 'python3.9'
+  PYTHON_TO_TEST: 'python3.11'
   HADOOP_PROFILE: ${{ inputs.hadoop }}
   HIVE_PROFILE: hive2.3
   GITHUB_PREV_SHA: ${{ github.event.before }}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (63837020ed29 -> f044748efeac)

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 63837020ed29 [SPARK-48109][INFRA] Enable `k8s-integration-tests` only 
for `kubernetes` module change
 add f044748efeac [SPARK-48103][K8S] Promote `KubernetesDriverBuilder` to 
`DeveloperApi`

No new revisions were added by this update.

Summary of changes:
 .../spark/deploy/k8s/submit/KubernetesDriverBuilder.scala| 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` module change

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 63837020ed29 [SPARK-48109][INFRA] Enable `k8s-integration-tests` only 
for `kubernetes` module change
63837020ed29 is described below

commit 63837020ed29c9e6003f24117ad21f8b97f40f0f
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 23:21:59 2024 -0700

[SPARK-48109][INFRA] Enable `k8s-integration-tests` only for `kubernetes` 
module change

### What changes were proposed in this pull request?

This PR aims to enable `k8s-integration-tests` only for `kubernetes` module 
change.

Although there is a chance of missing `core` module change, the daily CI 
test coverage will reveal that.

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> The average number of minutes a project uses in any consecutive 
five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46356 from dongjoon-hyun/SPARK-48109.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 3 ++-
 .github/workflows/build_branch34.yml | 1 +
 .github/workflows/build_branch35.yml | 1 +
 .github/workflows/build_java21.yml   | 3 ++-
 4 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 13a05e824f6a..56516c95dcb8 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -82,6 +82,7 @@ jobs:
   sparkr=`./dev/is-changed.py -m sparkr`
   tpcds=`./dev/is-changed.py -m sql`
   docker=`./dev/is-changed.py -m docker-integration-tests`
+  kubernetes=`./dev/is-changed.py -m kubernetes`
   # 'build' and 'maven-build' are always true for now.
   # It does not save significant time and most of PRs trigger the 
build.
   precondition="
@@ -93,7 +94,7 @@ jobs:
   \"docker-integration-tests\": \"$docker\",
   \"maven-build\": \"true\",
   \"lint\" : \"true\",
-  \"k8s-integration-tests\" : \"true\",
+  \"k8s-integration-tests\" : \"$kubernetes\",
   \"buf\" : \"true\",
   \"ui\" : \"true\",
 }"
diff --git a/.github/workflows/build_branch34.yml 
b/.github/workflows/build_branch34.yml
index deb43d82c979..68887970d4d8 100644
--- a/.github/workflows/build_branch34.yml
+++ b/.github/workflows/build_branch34.yml
@@ -47,5 +47,6 @@ jobs:
   "sparkr": "true",
   "tpcds-1g": "true",
   "docker-integration-tests": "true",
+  "k8s-integration-tests": "true",
   "lint" : "true"
 }
diff --git a/.github/workflows/build_branch35.yml 
b/.github/workflows/build_branch35.yml
index 9e6fe13c020e..55616c2f1f01 100644
--- a/.github/workflows/build_branch35.yml
+++ b/.github/workflows/build_branch35.yml
@@ -47,5 +47,6 @@ jobs:
   "sparkr": "true",
   "tpcds-1g": "true",
   "docker-integration-tests": "true",
+  "k8s-integration-tests": "true",
   "lint" : "true"
 }
diff --git a/.github/workflows/build_java21.yml 
b/.github/workflows/build_java21.yml
index b1ef5a321835..bfeedd4174cf 100644
--- a/.github/workflows/build_java21.yml
+++ b/.github/workflows/build_java21.yml
@@ -46,5 +46,6 @@ jobs:
   "pyspark": "true",
   "sparkr": "true",
   "tpcds-1g": "true",
-  "docker-integration-tests": "true"
+  "docker-integration-tests": "true",
+  "k8s-integration-tests": "true"
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48108][INFRA] Skip `tpcds-1g` and `docker-integration-tests` tests from `RocksDB UI-Backend` job

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 111df27d21ee [SPARK-48108][INFRA] Skip `tpcds-1g` and 
`docker-integration-tests` tests from `RocksDB UI-Backend` job
111df27d21ee is described below

commit 111df27d21ee4b9353d053628d76ae26c7f8f8f0
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 21:51:34 2024 -0700

[SPARK-48108][INFRA] Skip `tpcds-1g` and `docker-integration-tests` tests 
from `RocksDB UI-Backend` job

### What changes were proposed in this pull request?

This PR aims to skip `tpcds-1g` and `docker-integration-tests` tests from 
`RocksDB UI-Backend` job, `build_rockdb_as_ui_backend.yml`.

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> The average number of minutes a project uses in any consecutive 
five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review because this is a daily CI update.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46355 from dongjoon-hyun/SPARK-48108.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_rockdb_as_ui_backend.yml | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/.github/workflows/build_rockdb_as_ui_backend.yml 
b/.github/workflows/build_rockdb_as_ui_backend.yml
index e11ec85b8b17..a1cc34f7b54f 100644
--- a/.github/workflows/build_rockdb_as_ui_backend.yml
+++ b/.github/workflows/build_rockdb_as_ui_backend.yml
@@ -42,7 +42,5 @@ jobs:
 {
   "build": "true",
   "pyspark": "true",
-  "sparkr": "true",
-  "tpcds-1g": "true",
-  "docker-integration-tests": "true"
+  "sparkr": "true"
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48104][INFRA] Run `publish_snapshot.yml` once per day

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7b472e30db99 [SPARK-48104][INFRA] Run `publish_snapshot.yml` once per 
day
7b472e30db99 is described below

commit 7b472e30db99fe935b22a748d3f2adbce474ea37
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 20:15:03 2024 -0700

[SPARK-48104][INFRA] Run `publish_snapshot.yml` once per day

### What changes were proposed in this pull request?

This PR aims to reduce `publish_snapshot.yml` frequency from twice per day 
to once per day.

Technically, this is a revert of
- #45686

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.
- https://infra.apache.org/github-actions-policy.html

> The average number of minutes a project uses in any consecutive 
five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46352 from dongjoon-hyun/SPARK-48104.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/publish_snapshot.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/publish_snapshot.yml 
b/.github/workflows/publish_snapshot.yml
index d09babd37240..006ccf239e6f 100644
--- a/.github/workflows/publish_snapshot.yml
+++ b/.github/workflows/publish_snapshot.yml
@@ -21,7 +21,7 @@ name: Publish Snapshot
 
 on:
   schedule:
-  - cron: '0 0,12 * * *'
+  - cron: '0 0 * * *'
   workflow_dispatch:
 inputs:
   branch:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47671][CORE] Enable structured logging in log4j2.properties.template and update docs

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c6696cdcd611 [SPARK-47671][CORE] Enable structured logging in 
log4j2.properties.template and update docs
c6696cdcd611 is described below

commit c6696cdcd611a682ebf5b7a183e2970ecea3b58c
Author: Gengliang Wang 
AuthorDate: Thu May 2 19:45:48 2024 -0700

[SPARK-47671][CORE] Enable structured logging in log4j2.properties.template 
and update docs

### What changes were proposed in this pull request?

- Rename the current log4j2.properties.template as 
log4j2.properties.pattern-layout-template
- Enable structured logging in log4j2.properties.template
- Update `configuration.md` on how to configure logging

### Why are the changes needed?

Providing a structured logging template and document how to configure 
loggings in Spark 4.0.0

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46349 from gengliangwang/logTemplate.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
---
 ...template => log4j2.properties.pattern-layout-template} |  0
 conf/log4j2.properties.template   | 10 ++
 docs/configuration.md | 15 +--
 3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/conf/log4j2.properties.template 
b/conf/log4j2.properties.pattern-layout-template
similarity index 100%
copy from conf/log4j2.properties.template
copy to conf/log4j2.properties.pattern-layout-template
diff --git a/conf/log4j2.properties.template b/conf/log4j2.properties.template
index ab96e03baed2..876724531444 100644
--- a/conf/log4j2.properties.template
+++ b/conf/log4j2.properties.template
@@ -19,17 +19,11 @@
 rootLogger.level = info
 rootLogger.appenderRef.stdout.ref = console
 
-# In the pattern layout configuration below, we specify an explicit `%ex` 
conversion
-# pattern for logging Throwables. If this was omitted, then (by default) Log4J 
would
-# implicitly add an `%xEx` conversion pattern which logs stacktraces with 
additional
-# class packaging information. That extra information can sometimes add a 
substantial
-# performance overhead, so we disable it in our default logging config.
-# For more information, see SPARK-39361.
 appender.console.type = Console
 appender.console.name = console
 appender.console.target = SYSTEM_ERR
-appender.console.layout.type = PatternLayout
-appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex
+appender.console.layout.type = JsonTemplateLayout
+appender.console.layout.eventTemplateUri = 
classpath:org/apache/spark/SparkLayout.json
 
 # Set the default spark-shell/spark-sql log level to WARN. When running the
 # spark-shell/spark-sql, the log level for these classes is used to overwrite
diff --git a/docs/configuration.md b/docs/configuration.md
index 2e612ffd9ab9..a3b4e731f057 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -3670,14 +3670,17 @@ Note: When running Spark on YARN in `cluster` mode, 
environment variables need t
 # Configuring Logging
 
 Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can 
configure it by adding a
-`log4j2.properties` file in the `conf` directory. One way to start is to copy 
the existing
-`log4j2.properties.template` located there.
+`log4j2.properties` file in the `conf` directory. One way to start is to copy 
the existing templates `log4j2.properties.template` or 
`log4j2.properties.pattern-layout-template` located there.
 
-By default, Spark adds 1 record to the MDC (Mapped Diagnostic Context): 
`mdc.taskName`, which shows something
-like `task 1.0 in stage 0.0`. You can add `%X{mdc.taskName}` to your 
patternLayout in
-order to print it in the logs.
+## Structured Logging
+Starting from version 4.0.0, Spark has adopted the [JSON Template 
Layout](https://logging.apache.org/log4j/2.x/manual/json-template-layout.html) 
for logging, which outputs logs in JSON format. This format facilitates 
querying logs using Spark SQL with the JSON data source. Additionally, the logs 
include all Mapped Diagnostic Context (MDC) information for search and 
debugging purposes.
+
+To implement structured logging, start with the `log4j2.properties.template` 
file.
+
+## Plain Text Logging
+If you prefer plain text logging, you can use the 
`log4j2.properties.pattern-layout-template` file as a starting point. This is 
the default configuration used by Spark before the 4.0.0 release. This 
configuration uses the 
[PatternLayout](https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout)
 to log all the logs in plain text. 

(spark) branch master updated (7f6a1399a56b -> 8d9e7c9c6623)

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 7f6a1399a56b [SPARK-48098][INFRA] Enable `NOLINT_ON_COMPILE` for all 
except `lint` job
 add 8d9e7c9c6623 [SPARK-48099][INFRA] Run `maven-build` test only on `Java 
21 on MacOS14 (Apple Silicon)`

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_and_test.yml | 4 
 1 file changed, 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48098][INFRA] Enable `NOLINT_ON_COMPILE` for all except `lint` job

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7f6a1399a56b [SPARK-48098][INFRA] Enable `NOLINT_ON_COMPILE` for all 
except `lint` job
7f6a1399a56b is described below

commit 7f6a1399a56b07fa253a85dac757fdd788285274
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 19:26:13 2024 -0700

[SPARK-48098][INFRA] Enable `NOLINT_ON_COMPILE` for all except `lint` job

### What changes were proposed in this pull request?

This PR aims to enable `NOLINT_ON_COMPILE` for all except `lint` job.

### Why are the changes needed?

This will reduce the redundant CPU cycle and GitHub action usage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46346 from dongjoon-hyun/SPARK-48098.

Lead-authored-by: Dongjoon Hyun 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 2 ++
 project/SparkBuild.scala | 4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 92fda7adeb33..3f5a8087885e 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -193,6 +193,7 @@ jobs:
   HIVE_PROFILE: ${{ matrix.hive }}
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
+  NOLINT_ON_COMPILE: true
   SKIP_UNIDOC: true
   SKIP_MIMA: true
   SKIP_PACKAGING: true
@@ -606,6 +607,7 @@ jobs:
 env:
   LC_ALL: C.UTF-8
   LANG: C.UTF-8
+  NOLINT_ON_COMPILE: false
   PYSPARK_DRIVER_PYTHON: python3.9
   PYSPARK_PYTHON: python3.9
   GITHUB_PREV_SHA: ${{ github.event.before }}
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 9d2ee6077d11..5bb7745d77bf 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -255,9 +255,11 @@ object SparkBuild extends PomBuild {
 }
   )
 
+  val noLintOnCompile = sys.env.contains("NOLINT_ON_COMPILE") &&
+  !sys.env.get("NOLINT_ON_COMPILE").contains("false")
   lazy val sharedSettings = sparkGenjavadocSettings ++
 compilerWarningSettings ++
-  (if (sys.env.contains("NOLINT_ON_COMPILE")) Nil else enableScalaStyle) 
++ Seq(
+  (if (noLintOnCompile) Nil else enableScalaStyle) ++ Seq(
 (Compile / exportJars) := true,
 (Test / exportJars) := false,
 javaHome := sys.env.get("JAVA_HOME")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48097][INFRA] Limit GHA job execution time to up to 3 hours in `build_and_test.yml`

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1321dd604480 [SPARK-48097][INFRA] Limit GHA job execution time to up 
to 3 hours in `build_and_test.yml`
1321dd604480 is described below

commit 1321dd6044809dbbdd8c1887b8345b0f8d76797d
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 15:10:33 2024 -0700

[SPARK-48097][INFRA] Limit GHA job execution time to up to 3 hours in 
`build_and_test.yml`

### What changes were proposed in this pull request?

This PR aims to limit GHA job execution time to up to 3 hours in 
`build_and_test.yml` in order to avoid idle hung time.
New limit is applied for all jobs except three jobs (`precondition`, 
`infra-image`, and `breaking-changes-buf`) which didn't get a hung situation 
before.

### Why are the changes needed?

Since SPARK-45010, Apache spark used 5 hours.
- #42727

This is shorter than GitHub Action's the default value (6 hour) is used.

- 
https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes
  > The maximum number of minutes to let a job run before GitHub 
automatically cancels it. Default: 360

This PR reduces to `3 hour` to follow new ASF INFRA policy which has been 
applied since April 20, 2024.
- https://infra.apache.org/github-actions-policy.html

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46344 from dongjoon-hyun/SPARK-48097.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 7e59f7b792b4..92fda7adeb33 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -123,7 +123,7 @@ jobs:
 needs: precondition
 if: fromJson(needs.precondition.outputs.required).build == 'true'
 runs-on: ubuntu-latest
-timeout-minutes: 300
+timeout-minutes: 180
 strategy:
   fail-fast: false
   matrix:
@@ -333,7 +333,7 @@ jobs:
 if: (!cancelled()) && 
fromJson(needs.precondition.outputs.required).pyspark == 'true'
 name: "Build modules: ${{ matrix.modules }}"
 runs-on: ubuntu-latest
-timeout-minutes: 300
+timeout-minutes: 180
 container:
   image: ${{ needs.precondition.outputs.image_url }}
 strategy:
@@ -480,7 +480,7 @@ jobs:
 if: (!cancelled()) && fromJson(needs.precondition.outputs.required).sparkr 
== 'true'
 name: "Build modules: sparkr"
 runs-on: ubuntu-latest
-timeout-minutes: 300
+timeout-minutes: 180
 container:
   image: ${{ needs.precondition.outputs.image_url }}
 env:
@@ -602,7 +602,7 @@ jobs:
 if: (!cancelled()) && fromJson(needs.precondition.outputs.required).lint 
== 'true'
 name: Linters, licenses, dependencies and documentation generation
 runs-on: ubuntu-latest
-timeout-minutes: 300
+timeout-minutes: 180
 env:
   LC_ALL: C.UTF-8
   LANG: C.UTF-8
@@ -801,7 +801,7 @@ jobs:
   - java: 21
 os: macos-14 
 runs-on: ${{ matrix.os }}
-timeout-minutes: 300
+timeout-minutes: 180
 steps:
 - name: Checkout Spark repository
   uses: actions/checkout@v4
@@ -853,7 +853,7 @@ jobs:
 name: Run TPC-DS queries with SF=1
 # Pin to 'Ubuntu 20.04' due to 'databricks/tpcds-kit' compilation
 runs-on: ubuntu-20.04
-timeout-minutes: 300
+timeout-minutes: 180
 env:
   SPARK_LOCAL_IP: localhost
 steps:
@@ -954,7 +954,7 @@ jobs:
 if: fromJson(needs.precondition.outputs.required).docker-integration-tests 
== 'true'
 name: Run Docker integration tests
 runs-on: ubuntu-latest
-timeout-minutes: 300
+timeout-minutes: 180
 env:
   HADOOP_PROFILE: ${{ inputs.hadoop }}
   HIVE_PROFILE: hive2.3
@@ -1022,7 +1022,7 @@ jobs:
 if: fromJson(needs.precondition.outputs.required).k8s-integration-tests == 
'true'
 name: Run Spark on Kubernetes Integration test
 runs-on: ubuntu-latest
-timeout-minutes: 300
+timeout-minutes: 180
 steps:
   - name: Checkout Spark repository
 uses: actions/checkout@v4
@@ -1094,7 +1094,7 @@ jobs:
 if: fromJson(needs.precondition.outputs.required).ui == 'true'
 name: Run Spark UI tests
 runs-on: ubuntu-latest
-timeout-minutes: 300
+timeout-minutes: 180
 

(spark) branch master updated: [SPARK-48096][INFRA] Run `build_maven_java21_macos14.yml` every two days

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 53d4cdb4eefa [SPARK-48096][INFRA] Run `build_maven_java21_macos14.yml` 
every two days
53d4cdb4eefa is described below

commit 53d4cdb4eefa66161315f04d58d2742f52bfbcce
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 14:12:16 2024 -0700

[SPARK-48096][INFRA] Run `build_maven_java21_macos14.yml` every two days

### What changes were proposed in this pull request?

This PR aims to reduce `build_maven_java21_macos14.yml` frequency from once 
per day to every two days.

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46343 from dongjoon-hyun/SPARK-48096.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_maven_java21_macos14.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_maven_java21_macos14.yml 
b/.github/workflows/build_maven_java21_macos14.yml
index 70b47fcecb26..fb5e609f4eae 100644
--- a/.github/workflows/build_maven_java21_macos14.yml
+++ b/.github/workflows/build_maven_java21_macos14.yml
@@ -21,7 +21,7 @@ name: "Build / Maven (master, Scala 2.13, Hadoop 3, JDK 21, 
macos-14)"
 
 on:
   schedule:
-- cron: '0 20 * * *'
+- cron: '0 20 */2 * *'
 
 jobs:
   run-build:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48095][INFRA] Run `build_non_ansi.yml` once per day

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 48df28f6b311 [SPARK-48095][INFRA] Run `build_non_ansi.yml` once per day
48df28f6b311 is described below

commit 48df28f6b3112b949c0057f0c4ecb1d334f3662c
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 14:00:36 2024 -0700

[SPARK-48095][INFRA] Run `build_non_ansi.yml` once per day

### What changes were proposed in this pull request?

This PR aims to reduce `build_non_ansi.yml` frequency from twice per day to 
once per day.

### Why are the changes needed?

To reduce GitHub Action usage to meet ASF INFRA policy.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46342 from dongjoon-hyun/SPARK-48095.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_non_ansi.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_non_ansi.yml 
b/.github/workflows/build_non_ansi.yml
index cf97cdd4bfa1..ff3fda4625cc 100644
--- a/.github/workflows/build_non_ansi.yml
+++ b/.github/workflows/build_non_ansi.yml
@@ -21,7 +21,7 @@ name: "Build / NON-ANSI (master, Hadoop 3, JDK 17, Scala 
2.13)"
 
 on:
   schedule:
-- cron: '0 1,13 * * *'
+- cron: '0 1 * * *'
 
 jobs:
   run-build:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-48081][SQL][3.4] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 681a1de72bdf [SPARK-48081][SQL][3.4] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
681a1de72bdf is described below

commit 681a1de72bdf749e0a0782dde9bddfcbb3248d99
Author: Josh Rosen 
AuthorDate: Thu May 2 12:50:54 2024 -0700

[SPARK-48081][SQL][3.4] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

branch-3.4 pick of PR https://github.com/apache/spark/pull/46333 , fixing 
test issue due to difference in expected error message parameter formatting 
across branches; original description follows below:

---

### What changes were proposed in this pull request?

While migrating the `NTile` expression's type check failures to the new 
error class framework, PR https://github.com/apache/spark/pull/38457 removed a 
pair of not-unnecessary `return` statements and thus caused certain branches' 
values to be discarded rather than returned.

As a result, invalid usages like

```
select ntile(99.9) OVER (order by id) from range(10)
```

trigger internal errors like errors like

```
 java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal 
cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal 
is in unnamed module of loader 'app'; java.lang.Integer is in module java.base 
of loader 'bootstrap')
  at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
  at 
org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877)
```

instead of clear error framework errors like

```
org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to 
data type mismatch: The first parameter requires the "INT" type, however "99.9" 
has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7;
'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(]
+- Range (0, 10, step=1, splits=None)

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315)
```

### Why are the changes needed?

Improve error messages.

### Does this PR introduce _any_ user-facing change?

Yes, it improves an error message.

### How was this patch tested?

Added a new test case to AnalysisErrorSuite.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46337 from JoshRosen/SPARK-48081-branch-3.4.

Authored-by: Josh Rosen 
Signed-off-by: Dongjoon Hyun 
---
 .../catalyst/expressions/windowExpressions.scala   |  4 +--
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index 2d11b581ee4c..adc32866f58d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -848,7 +848,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
   // for each partition.
   override def checkInputDataTypes(): TypeCheckResult = {
 if (!buckets.foldable) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "NON_FOLDABLE_INPUT",
 messageParameters = Map(
   "inputName" -> "buckets",
@@ -859,7 +859,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
 }
 
 if (buckets.dataType != IntegerType) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "UNEXPECTED_INPUT_TYPE",
 messageParameters = Map(
   "paramIndex" -> "1",
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
index cbd6749807f7..ebc133719238 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisE

(spark) branch branch-3.5 updated: [SPARK-48081][SQL][3.5] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 9cd312574e97 [SPARK-48081][SQL][3.5] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
9cd312574e97 is described below

commit 9cd312574e9706e9a1784c18ef1c1bccb957bcba
Author: Josh Rosen 
AuthorDate: Thu May 2 12:49:54 2024 -0700

[SPARK-48081][SQL][3.5] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

branch-3.5 pick of PR https://github.com/apache/spark/pull/46333 , fixing 
test issue due to difference in expected error message parameter formatting 
across branches; original description follows below:

---

### What changes were proposed in this pull request?

While migrating the `NTile` expression's type check failures to the new 
error class framework, PR https://github.com/apache/spark/pull/38457 removed a 
pair of not-unnecessary `return` statements and thus caused certain branches' 
values to be discarded rather than returned.

As a result, invalid usages like

```
select ntile(99.9) OVER (order by id) from range(10)
```

trigger internal errors like errors like

```
 java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal 
cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal 
is in unnamed module of loader 'app'; java.lang.Integer is in module java.base 
of loader 'bootstrap')
  at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
  at 
org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877)
```

instead of clear error framework errors like

```
org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to 
data type mismatch: The first parameter requires the "INT" type, however "99.9" 
has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7;
'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(]
+- Range (0, 10, step=1, splits=None)

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315)
```

### Why are the changes needed?

Improve error messages.

### Does this PR introduce _any_ user-facing change?

Yes, it improves an error message.

### How was this patch tested?

Added a new test case to AnalysisErrorSuite.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46336 from JoshRosen/SPARK-48081-branch-3.5.

Authored-by: Josh Rosen 
Signed-off-by: Dongjoon Hyun 
---
 .../catalyst/expressions/windowExpressions.scala   |  4 +--
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index 50c98c01645d..a4ce78d1bb6d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -850,7 +850,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
   // for each partition.
   override def checkInputDataTypes(): TypeCheckResult = {
 if (!buckets.foldable) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "NON_FOLDABLE_INPUT",
 messageParameters = Map(
   "inputName" -> "buckets",
@@ -861,7 +861,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
 }
 
 if (buckets.dataType != IntegerType) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "UNEXPECTED_INPUT_TYPE",
 messageParameters = Map(
   "paramIndex" -> "1",
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
index e8dc9061199c..a7df53db936f 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisE

(spark) branch branch-3.4 updated: [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` GenericAlias in Python 3.11+

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new a75c93be9c0a [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to 
handle `list` GenericAlias in Python 3.11+
a75c93be9c0a is described below

commit a75c93be9c0a9c96de788db9fc74125590d2d26f
Author: Dongjoon Hyun 
AuthorDate: Mon Nov 20 08:30:42 2023 +0900

[SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` 
GenericAlias in Python 3.11+

### What changes were proposed in this pull request?

This PR aims to fix `type hints` to handle `list` GenericAlias in Python 
3.11+ for Apache Spark 4.0.0 and 3.5.1.
- https://github.com/apache/spark/actions/workflows/build_python.yml

### Why are the changes needed?

PEP 646 changes `GenericAlias` instances into `Iterable` ones at Python 
3.11.
- https://peps.python.org/pep-0646/

This behavior changes introduce the following failure on Python 3.11.

- **Python 3.11.6**

```python
Python 3.11.6 (main, Nov  1 2023, 07:46:30) [Clang 14.0.0 
(clang-1400.0.28.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
23/11/18 16:34:09 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0
  /_/

Using Python version 3.11.6 (main, Nov  1 2023 07:46:30)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1700354049391).
SparkSession available as 'spark'.
>>> from pyspark import pandas as ps
>>> from typing import List
>>> ps.DataFrame[float, [int, List[int]]]
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/frame.py",
 line 13647, in __class_getitem__
return create_tuple_for_frame_type(params)
   ^^^
  File 
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py",
 line 717, in create_tuple_for_frame_type
return Tuple[_to_type_holders(params)]
 
  File 
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py",
 line 762, in _to_type_holders
data_types = _new_type_holders(data_types, NameTypeHolder)
 ^
  File 
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py",
 line 828, in _new_type_holders
raise TypeError(
TypeError: Type hints should be specified as one of:
  - DataFrame[type, type, ...]
  - DataFrame[name: type, name: type, ...]
  - DataFrame[dtypes instance]
  - DataFrame[zip(names, types)]
  - DataFrame[index_type, [type, ...]]
  - DataFrame[(index_name, index_type), [(name, type), ...]]
  - DataFrame[dtype instance, dtypes instance]
  - DataFrame[(index_name, index_type), zip(names, types)]
  - DataFrame[[index_type, ...], [type, ...]]
  - DataFrame[[(index_name, index_type), ...], [(name, type), ...]]
  - DataFrame[dtypes instance, dtypes instance]
  - DataFrame[zip(index_names, index_types), zip(names, types)]
However, got (, typing.List[int]).
```

- **Python 3.10.13**

```python
Python 3.10.13 (main, Sep 29 2023, 16:03:45) [Clang 14.0.0 
(clang-1400.0.28.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
23/11/18 16:33:21 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0
  /_/

Using Python version 3.10.13 (main, Sep 29 2023 16:03:45)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master =

(spark) branch branch-3.4 updated: Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type"

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 4baf5ee19ba4 Revert "[SPARK-48081] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type"
4baf5ee19ba4 is described below

commit 4baf5ee19ba410ea39d784380b8e5ae434cf8601
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 08:42:48 2024 -0700

Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() 
when argument is non-foldable or of wrong type"

This reverts commit 32789ba3bbaa98dd14537d80204ed4aab8f77d9b.
---
 .../catalyst/expressions/windowExpressions.scala   |  4 +--
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 --
 2 files changed, 2 insertions(+), 36 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index adc32866f58d..2d11b581ee4c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -848,7 +848,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
   // for each partition.
   override def checkInputDataTypes(): TypeCheckResult = {
 if (!buckets.foldable) {
-  return DataTypeMismatch(
+  DataTypeMismatch(
 errorSubClass = "NON_FOLDABLE_INPUT",
 messageParameters = Map(
   "inputName" -> "buckets",
@@ -859,7 +859,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
 }
 
 if (buckets.dataType != IntegerType) {
-  return DataTypeMismatch(
+  DataTypeMismatch(
 errorSubClass = "UNEXPECTED_INPUT_TYPE",
 messageParameters = Map(
   "paramIndex" -> "1",
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
index 5a2aa87d7a83..cbd6749807f7 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
@@ -316,40 +316,6 @@ class AnalysisErrorSuite extends AnalysisTest {
 listRelation.select(Explode($"list").as("a"), Explode($"list").as("b")),
 "only one generator" :: "explode" :: Nil)
 
-  errorClassTest(
-"the buckets of ntile window function is not foldable",
-testRelation2.select(
-  WindowExpression(
-NTile(Literal(99.9f)),
-WindowSpecDefinition(
-  UnresolvedAttribute("a") :: Nil,
-  SortOrder(UnresolvedAttribute("b"), Ascending) :: Nil,
-  UnspecifiedFrame)).as("window")),
-errorClass = "DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE",
-messageParameters = Map(
-  "sqlExpr" -> "\"ntile(99.9)\"",
-  "paramIndex" -> "first",
-  "inputSql" -> "\"99.9\"",
-  "inputType" -> "\"FLOAT\"",
-  "requiredType" -> "\"INT\""))
-
-
-  errorClassTest(
-"the buckets of ntile window function is not int literal",
-testRelation2.select(
-  WindowExpression(
-NTile(AttributeReference("b", IntegerType)()),
-WindowSpecDefinition(
-  UnresolvedAttribute("a") :: Nil,
-  SortOrder(UnresolvedAttribute("b"), Ascending) :: Nil,
-  UnspecifiedFrame)).as("window")),
-errorClass = "DATATYPE_MISMATCH.NON_FOLDABLE_INPUT",
-messageParameters = Map(
-  "sqlExpr" -> "\"ntile(b)\"",
-  "inputName" -> "`buckets`",
-  "inputExpr" -> "\"b\"",
-  "inputType" -> "\"INT\""))
-
   errorClassTest(
 "unresolved attributes",
 testRelation.select($"abcd"),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type"

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new d82403f98033 Revert "[SPARK-48081] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type"
d82403f98033 is described below

commit d82403f980334cd40b1f24518c9c766827710c8c
Author: Dongjoon Hyun 
AuthorDate: Thu May 2 08:42:24 2024 -0700

Revert "[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() 
when argument is non-foldable or of wrong type"

This reverts commit 3d72063ccec6167bd3fe92e24a0ebd11bec8637b.
---
 .../catalyst/expressions/windowExpressions.scala   |  4 +--
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 --
 2 files changed, 2 insertions(+), 36 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index a4ce78d1bb6d..50c98c01645d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -850,7 +850,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
   // for each partition.
   override def checkInputDataTypes(): TypeCheckResult = {
 if (!buckets.foldable) {
-  return DataTypeMismatch(
+  DataTypeMismatch(
 errorSubClass = "NON_FOLDABLE_INPUT",
 messageParameters = Map(
   "inputName" -> "buckets",
@@ -861,7 +861,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
 }
 
 if (buckets.dataType != IntegerType) {
-  return DataTypeMismatch(
+  DataTypeMismatch(
 errorSubClass = "UNEXPECTED_INPUT_TYPE",
 messageParameters = Map(
   "paramIndex" -> "1",
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
index 48d9266542f1..e8dc9061199c 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
@@ -344,40 +344,6 @@ class AnalysisErrorSuite extends AnalysisTest {
   "inputType" -> "\"BOOLEAN\"",
   "requiredType" -> "\"INT\""))
 
-  errorClassTest(
-"the buckets of ntile window function is not foldable",
-testRelation2.select(
-  WindowExpression(
-NTile(Literal(99.9f)),
-WindowSpecDefinition(
-  UnresolvedAttribute("a") :: Nil,
-  SortOrder(UnresolvedAttribute("b"), Ascending) :: Nil,
-  UnspecifiedFrame)).as("window")),
-errorClass = "DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE",
-messageParameters = Map(
-  "sqlExpr" -> "\"ntile(99.9)\"",
-  "paramIndex" -> "first",
-  "inputSql" -> "\"99.9\"",
-  "inputType" -> "\"FLOAT\"",
-  "requiredType" -> "\"INT\""))
-
-
-  errorClassTest(
-"the buckets of ntile window function is not int literal",
-testRelation2.select(
-  WindowExpression(
-NTile(AttributeReference("b", IntegerType)()),
-WindowSpecDefinition(
-  UnresolvedAttribute("a") :: Nil,
-  SortOrder(UnresolvedAttribute("b"), Ascending) :: Nil,
-  UnspecifiedFrame)).as("window")),
-errorClass = "DATATYPE_MISMATCH.NON_FOLDABLE_INPUT",
-messageParameters = Map(
-  "sqlExpr" -> "\"ntile(b)\"",
-  "inputName" -> "`buckets`",
-  "inputExpr" -> "\"b\"",
-  "inputType" -> "\"INT\""))
-
   errorClassTest(
 "unresolved attributes",
 testRelation.select($"abcd"),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (b99a64b0fd1c -> bf1300835503)

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b99a64b0fd1c [SPARK-48081] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
 add bf1300835503 [SPARK-48079][BUILD] Upgrade maven-install/deploy-plugin 
to 3.1.2

No new revisions were added by this update.

Summary of changes:
 pom.xml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 32789ba3bbaa [SPARK-48081] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
32789ba3bbaa is described below

commit 32789ba3bbaa98dd14537d80204ed4aab8f77d9b
Author: Josh Rosen 
AuthorDate: Thu May 2 07:22:44 2024 -0700

[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when 
argument is non-foldable or of wrong type

### What changes were proposed in this pull request?

While migrating the `NTile` expression's type check failures to the new 
error class framework, PR https://github.com/apache/spark/pull/38457 removed a 
pair of not-unnecessary `return` statements and thus caused certain branches' 
values to be discarded rather than returned.

As a result, invalid usages like

```
select ntile(99.9) OVER (order by id) from range(10)
```

trigger internal errors like errors like

```
 java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal 
cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal 
is in unnamed module of loader 'app'; java.lang.Integer is in module java.base 
of loader 'bootstrap')
  at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
  at 
org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877)
```

instead of clear error framework errors like

```
org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to 
data type mismatch: The first parameter requires the "INT" type, however "99.9" 
has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7;
'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(]
+- Range (0, 10, step=1, splits=None)

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315)
```

### Why are the changes needed?

Improve error messages.

### Does this PR introduce _any_ user-facing change?

Yes, it improves an error message.

### How was this patch tested?

Added a new test case to AnalysisErrorSuite.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46333 from JoshRosen/SPARK-48081.

Authored-by: Josh Rosen 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit b99a64b0fd1cf4b32dd2f17423775db87bae20a6)
Signed-off-by: Dongjoon Hyun 
---
 .../catalyst/expressions/windowExpressions.scala   |  4 +--
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index 2d11b581ee4c..adc32866f58d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -848,7 +848,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
   // for each partition.
   override def checkInputDataTypes(): TypeCheckResult = {
 if (!buckets.foldable) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "NON_FOLDABLE_INPUT",
 messageParameters = Map(
   "inputName" -> "buckets",
@@ -859,7 +859,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
 }
 
 if (buckets.dataType != IntegerType) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "UNEXPECTED_INPUT_TYPE",
 messageParameters = Map(
   "paramIndex" -> "1",
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
index cbd6749807f7..5a2aa87d7a83 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
@@ -316,6 +316,40 @@ class AnalysisErrorSui

(spark) branch branch-3.5 updated: [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 3d72063ccec6 [SPARK-48081] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
3d72063ccec6 is described below

commit 3d72063ccec6167bd3fe92e24a0ebd11bec8637b
Author: Josh Rosen 
AuthorDate: Thu May 2 07:22:44 2024 -0700

[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when 
argument is non-foldable or of wrong type

### What changes were proposed in this pull request?

While migrating the `NTile` expression's type check failures to the new 
error class framework, PR https://github.com/apache/spark/pull/38457 removed a 
pair of not-unnecessary `return` statements and thus caused certain branches' 
values to be discarded rather than returned.

As a result, invalid usages like

```
select ntile(99.9) OVER (order by id) from range(10)
```

trigger internal errors like errors like

```
 java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal 
cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal 
is in unnamed module of loader 'app'; java.lang.Integer is in module java.base 
of loader 'bootstrap')
  at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
  at 
org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877)
```

instead of clear error framework errors like

```
org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to 
data type mismatch: The first parameter requires the "INT" type, however "99.9" 
has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7;
'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(]
+- Range (0, 10, step=1, splits=None)

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315)
```

### Why are the changes needed?

Improve error messages.

### Does this PR introduce _any_ user-facing change?

Yes, it improves an error message.

### How was this patch tested?

Added a new test case to AnalysisErrorSuite.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46333 from JoshRosen/SPARK-48081.

Authored-by: Josh Rosen 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit b99a64b0fd1cf4b32dd2f17423775db87bae20a6)
Signed-off-by: Dongjoon Hyun 
---
 .../catalyst/expressions/windowExpressions.scala   |  4 +--
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index 50c98c01645d..a4ce78d1bb6d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -850,7 +850,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
   // for each partition.
   override def checkInputDataTypes(): TypeCheckResult = {
 if (!buckets.foldable) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "NON_FOLDABLE_INPUT",
 messageParameters = Map(
   "inputName" -> "buckets",
@@ -861,7 +861,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
 }
 
 if (buckets.dataType != IntegerType) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "UNEXPECTED_INPUT_TYPE",
 messageParameters = Map(
   "paramIndex" -> "1",
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
index e8dc9061199c..48d9266542f1 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
@@ -344,6 +344,40 @@ class AnalysisErrorSui

(spark) branch master updated: [SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b99a64b0fd1c [SPARK-48081] Fix ClassCastException in 
NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
b99a64b0fd1c is described below

commit b99a64b0fd1cf4b32dd2f17423775db87bae20a6
Author: Josh Rosen 
AuthorDate: Thu May 2 07:22:44 2024 -0700

[SPARK-48081] Fix ClassCastException in NTile.checkInputDataTypes() when 
argument is non-foldable or of wrong type

### What changes were proposed in this pull request?

While migrating the `NTile` expression's type check failures to the new 
error class framework, PR https://github.com/apache/spark/pull/38457 removed a 
pair of not-unnecessary `return` statements and thus caused certain branches' 
values to be discarded rather than returned.

As a result, invalid usages like

```
select ntile(99.9) OVER (order by id) from range(10)
```

trigger internal errors like errors like

```
 java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal 
cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal 
is in unnamed module of loader 'app'; java.lang.Integer is in module java.base 
of loader 'bootstrap')
  at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
  at 
org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877)
```

instead of clear error framework errors like

```
org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "ntile(99.9)" due to 
data type mismatch: The first parameter requires the "INT" type, however "99.9" 
has the type "DECIMAL(3,1)". SQLSTATE: 42K09; line 1 pos 7;
'Project [unresolvedalias(ntile(99.9) windowspecdefinition(id#0L ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$(]
+- Range (0, 10, step=1, splits=None)

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315)
```

### Why are the changes needed?

Improve error messages.

### Does this PR introduce _any_ user-facing change?

Yes, it improves an error message.

### How was this patch tested?

Added a new test case to AnalysisErrorSuite.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46333 from JoshRosen/SPARK-48081.

Authored-by: Josh Rosen 
Signed-off-by: Dongjoon Hyun 
---
 .../catalyst/expressions/windowExpressions.scala   |  4 +--
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 34 ++
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index 00711332350c..5881c456f6e8 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -853,7 +853,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
   // for each partition.
   override def checkInputDataTypes(): TypeCheckResult = {
 if (!buckets.foldable) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "NON_FOLDABLE_INPUT",
 messageParameters = Map(
   "inputName" -> toSQLId("buckets"),
@@ -864,7 +864,7 @@ case class NTile(buckets: Expression) extends RowNumberLike 
with SizeBasedWindow
 }
 
 if (buckets.dataType != IntegerType) {
-  DataTypeMismatch(
+  return DataTypeMismatch(
 errorSubClass = "UNEXPECTED_INPUT_TYPE",
 messageParameters = Map(
   "paramIndex" -> ordinalNumber(0),
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
index f12d22409691..19eb3a418543 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
@@ -360,6 +360,40 @@ class AnalysisErrorSuite extends AnalysisTest with 
DataTypeErrorsBase {
   "inputType" -> "\"BO

(spark) branch master updated: [SPARK-48072][SQL][TESTS] Improve SQLQuerySuite test output - use `===` instead of `sameElements` for Arrays

2024-05-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5bbbc6c25bb7 [SPARK-48072][SQL][TESTS] Improve SQLQuerySuite test 
output - use `===` instead of `sameElements` for Arrays
5bbbc6c25bb7 is described below

commit 5bbbc6c25bb7cb7cf24330a384c67bc3e8b3a5e4
Author: Vladimir Golubev 
AuthorDate: Thu May 2 07:18:11 2024 -0700

[SPARK-48072][SQL][TESTS] Improve SQLQuerySuite test output - use `===` 
instead of `sameElements` for Arrays

### What changes were proposed in this pull request?
Improve test output for the actual query to be printed alongside of expected

### Why are the changes needed?
To reduce confusion later

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`testOnly org.apache.spark.sql.SQLQuerySuite -- -z SPARK-47939`
`testOnly org.apache.spark.sql.SQLQuerySuite -- -z SPARK-37965`
`testOnly org.apache.spark.sql.SQLQuerySuite -- -z SPARK-27442`

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46318 from 
vladimirg-db/vladimirg-db/improve-test-output-for-sql-query-suite.

Authored-by: Vladimir Golubev 
Signed-off-by: Dongjoon Hyun 
---
 .../src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala| 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index 470f8ff4cd85..56c364e20846 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -4399,8 +4399,8 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
   checkAnswer(df,
 Row(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) ::
   Row(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) :: Nil)
-  assert(df.schema.names.sameElements(
-Array("max(t)", "max(t", "=", "\n", ";", "a b", "{", ".", "a.b", "a")))
+  assert(df.schema.names ===
+Array("max(t)", "max(t", "=", "\n", ";", "a b", "{", ".", "a.b", "a"))
   checkAnswer(df.select("`max(t)`", "`a b`", "`{`", "`.`", "`a.b`"),
 Row(1, 6, 7, 8, 9) :: Row(2, 12, 14, 16, 18) :: Nil)
   checkAnswer(df.where("`a.b` > 10"),
@@ -4418,8 +4418,8 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
   checkAnswer(df,
 Row(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) ::
   Row(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22) :: Nil)
-  assert(df.schema.names.sameElements(
-Array("max(t)", "max(t", "=", "\n", ";", "a b", "{", ".", "a.b", "a", 
",")))
+  assert(df.schema.names ===
+Array("max(t)", "max(t", "=", "\n", ";", "a b", "{", ".", "a.b", "a", 
","))
   checkAnswer(df.select("`max(t)`", "`a b`", "`{`", "`.`", "`a.b`"),
 Row(1, 6, 7, 8, 9) :: Row(2, 12, 14, 16, 18) :: Nil)
   checkAnswer(df.where("`a.b` > 10"),
@@ -4754,7 +4754,7 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
   df.collect()
 .map(_.getString(0))
 .map(_.replaceAll("#[0-9]+", "#N"))
-.sameElements(Array(plan.stripMargin))
+=== Array(plan.stripMargin)
 )
 
 checkQueryPlan(


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48080][K8S] Promote `*MainAppResource` and `NonJVMResource` to `DeveloperApi`

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 66e2a350fc55 [SPARK-48080][K8S] Promote `*MainAppResource` and 
`NonJVMResource` to `DeveloperApi`
66e2a350fc55 is described below

commit 66e2a350fc55946315b52557a41d276d52124938
Author: Dongjoon Hyun 
AuthorDate: Wed May 1 20:41:48 2024 -0700

[SPARK-48080][K8S] Promote `*MainAppResource` and `NonJVMResource` to 
`DeveloperApi`

### What changes were proposed in this pull request?

This PR aims to promote `*MainAppResource` and `NonJVMResource` to 
`DeveloperApi`.

### Why are the changes needed?

Since `Apache Spark Kubernetes Operator` depends on these traits and 
classes, we had better maintain it as a developer API officially from `Apache 
Spark 4.0.0`.
- https://github.com/apache/spark-kubernetes-operator/pull/10

Since there are no changes after `3.0.0`, these are defined as `Stable`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46332 from dongjoon-hyun/SPARK-48080.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/deploy/k8s/submit/MainAppResource.scala  | 33 ++
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/MainAppResource.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/MainAppResource.scala
index a2e01fa2d9a0..398bb76376cf 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/MainAppResource.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/MainAppResource.scala
@@ -16,15 +16,38 @@
  */
 package org.apache.spark.deploy.k8s.submit
 
-private[spark] sealed trait MainAppResource
+import org.apache.spark.annotation.{DeveloperApi, Since, Stable}
 
-private[spark] sealed trait NonJVMResource
+/**
+ * :: DeveloperApi ::
+ *
+ * All traits and classes in this file are used by K8s module and Spark K8s 
operator.
+ */
+
+@Stable
+@DeveloperApi
+@Since("2.3.0")
+sealed trait MainAppResource
+
+@Stable
+@DeveloperApi
+@Since("2.4.0")
+sealed trait NonJVMResource
 
-private[spark] case class JavaMainAppResource(primaryResource: Option[String])
+@Stable
+@DeveloperApi
+@Since("3.0.0")
+case class JavaMainAppResource(primaryResource: Option[String])
   extends MainAppResource
 
-private[spark] case class PythonMainAppResource(primaryResource: String)
+@Stable
+@DeveloperApi
+@Since("2.4.0")
+case class PythonMainAppResource(primaryResource: String)
   extends MainAppResource with NonJVMResource
 
-private[spark] case class RMainAppResource(primaryResource: String)
+@Stable
+@DeveloperApi
+@Since("2.4.0")
+case class RMainAppResource(primaryResource: String)
   extends MainAppResource with NonJVMResource


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48078][K8S] Promote `o.a.s.d.k8s.Constants` to `DeveloperApi`

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4b16238784e0 [SPARK-48078][K8S] Promote `o.a.s.d.k8s.Constants` to 
`DeveloperApi`
4b16238784e0 is described below

commit 4b16238784e0a3bb1a6555c90a913b54f2aec2b1
Author: Dongjoon Hyun 
AuthorDate: Wed May 1 19:30:52 2024 -0700

[SPARK-48078][K8S] Promote `o.a.s.d.k8s.Constants` to `DeveloperApi`

### What changes were proposed in this pull request?

This PR aims to promote `org.apache.spark.deploy.k8s.Constants` to 
`DeveloperApi`

### Why are the changes needed?

Since `Apache Spark Kubernetes Operator` depends on this, we had better 
maintain it as a developer API officially from `Apache Spark 4.0.0`.
- https://github.com/apache/spark-kubernetes-operator/pull/10

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46329 from dongjoon-hyun/SPARK-48078.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/deploy/k8s/Constants.scala| 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala
index 385734c557a3..ead3188aa649 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala
@@ -16,7 +16,16 @@
  */
 package org.apache.spark.deploy.k8s
 
-private[spark] object Constants {
+import org.apache.spark.annotation.{DeveloperApi, Stable}
+
+/**
+ * :: DeveloperApi ::
+ *
+ * This is used in both K8s module and Spark K8s Operator.
+ */
+@Stable
+@DeveloperApi
+object Constants {
 
   // Labels
   val SPARK_VERSION_LABEL = "spark-version"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48077][K8S] Promote `KubernetesClientUtils` to `DeveloperApi`

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a42eef9e029a [SPARK-48077][K8S] Promote `KubernetesClientUtils` to 
`DeveloperApi`
a42eef9e029a is described below

commit a42eef9e029a388559e461f856af435457406a6d
Author: Dongjoon Hyun 
AuthorDate: Wed May 1 18:10:53 2024 -0700

[SPARK-48077][K8S] Promote `KubernetesClientUtils` to `DeveloperApi`

### What changes were proposed in this pull request?

This PR aims to promote `KubernetesClientUtils` to `DeveloperApi`.

### Why are the changes needed?

Since `Apache Spark Kubernetes Operator` requires this, we had better 
maintain it as a developer API officially from `Apache Spark 4.0.0`.
- https://github.com/apache/spark-kubernetes-operator/pull/10

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46327 from dongjoon-hyun/SPARK-48077.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/deploy/k8s/submit/KubernetesClientUtils.scala  | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala
index 930588fb0077..d6b1da39bcbb 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala
@@ -28,6 +28,7 @@ import scala.jdk.CollectionConverters._
 import io.fabric8.kubernetes.api.model.{ConfigMap, ConfigMapBuilder, KeyToPath}
 
 import org.apache.spark.SparkConf
+import org.apache.spark.annotation.{DeveloperApi, Since, Unstable}
 import org.apache.spark.deploy.k8s.{Config, Constants, KubernetesUtils}
 import 
org.apache.spark.deploy.k8s.Config.{KUBERNETES_DNS_SUBDOMAIN_NAME_MAX_LENGTH, 
KUBERNETES_NAMESPACE}
 import org.apache.spark.deploy.k8s.Constants.ENV_SPARK_CONF_DIR
@@ -35,16 +36,26 @@ import org.apache.spark.internal.{Logging, MDC}
 import org.apache.spark.internal.LogKeys.{CONFIG, PATH, PATHS}
 import org.apache.spark.util.ArrayImplicits._
 
-private[spark] object KubernetesClientUtils extends Logging {
+/**
+ * :: DeveloperApi ::
+ *
+ * A utility class used for K8s operations internally and Spark K8s operator.
+ */
+@Unstable
+@DeveloperApi
+object KubernetesClientUtils extends Logging {
 
   // Config map name can be KUBERNETES_DNS_SUBDOMAIN_NAME_MAX_LENGTH chars at 
max.
+  @Since("3.3.0")
   def configMapName(prefix: String): String = {
 val suffix = "-conf-map"
 s"${prefix.take(KUBERNETES_DNS_SUBDOMAIN_NAME_MAX_LENGTH - 
suffix.length)}$suffix"
   }
 
+  @Since("3.1.0")
   val configMapNameExecutor: String = 
configMapName(s"spark-exec-${KubernetesUtils.uniqueID()}")
 
+  @Since("3.1.0")
   val configMapNameDriver: String = 
configMapName(s"spark-drv-${KubernetesUtils.uniqueID()}")
 
   private def buildStringFromPropertiesMap(configMapName: String,
@@ -62,6 +73,7 @@ private[spark] object KubernetesClientUtils extends Logging {
   /**
* Build, file -> 'file's content' map of all the selected files in 
SPARK_CONF_DIR.
*/
+  @Since("3.1.1")
   def buildSparkConfDirFilesMap(
   configMapName: String,
   sparkConf: SparkConf,
@@ -77,6 +89,7 @@ private[spark] object KubernetesClientUtils extends Logging {
 }
   }
 
+  @Since("3.1.0")
   def buildKeyToPathObjects(confFilesMap: Map[String, String]): Seq[KeyToPath] 
= {
 confFilesMap.map {
   case (fileName: String, _: String) =>
@@ -89,6 +102,7 @@ private[spark] object KubernetesClientUtils extends Logging {
* Build a Config Map that will hold the content for environment variable 
SPARK_CONF_DIR
* on remote pods.
*/
+  @Since("3.1.0")
   def buildConfigMap(configMapName: String, confFileMap: Map[String, String],
   withLabels: Map[String, String] = Map()): ConfigMap = {
 val configMapNameSpace =


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (04f3a938895c -> 0fc7c4a29c46)

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 04f3a938895c [SPARK-48076][K8S] Promote `KubernetesVolumeUtils` to 
`DeveloperApi`
 add 0fc7c4a29c46 [SPARK-45891][SQL][FOLLOW-UP] Added length check to the 
is_variant_null expression

No new revisions were added by this update.

Summary of changes:
 .../expressions/variant/VariantExpressionEvalUtils.scala   | 10 +++---
 .../org/apache/spark/sql/errors/QueryExecutionErrors.scala |  5 +
 .../catalyst/expressions/variant/VariantExpressionSuite.scala  |  7 +++
 3 files changed, 19 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (e521d3c1f357 -> 04f3a938895c)

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e521d3c1f357 [MINOR] Fix the grammar of some comments on renaming 
error classes
 add 04f3a938895c [SPARK-48076][K8S] Promote `KubernetesVolumeUtils` to 
`DeveloperApi`

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala   | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (69ea082fc69a -> fd57c3493af7)

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 69ea082fc69a [SPARK-47934][CORE] Ensure trailing slashes in 
`HistoryServer` URL redirections
 add fd57c3493af7 [SPARK-47911][SQL] Introduces a universal BinaryFormatter 
to make binary output consistent

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/expressions/Cast.scala  |  2 -
 .../sql/catalyst/expressions/ToPrettyString.scala  |  2 +-
 .../sql/catalyst/expressions/ToStringBase.scala| 40 ++-
 .../org/apache/spark/sql/internal/SQLConf.scala| 42 
 .../apache/spark/sql/execution/HiveResult.scala| 34 +++-
 .../sql-tests/analyzer-results/binary.sql.out  | 27 +
 .../analyzer-results/binary_base64.sql.out | 27 +
 .../analyzer-results/binary_basic.sql.out  | 27 +
 .../sql-tests/analyzer-results/binary_hex.sql.out  | 27 +
 .../src/test/resources/sql-tests/inputs/binary.sql |  6 +++
 .../resources/sql-tests/inputs/binary_base64.sql   |  3 ++
 .../resources/sql-tests/inputs/binary_basic.sql|  4 ++
 .../test/resources/sql-tests/inputs/binary_hex.sql |  3 ++
 .../resources/sql-tests/results/binary.sql.out | 31 +++
 .../sql-tests/results/binary_base64.sql.out| 31 +++
 .../sql-tests/results/binary_basic.sql.out | 31 +++
 .../resources/sql-tests/results/binary_hex.sql.out | 31 +++
 .../org/apache/spark/sql/DataFrameShowSuite.scala  |  8 +++-
 .../org/apache/spark/sql/DataFrameSuite.scala  | 45 +++---
 .../spark/sql/execution/HiveResultSuite.scala  |  3 +-
 .../spark/sql/hive/thriftserver/RowSetUtils.scala  | 33 +---
 .../SparkExecuteStatementOperation.scala   |  3 +-
 .../thriftserver/ThriftServerQueryTestSuite.scala  | 24 +++-
 23 files changed, 429 insertions(+), 55 deletions(-)
 create mode 100644 
sql/core/src/test/resources/sql-tests/analyzer-results/binary.sql.out
 create mode 100644 
sql/core/src/test/resources/sql-tests/analyzer-results/binary_base64.sql.out
 create mode 100644 
sql/core/src/test/resources/sql-tests/analyzer-results/binary_basic.sql.out
 create mode 100644 
sql/core/src/test/resources/sql-tests/analyzer-results/binary_hex.sql.out
 create mode 100644 sql/core/src/test/resources/sql-tests/inputs/binary.sql
 create mode 100644 
sql/core/src/test/resources/sql-tests/inputs/binary_base64.sql
 create mode 100644 
sql/core/src/test/resources/sql-tests/inputs/binary_basic.sql
 create mode 100644 sql/core/src/test/resources/sql-tests/inputs/binary_hex.sql
 create mode 100644 sql/core/src/test/resources/sql-tests/results/binary.sql.out
 create mode 100644 
sql/core/src/test/resources/sql-tests/results/binary_base64.sql.out
 create mode 100644 
sql/core/src/test/resources/sql-tests/results/binary_basic.sql.out
 create mode 100644 
sql/core/src/test/resources/sql-tests/results/binary_hex.sql.out


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (5ac803079b30 -> 69ea082fc69a)

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5ac803079b30 [SPARK-48074][CORE] Improve the readability of JSON 
loggings
 add 69ea082fc69a [SPARK-47934][CORE] Ensure trailing slashes in 
`HistoryServer` URL redirections

No new revisions were added by this update.

Summary of changes:
 .../spark/deploy/history/HistoryServer.scala   |  4 +-
 .../spark/deploy/history/HistoryServerSuite.scala  | 58 ++
 2 files changed, 38 insertions(+), 24 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (35767bb09fe1 -> 5ac803079b30)

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 35767bb09fe1 [SPARK-48070][SQL][TESTS] Support 
`AdaptiveQueryExecSuite.runAdaptiveAndVerifyResult` to skip check results
 add 5ac803079b30 [SPARK-48074][CORE] Improve the readability of JSON 
loggings

No new revisions were added by this update.

Summary of changes:
 .../resources/org/apache/spark/SparkLayout.json| 31 +++---
 .../apache/spark/util/StructuredLoggingSuite.scala |  2 +-
 2 files changed, 28 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48070][SQL][TESTS] Support `AdaptiveQueryExecSuite.runAdaptiveAndVerifyResult` to skip check results

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 35767bb09fe1 [SPARK-48070][SQL][TESTS] Support 
`AdaptiveQueryExecSuite.runAdaptiveAndVerifyResult` to skip check results
35767bb09fe1 is described below

commit 35767bb09fe13468c03ffbb3a45e106e8b8eb179
Author: sychen 
AuthorDate: Wed May 1 12:37:24 2024 -0700

[SPARK-48070][SQL][TESTS] Support 
`AdaptiveQueryExecSuite.runAdaptiveAndVerifyResult` to skip check results

### What changes were proposed in this pull request?
This PR aims to support AdaptiveQueryExecSuite to skip check results.

### Why are the changes needed?
https://github.com/apache/spark/pull/46273#discussion_r1585445992

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46316 from cxzl25/SPARK-48070.

Authored-by: sychen 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
index f6ca7ff3cdcc..d74ecb32971c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
@@ -64,7 +64,8 @@ class AdaptiveQueryExecSuite
 
   setupTestData()
 
-  private def runAdaptiveAndVerifyResult(query: String): (SparkPlan, 
SparkPlan) = {
+  private def runAdaptiveAndVerifyResult(query: String,
+  skipCheckAnswer: Boolean = false): (SparkPlan, SparkPlan) = {
 var finalPlanCnt = 0
 var hasMetricsEvent = false
 val listener = new SparkListener {
@@ -88,8 +89,10 @@ class AdaptiveQueryExecSuite
 assert(planBefore.toString.startsWith("AdaptiveSparkPlan 
isFinalPlan=false"))
 val result = dfAdaptive.collect()
 withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
-  val df = sql(query)
-  checkAnswer(df, result.toImmutableArraySeq)
+  if (!skipCheckAnswer) {
+val df = sql(query)
+checkAnswer(df, result.toImmutableArraySeq)
+  }
 }
 val planAfter = dfAdaptive.queryExecution.executedPlan
 assert(planAfter.toString.startsWith("AdaptiveSparkPlan isFinalPlan=true"))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46009][SQL][FOLLOWUP] Remove unused PERCENTILE_CONT and PERCENTILE_DISC in g4

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ad63eef20617 [SPARK-46009][SQL][FOLLOWUP] Remove unused 
PERCENTILE_CONT and PERCENTILE_DISC in g4
ad63eef20617 is described below

commit ad63eef20617db7cdecce465af54e4787d0deeac
Author: beliefer 
AuthorDate: Wed May 1 11:25:54 2024 -0700

[SPARK-46009][SQL][FOLLOWUP] Remove unused PERCENTILE_CONT and 
PERCENTILE_DISC in g4

### What changes were proposed in this pull request?
This PR propose to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in 
g4

### Why are the changes needed?
https://github.com/apache/spark/pull/43910 merged the parse rule of 
`PercentileCont` and `PercentileDisc` into `functionCall`, but forgot to remove 
unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #46272 from beliefer/SPARK-46009_followup2.

Authored-by: beliefer 
Signed-off-by: Dongjoon Hyun 
---
 docs/sql-ref-ansi-compliance.md|   2 -
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |   2 -
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |   2 -
 .../sql-tests/analyzer-results/window2.sql.out | 126 +
 .../sql-tests/results/ansi/keywords.sql.out|   4 -
 .../resources/sql-tests/results/keywords.sql.out   |   2 -
 .../ThriftServerWithSparkContextSuite.scala|   2 +-
 7 files changed, 127 insertions(+), 13 deletions(-)

diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index 011bd671ca1f..84416ffd5f83 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -608,8 +608,6 @@ Below is a list of all the keywords in Spark SQL.
 |PARTITIONED|non-reserved|non-reserved|non-reserved|
 |PARTITIONS|non-reserved|non-reserved|non-reserved|
 |PERCENT|non-reserved|non-reserved|non-reserved|
-|PERCENTILE_CONT|reserved|non-reserved|non-reserved|
-|PERCENTILE_DISC|reserved|non-reserved|non-reserved|
 |PIVOT|non-reserved|non-reserved|non-reserved|
 |PLACING|non-reserved|non-reserved|non-reserved|
 |POSITION|non-reserved|non-reserved|reserved|
diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
index 83e40c4a20a2..86e16af7ff10 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
@@ -298,8 +298,6 @@ OVERWRITE: 'OVERWRITE';
 PARTITION: 'PARTITION';
 PARTITIONED: 'PARTITIONED';
 PARTITIONS: 'PARTITIONS';
-PERCENTILE_CONT: 'PERCENTILE_CONT';
-PERCENTILE_DISC: 'PERCENTILE_DISC';
 PERCENTLIT: 'PERCENT';
 PIVOT: 'PIVOT';
 PLACING: 'PLACING';
diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
index 71bd75f934ca..653224c5475f 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -1829,8 +1829,6 @@ nonReserved
 | PARTITION
 | PARTITIONED
 | PARTITIONS
-| PERCENTILE_CONT
-| PERCENTILE_DISC
 | PERCENTLIT
 | PIVOT
 | PLACING
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out
new file mode 100644
index ..6fd41286959a
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/window2.sql.out
@@ -0,0 +1,126 @@
+-- Automatically generated by SQLQueryTestSuite
+-- !query
+CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
+(null, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"),
+(1, 1L, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "a"),
+(1, 2L, 2.5D, date("2017-08-02"), timestamp_seconds(150200), "a"),
+(2, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), 
"a"),
+(1, null, 1.0D, date("2017-08-01"), timestamp_seconds(1501545600), "b"),
+(2, 3L, 3.3D, date("2017-08-03"), timestamp_seconds(150300), "b"),
+(3, 2147483650L, 100.001D, date("2020-12-31"), timestamp_seconds(1609372800), 
"b"),
+(null, null, null, null, null, null),
+(3, 1L, 1.0D, date("2017-

(spark) branch branch-3.5 updated: Revert "[SPARK-48016][SQL] Fix a bug in try_divide function when with decimals"

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new fc0ef07f2949 Revert "[SPARK-48016][SQL] Fix a bug in try_divide 
function when with decimals"
fc0ef07f2949 is described below

commit fc0ef07f2949c399537c6d9b5fb7b81f546de212
Author: Dongjoon Hyun 
AuthorDate: Wed May 1 11:18:29 2024 -0700

Revert "[SPARK-48016][SQL] Fix a bug in try_divide function when with 
decimals"

This reverts commit e78ee2c5770218a521340cb84f57a02dd00f7f3a.
---
 .../sql/catalyst/analysis/DecimalPrecision.scala   | 14 ++---
 .../spark/sql/catalyst/analysis/TypeCoercion.scala | 10 ++--
 sql/core/src/test/resources/log4j2.properties  |  2 +-
 .../analyzer-results/ansi/try_arithmetic.sql.out   | 56 ---
 .../analyzer-results/try_arithmetic.sql.out| 56 ---
 .../resources/sql-tests/inputs/try_arithmetic.sql  |  8 ---
 .../sql-tests/results/ansi/try_arithmetic.sql.out  | 64 --
 .../sql-tests/results/try_arithmetic.sql.out   | 64 --
 8 files changed, 13 insertions(+), 261 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
index f51127f53b38..09cf61a77955 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
@@ -83,7 +83,7 @@ object DecimalPrecision extends TypeCoercionRule {
   val resultType = widerDecimalType(p1, s1, p2, s2)
   val newE1 = if (e1.dataType == resultType) e1 else Cast(e1, resultType)
   val newE2 = if (e2.dataType == resultType) e2 else Cast(e2, resultType)
-  b.withNewChildren(Seq(newE1, newE2))
+  b.makeCopy(Array(newE1, newE2))
   }
 
   /**
@@ -202,21 +202,21 @@ object DecimalPrecision extends TypeCoercionRule {
 case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] &&
 l.dataType.isInstanceOf[IntegralType] &&
 literalPickMinimumPrecision =>
-  b.withNewChildren(Seq(Cast(l, DataTypeUtils.fromLiteral(l)), r))
+  b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r))
 case (l, r: Literal) if l.dataType.isInstanceOf[DecimalType] &&
 r.dataType.isInstanceOf[IntegralType] &&
 literalPickMinimumPrecision =>
-  b.withNewChildren(Seq(l, Cast(r, DataTypeUtils.fromLiteral(r
+  b.makeCopy(Array(l, Cast(r, DataTypeUtils.fromLiteral(r
 // Promote integers inside a binary expression with fixed-precision 
decimals to decimals,
 // and fixed-precision decimals in an expression with floats / doubles 
to doubles
 case (l @ IntegralTypeExpression(), r @ DecimalExpression(_, _)) =>
-  b.withNewChildren(Seq(Cast(l, DecimalType.forType(l.dataType)), r))
+  b.makeCopy(Array(Cast(l, DecimalType.forType(l.dataType)), r))
 case (l @ DecimalExpression(_, _), r @ IntegralTypeExpression()) =>
-  b.withNewChildren(Seq(l, Cast(r, DecimalType.forType(r.dataType
+  b.makeCopy(Array(l, Cast(r, DecimalType.forType(r.dataType
 case (l, r @ DecimalExpression(_, _)) if isFloat(l.dataType) =>
-  b.withNewChildren(Seq(l, Cast(r, DoubleType)))
+  b.makeCopy(Array(l, Cast(r, DoubleType)))
 case (l @ DecimalExpression(_, _), r) if isFloat(r.dataType) =>
-  b.withNewChildren(Seq(Cast(l, DoubleType), r))
+  b.makeCopy(Array(Cast(l, DoubleType), r))
 case _ => b
   }
   }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
index c9a4a2d40246..190e72a8e669 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
@@ -1102,22 +1102,22 @@ object TypeCoercion extends TypeCoercionBase {
 
   case a @ BinaryArithmetic(left @ StringTypeExpression(), right)
 if right.dataType != CalendarIntervalType =>
-a.withNewChildren(Seq(Cast(left, DoubleType), right))
+a.makeCopy(Array(Cast(left, DoubleType), right))
   case a @ BinaryArithmetic(left, right @ StringTypeExpression())
 if left.dataType != CalendarIntervalType =>
-a.withNewChildren(Seq(left, Cast(right, DoubleType)))
+a.makeCopy(Array(left, Cast(right, DoubleType)))
 
   // For equality between string and timestamp we cast the string to a 
timestam

(spark) branch branch-3.4 updated: [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 70ce67cc77cc [SPARK-48068][PYTHON] `mypy` should have 
`--python-executable` parameter
70ce67cc77cc is described below

commit 70ce67cc77ccce3a4509bba608dbab69b45cc2b9
Author: Dongjoon Hyun 
AuthorDate: Wed May 1 10:42:26 2024 -0700

[SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter

### What changes were proposed in this pull request?

This PR aims to fix `mypy` failure by propagating `lint-python`'s 
`PYTHON_EXECUTABLE` to `mypy`'s parameter correctly.

### Why are the changes needed?

We assumed that `PYTHON_EXECUTABLE` is used for `dev/lint-python` like the 
following. That's not always guaranteed. We need to use `mypy`'s parameter to 
make it sure.

https://github.com/apache/spark/blob/ff401dde50343c9bbc1c49a0294272f2da7d01e2/.github/workflows/build_and_test.yml#L705

This patch is useful whose `python3` chooses one of multiple Python 
installation like our CI environment.
```
$ docker run -it --rm 
ghcr.io/apache/apache-spark-ci-image:master-8905641334 bash
WARNING: The requested image's platform (linux/amd64) does not match the 
detected host platform (linux/arm64/v8) and no specific platform was requested
root2ef6ce08d2c4:/# python3 --version
Python 3.10.12
root2ef6ce08d2c4:/# python3.9 --version
Python 3.9.19
```

For example, the following shows that `PYTHON_EXECUTABLE` is not considered 
by `mypy`.
```
root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy 
--python-executable=python3.11 --namespace-packages --config-file 
python/mypy.ini python/pyspark | wc -l
3428
root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy 
--namespace-packages --config-file python/mypy.ini python/pyspark | wc -l
1
root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.11 mypy 
--namespace-packages --config-file python/mypy.ini python/pyspark | wc -l
1
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.
    
Closes #46314 from dongjoon-hyun/SPARK-48068.

Authored-by: Dongjoon Hyun 
    Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 26c871f180306fbf86ce65f14f8e7a71f89885ed)
    Signed-off-by: Dongjoon Hyun 
---
 dev/lint-python | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/dev/lint-python b/dev/lint-python
index b5ee63e38690..9b60ca75eb9b 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -69,6 +69,7 @@ function mypy_annotation_test {
 
 echo "starting mypy annotations test..."
 MYPY_REPORT=$( ($MYPY_BUILD \
+  --python-executable $PYTHON_EXECUTABLE \
   --namespace-packages \
   --config-file python/mypy.ini \
   --cache-dir /tmp/.mypy_cache/ \
@@ -128,6 +129,7 @@ function mypy_examples_test {
 echo "starting mypy examples test..."
 
 MYPY_REPORT=$( (MYPYPATH=python $MYPY_BUILD \
+  --python-executable $PYTHON_EXECUTABLE \
   --namespace-packages \
   --config-file python/mypy.ini \
   --exclude "mllib/*" \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 953d7f90c6db [SPARK-48068][PYTHON] `mypy` should have 
`--python-executable` parameter
953d7f90c6db is described below

commit 953d7f90c6dbee597b0360c551dfac2a1d87d961
Author: Dongjoon Hyun 
AuthorDate: Wed May 1 10:42:26 2024 -0700

[SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter

### What changes were proposed in this pull request?

This PR aims to fix `mypy` failure by propagating `lint-python`'s 
`PYTHON_EXECUTABLE` to `mypy`'s parameter correctly.

### Why are the changes needed?

We assumed that `PYTHON_EXECUTABLE` is used for `dev/lint-python` like the 
following. That's not always guaranteed. We need to use `mypy`'s parameter to 
make it sure.

https://github.com/apache/spark/blob/ff401dde50343c9bbc1c49a0294272f2da7d01e2/.github/workflows/build_and_test.yml#L705

This patch is useful whose `python3` chooses one of multiple Python 
installation like our CI environment.
```
$ docker run -it --rm 
ghcr.io/apache/apache-spark-ci-image:master-8905641334 bash
WARNING: The requested image's platform (linux/amd64) does not match the 
detected host platform (linux/arm64/v8) and no specific platform was requested
root2ef6ce08d2c4:/# python3 --version
Python 3.10.12
root2ef6ce08d2c4:/# python3.9 --version
Python 3.9.19
```

For example, the following shows that `PYTHON_EXECUTABLE` is not considered 
by `mypy`.
```
root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy 
--python-executable=python3.11 --namespace-packages --config-file 
python/mypy.ini python/pyspark | wc -l
3428
root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy 
--namespace-packages --config-file python/mypy.ini python/pyspark | wc -l
1
root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.11 mypy 
--namespace-packages --config-file python/mypy.ini python/pyspark | wc -l
1
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.
    
Closes #46314 from dongjoon-hyun/SPARK-48068.

Authored-by: Dongjoon Hyun 
    Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 26c871f180306fbf86ce65f14f8e7a71f89885ed)
    Signed-off-by: Dongjoon Hyun 
---
 dev/lint-python | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/dev/lint-python b/dev/lint-python
index d040493c86c4..7ccd32451acc 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -118,6 +118,7 @@ function mypy_annotation_test {
 
 echo "starting mypy annotations test..."
 MYPY_REPORT=$( ($MYPY_BUILD \
+  --python-executable $PYTHON_EXECUTABLE \
   --namespace-packages \
   --config-file python/mypy.ini \
   --cache-dir /tmp/.mypy_cache/ \
@@ -177,6 +178,7 @@ function mypy_examples_test {
 echo "starting mypy examples test..."
 
 MYPY_REPORT=$( (MYPYPATH=python $MYPY_BUILD \
+  --python-executable $PYTHON_EXECUTABLE \
   --namespace-packages \
   --config-file python/mypy.ini \
   --exclude "mllib/*" \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter

2024-05-01 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 26c871f18030 [SPARK-48068][PYTHON] `mypy` should have 
`--python-executable` parameter
26c871f18030 is described below

commit 26c871f180306fbf86ce65f14f8e7a71f89885ed
Author: Dongjoon Hyun 
AuthorDate: Wed May 1 10:42:26 2024 -0700

[SPARK-48068][PYTHON] `mypy` should have `--python-executable` parameter

### What changes were proposed in this pull request?

This PR aims to fix `mypy` failure by propagating `lint-python`'s 
`PYTHON_EXECUTABLE` to `mypy`'s parameter correctly.

### Why are the changes needed?

We assumed that `PYTHON_EXECUTABLE` is used for `dev/lint-python` like the 
following. That's not always guaranteed. We need to use `mypy`'s parameter to 
make it sure.

https://github.com/apache/spark/blob/ff401dde50343c9bbc1c49a0294272f2da7d01e2/.github/workflows/build_and_test.yml#L705

This patch is useful whose `python3` chooses one of multiple Python 
installation like our CI environment.
```
$ docker run -it --rm 
ghcr.io/apache/apache-spark-ci-image:master-8905641334 bash
WARNING: The requested image's platform (linux/amd64) does not match the 
detected host platform (linux/arm64/v8) and no specific platform was requested
root2ef6ce08d2c4:/# python3 --version
Python 3.10.12
root2ef6ce08d2c4:/# python3.9 --version
Python 3.9.19
```

For example, the following shows that `PYTHON_EXECUTABLE` is not considered 
by `mypy`.
```
root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy 
--python-executable=python3.11 --namespace-packages --config-file 
python/mypy.ini python/pyspark | wc -l
3428
root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.9 mypy 
--namespace-packages --config-file python/mypy.ini python/pyspark | wc -l
1
root18c8eae5791e:/spark# PYTHON_EXECUTABLE=python3.11 mypy 
--namespace-packages --config-file python/mypy.ini python/pyspark | wc -l
1
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.
    
Closes #46314 from dongjoon-hyun/SPARK-48068.

Authored-by: Dongjoon Hyun 
    Signed-off-by: Dongjoon Hyun 
---
 dev/lint-python | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/dev/lint-python b/dev/lint-python
index 6bd843103bd7..b8703310bc4b 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -125,6 +125,7 @@ function mypy_annotation_test {
 
 echo "starting mypy annotations test..."
 MYPY_REPORT=$( ($MYPY_BUILD \
+  --python-executable $PYTHON_EXECUTABLE \
   --namespace-packages \
   --config-file python/mypy.ini \
   --cache-dir /tmp/.mypy_cache/ \
@@ -184,6 +185,7 @@ function mypy_examples_test {
 echo "starting mypy examples test..."
 
 MYPY_REPORT=$( (MYPYPATH=python $MYPY_BUILD \
+  --python-executable $PYTHON_EXECUTABLE \
   --namespace-packages \
   --config-file python/mypy.ini \
   --exclude "mllib/*" \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48069][INFRA] Handle `PEP-632` by checking `ModuleNotFoundError` on `setuptools` in Python 3.12

2024-04-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ff401dde5034 [SPARK-48069][INFRA] Handle `PEP-632` by checking 
`ModuleNotFoundError` on `setuptools` in Python 3.12
ff401dde5034 is described below

commit ff401dde50343c9bbc1c49a0294272f2da7d01e2
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 30 23:54:06 2024 -0700

[SPARK-48069][INFRA] Handle `PEP-632` by checking `ModuleNotFoundError` on 
`setuptools` in Python 3.12

### What changes were proposed in this pull request?

This PR aims to handle `PEP-632` by checking `ModuleNotFoundError` on 
`setuptools`.
- [PEP 632 – Deprecate distutils module](https://peps.python.org/pep-0632/)

### Why are the changes needed?

Use `Python 3.12`.
```
$ python3 --version
Python 3.12.2
```

**BEFORE**
```
$ dev/lint-python --mypy | grep ModuleNotFoundError
Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'setuptools'
```

**AFTER**
```
$ dev/lint-python --mypy | grep ModuleNotFoundError
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and manual test.

### Was this patch authored or co-authored using generative AI tooling?

No.

    Closes #46315 from dongjoon-hyun/SPARK-48069.
    
Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/lint-python | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/dev/lint-python b/dev/lint-python
index 8d587bd52aca..6bd843103bd7 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -84,7 +84,10 @@ function satisfies_min_version {
 local expected_version="$2"
 echo "$(
 "$PYTHON_EXECUTABLE" << EOM
-from setuptools.extern.packaging import version
+try:
+from setuptools.extern.packaging import version
+except ModuleNotFoundError:
+from packaging import version
 print(version.parse('$provided_version') >= version.parse('$expected_version'))
 EOM
 )"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48016][SQL][TESTS][FOLLOWUP] Update Java 21 golden file

2024-04-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 65cf5b18648a [SPARK-48016][SQL][TESTS][FOLLOWUP] Update Java 21 golden 
file
65cf5b18648a is described below

commit 65cf5b18648a81fc9b0787d03f23f7465c20f3ec
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 30 22:42:02 2024 -0700

[SPARK-48016][SQL][TESTS][FOLLOWUP] Update Java 21 golden file

### What changes were proposed in this pull request?

This is a follow-up of SPARK-48016 to update the missed Java 21 golden file.
- #46286

### Why are the changes needed?

To recover Java 21 CIs:
- https://github.com/apache/spark/actions/workflows/build_java21.yml
- https://github.com/apache/spark/actions/workflows/build_maven_java21.yml
- 
https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual tests. I regenerated all in Java 21 and this was the only one 
affected.
```
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite"
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46313 from dongjoon-hyun/SPARK-48016.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../results/try_arithmetic.sql.out.java21  | 64 ++
 1 file changed, 64 insertions(+)

diff --git 
a/sql/core/src/test/resources/sql-tests/results/try_arithmetic.sql.out.java21 
b/sql/core/src/test/resources/sql-tests/results/try_arithmetic.sql.out.java21
index dcdb9d0dcb19..002a0dfcf37e 100644
--- 
a/sql/core/src/test/resources/sql-tests/results/try_arithmetic.sql.out.java21
+++ 
b/sql/core/src/test/resources/sql-tests/results/try_arithmetic.sql.out.java21
@@ -15,6 +15,22 @@ struct
 NULL
 
 
+-- !query
+SELECT try_add(2147483647, decimal(1))
+-- !query schema
+struct
+-- !query output
+2147483648
+
+
+-- !query
+SELECT try_add(2147483647, "1")
+-- !query schema
+struct
+-- !query output
+2.147483648E9
+
+
 -- !query
 SELECT try_add(-2147483648, -1)
 -- !query schema
@@ -249,6 +265,22 @@ struct
 NULL
 
 
+-- !query
+SELECT try_divide(1, decimal(0))
+-- !query schema
+struct
+-- !query output
+NULL
+
+
+-- !query
+SELECT try_divide(1, "0")
+-- !query schema
+struct
+-- !query output
+NULL
+
+
 -- !query
 SELECT try_divide(interval 2 year, 2)
 -- !query schema
@@ -313,6 +345,22 @@ struct
 NULL
 
 
+-- !query
+SELECT try_subtract(2147483647, decimal(-1))
+-- !query schema
+struct
+-- !query output
+2147483648
+
+
+-- !query
+SELECT try_subtract(2147483647, "-1")
+-- !query schema
+struct
+-- !query output
+2.147483648E9
+
+
 -- !query
 SELECT try_subtract(-2147483648, 1)
 -- !query schema
@@ -409,6 +457,22 @@ struct
 NULL
 
 
+-- !query
+SELECT try_multiply(2147483647, decimal(-2))
+-- !query schema
+struct
+-- !query output
+-4294967294
+
+
+-- !query
+SELECT try_multiply(2147483647, "-2")
+-- !query schema
+struct
+-- !query output
+-4.294967294E9
+
+
 -- !query
 SELECT try_multiply(-2147483648, 2)
 -- !query schema


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48047][SQL] Reduce memory pressure of empty TreeNode tags

2024-04-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 02206cd66dbf [SPARK-48047][SQL] Reduce memory pressure of empty 
TreeNode tags
02206cd66dbf is described below

commit 02206cd66dbfc8de602a685b032f1805bcf8e36f
Author: Nick Young 
AuthorDate: Tue Apr 30 22:07:20 2024 -0700

[SPARK-48047][SQL] Reduce memory pressure of empty TreeNode tags

### What changes were proposed in this pull request?

- Changed the `tags` variable of the `TreeNode` class to initialize lazily. 
This will reduce unnecessary driver memory pressure.

### Why are the changes needed?

- Plans with large expression or operator trees are known to cause driver 
memory pressure; this is one step in alleviating that issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UT covers behavior. Outwards facing behavior does not change.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46285 from n-young-db/treenode-tags.

Authored-by: Nick Young 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/catalyst/trees/TreeNode.scala | 24 ++
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
index 94e893d468b3..dd39f3182bfb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
@@ -78,8 +78,16 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]]
   /**
* A mutable map for holding auxiliary information of this tree node. It 
will be carried over
* when this node is copied via `makeCopy`, or transformed via 
`transformUp`/`transformDown`.
+   * We lazily evaluate the `tags` since the default size of a `mutable.Map` 
is nonzero. This
+   * will reduce unnecessary memory pressure.
*/
-  private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty
+  private[this] var _tags: mutable.Map[TreeNodeTag[_], Any] = null
+  private def tags: mutable.Map[TreeNodeTag[_], Any] = {
+if (_tags eq null) {
+  _tags = mutable.Map.empty
+}
+_tags
+  }
 
   /**
* Default tree pattern [[BitSet] for a [[TreeNode]].
@@ -147,11 +155,13 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]]
 ineffectiveRules.get(ruleId.id)
   }
 
+  def isTagsEmpty: Boolean = (_tags eq null) || _tags.isEmpty
+
   def copyTagsFrom(other: BaseType): Unit = {
 // SPARK-32753: it only makes sense to copy tags to a new node
 // but it's too expensive to detect other cases likes node removal
 // so we make a compromise here to copy tags to node with no tags
-if (tags.isEmpty) {
+if (isTagsEmpty && !other.isTagsEmpty) {
   tags ++= other.tags
 }
   }
@@ -161,11 +171,17 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]]
   }
 
   def getTagValue[T](tag: TreeNodeTag[T]): Option[T] = {
-tags.get(tag).map(_.asInstanceOf[T])
+if (isTagsEmpty) {
+  None
+} else {
+  tags.get(tag).map(_.asInstanceOf[T])
+}
   }
 
   def unsetTagValue[T](tag: TreeNodeTag[T]): Unit = {
-tags -= tag
+if (!isTagsEmpty) {
+  tags -= tag
+}
   }
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48063][CORE] Enable `spark.stage.ignoreDecommissionFetchFailure` by default

2024-04-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f3cc8f930383 [SPARK-48063][CORE] Enable 
`spark.stage.ignoreDecommissionFetchFailure` by default
f3cc8f930383 is described below

commit f3cc8f930383659b9f99e56b38de4b97d588e20b
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 30 15:19:00 2024 -0700

[SPARK-48063][CORE] Enable `spark.stage.ignoreDecommissionFetchFailure` by 
default

### What changes were proposed in this pull request?

This PR aims to **enable `spark.stage.ignoreDecommissionFetchFailure` by 
default** while keeping 
`spark.scheduler.maxRetainedRemovedDecommissionExecutors=0` without any change 
for Apache Spark 4.0.0 in order to help a user use this feature more easily by 
setting only one configuration, 
`spark.scheduler.maxRetainedRemovedDecommissionExecutors`.

### Why are the changes needed?

This feature was added at Apache Spark 3.4.0 via SPARK-40481 and 
SPARK-40979 and has been used for two years to support executor decommissioning 
features in the production.
- #37924
- #38441

### Does this PR introduce _any_ user-facing change?

No because `spark.scheduler.maxRetainedRemovedDecommissionExecutors` is 
still `0`.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46308 from dongjoon-hyun/SPARK-48063.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/internal/config/package.scala | 2 +-
 docs/configuration.md  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala 
b/core/src/main/scala/org/apache/spark/internal/config/package.scala
index b2cbb6f6deb6..2e207422ae06 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/package.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -2403,7 +2403,7 @@ package object config {
 s"count ${STAGE_MAX_CONSECUTIVE_ATTEMPTS.key}")
   .version("3.4.0")
   .booleanConf
-  .createWithDefault(false)
+  .createWithDefault(true)
 
   private[spark] val SCHEDULER_MAX_RETAINED_REMOVED_EXECUTORS =
 ConfigBuilder("spark.scheduler.maxRetainedRemovedDecommissionExecutors")
diff --git a/docs/configuration.md b/docs/configuration.md
index d5e2a569fdea..2e612ffd9ab9 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -3072,7 +3072,7 @@ Apart from these, the following properties are also 
available, and may be useful
 
 
   spark.stage.ignoreDecommissionFetchFailure
-  false
+  true
   
 Whether ignore stage fetch failure caused by executor decommission when
 count spark.stage.maxConsecutiveAttempts


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48060][SS][TESTS] Fix `StreamingQueryHashPartitionVerifySuite` to update golden files correctly

2024-04-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new faab553cac70 [SPARK-48060][SS][TESTS] Fix 
`StreamingQueryHashPartitionVerifySuite` to update golden files correctly
faab553cac70 is described below

commit faab553cac70eefeec286b1823b70ad62bed87f8
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 30 12:50:07 2024 -0700

[SPARK-48060][SS][TESTS] Fix `StreamingQueryHashPartitionVerifySuite` to 
update golden files correctly

### What changes were proposed in this pull request?

This PR aims to fix `StreamingQueryHashPartitionVerifySuite` to update 
golden files correctly.
- The documentation is added.
- Newly generated files are updated.

### Why are the changes needed?

Previously, `SPARK_GENERATE_GOLDEN_FILES` doesn't work as expected because 
it updates the files under `target` directory. We need to update `src/test` 
files.

**BEFORE**
```
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
*StreamingQueryHashPartitionVerifySuite"

$ git status
On branch master
Your branch is up to date with 'apache/master'.

nothing to commit, working tree clean
```

**AFTER**
```
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
*StreamingQueryHashPartitionVerifySuite" \
-Dspark.sql.test.randomDataGenerator.maxStrLen=100 \
-Dspark.sql.test.randomDataGenerator.maxArraySize=4

$ git status
On branch SPARK-48060
    Your branch is up to date with 'dongjoon/SPARK-48060'.

Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git restore ..." to discard changes in working directory)
modified:   
sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas
modified:   
sql/core/src/test/resources/structured-streaming/partition-tests/rowsAndPartIds

no changes added to commit (use "git add" and/or "git commit -a")
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs. I regenerate the data like the following.

```
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
*StreamingQueryHashPartitionVerifySuite" \
-Dspark.sql.test.randomDataGenerator.maxStrLen=100 \
-Dspark.sql.test.randomDataGenerator.maxArraySize=4
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46304 from dongjoon-hyun/SPARK-48060.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../partition-tests/randomSchemas  |   2 +-
 .../partition-tests/rowsAndPartIds | Bin 4862115 -> 13341426 
bytes
 .../StreamingQueryHashPartitionVerifySuite.scala   |  22 +++--
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git 
a/sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas
 
b/sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas
index 8d6ff942610c..f6eadd776cc6 100644
--- 
a/sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas
+++ 
b/sql/core/src/test/resources/structured-streaming/partition-tests/randomSchemas
@@ -1 +1 @@
-col_0 STRUCT NOT 
NULL, col_3: FLOAT NOT NULL, col_4: INT NOT NULL>,col_1 STRUCT, col_3: 
ARRAY NOT NULL, col_4: ARRAY, col_5: TIMESTAMP NOT NULL, col_6: 
STRUCT, col_1: BIGINT NOT NULL> NOT NULL, col_7: 
ARRAY NOT NULL, col_8: ARRAY, col_9: BIGINT NOT NULL> NOT 
NULL,col_2 BIGINT NOT NULL,col_3 STRUCT,col_1 STRUCT NOT NULL,col_2 STRING NOT 
NULL,col_3 STRUCT, col_2: ARRAY NOT 
NULL> NOT NULL,col_4 BINARY NOT NULL,col_5 ARRAY NOT NULL,col_6 
ARRAY,col_7 DOUBLE NOT NULL,col_8 ARRAY NOT NULL,col_9 
ARRAY,col_10 FLOAT NOT NULL,col_11 STRUCT NOT NULL>, col_1: STRUCT NOT NULL, col_1: 
INT, col_2: STRUCT

(spark) branch master updated: [SPARK-48057][PYTHON][CONNECT][TESTS] Enable `GroupedApplyInPandasTests.test_grouped_with_empty_partition`

2024-04-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dab20b31388b [SPARK-48057][PYTHON][CONNECT][TESTS] Enable 
`GroupedApplyInPandasTests.test_grouped_with_empty_partition`
dab20b31388b is described below

commit dab20b31388ba7bcd2ab4d4424cbbd072bf84c30
Author: Ruifeng Zheng 
AuthorDate: Tue Apr 30 12:19:18 2024 -0700

[SPARK-48057][PYTHON][CONNECT][TESTS] Enable 
`GroupedApplyInPandasTests.test_grouped_with_empty_partition`

### What changes were proposed in this pull request?
Enable `GroupedApplyInPandasTests. test_grouped_with_empty_partition`

### Why are the changes needed?
test coverage

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46299 from zhengruifeng/fix_test_grouped_with_empty_partition.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py | 4 
 python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py | 4 ++--
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py 
b/python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py
index 1cc4ce012623..8a1da440c799 100644
--- a/python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py
+++ b/python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map.py
@@ -38,10 +38,6 @@ class 
GroupedApplyInPandasTests(GroupedApplyInPandasTestsMixin, ReusedConnectTes
 def test_apply_in_pandas_returning_incompatible_type(self):
 super().test_apply_in_pandas_returning_incompatible_type()
 
-@unittest.skip("Spark Connect doesn't support RDD but the test depends on 
it.")
-def test_grouped_with_empty_partition(self):
-super().test_grouped_with_empty_partition()
-
 
 if __name__ == "__main__":
 from pyspark.sql.tests.connect.test_parity_pandas_grouped_map import *  # 
noqa: F401
diff --git a/python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py 
b/python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py
index f43dafc0a4a1..1e86e12eb74f 100644
--- a/python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py
+++ b/python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py
@@ -680,13 +680,13 @@ class GroupedApplyInPandasTestsMixin:
 data = [Row(id=1, x=2), Row(id=1, x=3), Row(id=2, x=4)]
 expected = [Row(id=1, x=5), Row(id=1, x=5), Row(id=2, x=4)]
 num_parts = len(data) + 1
-df = self.spark.createDataFrame(self.sc.parallelize(data, 
numSlices=num_parts))
+df = self.spark.createDataFrame(data).repartition(num_parts)
 
 f = pandas_udf(
 lambda pdf: pdf.assign(x=pdf["x"].sum()), "id long, x int", 
PandasUDFType.GROUPED_MAP
 )
 
-result = df.groupBy("id").apply(f).collect()
+result = df.groupBy("id").apply(f).sort("id").collect()
 self.assertEqual(result, expected)
 
 def test_grouped_over_window(self):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (0329479acb67 -> 9caa6f7f8b8e)

2024-04-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0329479acb67 [SPARK-47359][SQL] Support TRANSLATE function to work 
with collated strings
 add 9caa6f7f8b8e [SPARK-48061][SQL][TESTS] Parameterize max limits of 
`spark.sql.test.randomDataGenerator`

No new revisions were added by this update.

Summary of changes:
 .../test/scala/org/apache/spark/sql/RandomDataGenerator.scala| 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default

2024-04-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9e8c4aa3f43a [SPARK-46122][SQL] Set 
`spark.sql.legacy.createHiveTableByDefault` to `false` by default
9e8c4aa3f43a is described below

commit 9e8c4aa3f43a3d99bff56cca319db623abc473ee
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 30 01:44:37 2024 -0700

[SPARK-46122][SQL] Set `spark.sql.legacy.createHiveTableByDefault` to 
`false` by default

### What changes were proposed in this pull request?

This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to 
`false` by default in order to move away from this legacy behavior from `Apache 
Spark 4.0.0` while the legacy functionality will be preserved during Apache 
Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`.

### Why are the changes needed?

Historically, this behavior change was merged at `Apache Spark 3.0.0` 
activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period.

- 2019-12-06: #26736 (58be82a)
- 2019-12-06: 
https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j
- 2020-05-16: #28517

At `Apache Spark 3.1.0`, we had another discussion and defined it as 
`Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098.
- 2020-12-01: 
https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
- 2020-12-03: #30554

Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good 
time to make a decision for Apache Spark future direction.
- SPARK-42603 on 2023-02-27 as an independent idea.
- SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea

### Does this PR introduce _any_ user-facing change?

Yes, the migration document is updated.

### How was this patch tested?

Pass the CIs with the adjusted test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46207 from dongjoon-hyun/SPARK-46122.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 docs/sql-migration-guide.md   | 1 +
 python/pyspark/sql/tests/test_readwriter.py   | 5 ++---
 .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala| 2 +-
 .../apache/spark/sql/execution/command/PlanResolutionSuite.scala  | 8 +++-
 4 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 1e0fdadde1e3..07562babc87d 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -25,6 +25,7 @@ license: |
 ## Upgrading from Spark SQL 3.5 to 4.0
 
 - Since Spark 4.0, `spark.sql.ansi.enabled` is on by default. To restore the 
previous behavior, set `spark.sql.ansi.enabled` to `false` or 
`SPARK_ANSI_SQL_MODE` to `false`.
+- Since Spark 4.0, `CREATE TABLE` syntax without `USING` and `STORED AS` will 
use the value of `spark.sql.sources.default` as the table provider instead of 
`Hive`. To restore the previous behavior, set 
`spark.sql.legacy.createHiveTableByDefault` to `true`.
 - Since Spark 4.0, the default behaviour when inserting elements in a map is 
changed to first normalize keys -0.0 to 0.0. The affected SQL functions are 
`create_map`, `map_from_arrays`, `map_from_entries`, and `map_concat`. To 
restore the previous behaviour, set 
`spark.sql.legacy.disableMapKeyNormalization` to `true`.
 - Since Spark 4.0, the default value of `spark.sql.maxSinglePartitionBytes` is 
changed from `Long.MaxValue` to `128m`. To restore the previous behavior, set 
`spark.sql.maxSinglePartitionBytes` to `9223372036854775807`(`Long.MaxValue`).
 - Since Spark 4.0, any read of SQL tables takes into consideration the SQL 
configs 
`spark.sql.files.ignoreCorruptFiles`/`spark.sql.files.ignoreMissingFiles` 
instead of the core config 
`spark.files.ignoreCorruptFiles`/`spark.files.ignoreMissingFiles`.
diff --git a/python/pyspark/sql/tests/test_readwriter.py 
b/python/pyspark/sql/tests/test_readwriter.py
index 5784d2c72973..e752856d0316 100644
--- a/python/pyspark/sql/tests/test_readwriter.py
+++ b/python/pyspark/sql/tests/test_readwriter.py
@@ -247,10 +247,9 @@ class ReadwriterV2TestsMixin:
 
 def test_create_without_provider(self):
 df = self.df
-with self.assertRaisesRegex(
-AnalysisException, "NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT"
-):
+with self.table("test_table"):
 df.writeTo("test_table").create()
+self.assertEqual(100, self.spark.sql("select * from 
test_table").count())
 
 def test_table_overwrite(self):
 df = self.df
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.

(spark) branch master updated: [SPARK-48042][SQL] Use a timestamp formatter with timezone at class level instead of making copies at method level

2024-04-29 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c9ed9dfccb72 [SPARK-48042][SQL] Use a timestamp formatter with 
timezone at class level instead of making copies at method level
c9ed9dfccb72 is described below

commit c9ed9dfccb72bc8d30557dcd2809c298a75c3f69
Author: Kent Yao 
AuthorDate: Mon Apr 29 11:13:39 2024 -0700

[SPARK-48042][SQL] Use a timestamp formatter with timezone at class level 
instead of making copies at method level

### What changes were proposed in this pull request?

This PR creates a timestamp formatter with the timezone directly for 
formatting. Previously, we called `withZone` for every value in the `format` 
function. Because the original `zoneId` in the formatter is null and never 
equals the one we pass in, it creates new copies of the formatter over and over.

```java
...
 *
 * param zone  the new override zone, null if no override
 * return a formatter based on this formatter with the requested 
override zone, not null
 */
public DateTimeFormatter withZone(ZoneId zone) {
if (Objects.equals(this.zone, zone)) {
return this;
}
return new DateTimeFormatter(printerParser, locale, decimalStyle, 
resolverStyle, resolverFields, chrono, zone);
}
```

### Why are the changes needed?

improvement
### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

- Existing tests
- I also ran the DateTimeBenchmark result locally, there's no performance 
gain at least for these cases.

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46282 from yaooqinn/SPARK-48042.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/catalyst/util/TimestampFormatter.scala  | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
index d59b52a3818a..9f57f8375c54 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
@@ -162,6 +162,9 @@ class Iso8601TimestampFormatter(
   protected lazy val formatter: DateTimeFormatter =
 getOrCreateFormatter(pattern, locale, isParsing)
 
+  @transient
+  private lazy val zonedFormatter: DateTimeFormatter = 
formatter.withZone(zoneId)
+
   @transient
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
 pattern, zoneId, locale, legacyFormat)
@@ -231,7 +234,7 @@ class Iso8601TimestampFormatter(
 
   override def format(instant: Instant): String = {
 try {
-  formatter.withZone(zoneId).format(instant)
+  zonedFormatter.format(instant)
 } catch checkFormattedDiff(toJavaTimestamp(instantToMicros(instant)),
   (t: Timestamp) => format(t))
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (f781d153a5e4 -> c35a21e5984f)

2024-04-29 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f781d153a5e4 [SPARK-48046][K8S] Remove `clock` parameter from 
`DriverServiceFeatureStep`
 add c35a21e5984f [SPARK-48044][PYTHON][CONNECT] Cache 
`DataFrame.isStreaming`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/dataframe.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (d42c10d9411d -> f781d153a5e4)

2024-04-29 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from d42c10d9411d [SPARK-47693][TESTS][FOLLOWUP] Reduce CollationBenchmarks 
time
 add f781d153a5e4 [SPARK-48046][K8S] Remove `clock` parameter from 
`DriverServiceFeatureStep`

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala   | 4 +---
 .../spark/deploy/k8s/features/DriverServiceFeatureStepSuite.scala | 2 +-
 2 files changed, 2 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (ccb0eb699f7c -> d42c10d9411d)

2024-04-29 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ccb0eb699f7c [SPARK-48038][K8S] Promote driverServiceName to 
KubernetesDriverConf
 add d42c10d9411d [SPARK-47693][TESTS][FOLLOWUP] Reduce CollationBenchmarks 
time

No new revisions were added by this update.

Summary of changes:
 .../execution/benchmark/CollationBenchmark.scala   | 38 --
 1 file changed, 20 insertions(+), 18 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48038][K8S] Promote driverServiceName to KubernetesDriverConf

2024-04-29 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ccb0eb699f7c [SPARK-48038][K8S] Promote driverServiceName to 
KubernetesDriverConf
ccb0eb699f7c is described below

commit ccb0eb699f7c54aa3902d1ebbb34684693b563de
Author: Cheng Pan 
AuthorDate: Mon Apr 29 08:35:13 2024 -0700

[SPARK-48038][K8S] Promote driverServiceName to KubernetesDriverConf

### What changes were proposed in this pull request?

Promote `driverServiceName` from `DriverServiceFeatureStep` to 
`KubernetesDriverConf`.

### Why are the changes needed?

To allow other feature steps, e.g. ingress(proposed in SPARK-47954), to 
access `driverServiceName`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT has been updated.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46276 from pan3793/SPARK-48038.

Authored-by: Cheng Pan 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/k8s/KubernetesConf.scala   | 22 +++---
 .../k8s/features/DriverServiceFeatureStep.scala| 14 ++
 .../spark/deploy/k8s/KubernetesTestConf.scala  |  6 --
 .../features/DriverServiceFeatureStepSuite.scala   | 17 +
 4 files changed, 34 insertions(+), 25 deletions(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
index b55f9317d10b..fda772b737fe 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
@@ -24,12 +24,13 @@ import org.apache.commons.lang3.StringUtils
 import org.apache.spark.{SPARK_VERSION, SparkConf}
 import org.apache.spark.deploy.k8s.Config._
 import org.apache.spark.deploy.k8s.Constants._
+import org.apache.spark.deploy.k8s.features.DriverServiceFeatureStep._
 import org.apache.spark.deploy.k8s.submit._
 import org.apache.spark.internal.{Logging, MDC}
 import org.apache.spark.internal.LogKeys.{CONFIG, EXECUTOR_ENV_REGEX}
 import org.apache.spark.internal.config.ConfigEntry
 import org.apache.spark.resource.ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{Clock, SystemClock, Utils}
 
 /**
  * Structure containing metadata for Kubernetes logic to build Spark pods.
@@ -83,12 +84,27 @@ private[spark] class KubernetesDriverConf(
 val mainAppResource: MainAppResource,
 val mainClass: String,
 val appArgs: Array[String],
-val proxyUser: Option[String])
-  extends KubernetesConf(sparkConf) {
+val proxyUser: Option[String],
+clock: Clock = new SystemClock())
+  extends KubernetesConf(sparkConf) with Logging {
 
   def driverNodeSelector: Map[String, String] =
 KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, 
KUBERNETES_DRIVER_NODE_SELECTOR_PREFIX)
 
+  lazy val driverServiceName: String = {
+val preferredServiceName = s"$resourceNamePrefix$DRIVER_SVC_POSTFIX"
+if (preferredServiceName.length <= MAX_SERVICE_NAME_LENGTH) {
+  preferredServiceName
+} else {
+  val randomServiceId = KubernetesUtils.uniqueID(clock)
+  val shorterServiceName = s"spark-$randomServiceId$DRIVER_SVC_POSTFIX"
+  logWarning(s"Driver's hostname would preferably be 
$preferredServiceName, but this is " +
+s"too long (must be <= $MAX_SERVICE_NAME_LENGTH characters). Falling 
back to use " +
+s"$shorterServiceName as the driver service's name.")
+  shorterServiceName
+}
+  }
+
   override val resourceNamePrefix: String = {
 val custom = if (Utils.isTesting) get(KUBERNETES_DRIVER_POD_NAME_PREFIX) 
else None
 custom.getOrElse(KubernetesConf.getResourceNamePrefix(appName))
diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala
index cba4f442371c..9adfb2b8de49 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala
@@ -20,7 +20,7 @@ import scala.jdk.CollectionConverters._
 
 import io.fabric8.kubernetes.api.model.{HasMetadata, ServiceBuilder}
 
-import org.apache.spark.deploy.k8s.{KubernetesDriverConf, KubernetesUtils, 
SparkPod}
+

(spark) branch master updated: [MINOR][DOCS] Remove space in the middle of configuration name in Arrow-optimized Python UDF page

2024-04-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ff0751a56f01 [MINOR][DOCS] Remove space in the middle of configuration 
name in Arrow-optimized Python UDF page
ff0751a56f01 is described below

commit ff0751a56f010a6bf8a9ae86ddf0868bee615848
Author: Hyukjin Kwon 
AuthorDate: Sun Apr 28 22:34:30 2024 -0700

[MINOR][DOCS] Remove space in the middle of configuration name in 
Arrow-optimized Python UDF page

### What changes were proposed in this pull request?

This PR removes a space in the middle of configuration name in 
Arrow-optimized Python UDF page.

![Screenshot 2024-04-29 at 1 53 42 
PM](https://github.com/apache/spark/assets/6477701/46b7c448-fb30-4838-a5ba-c8f1c23398fd)


https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#arrow-python-udfs

### Why are the changes needed?

So users can copy and paste the configuration names properly.

### Does this PR introduce _any_ user-facing change?

Yes it fixes the doc.

### How was this patch tested?

Manually built the docs, and checked.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46274 from HyukjinKwon/fix-minor-typo.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 python/docs/source/user_guide/sql/arrow_pandas.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/python/docs/source/user_guide/sql/arrow_pandas.rst 
b/python/docs/source/user_guide/sql/arrow_pandas.rst
index a5dfb9aa4e52..1d6a4df60690 100644
--- a/python/docs/source/user_guide/sql/arrow_pandas.rst
+++ b/python/docs/source/user_guide/sql/arrow_pandas.rst
@@ -339,9 +339,9 @@ Arrow Python UDFs
 Arrow Python UDFs are user defined functions that are executed row-by-row, 
utilizing Arrow for efficient batch data
 transfer and serialization. To define an Arrow Python UDF, you can use the 
:meth:`udf` decorator or wrap the function
 with the :meth:`udf` method, ensuring the ``useArrow`` parameter is set to 
True. Additionally, you can enable Arrow
-optimization for Python UDFs throughout the entire SparkSession by setting the 
Spark configuration ``spark.sql
-.execution.pythonUDF.arrow.enabled`` to true. It's important to note that the 
Spark configuration takes effect only
-when ``useArrow`` is either not set or set to None.
+optimization for Python UDFs throughout the entire SparkSession by setting the 
Spark configuration
+``spark.sql.execution.pythonUDF.arrow.enabled`` to true. It's important to 
note that the Spark configuration takes
+effect only when ``useArrow`` is either not set or set to None.
 
 The type hints for Arrow Python UDFs should be specified in the same way as 
for default, pickled Python UDFs.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (9a42610d5ad8 -> e1445e3f1cf5)

2024-04-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 9a42610d5ad8 [SPARK-48029][INFRA] Update the packages name removed in 
building the spark docker image
 add e1445e3f1cf5 [SPARK-48036][DOCS] Update `sql-ref-ansi-compliance.md` 
and `sql-ref-identifier.md`

No new revisions were added by this update.

Summary of changes:
 docs/sql-ref-ansi-compliance.md | 14 ++
 docs/sql-ref-identifier.md  |  2 +-
 2 files changed, 7 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48029][INFRA] Update the packages name removed in building the spark docker image

2024-04-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9a42610d5ad8 [SPARK-48029][INFRA] Update the packages name removed in 
building the spark docker image
9a42610d5ad8 is described below

commit 9a42610d5ad8ae0ded92fb68c7617861cfe975e1
Author: panbingkun 
AuthorDate: Sun Apr 28 21:43:47 2024 -0700

[SPARK-48029][INFRA] Update the packages name removed in building the spark 
docker image

### What changes were proposed in this pull request?
The pr aims to update the packages name removed in building the spark 
docker image.

### Why are the changes needed?
When our default image base was switched from `ubuntu 20.04` to `ubuntu 
22.04`, the unused installation package in the base image has changed, in order 
to eliminate some warnings in building images and free disk space more 
accurately, we need to correct it.

Before:
```
#35 [29/31] RUN apt-get remove --purge -y '^aspnet.*' '^dotnet-.*' 
'^llvm-.*' 'php.*' '^mongodb-.*' snapd google-chrome-stable 
microsoft-edge-stable firefox azure-cli google-cloud-sdk mono-devel 
powershell libgl1-mesa-dri || true
#35 0.489 Reading package lists...
#35 0.505 Building dependency tree...
#35 0.507 Reading state information...
#35 0.511 E: Unable to locate package ^aspnet.*
#35 0.511 E: Couldn't find any package by glob '^aspnet.*'
#35 0.511 E: Couldn't find any package by regex '^aspnet.*'
#35 0.511 E: Unable to locate package ^dotnet-.*
#35 0.511 E: Couldn't find any package by glob '^dotnet-.*'
#35 0.511 E: Couldn't find any package by regex '^dotnet-.*'
#35 0.511 E: Unable to locate package ^llvm-.*
#35 0.511 E: Couldn't find any package by glob '^llvm-.*'
#35 0.511 E: Couldn't find any package by regex '^llvm-.*'
#35 0.511 E: Unable to locate package ^mongodb-.*
#35 0.511 E: Couldn't find any package by glob '^mongodb-.*'
#35 0.511 EPackage 'php-crypt-gpg' is not installed, so not removed
#35 0.511 Package 'php' is not installed, so not removed
#35 0.511 : Couldn't find any package by regex '^mongodb-.*'
#35 0.511 E: Unable to locate package snapd
#35 0.511 E: Unable to locate package google-chrome-stable
#35 0.511 E: Unable to locate package microsoft-edge-stable
#35 0.511 E: Unable to locate package firefox
#35 0.511 E: Unable to locate package azure-cli
#35 0.511 E: Unable to locate package google-cloud-sdk
#35 0.511 E: Unable to locate package mono-devel
#35 0.511 E: Unable to locate package powershell
#35 DONE 0.5s

#36 [30/31] RUN apt-get autoremove --purge -y
#36 0.063 Reading package lists...
#36 0.079 Building dependency tree...
#36 0.082 Reading state information...
#36 0.088 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
#36 DONE 0.4s
```

After:
```
#38 [32/36] RUN apt-get remove --purge -y 'gfortran-11' 
'humanity-icon-theme' 'nodejs-doc' || true
#38 0.066 Reading package lists...
#38 0.087 Building dependency tree...
#38 0.089 Reading state information...
#38 0.094 The following packages were automatically installed and are no 
longer required:
#38 0.094   at-spi2-core bzip2-doc dbus-user-session dconf-gsettings-backend
#38 0.095   dconf-service gsettings-desktop-schemas gtk-update-icon-cache
#38 0.095   hicolor-icon-theme libatk-bridge2.0-0 libatk1.0-0 libatk1.0-data
#38 0.095   libatspi2.0-0 libbz2-dev libcairo-gobject2 libcolord2 libdconf1 
libepoxy0
#38 0.095   libgfortran-11-dev libgtk-3-common libjs-highlight.js libllvm11
#38 0.095   libncurses-dev libncurses5-dev libphobos2-ldc-shared98 
libreadline-dev
#38 0.095   librsvg2-2 librsvg2-common libvte-2.91-common libwayland-client0
#38 0.095   libwayland-cursor0 libwayland-egl1 libxdamage1 libxkbcommon0
#38 0.095   session-migration tilix-common xkb-data
#38 0.095 Use 'apt autoremove' to remove them.
#38 0.096 The following packages will be REMOVED:
#38 0.096   adwaita-icon-theme* gfortran* gfortran-11* humanity-icon-theme* 
libgtk-3-0*
#38 0.096   libgtk-3-bin* libgtkd-3-0* libvte-2.91-0* libvted-3-0* 
nodejs-doc*
#38 0.096   r-base-dev* tilix* ubuntu-mono*
#38 0.248 0 upgraded, 0 newly installed, 13 to remove and 0 not upgraded.
#38 0.248 After this operation, 99.6 MB disk space will be freed.
...
(Reading database ... 70597 files and directories currently installed.)
#38 0.304 Removing r-base-dev (4.1.2-1ubuntu2) ...
#38 0.319 Removing gfortran (4:11

(spark) branch master updated (3d62dd72a58f -> 8f1634e833ce)

2024-04-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3d62dd72a58f [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` 
placeholders in labels
 add 8f1634e833ce [SPARK-48032][BUILD] Upgrade `commons-codec` to 1.17.0

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels

2024-04-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3d62dd72a58f [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` 
placeholders in labels
3d62dd72a58f is described below

commit 3d62dd72a58f5a19e9a371acc09604ab9ceb9e68
Author: Xi Chen 
AuthorDate: Sun Apr 28 18:30:06 2024 -0700

[SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels

### What changes were proposed in this pull request?

Currently, only the pod annotations supports `APP_ID` and `EXECUTOR_ID` 
placeholders. This commit aims to add the same function to pod labels.

### Why are the changes needed?

The use case is to support using customized labels for availability zone 
based topology pod affinity. We want to use the Spark application ID as the 
customized label value, to allow Spark executor pods to run in the same 
availability zone as Spark driver pod.

Although we can use the Spark internal label `spark-app-selector` directly, 
this is not a good practice when using it along with YuniKorn Gang Scheduling. 
When Gang Scheduling is enabled, the YuniKorn placeholder pods should use the 
same affinity as real Spark pods. In this way, we have to add the internal 
`spark-app-selector` label to the placeholder pods. This is not good because 
the placeholder pods could be recognized as Spark pods in the monitoring system.

Thus we propose supporting the `APP_ID` and `EXECUTOR_ID` placeholders in 
Spark pod labels as well for flexibility.

### Does this PR introduce _any_ user-facing change?

No because the pattern strings are very specific.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46149 from jshmchenxi/SPARK-47730/support-app-placeholder-in-labels.

Authored-by: Xi Chen 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/deploy/k8s/KubernetesConf.scala  | 10 ++
 .../org/apache/spark/deploy/k8s/KubernetesConfSuite.scala   | 13 ++---
 .../deploy/k8s/features/BasicDriverFeatureStepSuite.scala   | 11 +++
 .../spark/deploy/k8s/integrationtest/BasicTestsSuite.scala  |  6 --
 .../spark/deploy/k8s/integrationtest/KubernetesSuite.scala  |  6 --
 5 files changed, 31 insertions(+), 15 deletions(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
index a1ef04f4e311..b55f9317d10b 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
@@ -100,8 +100,9 @@ private[spark] class KubernetesDriverConf(
   SPARK_APP_ID_LABEL -> appId,
   SPARK_APP_NAME_LABEL -> KubernetesConf.getAppNameLabel(appName),
   SPARK_ROLE_LABEL -> SPARK_POD_DRIVER_ROLE)
-val driverCustomLabels = KubernetesUtils.parsePrefixedKeyValuePairs(
-  sparkConf, KUBERNETES_DRIVER_LABEL_PREFIX)
+val driverCustomLabels =
+  KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, 
KUBERNETES_DRIVER_LABEL_PREFIX)
+.map { case(k, v) => (k, Utils.substituteAppNExecIds(v, appId, "")) }
 
 presetLabels.keys.foreach { key =>
   require(
@@ -173,8 +174,9 @@ private[spark] class KubernetesExecutorConf(
   SPARK_ROLE_LABEL -> SPARK_POD_EXECUTOR_ROLE,
   SPARK_RESOURCE_PROFILE_ID_LABEL -> resourceProfileId.toString)
 
-val executorCustomLabels = KubernetesUtils.parsePrefixedKeyValuePairs(
-  sparkConf, KUBERNETES_EXECUTOR_LABEL_PREFIX)
+val executorCustomLabels =
+  KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, 
KUBERNETES_EXECUTOR_LABEL_PREFIX)
+.map { case(k, v) => (k, Utils.substituteAppNExecIds(v, appId, 
executorId)) }
 
 presetLabels.keys.foreach { key =>
   require(
diff --git 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesConfSuite.scala
 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesConfSuite.scala
index 9963db016ad9..3c53e9b74f92 100644
--- 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesConfSuite.scala
+++ 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesConfSuite.scala
@@ -40,7 +40,9 @@ class KubernetesConfSuite extends SparkFunSuite {
 "execNodeSelectorKey2" -> "execNodeSelectorValue2")
   private val CUSTOM_LABELS = Map(
 "customLabel1Key" -> "customLabe

(spark) branch master updated: [SPARK-48021][ML][BUILD] Add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions`

2024-04-27 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 64d321926bbc [SPARK-48021][ML][BUILD] Add 
`--add-modules=jdk.incubator.vector` to `JavaModuleOptions`
64d321926bbc is described below

commit 64d321926bbcede05d1c145405d503b3431f185b
Author: panbingkun 
AuthorDate: Sat Apr 27 17:38:55 2024 -0700

[SPARK-48021][ML][BUILD] Add `--add-modules=jdk.incubator.vector` to 
`JavaModuleOptions`

### What changes were proposed in this pull request?
The pr aims to:
- add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions`
- remove `jdk.incubator.foreign` and `-Dforeign.restricted=warn` from 
`SparkBuild.scala`

### Why are the changes needed?
1.`jdk.incubator.vector`
First introduction: https://github.com/apache/spark/pull/30810

https://github.com/apache/spark/pull/30810/files#diff-6f545c33f2fcc975200bf208c900a600a593ce6b170180f81e2f93b3efb6cb3e
https://github.com/apache/spark/assets/15246973/6ac7919a-5d82-475c-b8a2-7d9de71acacc";>

Why should we add `--add-modules=jdk.incubator.vector` to 
`JavaModuleOptions`,
Because when we only add `--add-modules=jdk.incubator.vector` to 
`SparkBuild.scala`, it will only take effect when compiling, as follows:
```
build/sbt "mllib-local/Test/runMain 
org.apache.spark.ml.linalg.BLASBenchmark"
...
```
https://github.com/apache/spark/assets/15246973/54d5f55f-cefe-4126-b255-69488f8699a6";>

However, when we use `spark-submit`, it is as follows:
```
./bin/spark-submit --class org.apache.spark.ml.linalg.BLASBenchmark 
/Users/panbingkun/Developer/spark/spark-community/mllib-local/target/scala-2.13/spark-mllib-local_2.13-4.0.0-SNAPSHOT-tests.jar
```
https://github.com/apache/spark/assets/15246973/8e02fa93-fef4-4cdc-96bd-908b3e9baea1";>

Obviously, `--add-modules=jdk.incubator.vector` does not take effect in the 
`Spark runtime`, so I propose adding `--add-modules=jdk.incubator.vector` to 
the `JavaModuleOptions`(`Spark runtime options`) so that we can improve 
`performance` by using `hardware-accelerated BLAS operations` by default.

After this patch(add `--add-modules=jdk.incubator.vector` to the 
`JavaModuleOptions`), as follows:
https://github.com/apache/spark/assets/15246973/da7aa494-0d3c-4c60-9991-e7cd29a1cec5";>

2.`jdk.incubator.foreign` and `-Dforeign.restricted=warn`
A.First introduction: https://github.com/apache/spark/pull/32253

https://github.com/apache/spark/pull/32253/files#diff-6f545c33f2fcc975200bf208c900a600a593ce6b170180f81e2f93b3efb6cb3e
https://github.com/apache/spark/assets/15246973/3f526019-c389-4e60-ab2a-f8e99cfb";>
Use `dev.ludovic.netlib:blas:1.3.2`, the class `ForeignLinkerBLAS` uses 
`jdk.incubator.foreign.*` in this version, so we need to add 
`jdk.incubator.foreign` and `-Dforeign.restricted=warn` to `SparkBuild.scala`

https://github.com/apache/spark/pull/32253/files#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8
https://github.com/apache/spark/assets/15246973/4fd35e96-0da2-4456-a3f6-6b57ad2e9b64";>

https://github.com/luhenry/netlib/blob/v1.3.2/blas/src/main/java/dev/ludovic/netlib/blas/ForeignLinkerBLAS.java#L36
https://github.com/apache/spark/assets/15246973/4b7e3bd1-4650-4c7d-bdb4-c1761d48d478";>

However, with the iterative development of `dev.ludovic.netlib`, 
`ForeignLinkerBLAS` has experienced one `major` change, as following:

https://github.com/luhenry/netlib/commit/48e923c3e5e84560139eb25b3c9df9873c05e41d
https://github.com/apache/spark/assets/15246973/7ba30b19-00c7-4cc4-bea7-a6ab4b326ad8";>
From now on (V3.0.0), `jdk.incubator.foreign.*` will not be used in 
`dev.ludovic.netlib`

Currently, Spark has used the `dev.ludovic.netlib` of version `v3.0.3`. In 
this version, `ForeignLinkerBLAS` has be removed.
https://github.com/apache/spark/blob/master/pom.xml#L191

Double check (`jdk.incubator.foreign` cannot be found in the `netlib` 
source code):
https://github.com/apache/spark/assets/15246973/5c6c6d73-6a5d-427a-9fb4-f626f02335ca";>

So we can completely remove options `jdk.incubator.foreign` and 
`-Dforeign.restricted=warn`.

B.For JDK 21
(PS: This is to explain the historical reasons for the differences between 
the current code logic and the initial ones)
(Just because `Spark` made changes to support `JDK 21`)
https://issues.apache.org/jira/browse/SPARK-44088
https://github.com/apache/spark/assets/15246973/34e7e7e8-4e72-470e-abc0-d79406ad25e5";>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Manually test
- Pass G

(spark) branch master updated: [SPARK-47408][SQL] Fix mathExpressions that use StringType

2024-04-26 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b623601910a3 [SPARK-47408][SQL] Fix mathExpressions that use StringType
b623601910a3 is described below

commit b623601910a37c863edac56d18e79a44b93c5b36
Author: Mihailo Milosevic 
AuthorDate: Fri Apr 26 19:48:27 2024 -0700

[SPARK-47408][SQL] Fix mathExpressions that use StringType

### What changes were proposed in this pull request?
Support more functions that use strings with collations.

### Why are the changes needed?
Hex, Unhex, Conv are widely used and need to be enabled wih collations

### Does this PR introduce _any_ user-facing change?
Yes, enabled more functions.

### How was this patch tested?
With new tests in `CollationSQLExpressionsSuite.scala`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46227 from mihailom-db/SPARK-47408.

Lead-authored-by: Mihailo Milosevic 
Co-authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun 
---
 .../sql/catalyst/expressions/mathExpressions.scala |  21 ++--
 .../catalyst/expressions/stringExpressions.scala   |   2 +-
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 124 +
 3 files changed, 138 insertions(+), 9 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
index 0c09e9be12e9..dc50c18f2ebb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
@@ -30,6 +30,7 @@ import 
org.apache.spark.sql.catalyst.expressions.codegen.Block._
 import org.apache.spark.sql.catalyst.util.{MathUtils, NumberConverter, 
TypeUtils}
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.types.StringTypeAnyCollation
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
@@ -450,8 +451,9 @@ case class Conv(
   override def first: Expression = numExpr
   override def second: Expression = fromBaseExpr
   override def third: Expression = toBaseExpr
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
IntegerType, IntegerType)
-  override def dataType: DataType = StringType
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, IntegerType, IntegerType)
+  override def dataType: DataType = first.dataType
   override def nullable: Boolean = true
 
   override def nullSafeEval(num: Any, fromBase: Any, toBase: Any): Any = {
@@ -1002,7 +1004,7 @@ case class Bin(child: Expression)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant with 
Serializable {
 
   override def inputTypes: Seq[DataType] = Seq(LongType)
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
   protected override def nullSafeEval(input: Any): Any =
 UTF8String.fromString(jl.Long.toBinaryString(input.asInstanceOf[Long]))
@@ -1108,21 +1110,24 @@ case class Hex(child: Expression)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
   override def inputTypes: Seq[AbstractDataType] =
-Seq(TypeCollection(LongType, BinaryType, StringType))
+Seq(TypeCollection(LongType, BinaryType, StringTypeAnyCollation))
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = child.dataType match {
+case st: StringType => st
+case _ => SQLConf.get.defaultStringType
+  }
 
   protected override def nullSafeEval(num: Any): Any = child.dataType match {
 case LongType => Hex.hex(num.asInstanceOf[Long])
 case BinaryType => Hex.hex(num.asInstanceOf[Array[Byte]])
-case StringType => Hex.hex(num.asInstanceOf[UTF8String].getBytes)
+case _: StringType => Hex.hex(num.asInstanceOf[UTF8String].getBytes)
   }
 
   override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
 nullSafeCodeGen(ctx, ev, (c) => {
   val hex = Hex.getClass.getName.stripSuffix("$")
   s"${ev.value} = " + (child.dataType match {
-case StringType => s"""$hex.hex($c.getBytes());"""
+case _: StringType => s"""$hex.hex($c.getBytes());"""
 case _ => s"""$hex.hex($c);"""
   })
 })
@@ -1149,7 +1154,7 @@ case class Unhex(child: Expression, failOnError: Boolean 

(spark-kubernetes-operator) branch main updated: [SPARK-48015] Update `build.gradle` to fix deprecation warnings

2024-04-26 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new 167047a  [SPARK-48015] Update `build.gradle` to fix deprecation 
warnings
167047a is described below

commit 167047abed12ea8e6d709dbb3c6c326330d5787e
Author: Dongjoon Hyun 
AuthorDate: Fri Apr 26 14:58:08 2024 -0700

[SPARK-48015] Update `build.gradle` to fix deprecation warnings

### What changes were proposed in this pull request?

This PR aims to update `build.gradle` to fix deprecation warnings.

### Why are the changes needed?

**AFTER**
```
$ ./gradlew build --warning-mode all

> Configure project :spark-operator-api
Updating PrinterColumns for generated CRD

BUILD SUCCESSFUL in 331ms
16 actionable tasks: 16 up-to-date
```

**BEFORE**
```
$ ./gradlew build --warning-mode all

> Configure project :
Build file '/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle': 
line 20
The org.gradle.api.plugins.JavaPluginConvention type has been deprecated. 
This is scheduled to be removed in Gradle 9.0. Consult the upgrading guide for 
further information: 
https://docs.gradle.org/8.7/userguide/upgrading_version_8.html#java_convention_deprecation
at 
build_1ab30mf3g41rlj3ezxkowdftr$_run_closure1.doCall$original(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:20)
(Run with --stacktrace to get the full stack trace of this 
deprecation warning.)
at 
build_1ab30mf3g41rlj3ezxkowdftr.run(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:16)
(Run with --stacktrace to get the full stack trace of this 
deprecation warning.)
Build file '/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle': 
line 21
The org.gradle.api.plugins.JavaPluginConvention type has been deprecated. 
This is scheduled to be removed in Gradle 9.0. Consult the upgrading guide for 
further information: 
https://docs.gradle.org/8.7/userguide/upgrading_version_8.html#java_convention_deprecation
at 
build_1ab30mf3g41rlj3ezxkowdftr$_run_closure1.doCall$original(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:21)
(Run with --stacktrace to get the full stack trace of this 
deprecation warning.)
at 
build_1ab30mf3g41rlj3ezxkowdftr.run(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:16)
(Run with --stacktrace to get the full stack trace of this 
deprecation warning.)
Build file '/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle': 
line 25
The RepositoryHandler.jcenter() method has been deprecated. This is 
scheduled to be removed in Gradle 9.0. JFrog announced JCenter's sunset in 
February 2021. Use mavenCentral() instead. Consult the upgrading guide for 
further information: 
https://docs.gradle.org/8.7/userguide/upgrading_version_6.html#jcenter_deprecation
at 
build_1ab30mf3g41rlj3ezxkowdftr$_run_closure1$_closure2.doCall$original(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:25)
(Run with --stacktrace to get the full stack trace of this 
deprecation warning.)
at 
build_1ab30mf3g41rlj3ezxkowdftr$_run_closure1.doCall$original(/Users/dongjoon/APACHE/spark-kubernetes-operator/build.gradle:23)
(Run with --stacktrace to get the full stack trace of this 
deprecation warning.)

> Configure project :spark-operator-api
Updating PrinterColumns for generated CRD

BUILD SUCCESSFUL in 353ms
16 actionable tasks: 16 up-to-date
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually build with `--warning-mode all`.
```
$ ./gradlew build --warning-mode all

> Configure project :spark-operator-api
Updating PrinterColumns for generated CRD

BUILD SUCCESSFUL in 331ms
16 actionable tasks: 16 up-to-date
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #9 from dongjoon-hyun/SPARK-48015.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 build.gradle | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/build.gradle b/build.gradle
index ed54f7b..a6c1701 100644
--- a/build.gradle
+++ b/build.gradle
@@ -17,12 +17,14 @@ subprojects {
   apply plugin: 'idea'
   apply plugin: 'eclipse'
   apply plugin: 'java'
-  sourceCompatibility = 17
-  targetCompatibility = 17
+
+  java {
+sourceCompatibility = 17
+targetCompatibility = 17
+  }
 
   repositories {
 mavenCentral()
-jcenter()
   }
 
   apply plugin: 'chec

(spark-kubernetes-operator) branch main updated: [SPARK-47950] Add Java API Module for Spark Operator

2024-04-26 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new 28ff3e0  [SPARK-47950] Add Java API Module for Spark Operator
28ff3e0 is described below

commit 28ff3e069e80bffa2a3be69fc4905ad3a0f76fd5
Author: zhou-jiang 
AuthorDate: Fri Apr 26 14:18:09 2024 -0700

[SPARK-47950] Add Java API Module for Spark Operator

### What changes were proposed in this pull request?

This PR adds Java API library for Spark Operator, with the ability to 
generate yaml spec.

### Why are the changes needed?

Spark Operator API refers to the 
CustomResourceDefinition(https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/)
 that represents the spec for Spark Application in k8s.

This module would be used by operator controller and reconciler. It can 
also serve external services that access k8s server with Java library.

### Does this PR introduce _any_ user-facing change?

No API changes in Apache Spark core API. Spark Operator API is proposed.

To view generate SparkApplication spec yaml, use

```
./gradlew :spark-operator-api:finalizeGeneratedCRD
```

(this requires yq to be installed for patching additional printer columns)

Generated yaml file would be located at

```

spark-operator-api/build/classes/java/main/META-INF/fabric8/sparkapplications.org.apache.spark-v1.yml
```

For more details, please also refer 
`spark-operator-docs/spark_application.md`

### How was this patch tested?

This is tested locally.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #8 from jiangzho/api.

Authored-by: zhou-jiang 
Signed-off-by: Dongjoon Hyun 
---
 .github/.licenserc.yaml|   1 +
 build.gradle   |   2 +
 dev/.rat-excludes  |   2 +
 gradle.properties  |  16 ++
 settings.gradle|   2 +
 spark-operator-api/build.gradle|  32 
 .../apache/spark/k8s/operator/BaseResource.java|  36 +
 .../org/apache/spark/k8s/operator/Constants.java   |  82 ++
 .../spark/k8s/operator/SparkApplication.java   |  57 +++
 .../spark/k8s/operator/SparkApplicationList.java   |  26 +++
 .../k8s/operator/decorators/ResourceDecorator.java |  26 +++
 .../apache/spark/k8s/operator/diff/Diffable.java   |  22 +++
 .../spark/k8s/operator/spec/ApplicationSpec.java   |  57 +++
 .../operator/spec/ApplicationTimeoutConfig.java|  66 
 .../k8s/operator/spec/ApplicationTolerations.java  |  45 ++
 .../operator/spec/BaseApplicationTemplateSpec.java |  38 +
 .../apache/spark/k8s/operator/spec/BaseSpec.java   |  36 +
 .../spark/k8s/operator/spec/DeploymentMode.java|  25 +++
 .../spark/k8s/operator/spec/InstanceConfig.java|  68 
 .../k8s/operator/spec/ResourceRetainPolicy.java|  39 +
 .../spark/k8s/operator/spec/RestartConfig.java |  39 +
 .../spark/k8s/operator/spec/RestartPolicy.java |  39 +
 .../spark/k8s/operator/spec/RuntimeVersions.java   |  40 +
 .../operator/status/ApplicationAttemptSummary.java |  53 ++
 .../k8s/operator/status/ApplicationState.java  |  50 ++
 .../operator/status/ApplicationStateSummary.java   | 151 +
 .../k8s/operator/status/ApplicationStatus.java | 170 
 .../spark/k8s/operator/status/AttemptInfo.java |  44 +
 .../k8s/operator/status/BaseAttemptSummary.java|  37 +
 .../spark/k8s/operator/status/BaseState.java   |  37 +
 .../k8s/operator/status/BaseStateSummary.java  |  29 
 .../spark/k8s/operator/status/BaseStatus.java  |  64 
 .../spark/k8s/operator/utils/ModelUtils.java   | 110 +
 .../src/main/resources/printer-columns.sh  |  14 +-
 .../k8s/operator/spec/ApplicationSpecTest.java |  42 +
 .../spark/k8s/operator/spec/RestartPolicyTest.java |  62 +++
 .../k8s/operator/status/ApplicationStatusTest.java | 178 +
 .../spark/k8s/operator/utils/ModelUtilsTest.java   | 124 ++
 38 files changed, 1956 insertions(+), 5 deletions(-)

diff --git a/.github/.licenserc.yaml b/.github/.licenserc.yaml
index 26ac0c1..d1d65e2 100644
--- a/.github/.licenserc.yaml
+++ b/.github/.licenserc.yaml
@@ -16,5 +16,6 @@ header:
 - '.asf.yaml'
 - '**/*.gradle'
 - gradlew
+- 'build/**'
 
   comment: on-failure
diff --git a/build.gradle b/build.gradle
index f64212b..ed54f7b 100644
--- a/build.gradle
+++ b/build.gradle

(spark) branch master updated: [SPARK-48011][CORE] Store LogKey name as a value to avoid generating new string instances

2024-04-26 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2b2a33cc35a8 [SPARK-48011][CORE] Store LogKey name as a value to avoid 
generating new string instances
2b2a33cc35a8 is described below

commit 2b2a33cc35a880fafc569c707674313a56c15811
Author: Gengliang Wang 
AuthorDate: Fri Apr 26 13:25:15 2024 -0700

[SPARK-48011][CORE] Store LogKey name as a value to avoid generating new 
string instances

### What changes were proposed in this pull request?

Store LogKey name as a value to avoid generating new string instances
### Why are the changes needed?

To save memory usage on getting the names of `LogKey`s.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46249 from gengliangwang/addKeyName.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
---
 common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala  | 6 +-
 common/utils/src/main/scala/org/apache/spark/internal/Logging.scala | 4 +---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala 
b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala
index 04990ddc4c9d..2ca80a496ccb 100644
--- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala
+++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala
@@ -16,10 +16,14 @@
  */
 package org.apache.spark.internal
 
+import java.util.Locale
+
 /**
  * All structured logging `keys` used in `MDC` must be extends `LogKey`
  */
-trait LogKey
+trait LogKey {
+  val name: String = this.toString.toLowerCase(Locale.ROOT)
+}
 
 /**
  * Various keys used for mapped diagnostic contexts(MDC) in logging.
diff --git 
a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala 
b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala
index 085b22bee5f3..24a60f88c24a 100644
--- a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala
+++ b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala
@@ -17,8 +17,6 @@
 
 package org.apache.spark.internal
 
-import java.util.Locale
-
 import scala.jdk.CollectionConverters._
 
 import org.apache.logging.log4j.{CloseableThreadContext, Level, LogManager}
@@ -110,7 +108,7 @@ trait Logging {
 val value = if (mdc.value != null) mdc.value.toString else null
 sb.append(value)
 if (Logging.isStructuredLoggingEnabled) {
-  context.put(mdc.key.toString.toLowerCase(Locale.ROOT), value)
+  context.put(mdc.key.name, value)
 }
 
 if (processedParts.hasNext) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-48010][SQL] Avoid repeated calls to conf.resolver in resolveExpression

2024-04-26 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6098bd944f66 [SPARK-48010][SQL] Avoid repeated calls to conf.resolver 
in resolveExpression
6098bd944f66 is described below

commit 6098bd944f6603546601a9d5b5da5f756ce2257c
Author: Nikhil Sheoran <125331115+nikhilsheoran...@users.noreply.github.com>
AuthorDate: Fri Apr 26 11:23:12 2024 -0700

[SPARK-48010][SQL] Avoid repeated calls to conf.resolver in 
resolveExpression

### What changes were proposed in this pull request?
- This PR instead of calling `conf.resolver` for each call in 
`resolveExpression`, reuses the `resolver` obtained once.

### Why are the changes needed?
- Consider a view with large number of columns (~1000s). When looking at 
the RuleExecutor metrics and flamegraph for a query that only does `DESCRIBE 
SELECT * FROM large_view`, observed that a large fraction of time is spent in 
`ResolveReferences` and `ResolveRelations`. Of these, the majority of the 
driver time went in initializing the `conf` to obtain `conf.resolver` for each 
of the column in the view.
- Since, the same `conf` is used in each of these calls, calling the 
`conf.resolver` again and again can be avoided by initializing it once and 
reusing the same resolver.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Created a dummy view with 3000 columns.
- Observed the `RuleExecutor` metrics using `RuleExecutor.dumpTimeSpent()`.
- `RuleExecutor` metrics before this change (after multiple runs)
```
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 1483
Total time: 8.026801698 seconds

Rule
Effective Time / Total Time Effective Runs / 
Total Runs

org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
4060159342 / 4062186814 1 / 6
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences   
3789405037 / 3809203288 2 / 6

org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$CombinedTypeCoercionRule
0 / 207411640 / 6
org.apache.spark.sql.catalyst.analysis.ResolveTimeZone  
17800584 / 19431350 1 / 6
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast   
15036018 / 15060440 1 / 6
org.apache.spark.sql.catalyst.analysis.UpdateAttributeNullability   
0 / 149298100 / 7
```
- `RuleExecutor` metrics after this change (after multiple runs)
```
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 1483
Total time: 2.892630859 seconds

Rule
Effective Time / Total Time Effective Runs / 
Total Runs

org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
1490357745 / 1492398446 1 / 6
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences   
1212205822 / 1241729981 2 / 6

org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$CombinedTypeCoercionRule
0 / 238571610 / 6
org.apache.spark.sql.catalyst.analysis.ResolveTimeZone  
16603250 / 18806065 1 / 6
org.apache.spark.sql.catalyst.analysis.UpdateAttributeNullability   
0 / 167493060 / 7
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast   
11158299 / 11183593 1 / 6
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46248 from nikhilsheoran-db/SPARK-48010.

Authored-by: Nikhil Sheoran 
<125331115+nikhilsheoran...@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/analysis/ColumnResolutionHelper.scala   | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
index 6e27192ead32..c10e000a098c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
+++ 
b/sql/

(spark) branch master updated: [SPARK-48005][PS][CONNECT][TESTS] Enable `DefaultIndexParityTests.test_index_distributed_sequence_cleanup`

2024-04-26 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 78b19d5af08e [SPARK-48005][PS][CONNECT][TESTS] Enable 
`DefaultIndexParityTests.test_index_distributed_sequence_cleanup`
78b19d5af08e is described below

commit 78b19d5af08ea772eaea9c13b7b984a13294
Author: Ruifeng Zheng 
AuthorDate: Fri Apr 26 09:58:54 2024 -0700

[SPARK-48005][PS][CONNECT][TESTS] Enable 
`DefaultIndexParityTests.test_index_distributed_sequence_cleanup`

### What changes were proposed in this pull request?
Enable `DefaultIndexParityTests. test_index_distributed_sequence_cleanup`

### Why are the changes needed?
this test requires `sc` access, can be enabled in `Spark Connect with JVM` 
mode

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci, also manually test:
```
python/run-tests -k --python-executables python3 --testnames 
'pyspark.pandas.tests.connect.indexes.test_parity_default 
DefaultIndexParityTests.test_index_distributed_sequence_cleanup'
Running PySpark tests. Output is in 
/Users/ruifeng.zheng/Dev/spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python tests: 
['pyspark.pandas.tests.connect.indexes.test_parity_default 
DefaultIndexParityTests.test_index_distributed_sequence_cleanup']
python3 python_implementation is CPython
python3 version is: Python 3.12.2
Starting test(python3): 
pyspark.pandas.tests.connect.indexes.test_parity_default 
DefaultIndexParityTests.test_index_distributed_sequence_cleanup (temp output: 
/Users/ruifeng.zheng/Dev/spark/python/target/ccd3da45-f774-4f5f-8283-a91a8ee12212/python3__pyspark.pandas.tests.connect.indexes.test_parity_default_DefaultIndexParityTests.test_index_distributed_sequence_cleanup__p9yved3e.log)
Finished test(python3): 
pyspark.pandas.tests.connect.indexes.test_parity_default 
DefaultIndexParityTests.test_index_distributed_sequence_cleanup (16s)
Tests passed in 16 seconds
```

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46242 from 
zhengruifeng/enable_test_index_distributed_sequence_cleanup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .../pyspark/pandas/tests/connect/indexes/test_parity_default.py   | 3 ++-
 python/pyspark/pandas/tests/indexes/test_default.py   | 8 
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/pandas/tests/connect/indexes/test_parity_default.py 
b/python/pyspark/pandas/tests/connect/indexes/test_parity_default.py
index d6f0cadbf0cd..4240eb8fdbc8 100644
--- a/python/pyspark/pandas/tests/connect/indexes/test_parity_default.py
+++ b/python/pyspark/pandas/tests/connect/indexes/test_parity_default.py
@@ -19,6 +19,7 @@ import unittest
 from pyspark.pandas.tests.indexes.test_default import DefaultIndexTestsMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.util import is_remote_only
 
 
 class DefaultIndexParityTests(
@@ -26,7 +27,7 @@ class DefaultIndexParityTests(
 PandasOnSparkTestUtils,
 ReusedConnectTestCase,
 ):
-@unittest.skip("Test depends on SparkContext which is not supported from 
Spark Connect.")
+@unittest.skipIf(is_remote_only(), "Requires JVM access")
 def test_index_distributed_sequence_cleanup(self):
 super().test_index_distributed_sequence_cleanup()
 
diff --git a/python/pyspark/pandas/tests/indexes/test_default.py 
b/python/pyspark/pandas/tests/indexes/test_default.py
index 3d19eb407b42..5cd9fae76dfb 100644
--- a/python/pyspark/pandas/tests/indexes/test_default.py
+++ b/python/pyspark/pandas/tests/indexes/test_default.py
@@ -44,7 +44,7 @@ class DefaultIndexTestsMixin:
 "compute.default_index_type", "distributed-sequence"
 ), ps.option_context("compute.ops_on_diff_frames", True):
 with ps.option_context("compute.default_index_cache", 
"LOCAL_CHECKPOINT"):
-cached_rdd_ids = [rdd_id for rdd_id in 
self.spark._jsc.getPersistentRDDs()]
+cached_rdd_ids = [rdd_id for rdd_id in 
self._legacy_sc._jsc.getPersistentRDDs()]
 
 psdf1 = (
 self.spark.range(0, 100, 1, 10).withColumn("Key", 
F.col("id") % 33).pandas_api()
@@ -61,13 +61,13 @@ class DefaultIndexTestsMixin:
 self.assertTrue(
 any(
 rdd_id not in cached_rdd_ids
-for rdd_id in self.spark._jsc.getPersistentRDDs()
+   

(spark) branch master updated: [SPARK-48007][BUILD][TESTS] Upgrade `mssql.jdbc` to `12.6.1.jre11`

2024-04-26 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4ee528f9b29f [SPARK-48007][BUILD][TESTS] Upgrade `mssql.jdbc` to 
`12.6.1.jre11`
4ee528f9b29f is described below

commit 4ee528f9b29f5cd52b70b27a4b8c250c8ca1a17c
Author: Kent Yao 
AuthorDate: Fri Apr 26 08:08:57 2024 -0700

[SPARK-48007][BUILD][TESTS] Upgrade `mssql.jdbc` to `12.6.1.jre11`

### What changes were proposed in this pull request?

This PR upgrades mssql.jdbc.version to 12.6.1.jre11, 
https://mvnrepository.com/artifact/com.microsoft.sqlserver/mssql-jdbc.

### Why are the changes needed?

test dependency management

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46244 from yaooqinn/SPARK-48007.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala  | 3 ++-
 pom.xml| 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala
index b351b2ad1ec7..61530f713eb8 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSQLServerDatabaseOnDocker.scala
@@ -28,5 +28,6 @@ class MsSQLServerDatabaseOnDocker extends DatabaseOnDocker {
   override val jdbcPort: Int = 1433
 
   override def getJdbcUrl(ip: String, port: Int): String =
-s"jdbc:sqlserver://$ip:$port;user=sa;password=Sapass123;"
+s"jdbc:sqlserver://$ip:$port;user=sa;password=Sapass123;" +
+  "encrypt=true;trustServerCertificate=true"
 }
diff --git a/pom.xml b/pom.xml
index 9c8f8fbb2ab0..b916659fdbfa 100644
--- a/pom.xml
+++ b/pom.xml
@@ -325,7 +325,7 @@
 8.3.0
 42.7.3
 11.5.9.0
-9.4.1.jre8
+12.6.1.jre11
 23.3.0.23.09
   
   


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47991][SQL][TEST] Arrange the test cases for window frames and window functions

2024-04-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ea4b7a242910 [SPARK-47991][SQL][TEST] Arrange the test cases for 
window frames and window functions
ea4b7a242910 is described below

commit ea4b7a2429106067eb30b6b47bf7c42059053d31
Author: beliefer 
AuthorDate: Thu Apr 25 20:54:27 2024 -0700

[SPARK-47991][SQL][TEST] Arrange the test cases for window frames and 
window functions

### What changes were proposed in this pull request?
This PR propose to arrange the test cases for window frames and window 
functions.

### Why are the changes needed?
Currently, `DataFrameWindowFramesSuite` and `DataFrameWindowFunctionsSuite` 
have different testing objectives.
The comments for the above two classes are as follows:
`DataFrameWindowFramesSuite` is `Window frame testing for DataFrame API.`
`DataFrameWindowFunctionsSuite` is `Window function testing for DataFrame 
API.`

But there are some test cases for window frame placed into 
`DataFrameWindowFunctionsSuite`.

### Does this PR introduce _any_ user-facing change?
'No'.
Just arrange the test cases for window frames and window functions.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #46226 from beliefer/SPARK-47991.

Authored-by: beliefer 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/DataFrameWindowFramesSuite.scala | 48 ++
 .../spark/sql/DataFrameWindowFunctionsSuite.scala  | 48 --
 2 files changed, 48 insertions(+), 48 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala
index fe1393af8174..95f4cc78d156 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala
@@ -32,6 +32,28 @@ import org.apache.spark.sql.types.CalendarIntervalType
 class DataFrameWindowFramesSuite extends QueryTest with SharedSparkSession {
   import testImplicits._
 
+  test("reuse window partitionBy") {
+val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
+val w = Window.partitionBy("key").orderBy("value")
+
+checkAnswer(
+  df.select(
+lead("key", 1).over(w),
+lead("value", 1).over(w)),
+  Row(1, "1") :: Row(2, "2") :: Row(null, null) :: Row(null, null) :: Nil)
+  }
+
+  test("reuse window orderBy") {
+val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
+val w = Window.orderBy("value").partitionBy("key")
+
+checkAnswer(
+  df.select(
+lead("key", 1).over(w),
+lead("value", 1).over(w)),
+  Row(1, "1") :: Row(2, "2") :: Row(null, null) :: Row(null, null) :: Nil)
+  }
+
   test("lead/lag with empty data frame") {
 val df = Seq.empty[(Int, String)].toDF("key", "value")
 val window = Window.partitionBy($"key").orderBy($"value")
@@ -570,4 +592,30 @@ class DataFrameWindowFramesSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  test("SPARK-34227: WindowFunctionFrame should clear its states during 
preparation") {
+// This creates a single partition dataframe with 3 records:
+//   "a", 0, null
+//   "a", 1, "x"
+//   "b", 0, null
+val df = spark.range(0, 3, 1, 1).select(
+  when($"id" < 2, lit("a")).otherwise(lit("b")).as("key"),
+  ($"id" % 2).cast("int").as("order"),
+  when($"id" % 2 === 0, lit(null)).otherwise(lit("x")).as("value"))
+
+val window1 = Window.partitionBy($"key").orderBy($"order")
+  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
+val window2 = Window.partitionBy($"key").orderBy($"order")
+  .rowsBetween(Window.unboundedPreceding, Window.currentRow)
+checkAnswer(
+  df.select(
+$"key",
+$"order",
+nth_value($"value", 1, ignoreNulls = true).over(window1),
+nth_value($"value", 1, ignoreNulls = true).over(window2)),
+  Seq(
+Row("a", 0, "x", null),
+Row("a&quo

(spark) branch master updated: [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Avoid referencing _to_seq in `pyspark-connect`

2024-04-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 79357c8ccd22 [SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Avoid 
referencing _to_seq in `pyspark-connect`
79357c8ccd22 is described below

commit 79357c8ccd22729a074c42f700544e7e3f023a8d
Author: Hyukjin Kwon 
AuthorDate: Thu Apr 25 14:49:21 2024 -0700

[SPARK-47933][CONNECT][PYTHON][FOLLOW-UP] Avoid referencing _to_seq in 
`pyspark-connect`

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/46155 that 
removes the reference of `_to_seq` that `pyspark-connect` package does not have.

### Why are the changes needed?

To recover the CI 
https://github.com/apache/spark/actions/runs/8821919392/job/24218893631

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46229 from HyukjinKwon/SPARK-47933-followuptmp.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/group.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/group.py b/python/pyspark/sql/group.py
index d26e23bc7160..34c3531c8302 100644
--- a/python/pyspark/sql/group.py
+++ b/python/pyspark/sql/group.py
@@ -43,9 +43,9 @@ def dfapi(f: Callable[..., DataFrame]) -> Callable[..., 
DataFrame]:
 
 
 def df_varargs_api(f: Callable[..., DataFrame]) -> Callable[..., DataFrame]:
-from pyspark.sql.classic.column import _to_seq
-
 def _api(self: "GroupedData", *cols: str) -> DataFrame:
+from pyspark.sql.classic.column import _to_seq
+
 name = f.__name__
 jdf = getattr(self._jgd, name)(_to_seq(self.session._sc, cols))
 return DataFrame(jdf, self.session)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45425][DOCS][FOLLOWUP] Add a migration guide for TINYINT type mapping change

2024-04-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e1d021214c61 [SPARK-45425][DOCS][FOLLOWUP] Add a migration guide for 
TINYINT type mapping change
e1d021214c61 is described below

commit e1d021214c6130588e69dfa05e0391d89b463f9d
Author: Kent Yao 
AuthorDate: Thu Apr 25 08:19:40 2024 -0700

[SPARK-45425][DOCS][FOLLOWUP] Add a migration guide for TINYINT type 
mapping change

### What changes were proposed in this pull request?

Followup of SPARK-45425, adding migration guide.

### Why are the changes needed?

migration guide

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing build

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46224 from yaooqinn/SPARK-45425.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 docs/sql-migration-guide.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 9b189eee6ad1..024423fb145a 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -47,6 +47,7 @@ license: |
 - Since Spark 4.0, MySQL JDBC datasource will read BIT(n > 1) as BinaryType, 
while in Spark 3.5 and previous, read as LongType. To restore the previous 
behavior, set `spark.sql.legacy.mysql.bitArrayMapping.enabled` to `true`.
 - Since Spark 4.0, MySQL JDBC datasource will write ShortType as SMALLINT, 
while in Spark 3.5 and previous, write as INTEGER. To restore the previous 
behavior, you can replace the column with IntegerType whenever before writing.
 - Since Spark 4.0, Oracle JDBC datasource will write TimestampType as 
TIMESTAMP WITH LOCAL TIME ZONE, while in Spark 3.5 and previous, write as 
TIMESTAMP. To restore the previous behavior, set 
`spark.sql.legacy.oracle.timestampMapping.enabled` to `true`.
+- Since Spark 4.0, MsSQL Server JDBC datasource will read TINYINT as 
ShortType, while in Spark 3.5 and previous, read as IntegerType. To restore the 
previous behavior, set `spark.sql.legacy.mssqlserver.numericMapping.enabled` to 
`true`.
 - Since Spark 4.0, The default value for 
`spark.sql.legacy.ctePrecedencePolicy` has been changed from `EXCEPTION` to 
`CORRECTED`. Instead of raising an error, inner CTE definitions take precedence 
over outer definitions.
 - Since Spark 4.0, The default value for `spark.sql.legacy.timeParserPolicy` 
has been changed from `EXCEPTION` to `CORRECTED`. Instead of raising an 
`INCONSISTENT_BEHAVIOR_CROSS_VERSION` error, `CANNOT_PARSE_TIMESTAMP` will be 
raised if ANSI mode is enable. `NULL` will be returned if ANSI mode is 
disabled. See [Datetime Patterns for Formatting and 
Parsing](sql-ref-datetime-pattern.html).
 - Since Spark 4.0, A bug falsely allowing `!` instead of `NOT` when `!` is not 
a prefix operator has been fixed. Clauses such as `expr ! IN (...)`, `expr ! 
BETWEEN ...`, or `col ! NULL` now raise syntax errors. To restore the previous 
behavior, set `spark.sql.legacy.bangEqualsNot` to `true`. 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (de5c512e0179 -> 287d02073929)

2024-04-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from de5c512e0179 [SPARK-47987][PYTHON][CONNECT][TESTS] Enable 
`ArrowParityTests.test_createDataFrame_empty_partition`
 add 287d02073929 [SPARK-47989][SQL] MsSQLServer: Fix the scope of 
spark.sql.legacy.mssqlserver.numericMapping.enabled

No new revisions were added by this update.

Summary of changes:
 .../sql/jdbc/MsSqlServerIntegrationSuite.scala | 177 +++--
 .../org/apache/spark/sql/internal/SQLConf.scala|   2 +-
 .../apache/spark/sql/jdbc/MsSqlServerDialect.scala |  29 ++--
 3 files changed, 104 insertions(+), 104 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47987][PYTHON][CONNECT][TESTS] Enable `ArrowParityTests.test_createDataFrame_empty_partition`

2024-04-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new de5c512e0179 [SPARK-47987][PYTHON][CONNECT][TESTS] Enable 
`ArrowParityTests.test_createDataFrame_empty_partition`
de5c512e0179 is described below

commit de5c512e017965b5c726e254f8969fb17d5c17ea
Author: Ruifeng Zheng 
AuthorDate: Thu Apr 25 08:16:56 2024 -0700

[SPARK-47987][PYTHON][CONNECT][TESTS] Enable 
`ArrowParityTests.test_createDataFrame_empty_partition`

### What changes were proposed in this pull request?
Reenable `ArrowParityTests.test_createDataFrame_empty_partition`

We actually already had set up Classic SparkContext `_legacy_sc ` for Spark 
Connect test, so only need to add `_legacy_sc` in Classic PySpark test.

### Why are the changes needed?
to improve test coverage

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46220 from zhengruifeng/enable_test_createDataFrame_empty_partition.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/tests/connect/test_parity_arrow.py | 4 
 python/pyspark/sql/tests/test_arrow.py| 4 +++-
 python/pyspark/testing/sqlutils.py| 1 +
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_arrow.py 
b/python/pyspark/sql/tests/connect/test_parity_arrow.py
index 93d0b6cf0f5f..8727cc279641 100644
--- a/python/pyspark/sql/tests/connect/test_parity_arrow.py
+++ b/python/pyspark/sql/tests/connect/test_parity_arrow.py
@@ -24,10 +24,6 @@ from pyspark.testing.pandasutils import 
PandasOnSparkTestUtils
 
 
 class ArrowParityTests(ArrowTestsMixin, ReusedConnectTestCase, 
PandasOnSparkTestUtils):
-@unittest.skip("Spark Connect does not support Spark Context but the test 
depends on that.")
-def test_createDataFrame_empty_partition(self):
-super().test_createDataFrame_empty_partition()
-
 @unittest.skip("Spark Connect does not support fallback.")
 def test_createDataFrame_fallback_disabled(self):
 super().test_createDataFrame_fallback_disabled()
diff --git a/python/pyspark/sql/tests/test_arrow.py 
b/python/pyspark/sql/tests/test_arrow.py
index 5235e021bae9..03cb35feb994 100644
--- a/python/pyspark/sql/tests/test_arrow.py
+++ b/python/pyspark/sql/tests/test_arrow.py
@@ -56,6 +56,7 @@ from pyspark.testing.sqlutils import (
 ExamplePointUDT,
 )
 from pyspark.errors import ArithmeticException, PySparkTypeError, 
UnsupportedOperationException
+from pyspark.util import is_remote_only
 
 if have_pandas:
 import pandas as pd
@@ -830,7 +831,8 @@ class ArrowTestsMixin:
 pdf = pd.DataFrame({"c1": [1], "c2": ["string"]})
 df = self.spark.createDataFrame(pdf)
 self.assertEqual([Row(c1=1, c2="string")], df.collect())
-self.assertGreater(self.spark.sparkContext.defaultParallelism, 
len(pdf))
+if not is_remote_only():
+self.assertGreater(self._legacy_sc.defaultParallelism, len(pdf))
 
 def test_toPandas_error(self):
 for arrow_enabled in [True, False]:
diff --git a/python/pyspark/testing/sqlutils.py 
b/python/pyspark/testing/sqlutils.py
index 690d5c37b22e..a0fdada72972 100644
--- a/python/pyspark/testing/sqlutils.py
+++ b/python/pyspark/testing/sqlutils.py
@@ -258,6 +258,7 @@ class ReusedSQLTestCase(ReusedPySparkTestCase, 
SQLTestUtils, PySparkErrorTestUti
 @classmethod
 def setUpClass(cls):
 super(ReusedSQLTestCase, cls).setUpClass()
+cls._legacy_sc = cls.sc
 cls.spark = SparkSession(cls.sc)
 cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
 os.unlink(cls.tempdir.name)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47990][BUILD] Upgrade `zstd-jni` to 1.5.6-3

2024-04-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5810554ce0fa [SPARK-47990][BUILD] Upgrade `zstd-jni` to 1.5.6-3
5810554ce0fa is described below

commit 5810554ce0faba4cb8e7f3ca3dd5812bd2cf179f
Author: panbingkun 
AuthorDate: Thu Apr 25 08:10:04 2024 -0700

[SPARK-47990][BUILD] Upgrade `zstd-jni` to 1.5.6-3

### What changes were proposed in this pull request?
The pr aims to upgrade `zstd-jni` from `1.5.6-2` to `1.5.6-3`.

### Why are the changes needed?
1.This version fix a potential memory leak problem, as follows:
https://github.com/apache/spark/assets/15246973/eeae3e7f-0c44-443d-838b-fa39b9e45d64";>

2.https://github.com/luben/zstd-jni/compare/v1.5.6-2...v1.5.6-3

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46225 from panbingkun/SPARK-47990.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index f6adb6d18b85..005cc7bfb435 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -278,4 +278,4 @@ xz/1.9//xz-1.9.jar
 zjsonpatch/0.3.0//zjsonpatch-0.3.0.jar
 zookeeper-jute/3.9.2//zookeeper-jute-3.9.2.jar
 zookeeper/3.9.2//zookeeper-3.9.2.jar
-zstd-jni/1.5.6-2//zstd-jni-1.5.6-2.jar
+zstd-jni/1.5.6-3//zstd-jni-1.5.6-3.jar
diff --git a/pom.xml b/pom.xml
index c98514efa356..9c8f8fbb2ab0 100644
--- a/pom.xml
+++ b/pom.xml
@@ -800,7 +800,7 @@
   
 com.github.luben
 zstd-jni
-1.5.6-2
+1.5.6-3
   
   
 com.clearspring.analytics


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47979][SQL][TESTS] Use Hive tables explicitly for Hive table capability tests

2024-04-24 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0fcced63be99 [SPARK-47979][SQL][TESTS] Use Hive tables explicitly for 
Hive table capability tests
0fcced63be99 is described below

commit 0fcced63be99302593591d29370c00e7c0d73cec
Author: Dongjoon Hyun 
AuthorDate: Wed Apr 24 18:57:29 2024 -0700

[SPARK-47979][SQL][TESTS] Use Hive tables explicitly for Hive table 
capability tests

### What changes were proposed in this pull request?

This PR aims to use `Hive` tables explicitly for Hive table capability 
tests in `hive` and `hive-thriftserver` module.

### Why are the changes needed?

To make Hive test coverage robust by making it independent from Apache 
Spark configuration changes.

### Does this PR introduce _any_ user-facing change?

No, this is a test only change.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46211 from dongjoon-hyun/SPARK-47979.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala | 2 +-
 .../scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala | 1 +
 .../scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala | 9 +++--
 .../org/apache/spark/sql/hive/execution/HiveQuerySuite.scala | 6 +++---
 .../spark/sql/hive/execution/command/ShowCreateTableSuite.scala  | 4 
 5 files changed, 12 insertions(+), 10 deletions(-)

diff --git 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala
 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala
index b552611b75d1..2b2cbec41d64 100644
--- 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala
+++ 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala
@@ -108,7 +108,7 @@ class UISeleniumSuite
   val baseURL = s"http://$localhost:$uiPort";
 
   val queries = Seq(
-"CREATE TABLE test_map(key INT, value STRING)",
+"CREATE TABLE test_map (key INT, value STRING) USING HIVE",
 s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE 
test_map")
 
   queries.foreach(statement.execute)
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala
index 0bc288501a01..b60adfb6f4cf 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala
@@ -686,6 +686,7 @@ class HiveClientSuite(version: String) extends 
HiveVersionSuite(version) {
 versionSpark.sql(
   s"""
  |CREATE TABLE tab(c1 string)
+ |USING HIVE
  |location '${tmpDir.toURI.toString}'
  """.stripMargin)
 
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
index 241fdd4b9ec5..965db22b78f1 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
@@ -216,7 +216,7 @@ class HiveDDLSuite
 
   test("SPARK-22431: alter table tests with nested types") {
 withTable("t1", "t2", "t3") {
-  spark.sql("CREATE TABLE t1 (q STRUCT, i1 INT)")
+  spark.sql("CREATE TABLE t1 (q STRUCT, i1 INT) 
USING HIVE")
   spark.sql("ALTER TABLE t1 ADD COLUMNS (newcol1 STRUCT<`col1`:STRING, 
col2:Int>)")
   val newcol = spark.sql("SELECT * FROM t1").schema.fields(2).name
   assert("newcol1".equals(newcol))
@@ -2614,7 +2614,7 @@ class HiveDDLSuite
   "msg" -> "java.lang.UnsupportedOperationException: Unknown field 
type: void")
   )
 
-  sql("CREATE TABLE t3 AS SELECT NULL AS null_col")
+  sql("CREATE TABLE t3 USING HIVE AS SELECT NULL AS null_col")
   checkAnswer(sql("SELECT * FROM t3"), Row(null))
 }
 
@@ -2642,9 +2642,6 @@ class HiveDDLSuite
 
   sql("CREATE TABLE t3 (v VOID) USING hive")
   checkAnswer(sql("SELECT * FROM t3"), Seq.empty)
-
-  sql("CREATE TABLE t4 (v VOID)")
-  checkAnswer(sql("SELECT * FROM t4"), Seq.empty)
 }
 
 //

(spark) branch branch-3.5 updated: [SPARK-47633][SQL][3.5] Include right-side plan output in `LateralJoin#allAttributes` for more consistent canonicalization

2024-04-24 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new ce19bfc10682 [SPARK-47633][SQL][3.5] Include right-side plan output in 
`LateralJoin#allAttributes` for more consistent canonicalization
ce19bfc10682 is described below

commit ce19bfc1068229897454c5f5cb78aeb435821bd2
Author: Bruce Robbins 
AuthorDate: Wed Apr 24 09:48:21 2024 -0700

[SPARK-47633][SQL][3.5] Include right-side plan output in 
`LateralJoin#allAttributes` for more consistent canonicalization

This is a backport of #45763 to branch-3.5.

### What changes were proposed in this pull request?

Modify `LateralJoin` to include right-side plan output in `allAttributes`.

### Why are the changes needed?

In the following example, the view v1 is cached, but a query of v1 does not 
use the cache:
```
CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);

create or replace temp view v1 as
select *
from t1
join lateral (
  select c1 as a, c2 as b
  from t2)
on c1 = a;

cache table v1;

explain select * from v1;
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false
   :- LocalTableScan [c1#180, c2#181]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)),false), [plan_id=113]
  +- LocalTableScan [a#173, b#174]
```

The canonicalized version of the `LateralJoin` node is not consistent when 
there is a join condition. For example, for the above query, the join condition 
is canonicalized as follows:
```
Before canonicalization: Some((c1#174 = a#167))
After canonicalization:  Some((none#0 = none#167))
```
You can see that the `exprId` for the second operand of `EqualTo` is not 
normalized (it remains 167). That's because the attribute `a` from the 
right-side plan is not included `allAttributes`.

This PR adds right-side attributes to `allAttributes` so that references to 
right-side attributes in the join condition are normalized during 
canonicalization.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46190 from bersprockets/lj_canonical_issue_35.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
---
 .../plans/logical/basicLogicalOperators.scala |  2 ++
 .../scala/org/apache/spark/sql/CachedTableSuite.scala | 19 +++
 2 files changed, 21 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index 58c03ee72d6d..ca2c6a850561 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -2017,6 +2017,8 @@ case class LateralJoin(
 joinType: JoinType,
 condition: Option[Expression]) extends UnaryNode {
 
+  override lazy val allAttributes: AttributeSeq = left.output ++ 
right.plan.output
+
   require(Seq(Inner, LeftOuter, Cross).contains(joinType),
 s"Unsupported lateral join type $joinType")
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
index 8331a3c10fc9..9815cb816c99 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
@@ -1710,4 +1710,23 @@ class CachedTableSuite extends QueryTest with 
SQLTestUtils
   }
 }
   }
+
+  test("SPARK-47633: Cache hit for lateral join with join condition") {
+withTempView("t", "q1") {
+  sql("create or replace temp view t(c1, c2) as values (0, 1), (1, 2)")
+  val query = """select *
+|from t
+|join lateral (
+|  select c1 as a, c2 as b
+|  from t)
+|on c1 = a;
+|""".stripMargin
+  sql(s"cache table q1 as $query")
+  val df = sql(query)
+  checkAnswer(df,
+Row(0, 1, 0, 1) :: Row(1, 2, 1, 2) :: Nil)
+  assert(getNumInMemoryRelations(df) == 1)
+}
+
+  }
 }


---

(spark) branch master updated (09ed09cb18e7 -> 03d4ea6a707c)

2024-04-24 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 09ed09cb18e7 [SPARK-47958][TESTS] Change LocalSchedulerBackend to 
notify scheduler of executor on start
 add 03d4ea6a707c [SPARK-47974][BUILD] Remove `install_scala` from 
`build/mvn`

No new revisions were added by this update.

Summary of changes:
 .github/workflows/benchmark.yml|  6 ++
 .github/workflows/build_and_test.yml   | 24 
 .github/workflows/build_python_connect.yml |  3 +--
 .github/workflows/maven_test.yml   |  3 +--
 build/mvn  | 24 
 5 files changed, 12 insertions(+), 48 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47969][PYTHON][TESTS] Make `test_creation_index` deterministic

2024-04-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cb1e1f5cd49a [SPARK-47969][PYTHON][TESTS] Make `test_creation_index` 
deterministic
cb1e1f5cd49a is described below

commit cb1e1f5cd49a612c0c081949759c1f931883c263
Author: Ruifeng Zheng 
AuthorDate: Tue Apr 23 23:09:10 2024 -0700

[SPARK-47969][PYTHON][TESTS] Make `test_creation_index` deterministic

### What changes were proposed in this pull request?
Make `test_creation_index` deterministic

### Why are the changes needed?
it may fail in some env
```
FAIL [16.261s]: test_creation_index 
(pyspark.pandas.tests.frame.test_constructor.FrameConstructorTests.test_creation_index)
--
Traceback (most recent call last):
  File "/home/jenkins/python/pyspark/testing/pandasutils.py", line 91, in 
_assert_pandas_equal
assert_frame_equal(
  File 
"/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py",
 line 1257, in assert_frame_equal
assert_index_equal(
  File 
"/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py",
 line 407, in assert_index_equal
raise_assert_detail(obj, msg, left, right)
  File 
"/databricks/python3/lib/python3.11/site-packages/pandas/_testing/asserters.py",
 line 665, in raise_assert_detail
raise AssertionError(msg)
AssertionError: DataFrame.index are different
DataFrame.index values are different (40.0 %)
[left]:  Int64Index([2, 3, 4, 6, 5], dtype='int64')
[right]: Int64Index([2, 3, 4, 5, 6], dtype='int64')
```

### Does this PR introduce _any_ user-facing change?
no. test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46200 from zhengruifeng/fix_test_creation_index.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/pandas/tests/frame/test_constructor.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/pandas/tests/frame/test_constructor.py 
b/python/pyspark/pandas/tests/frame/test_constructor.py
index ee010d8f023d..d7581895c6c9 100644
--- a/python/pyspark/pandas/tests/frame/test_constructor.py
+++ b/python/pyspark/pandas/tests/frame/test_constructor.py
@@ -195,14 +195,14 @@ class FrameConstructorMixin:
 with ps.option_context("compute.ops_on_diff_frames", True):
 # test with ps.DataFrame and pd.Index
 self.assert_eq(
-ps.DataFrame(data=psdf, index=pd.Index([2, 3, 4, 5, 6])),
-pd.DataFrame(data=pdf, index=pd.Index([2, 3, 4, 5, 6])),
+ps.DataFrame(data=psdf, index=pd.Index([2, 3, 4, 5, 
6])).sort_index(),
+pd.DataFrame(data=pdf, index=pd.Index([2, 3, 4, 5, 
6])).sort_index(),
 )
 
 # test with ps.DataFrame and ps.Index
 self.assert_eq(
-ps.DataFrame(data=psdf, index=ps.Index([2, 3, 4, 5, 6])),
-pd.DataFrame(data=pdf, index=pd.Index([2, 3, 4, 5, 6])),
+ps.DataFrame(data=psdf, index=ps.Index([2, 3, 4, 5, 
6])).sort_index(),
+pd.DataFrame(data=pdf, index=pd.Index([2, 3, 4, 5, 
6])).sort_index(),
 )
 
 # test String Index


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47956][SQL] Sanity check for unresolved LCA reference

2024-04-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 66613ba042c4 [SPARK-47956][SQL] Sanity check for unresolved LCA 
reference
66613ba042c4 is described below

commit 66613ba042c4b73b45b3c71e79ce05c225f527e7
Author: Wenchen Fan 
AuthorDate: Tue Apr 23 08:44:48 2024 -0700

[SPARK-47956][SQL] Sanity check for unresolved LCA reference

### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/40558. The 
sanity check should apply to all plan nodes, not only Project/Aggregate/Window, 
as we don't know what bug can happen. Maybe the bug moves LCA references to 
other plan nodes.

### Why are the changes needed?

better error message when bug happens

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46185 from cloud-fan/small.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/analysis/CheckAnalysis.scala  | 20 ++--
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 10bff5e6e59a..d1b336b08955 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -110,9 +110,8 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   }
 
   /** Check and throw exception when a given resolved plan contains 
LateralColumnAliasReference. */
-  private def checkNotContainingLCA(exprSeq: Seq[NamedExpression], plan: 
LogicalPlan): Unit = {
-if (!plan.resolved) return
-
exprSeq.foreach(_.transformDownWithPruning(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE))
 {
+  private def checkNotContainingLCA(exprs: Seq[Expression], plan: 
LogicalPlan): Unit = {
+
exprs.foreach(_.transformDownWithPruning(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE))
 {
   case lcaRef: LateralColumnAliasReference =>
 throw SparkException.internalError("Resolved plan should not contain 
any " +
   s"LateralColumnAliasReference.\nDebugging information: plan:\n$plan",
@@ -789,17 +788,10 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   msg = s"Found the unresolved operator: 
${o.simpleString(SQLConf.get.maxToStringFields)}",
   context = o.origin.getQueryContext,
   summary = o.origin.context.summary)
-  // If the plan is resolved, the resolved Project, Aggregate or Window 
should have restored or
-  // resolved all lateral column alias references. Add check for extra 
safe.
-  case p @ Project(pList, _)
-if pList.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) =>
-checkNotContainingLCA(pList, p)
-  case agg @ Aggregate(_, aggList, _)
-if aggList.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) =>
-checkNotContainingLCA(aggList, agg)
-  case w @ Window(pList, _, _, _)
-if pList.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) =>
-checkNotContainingLCA(pList, w)
+  // If the plan is resolved, all lateral column alias references should 
have been either
+  // restored or resolved. Add check for extra safe.
+  case o if 
o.expressions.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) =>
+checkNotContainingLCA(o.expressions, o)
   case _ =>
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47948][PYTHON] Upgrade the minimum `Pandas` version to 2.0.0

2024-04-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2b01755f2791 [SPARK-47948][PYTHON] Upgrade the minimum `Pandas` 
version to 2.0.0
2b01755f2791 is described below

commit 2b01755f27917b1d391835e6f8b1b2f9a34cc832
Author: Haejoon Lee 
AuthorDate: Tue Apr 23 07:49:15 2024 -0700

[SPARK-47948][PYTHON] Upgrade the minimum `Pandas` version to 2.0.0

### What changes were proposed in this pull request?

This PR proposes to bump Pandas version up to 2.0.0.

### Why are the changes needed?

From Apache Spark 4.0.0, Pandas API on Spark supports Pandas 2.0.0 and 
above and some of features will be broken from Pandas 1.x, so installing Pandas 
2.x is required.

See the full list of breaking changes from [Upgrading from PySpark 3.5 to 
4.0](https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_upgrade.rst#upgrading-from-pyspark-35-to-40).

### Does this PR introduce _any_ user-facing change?

No API changes, but the minimum Pandas version from user-facing 
documentation will be changed.

### How was this patch tested?

The existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46175 from itholic/bump_pandas_2.

Authored-by: Haejoon Lee 
Signed-off-by: Dongjoon Hyun 
---
 dev/create-release/spark-rm/Dockerfile | 2 +-
 python/docs/source/getting_started/install.rst | 6 +++---
 python/docs/source/migration_guide/pyspark_upgrade.rst | 3 +--
 python/docs/source/user_guide/sql/arrow_pandas.rst | 2 +-
 python/packaging/classic/setup.py  | 2 +-
 python/packaging/connect/setup.py  | 2 +-
 python/pyspark/sql/pandas/utils.py | 2 +-
 7 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index f51b24d58394..8d5ca38ba88e 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -37,7 +37,7 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true
 # These arguments are just for reuse and not really meant to be customized.
 ARG APT_INSTALL="apt-get install --no-install-recommends -y"
 
-ARG PIP_PKGS="sphinx==4.5.0 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.13.3 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==3.1.2 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==1.5.3 pyarrow==10.0.1 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17 grpcio==1.62.0 protobuf==4.21.6 
grpcio-status==1.62.0 googleapis-common-protos==1.56.4"
+ARG PIP_PKGS="sphinx==4.5.0 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.13.3 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==3.1.2 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==2.0.3 pyarrow==10.0.1 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17 grpcio==1.62.0 protobuf==4.21.6 
grpcio-status==1.62.0 googleapis-common-protos==1.56.4"
 ARG GEM_PKGS="bundler:2.3.8"
 
 # Install extra needed repos and refresh.
diff --git a/python/docs/source/getting_started/install.rst 
b/python/docs/source/getting_started/install.rst
index 08b6cc813cba..33a0560764df 100644
--- a/python/docs/source/getting_started/install.rst
+++ b/python/docs/source/getting_started/install.rst
@@ -205,7 +205,7 @@ Installable with ``pip install "pyspark[connect]"``.
 == = ==
 PackageSupported version Note
 == = ==
-`pandas`   >=1.4.4   Required for Spark Connect
+`pandas`   >=2.0.0   Required for Spark Connect
 `pyarrow`  >=10.0.0  Required for Spark Connect
 `grpcio`   >=1.62.0  Required for Spark Connect
 `grpcio-status`>=1.62.0  Required for Spark Connect
@@ -220,7 +220,7 @@ Installable with ``pip install "pyspark[sql]"``.
 = = ==
 Package   Supported version Note
 = = ==
-`pandas`  >=1.4.4   Required for Spark SQL
+`pandas`  >=2.0.0   Required for Spark SQL
 `pyarrow` >=10.0.0  Required for Spark SQL
 = = ==
 
@@ -233,7 +233,7 @@ Installable with ``pip install "pyspark[pandas_on_spark]"``.
 = = 
 Package   Supported version Note
 = = 
-`p

(spark) branch master updated (cf5fc0c720ee -> 9c4f12ca04ac)

2024-04-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from cf5fc0c720ee [MINOR][DOCS] Fix type hint of 3 functions
 add 9c4f12ca04ac [SPARK-47949][SQL][DOCKER][TESTS] MsSQLServer: Bump up 
mssql docker image version to 2022-CU12-GDR1-ubuntu-22.04

No new revisions were added by this update.

Summary of changes:
 ...OnDocker.scala => MsSQLServerDatabaseOnDocker.scala} | 13 +++--
 .../spark/sql/jdbc/MsSqlServerIntegrationSuite.scala| 14 +-
 .../spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala | 16 ++--
 .../spark/sql/jdbc/v2/MsSqlServerNamespaceSuite.scala   | 17 ++---
 4 files changed, 12 insertions(+), 48 deletions(-)
 copy 
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/{MySQLDatabaseOnDocker.scala
 => MsSQLServerDatabaseOnDocker.scala} (72%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [MINOR][DOCS] Fix type hint of 3 functions

2024-04-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cf5fc0c720ee [MINOR][DOCS] Fix type hint of 3 functions
cf5fc0c720ee is described below

commit cf5fc0c720eef01c5fe86a6ce05160adbdbf4678
Author: Ruifeng Zheng 
AuthorDate: Tue Apr 23 07:42:44 2024 -0700

[MINOR][DOCS] Fix type hint of 3 functions

### What changes were proposed in this pull request?
Fix type hint of 3 functions

I did a quick scan of the functions, don't find other similar places.

### Why are the changes needed?
a string input will be treated as literal instead of column name

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46179 from zhengruifeng/correct_con.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/connect/functions/builtin.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/connect/functions/builtin.py 
b/python/pyspark/sql/connect/functions/builtin.py
index 519e53c3a13f..8fffb1831466 100644
--- a/python/pyspark/sql/connect/functions/builtin.py
+++ b/python/pyspark/sql/connect/functions/builtin.py
@@ -2141,7 +2141,7 @@ def sequence(
 sequence.__doc__ = pysparkfuncs.sequence.__doc__
 
 
-def schema_of_csv(csv: "ColumnOrName", options: Optional[Dict[str, str]] = 
None) -> Column:
+def schema_of_csv(csv: Union[str, Column], options: Optional[Dict[str, str]] = 
None) -> Column:
 if isinstance(csv, Column):
 _csv = csv
 elif isinstance(csv, str):
@@ -2161,7 +2161,7 @@ def schema_of_csv(csv: "ColumnOrName", options: 
Optional[Dict[str, str]] = None)
 schema_of_csv.__doc__ = pysparkfuncs.schema_of_csv.__doc__
 
 
-def schema_of_json(json: "ColumnOrName", options: Optional[Dict[str, str]] = 
None) -> Column:
+def schema_of_json(json: Union[str, Column], options: Optional[Dict[str, str]] 
= None) -> Column:
 if isinstance(json, Column):
 _json = json
 elif isinstance(json, str):
@@ -2181,7 +2181,7 @@ def schema_of_json(json: "ColumnOrName", options: 
Optional[Dict[str, str]] = Non
 schema_of_json.__doc__ = pysparkfuncs.schema_of_json.__doc__
 
 
-def schema_of_xml(xml: "ColumnOrName", options: Optional[Dict[str, str]] = 
None) -> Column:
+def schema_of_xml(xml: Union[str, Column], options: Optional[Dict[str, str]] = 
None) -> Column:
 if isinstance(xml, Column):
 _xml = xml
 elif isinstance(xml, str):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (ca916258b991 -> 33fa77cb4868)

2024-04-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ca916258b991 [SPARK-47953][DOCS] MsSQLServer: Document Mapping Spark 
SQL Data Types to Microsoft SQL Server
 add 33fa77cb4868 [MINOR][DOCS] Add `docs/_generated/` to .gitignore

No new revisions were added by this update.

Summary of changes:
 .gitignore | 1 +
 1 file changed, 1 insertion(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47953][DOCS] MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server

2024-04-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ca916258b991 [SPARK-47953][DOCS] MsSQLServer: Document Mapping Spark 
SQL Data Types to Microsoft SQL Server
ca916258b991 is described below

commit ca916258b9916452aa2f377608e6be8df65550e5
Author: Kent Yao 
AuthorDate: Tue Apr 23 07:41:04 2024 -0700

[SPARK-47953][DOCS] MsSQLServer: Document Mapping Spark SQL Data Types to 
Microsoft SQL Server

### What changes were proposed in this pull request?

This PR adds Document Mapping Spark SQL Data Types to Microsoft SQL Server

### Why are the changes needed?

doc improvement
### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

doc build

![image](https://github.com/apache/spark/assets/8326978/7220d96a-c5ca-4780-9fc5-f93c99f91c10)

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46177 from yaooqinn/SPARK-47953.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 docs/sql-data-sources-jdbc.md | 106 ++
 1 file changed, 106 insertions(+)

diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md
index 51c0886430a3..734ed43f912a 100644
--- a/docs/sql-data-sources-jdbc.md
+++ b/docs/sql-data-sources-jdbc.md
@@ -1630,3 +1630,109 @@ as the activated JDBC Driver.
 
   
 
+
+### Mapping Spark SQL Data Types to Microsoft SQL Server
+
+The below table describes the data type conversions from Spark SQL Data Types 
to Microsoft SQL Server data types,
+when creating, altering, or writing data to a Microsoft SQL Server table using 
the built-in jdbc data source with
+the mssql-jdbc as the activated JDBC Driver.
+
+
+  
+
+  Spark SQL Data Type
+  SQL Server Data Type
+  Remarks
+
+  
+  
+
+  BooleanType
+  bit
+  
+
+
+  ByteType
+  smallint
+  Supported since Spark 4.0.0, previous versions throw errors
+
+
+  ShortType
+  smallint
+  
+
+
+  IntegerType
+  int
+  
+
+
+  LongType
+  bigint
+  
+
+
+  FloatType
+  real
+  
+
+
+  DoubleType
+  double precision
+  
+
+
+  DecimalType(p, s)
+  number(p,s)
+  
+
+
+  DateType
+  date
+  
+
+
+  TimestampType
+  datetime
+  
+
+
+  TimestampNTZType
+  datetime
+  
+
+
+  StringType
+  nvarchar(max)
+  
+
+
+  BinaryType
+  varbinary(max)
+  
+
+
+  CharType(n)
+  char(n)
+  
+
+
+  VarcharType(n)
+  varchar(n)
+  
+
+  
+
+
+The Spark Catalyst data types below are not supported with suitable SQL Server 
types.
+
+- DayTimeIntervalType
+- YearMonthIntervalType
+- CalendarIntervalType
+- ArrayType
+- MapType
+- StructType
+- UserDefinedType
+- NullType
+- ObjectType
+- VariantType


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark-kubernetes-operator) branch main updated: [SPARK-47943] Add `GitHub Action` CI for Java Build and Test

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new 4a5febd  [SPARK-47943] Add `GitHub Action` CI for Java Build and Test
4a5febd is described below

commit 4a5febd8f48716c0506738fc6a5fd58afb95779f
Author: zhou-jiang 
AuthorDate: Mon Apr 22 22:44:17 2024 -0700

[SPARK-47943] Add `GitHub Action` CI for Java Build and Test

### What changes were proposed in this pull request?

This PR adds an additional CI build task for operator.

### Why are the changes needed?

The additional CI task is needed in order to build and test Java code for 
upcoming operator pull requests.

When Java plugin is enabled and Java source is checked in, `./gradlew 
build` 
[task](https://docs.gradle.org/3.3/userguide/java_plugin.html#sec:java_tasks) 
by default includes a set of tasks to compile and run tests. This can serve as 
pull request build.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

tested locally.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #7 from jiangzho/ci.

Authored-by: zhou-jiang 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 6a5a147..887119f 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -26,4 +26,20 @@ jobs:
   GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 with:
   config: .github/.licenserc.yaml
-
+  build-test:
+name: "Build Test CI"
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+java-version: [ 17, 21 ]
+steps:
+  - name: Checkout repository
+uses: actions/checkout@v3
+  - name: Set up JDK ${{ matrix.java-version }}
+uses: actions/setup-java@v2
+with:
+  java-version: ${{ matrix.java-version }}
+  distribution: 'adopt'
+  - name: Build with Gradle
+run: |
+  set -o pipefail; ./gradlew build; set +o pipefail


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark-kubernetes-operator) branch main updated: [SPARK-47929] Setup Static Analysis for Operator

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new 798ca15  [SPARK-47929] Setup Static Analysis for Operator
798ca15 is described below

commit 798ca15844c71baf5d7f1f8842e461a73c1009a9
Author: zhou-jiang 
AuthorDate: Mon Apr 22 22:42:23 2024 -0700

[SPARK-47929] Setup Static Analysis for Operator

### What changes were proposed in this pull request?

This is a breakdown PR from #2  - setting up common build Java tasks and 
corresponding plugins.

### Why are the changes needed?

This PR includes checkstyle, pmd, spotbugs. Also includes jacoco for 
coverage analysis, spotless for formatting. These tasks can help to enhance the 
quality of future Java contributions. They can also be referred in CI tasks for 
automation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested manually.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #6 from jiangzho/builder_task.

Authored-by: zhou-jiang 
Signed-off-by: Dongjoon Hyun 
---
 build.gradle |  76 -
 config/checkstyle/checkstyle.xml | 208 +++
 config/pmd/ruleset.xml   |  33 ++
 config/spotbugs/spotbugs_exclude.xml |  25 +
 gradle.properties|  22 
 5 files changed, 362 insertions(+), 2 deletions(-)

diff --git a/build.gradle b/build.gradle
index 6732f5a..f64212b 100644
--- a/build.gradle
+++ b/build.gradle
@@ -1,3 +1,18 @@
+buildscript {
+  repositories {
+maven {
+  url = uri("https://plugins.gradle.org/m2/";)
+}
+  }
+  dependencies {
+classpath 
"com.github.spotbugs.snom:spotbugs-gradle-plugin:${spotBugsGradlePluginVersion}"
+classpath 
"com.diffplug.spotless:spotless-plugin-gradle:${spotlessPluginVersion}"
+  }
+}
+
+assert JavaVersion.current().isCompatibleWith(JavaVersion.VERSION_17): "Java 
17 or newer is " +
+"required"
+
 subprojects {
   apply plugin: 'idea'
   apply plugin: 'eclipse'
@@ -6,7 +21,64 @@ subprojects {
   targetCompatibility = 17
 
   repositories {
-  mavenCentral()
-  jcenter()
+mavenCentral()
+jcenter()
+  }
+
+  apply plugin: 'checkstyle'
+  checkstyle {
+toolVersion = checkstyleVersion
+configFile = file("$rootDir/config/checkstyle/checkstyle.xml")
+ignoreFailures = false
+showViolations = true
+  }
+
+  apply plugin: 'pmd'
+  pmd {
+ruleSets = ["java-basic", "java-braces"]
+ruleSetFiles = files("$rootDir/config/pmd/ruleset.xml")
+toolVersion = pmdVersion
+consoleOutput = true
+ignoreFailures = false
+  }
+
+  apply plugin: 'com.github.spotbugs'
+  spotbugs {
+toolVersion = spotBugsVersion
+afterEvaluate {
+  reportsDir = file("${project.reporting.baseDir}/findbugs")
+}
+excludeFilter = file("$rootDir/config/spotbugs/spotbugs_exclude.xml")
+ignoreFailures = false
+  }
+
+  apply plugin: 'jacoco'
+  jacoco {
+toolVersion = jacocoVersion
+  }
+  jacocoTestReport {
+dependsOn test
+  }
+
+  apply plugin: 'com.diffplug.spotless'
+  spotless {
+java {
+  endWithNewline()
+  googleJavaFormat('1.17.0')
+  importOrder(
+'java',
+'javax',
+'scala',
+'',
+'org.apache.spark',
+  )
+  trimTrailingWhitespace()
+  removeUnusedImports()
+}
+format 'misc', {
+  target '*.md', '*.gradle', '**/*.properties', '**/*.xml', '**/*.yaml', 
'**/*.yml'
+  endWithNewline()
+  trimTrailingWhitespace()
+}
   }
 }
diff --git a/config/checkstyle/checkstyle.xml b/config/checkstyle/checkstyle.xml
new file mode 100644
index 000..90161fe
--- /dev/null
+++ b/config/checkstyle/checkstyle.xml
@@ -0,0 +1,208 @@
+
+
+https://checkstyle.org/dtds/configuration_1_3.dtd";>
+
+
+
+
+  
+
+  
+
+  
+
+  
+  
+  
+
+  
+
+  
+
+
+
+  
+
+  
+
+ftp://"/>
+  
+
+  
+
+  
+
+
+  
+  
+  
+
+
+
+  
+  
+  
+
+
+  
+  
+  
+
+
+
+  
+  
+
+
+  
+
+
+
+
+
+
+
+  
+  
+
+
+  
+  
+
+
+  
+  
+
+
+  
+  
+
+
+  
+  
+
+
+  
+  
+  
+  
+
+
+
+  
+
+
+  
+  
+
+
+  
+  
+

(spark) branch master updated (9d715ba49171 -> 876c2cf34a35)

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 9d715ba49171 [SPARK-47938][SQL] MsSQLServer: Cannot find data type 
BYTE error
 add 876c2cf34a35 [SPARK-44170][BUILD][FOLLOWUP] Align JUnit5 dependency's 
version and clean up exclusions

No new revisions were added by this update.

Summary of changes:
 pom.xml | 69 +++--
 1 file changed, 41 insertions(+), 28 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9d715ba49171 [SPARK-47938][SQL] MsSQLServer: Cannot find data type 
BYTE error
9d715ba49171 is described below

commit 9d715ba491710969340d9e8a49a21d11f51ef7d3
Author: Kent Yao 
AuthorDate: Mon Apr 22 22:31:13 2024 -0700

[SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error

### What changes were proposed in this pull request?

This PR uses SMALLINT (as TINYINT ranges [0, 255]) instead of BYTE to fix 
the ByteType mapping for MsSQLServer JDBC

```java
[info]   com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
parameter, or variable #1: Cannot find data type BYTE.
[info]   at 
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:265)
[info]   at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1662)
[info]   at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:898)
[info]   at 
com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:793)
[info]   at 
com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7417)
[info]   at 
com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3488)
[info]   at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:262)
[info]   at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:237)
[info]   at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeUpdate(SQLServerStatement.java:733)
[info]   at 
org.apache.spark.sql.jdbc.JdbcDialect.createTable(JdbcDialects.scala:267)
```

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests
### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46164 from yaooqinn/SPARK-47938.

Lead-authored-by: Kent Yao 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala   | 8 
 .../main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala | 1 +
 2 files changed, 9 insertions(+)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala
index 8bceb9506e85..273e8c35dd07 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala
@@ -437,4 +437,12 @@ class MsSqlServerIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   .load()
 assert(df.collect().toSet === expectedResult)
   }
+
+  test("SPARK-47938: Fix 'Cannot find data type BYTE' in SQL Server") {
+spark.sql("select cast(1 as byte) as c0")
+  .write
+  .jdbc(jdbcUrl, "test_byte", new Properties)
+val df = spark.read.jdbc(jdbcUrl, "test_byte", new Properties)
+checkAnswer(df, Row(1.toShort))
+  }
 }
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala
index 862e99adc3b0..1d05c0d7c24e 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala
@@ -136,6 +136,7 @@ private case class MsSqlServerDialect() extends JdbcDialect 
{
 case BinaryType => Some(JdbcType("VARBINARY(MAX)", 
java.sql.Types.VARBINARY))
 case ShortType if !SQLConf.get.legacyMsSqlServerNumericMappingEnabled =>
   Some(JdbcType("SMALLINT", java.sql.Types.SMALLINT))
+case ByteType => Some(JdbcType("SMALLINT", java.sql.Types.TINYINT))
 case _ => None
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (e4fb7dd98219 -> a97e72cfa7d4)

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e4fb7dd98219 [MINOR] Remove unnecessary `imports`
 add a97e72cfa7d4 [SPARK-47937][PYTHON][DOCS] Fix docstring of 
`hll_sketch_agg`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/functions/builtin.py |  8 +---
 python/pyspark/sql/functions/builtin.py | 12 +++-
 2 files changed, 12 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (b335dd366fb1 -> e4fb7dd98219)

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b335dd366fb1 [SPARK-47909][CONNECT][PYTHON][TESTS][FOLLOW-UP] Move 
`pyspark.classic` references
 add e4fb7dd98219 [MINOR] Remove unnecessary `imports`

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/util/Distribution.scala| 2 --
 .../scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala| 2 --
 .../scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala   | 2 --
 sql/api/src/main/scala/org/apache/spark/sql/types/UpCastRule.scala  | 2 --
 .../src/main/scala/org/apache/spark/sql/execution/CacheManager.scala| 2 --
 .../scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala| 2 --
 .../scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala| 2 --
 sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala   | 1 -
 .../test/scala/org/apache/spark/sql/hive/client/HiveClientSuites.scala  | 2 --
 .../org/apache/spark/sql/hive/client/HiveClientUserNameSuites.scala | 2 --
 .../scala/org/apache/spark/sql/hive/client/HiveClientVersions.scala | 2 --
 .../org/apache/spark/sql/hive/client/HivePartitionFilteringSuites.scala | 2 --
 12 files changed, 23 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-47904][SQL][3.5] Preserve case in Avro schema when using enableStableIdentifiersForUnionType

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new d7c3794a0c56 [SPARK-47904][SQL][3.5] Preserve case in Avro schema when 
using enableStableIdentifiersForUnionType
d7c3794a0c56 is described below

commit d7c3794a0c567b12e8c8e18132aa362f11acdf5f
Author: Ivan Sadikov 
AuthorDate: Mon Apr 22 15:36:13 2024 -0700

[SPARK-47904][SQL][3.5] Preserve case in Avro schema when using 
enableStableIdentifiersForUnionType

### What changes were proposed in this pull request?

Backport of https://github.com/apache/spark/pull/46126 to branch-3.5.

When `enableStableIdentifiersForUnionType` is enabled, all of the types are 
lowercased which creates a problem when field types are case-sensitive:

Union type with fields:
```
Schema.createEnum("myENUM", "", null, List[String]("E1", "e2").asJava),
Schema.createRecord("myRecord2", "", null, false, List[Schema.Field](new 
Schema.Field("F", Schema.create(Type.FLOAT))).asJava)
```

would become

```
struct>
```

but instead should be
```
struct>
```

### Why are the changes needed?

Fixes a bug of lowercasing the field name (the type portion).

### Does this PR introduce _any_ user-facing change?

Yes, if a user enables `enableStableIdentifiersForUnionType` and has Union 
types, all fields will preserve the case. Previously, the field names would be 
all in lowercase.

### How was this patch tested?

I added a test case to verify the new field names.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46169 from sadikovi/SPARK-47904-3.5.

Authored-by: Ivan Sadikov 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/avro/SchemaConverters.scala   | 10 +++
 .../org/apache/spark/sql/avro/AvroSuite.scala  | 31 --
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
index 06abe977e3b0..af358a8d1c96 100644
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
@@ -183,14 +183,14 @@ object SchemaConverters {
   // Avro's field name may be case sensitive, so field names 
for two named type
   // could be "a" and "A" and we need to distinguish them. In 
this case, we throw
   // an exception.
-  val temp_name = 
s"member_${s.getName.toLowerCase(Locale.ROOT)}"
-  if (fieldNameSet.contains(temp_name)) {
+  // Stable id prefix can be empty so the name of the field 
can be just the type.
+  val tempFieldName = s"member_${s.getName}"
+  if 
(!fieldNameSet.add(tempFieldName.toLowerCase(Locale.ROOT))) {
 throw new IncompatibleSchemaException(
-  "Cannot generate stable indentifier for Avro union type 
due to name " +
+  "Cannot generate stable identifier for Avro union type 
due to name " +
   s"conflict of type name ${s.getName}")
   }
-  fieldNameSet.add(temp_name)
-  temp_name
+  tempFieldName
 } else {
   s"member$i"
 }
diff --git 
a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala 
b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
index 1df99210a55a..01c9dfb57a19 100644
--- a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
+++ b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
@@ -370,7 +370,7 @@ abstract class AvroSuite
   "",
   Seq())
   }
-  assert(e.getMessage.contains("Cannot generate stable indentifier"))
+  assert(e.getMessage.contains("Cannot generate stable identifier"))
 }
 {
   val e = intercept[Exception] {
@@ -381,7 +381,7 @@ abstract class AvroSuite
   "",
   Seq())
   }
-  assert(e.getMessage.contains("Cannot generate stable indentifier"))
+  assert(e.getMessage.contains("Cannot generate stable identifier"))
 }
 // Two array types or two map types are not allowed in union.
 {
@@ -434,6 +434,33 @@ abstract class AvroSuite
 }
   }
 
+

(spark) branch master updated: [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ac9a12ef6e06 [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support
ac9a12ef6e06 is described below

commit ac9a12ef6e062ae07e878e202521b22de9979a17
Author: Dongjoon Hyun 
AuthorDate: Mon Apr 22 14:46:03 2024 -0700

[SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support

### What changes were proposed in this pull request?

This PR aims to update K8s docs to recommend K8s v1.27+ for Apache Spark 
4.0.0.

This is a kind of follow-up of the following previous PR because Apache 
Spark 4.0.0 schedule is delayed slightly.
- #43069

### Why are the changes needed?

**1. K8s community starts to release v1.30.0 from 2024-04-17.**
- https://kubernetes.io/releases/#release-v1-30

**2. Default K8s Version in Public Cloud environments**

The default K8s versions of public cloud providers are already K8s 1.27+.

- EKS: v1.29 (Default)
- GKE: v1.29 (Rapid),  v1.28 (Regular), v1.27 (Stable)
- AKS: v1.27

**3. End Of Support**

In addition, K8s 1.26 is going to reach EOL when Apache Spark 4.0.0 arrives 
because K8s 1.26 is also going to reach EOL on June.

| K8s  |   AKS   |   GKE   |   EKS   |
|  | --- | --- | --- |
| 1.26 | 2024-03 | 2024-06 | 2024-06 |

- [AKS EOL 
Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar)
- [GKE EOL 
Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule)
- [EKS EOL 
Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar)

### Does this PR introduce _any_ user-facing change?

- No, this is a documentation-only change about K8s versions.
- Apache Spark K8s Integration Test is currently using K8s v1.30.0 on 
Minikube already.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46168 from dongjoon-hyun/SPARK-47942.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 docs/running-on-kubernetes.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md
index 778af5f0751a..606b5eb6f900 100644
--- a/docs/running-on-kubernetes.md
+++ b/docs/running-on-kubernetes.md
@@ -44,7 +44,7 @@ Cluster administrators should use [Pod Security 
Policies](https://kubernetes.io/
 
 # Prerequisites
 
-* A running Kubernetes cluster at version >= 1.26 with access configured to it 
using
+* A running Kubernetes cluster at version >= 1.27 with access configured to it 
using
 [kubectl](https://kubernetes.io/docs/reference/kubectl/).  If you do not 
already have a working Kubernetes cluster,
 you may set up a test cluster on your local machine using
 [minikube](https://kubernetes.io/docs/getting-started-guides/minikube/).


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (f2d0cf23018f -> fc0c8553ea05)

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f2d0cf23018f [SPARK-47907][SQL] Put bang under a config
 add fc0c8553ea05 [SPARK-47904][SQL] Preserve case in Avro schema when 
using enableStableIdentifiersForUnionType

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/avro/SchemaConverters.scala   |  8 +++---
 .../org/apache/spark/sql/avro/AvroSuite.scala  | 31 --
 2 files changed, 32 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to `33.1.0-jre` in Docker IT

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 86563169eef8 [SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to 
`33.1.0-jre` in Docker IT
86563169eef8 is described below

commit 86563169eef899040e1ec70dd9963c64311dbaa1
Author: Cheng Pan 
AuthorDate: Mon Apr 22 13:34:20 2024 -0700

[SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to `33.1.0-jre` in 
Docker IT

### What changes were proposed in this pull request?

This PR aims to upgrade `guava` dependency to `33.1.0-jre` in Docker 
Integration tests.

### Why are the changes needed?

This is a preparation of the following PR.
- #45372

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46167 from dongjoon-hyun/SPARK-47940.

Authored-by: Cheng Pan 
Signed-off-by: Dongjoon Hyun 
---
 connector/docker-integration-tests/pom.xml | 2 +-
 project/SparkBuild.scala   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/connector/docker-integration-tests/pom.xml 
b/connector/docker-integration-tests/pom.xml
index bb7647c72491..9003c2190be2 100644
--- a/connector/docker-integration-tests/pom.xml
+++ b/connector/docker-integration-tests/pom.xml
@@ -39,7 +39,7 @@
 
   com.google.guava
   guava
-  33.0.0-jre
+  33.1.0-jre
   test
 
 
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index bcaa51ec30ff..1bcc9c893393 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -952,7 +952,7 @@ object Unsafe {
 object DockerIntegrationTests {
   // This serves to override the override specified in DependencyOverrides:
   lazy val settings = Seq(
-dependencyOverrides += "com.google.guava" % "guava" % "33.0.0-jre"
+dependencyOverrides += "com.google.guava" % "guava" % "33.1.0-jre"
   )
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (256fc51508e4 -> 676d47ffe091)

2024-04-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 256fc51508e4 [SPARK-47411][SQL] Support StringInstr & FindInSet 
functions to work with collated strings
 add 676d47ffe091 [SPARK-47935][INFRA][PYTHON] Pin `pandas==2.0.3` for 
`pypy3.8`

No new revisions were added by this update.

Summary of changes:
 dev/infra/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >