[spark] branch master updated: [SPARK-44939][R] Support Java 21 in SparkR SystemRequirements

2023-08-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0c1342ea069 [SPARK-44939][R] Support Java 21 in SparkR 
SystemRequirements
0c1342ea069 is described below

commit 0c1342ea06941ce4a31e944f6d3bb7f52a7455d9
Author: Dongjoon Hyun 
AuthorDate: Wed Aug 23 22:57:16 2023 -0700

[SPARK-44939][R] Support Java 21 in SparkR SystemRequirements

### What changes were proposed in this pull request?

This PR aims to update `SystemRequirements` to support Java 21 in SparkR.

### Why are the changes needed?

To support Java 21 officially in SparkR. We've been running SparkR CI on 
master branch.
- https://github.com/apache/spark/actions/runs/5946839640/job/16128043220

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

On Java 21, do the following.
```
$ build/sbt test:package -Psparkr -Phive

$ R/install-dev.sh; R/run-tests.sh
...
Status: 2 NOTEs
See
  ‘/Users/dongjoon/APACHE/spark-merge/R/SparkR.Rcheck/00check.log’
for details.

+ popd
Tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42645 from dongjoon-hyun/SPARK-44939.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 R/pkg/DESCRIPTION | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index cb9c5f1be07..cb982e00bdd 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -10,7 +10,7 @@ Authors@R:
 License: Apache License (== 2.0)
 URL: https://www.apache.org https://spark.apache.org
 BugReports: https://spark.apache.org/contributing.html
-SystemRequirements: Java (>= 8, < 18)
+SystemRequirements: Java (>= 8, < 22)
 Depends:
 R (>= 3.5),
 methods


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] panbingkun opened a new pull request, #474: [SPARK-44820][DOCS] Switch languages consistently across docs for all code snippets

2023-08-23 Thread via GitHub


panbingkun opened a new pull request, #474:
URL: https://github.com/apache/spark-website/pull/474

   When a user chooses a different language for a code snippet, all code 
snippets on that page should switch to the chosen language. This was the 
behavior for, for example, Spark 2.0 doc: 
https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html
   
   But it was broken for later docs, for example the Spark 3.4.1 doc: 
https://spark.apache.org/docs/latest/quick-start.html
   
   We should fix this behavior change and possibly add test cases to prevent 
future regressions.
   
   Jira: https://issues.apache.org/jira/browse/SPARK-44820


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-44929][TESTS] Standardize log output for console appender in tests

2023-08-23 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 21a86b6fba4 [SPARK-44929][TESTS] Standardize log output for console 
appender in tests
21a86b6fba4 is described below

commit 21a86b6fba47a9e15cf97fbad0c277aef7c5ef5f
Author: Kent Yao 
AuthorDate: Thu Aug 24 13:51:26 2023 +0800

[SPARK-44929][TESTS] Standardize log output for console appender in tests

### What changes were proposed in this pull request?

This PR set a character length limit for the error message and a stack 
depth limit for error stack traces to the console appender in tests.

The original patterns are

- %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex
- %t: %m%n%ex

And they're adjusted to the new consistent pattern

- `%d{HH:mm:ss.SSS} %p %c: %maxLen{%m}{512}%n%ex{8}%n`

### Why are the changes needed?

In testing, intentional and unintentional failures are created to generate 
extensive log volumes. For instance, a single FileNotFound error may be logged 
multiple times in the writer, task runner, task set manager, and other areas, 
resulting in thousands of lines per failure.

For example, tests in ParquetRebaseDatetimeSuite will be run with V1 and V2 
Datasource, two or more specific values, and multiple configuration pairs. I 
have seen the SparkUpgradeException all over the CI logs

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

```
build/sbt "sql/testOnly *ParquetRebaseDatetimeV1Suite"
```

```
15:59:55.446 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202308230059551630377040190578321_1301 aborted.
15:59:55.446 ERROR org.apache.spark.executor.Executor: Exception in task 
0.0 in stage 1301.0 (TID 1595)
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while 
writing rows to 
file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
15:59:55.446 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 
in stage 1301.0 (TID 1595) (10.221.97.38 executor driver): 
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing 
rows to 
file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
at org.apache.spark.sql.execution.datasources
15:59:55.446 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in 
stage 1301.0 failed 1 times; aborting job
15:59:55.447 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job 
0ead031e-c9dd-446b-b20b-c76ec54978b1.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 1301.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
1301.0 (TID 1595) (10.221.97.38 executor driver): 
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing 
rows to 
file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
15:59:55.579 ERROR org.apache.spark.executor.Executor: Exception in task 
0.0 in stage 1303.0 (TID 1597)

[spark] branch branch-3.5 updated: [SPARK-44929][TESTS] Standardize log output for console appender in tests

2023-08-23 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e07e291cc7a [SPARK-44929][TESTS] Standardize log output for console 
appender in tests
e07e291cc7a is described below

commit e07e291cc7a325e7b698d1a974ff68423e1781e5
Author: Kent Yao 
AuthorDate: Thu Aug 24 13:51:26 2023 +0800

[SPARK-44929][TESTS] Standardize log output for console appender in tests

### What changes were proposed in this pull request?

This PR set a character length limit for the error message and a stack 
depth limit for error stack traces to the console appender in tests.

The original patterns are

- %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex
- %t: %m%n%ex

And they're adjusted to the new consistent pattern

- `%d{HH:mm:ss.SSS} %p %c: %maxLen{%m}{512}%n%ex{8}%n`

### Why are the changes needed?

In testing, intentional and unintentional failures are created to generate 
extensive log volumes. For instance, a single FileNotFound error may be logged 
multiple times in the writer, task runner, task set manager, and other areas, 
resulting in thousands of lines per failure.

For example, tests in ParquetRebaseDatetimeSuite will be run with V1 and V2 
Datasource, two or more specific values, and multiple configuration pairs. I 
have seen the SparkUpgradeException all over the CI logs

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

```
build/sbt "sql/testOnly *ParquetRebaseDatetimeV1Suite"
```

```
15:59:55.446 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202308230059551630377040190578321_1301 aborted.
15:59:55.446 ERROR org.apache.spark.executor.Executor: Exception in task 
0.0 in stage 1301.0 (TID 1595)
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while 
writing rows to 
file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
15:59:55.446 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 
in stage 1301.0 (TID 1595) (10.221.97.38 executor driver): 
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing 
rows to 
file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
at org.apache.spark.sql.execution.datasources
15:59:55.446 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in 
stage 1301.0 failed 1 times; aborting job
15:59:55.447 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job 
0ead031e-c9dd-446b-b20b-c76ec54978b1.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 1301.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
1301.0 (TID 1595) (10.221.97.38 executor driver): 
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing 
rows to 
file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
15:59:55.579 ERROR org.apache.spark.executor.Executor: Exception in task 
0.0 in stage 1303.0 (TID 1597)

[spark] branch master updated: [SPARK-44929][TESTS] Standardize log output for console appender in tests

2023-08-23 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 830500150f7 [SPARK-44929][TESTS] Standardize log output for console 
appender in tests
830500150f7 is described below

commit 830500150f7e3972d1fa5b47d0ab564bfa7e4b12
Author: Kent Yao 
AuthorDate: Thu Aug 24 13:51:26 2023 +0800

[SPARK-44929][TESTS] Standardize log output for console appender in tests

### What changes were proposed in this pull request?

This PR set a character length limit for the error message and a stack 
depth limit for error stack traces to the console appender in tests.

The original patterns are

- %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex
- %t: %m%n%ex

And they're adjusted to the new consistent pattern

- `%d{HH:mm:ss.SSS} %p %c: %maxLen{%m}{512}%n%ex{8}%n`

### Why are the changes needed?

In testing, intentional and unintentional failures are created to generate 
extensive log volumes. For instance, a single FileNotFound error may be logged 
multiple times in the writer, task runner, task set manager, and other areas, 
resulting in thousands of lines per failure.

For example, tests in ParquetRebaseDatetimeSuite will be run with V1 and V2 
Datasource, two or more specific values, and multiple configuration pairs. I 
have seen the SparkUpgradeException all over the CI logs

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

```
build/sbt "sql/testOnly *ParquetRebaseDatetimeV1Suite"
```

```
15:59:55.446 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202308230059551630377040190578321_1301 aborted.
15:59:55.446 ERROR org.apache.spark.executor.Executor: Exception in task 
0.0 in stage 1301.0 (TID 1595)
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while 
writing rows to 
file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
15:59:55.446 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 
in stage 1301.0 (TID 1595) (10.221.97.38 executor driver): 
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing 
rows to 
file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
at org.apache.spark.sql.execution.datasources
15:59:55.446 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in 
stage 1301.0 failed 1 times; aborting job
15:59:55.447 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job 
0ead031e-c9dd-446b-b20b-c76ec54978b1.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 1301.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
1301.0 (TID 1595) (10.221.97.38 executor driver): 
org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing 
rows to 
file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
15:59:55.579 ERROR org.apache.spark.executor.Executor: Exception in task 
0.0 in stage 1303.0 (TID 1597)
```

[spark] branch master updated: [SPARK-44928][PYTHON][DOCS] Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions

2023-08-23 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4f8c5ede6fa [SPARK-44928][PYTHON][DOCS] Replace the module alias 'sf' 
instead of 'F' in pyspark.sql import functions
4f8c5ede6fa is described below

commit 4f8c5ede6fae3ce86b92c2649544d8f73f6a16a8
Author: Hyukjin Kwon 
AuthorDate: Thu Aug 24 11:35:13 2023 +0800

[SPARK-44928][PYTHON][DOCS] Replace the module alias 'sf' instead of 'F' in 
pyspark.sql import functions

### What changes were proposed in this pull request?

This PR proposes the alias name `sf` instead of `F` for 
`pyspark.sql.functions` alias in public documentation:

```python
from pyspark.sql import functions as sf
```

This PR does not change the internal or test codes as it's too invasive, 
and might easily cause conflicts.

### Why are the changes needed?

```python
from pyspark.sql import functions as F
```

isn’t very Pythonic - it does not follow PEP 8, see [Package and Module 
Names](https://peps.python.org/pep-0008/#package-and-module-names).

> Modules should have short, all-lowercase names. Underscores can be used 
in the module name if it improves
> readability. Python packages should also have short, all-lowercase names, 
although the use of underscores
> is discouraged.

Therefore, the module’s alias should follow this. In practice, the 
uppercase is only used at the module/package
level constants in my experience, see also 
[Constants](https://peps.python.org/pep-0008/#constants).

See also [this stackoverflow 
comment](https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).

### Does this PR introduce _any_ user-facing change?

Yes, it changes documentation so users

### How was this patch tested?

Manually checked.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42628 from HyukjinKwon/SPARK-44928.

Authored-by: Hyukjin Kwon 
Signed-off-by: Ruifeng Zheng 
---
 docs/quick-start.md|  6 ++---
 docs/structured-streaming-programming-guide.md |  4 +--
 python/docs/source/development/debugging.rst   |  4 +--
 .../source/user_guide/pandas_on_spark/options.rst  | 10 
 python/pyspark/pandas/namespace.py |  5 ++--
 python/pyspark/pandas/utils.py |  7 ++---
 python/pyspark/sql/column.py   |  8 +++---
 python/pyspark/sql/dataframe.py| 10 
 python/pyspark/sql/functions.py| 30 +++---
 python/pyspark/sql/group.py|  4 +--
 10 files changed, 45 insertions(+), 43 deletions(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index f6a5c61008f..cab541a0351 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -130,8 +130,8 @@ Dataset actions and transformations can be used for more 
complex computations. L
 
 
 {% highlight python %}
->>> from pyspark.sql import functions as F
->>> textFile.select(F.size(F.split(textFile.value, 
"\s+")).name("numWords")).agg(F.max(F.col("numWords"))).collect()
+>>> from pyspark.sql import functions as sf
+>>> textFile.select(sf.size(sf.split(textFile.value, 
"\s+")).name("numWords")).agg(sf.max(sf.col("numWords"))).collect()
 [Row(max(numWords)=15)]
 {% endhighlight %}
 
@@ -140,7 +140,7 @@ This first maps a line to an integer value and aliases it 
as "numWords", creatin
 One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can 
implement MapReduce flows easily:
 
 {% highlight python %}
->>> wordCounts = textFile.select(F.explode(F.split(textFile.value, 
"\s+")).alias("word")).groupBy("word").count()
+>>> wordCounts = textFile.select(sf.explode(sf.split(textFile.value, 
"\s+")).alias("word")).groupBy("word").count()
 {% endhighlight %}
 
 Here, we use the `explode` function in `select`, to transform a Dataset of 
lines to a Dataset of words, and then combine `groupBy` and `count` to compute 
the per-word counts in the file as a DataFrame of 2 columns: "word" and 
"count". To collect the word counts in our shell, we can call `collect`:
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 8ec4d620052..b5a63d4d6aa 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1215,12 +1215,12 @@ event start time and evaluated gap duration during the 
query execution.
 
 
 {% highlight python %}
-from pyspark.sql import functions as F
+from pyspark.sql import functions as sf
 
 events = ...  # streaming DataFrame of schema { timestamp: Timestamp, 

[spark] branch master updated: [SPARK-44936][CORE] Simplify the log when Spark HybridStore hits the memory limit

2023-08-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6a604a4e31a [SPARK-44936][CORE] Simplify the log when Spark 
HybridStore hits the memory limit
6a604a4e31a is described below

commit 6a604a4e31afa8af619a451c1b6b033b3b0eed19
Author: Dongjoon Hyun 
AuthorDate: Wed Aug 23 20:12:58 2023 -0700

[SPARK-44936][CORE] Simplify the log when Spark HybridStore hits the memory 
limit

### What changes were proposed in this pull request?

This PR aims to simplify the log when Spark HybridStore hits the memory 
limit.

### Why are the changes needed?

`HistoryServerMemoryManager.lease` throws `RuntimeException`s frequently 
when the current usage is high.


https://github.com/apache/spark/blob/d382c6b3aef28bde6adcdf62b7be565ff1152942/core/src/main/scala/org/apache/spark/deploy/history/HistoryServerMemoryManager.scala#L52-L55

In this case, although Apache Spark shows `RuntimeException` as `INFO` 
level log, HybridStore works fine by fallback to disk store. So, there is no 
need to surprise the users with `RuntimeException` in the log. After this PR, 
we will provide a simpler message with the all messages without StrackTrace and 
`RuntimeException`.

**BEFORE**
```
23/08/23 22:40:34 INFO FsHistoryProvider: Failed to create HybridStore for 
spark-xxx/None. Using ROCKSDB.
java.lang.RuntimeException: Not enough memory to create hybrid store for 
app spark-xxx / None.
at 
org.apache.spark.deploy.history.HistoryServerMemoryManager.lease(HistoryServerMemoryManager.scala:54)
at 
org.apache.spark.deploy.history.FsHistoryProvider.createHybridStore(FsHistoryProvider.scala:1256)
at 
org.apache.spark.deploy.history.FsHistoryProvider.loadDiskStore(FsHistoryProvider.scala:1231)
at 
org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:342)
at 
org.apache.spark.deploy.history.HistoryServer.getAppUI(HistoryServer.scala:199)
at 
org.apache.spark.deploy.history.ApplicationCache.$anonfun$loadApplicationEntry$2(ApplicationCache.scala:163)
at 
org.apache.spark.deploy.history.ApplicationCache.time(ApplicationCache.scala:134)
at 
org.apache.spark.deploy.history.ApplicationCache.org$apache$spark$deploy$history$ApplicationCache$$loadApplicationEntry(ApplicationCache.scala:161)
at 
org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:55)
at 
org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:51)
at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.deploy.history.ApplicationCache.get(ApplicationCache.scala:88)
at 
org.apache.spark.deploy.history.ApplicationCache.withSparkUI(ApplicationCache.scala:100)
at 
org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:256)
at 
org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:104)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
at 
org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
at 
org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
at 
org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
at 
org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at 
org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at 

[spark] branch master updated (6c3e5346d4e -> b93b31f633f)

2023-08-23 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6c3e5346d4e [SPARK-44903][PYTHON][DOCS] Refine docstring of 
`approx_count_distinct`
 add b93b31f633f [MINOR][DOCS] Fix typos in `pyspark_upgrade.rst`

No new revisions were added by this update.

Summary of changes:
 python/docs/source/migration_guide/pyspark_upgrade.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

2023-08-23 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6c3e5346d4e [SPARK-44903][PYTHON][DOCS] Refine docstring of 
`approx_count_distinct`
6c3e5346d4e is described below

commit 6c3e5346d4eefdbad9cc8d7bca87889319cdd22a
Author: yangjie01 
AuthorDate: Thu Aug 24 09:06:24 2023 +0800

[SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

### What changes were proposed in this pull request?
This pr refine docstring of `approx_count_distinct ` and add some new 
examples.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42596 from LuciferYang/approx-pydoc.

Authored-by: yangjie01 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions.py | 59 -
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 3115b0199ec..0a00777b42c 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -3672,8 +3672,9 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-of column `col`.
+"""
+This aggregate function returns a new :class:`~pyspark.sql.Column`, which 
estimates
+the approximate distinct count of elements in a specified column or a 
group of columns.
 
 .. versionadded:: 2.1.0
 
@@ -3683,24 +3684,70 @@ def approx_count_distinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> C
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
+The label of the column to count distinct values in.
 rsd : float, optional
-maximum relative standard deviation allowed (default = 0.05).
-For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+The maximum allowed relative standard deviation (default = 0.05).
+If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
 Returns
 ---
 :class:`~pyspark.sql.Column`
-the column of computed results.
+A new Column object representing the approximate unique count.
+
+See Also
+--
+:meth:`pyspark.sql.functions.count_distinct`
 
 Examples
 
->>> df = spark.createDataFrame([1,2,2,3], "INT")
+Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.createDataFrame([1,2,2,3], "int")
 >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
 +---+
 |distinct_values|
 +---+
 |  3|
 +---+
+
+Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.createDataFrame([("apple",), ("orange",), ("apple",), 
("banana",)], ['fruit'])
+>>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
++---+
+|distinct_fruits|
++---+
+|  3|
++---+
+
+Example 3: Counting distinct values in a DataFrame with multiple columns
+
+>>> from pyspark.sql.functions import approx_count_distinct, struct
+>>> df = spark.createDataFrame([("Alice", 1),
+... ("Alice", 2),
+... ("Bob", 3),
+... ("Bob", 3)], ["name", "value"])
+>>> df = df.withColumn("combined", struct("name", "value"))
+>>> 
df.agg(approx_count_distinct("combined").alias('distinct_pairs')).show()
++--+
+|distinct_pairs|
++--+
+| 3|
++--+
+
+Example 4: Counting distinct values with a specified relative standard 
deviation
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.range(10)
+>>> df.agg(approx_count_distinct("id").alias('with_default_rsd'),
+...approx_count_distinct("id", 0.1).alias('with_rsd_0.1')).show()
++++
+|with_default_rsd|with_rsd_0.1|
++++
+|   95546|  102065|
++++
 """
 if rsd is None:
  

[spark] branch master updated: [SPARK-42017][PYTHON][CONNECT] df['col_name']` should validate the column name

2023-08-23 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9c68d03c300 [SPARK-42017][PYTHON][CONNECT] df['col_name']` should 
validate the column name
9c68d03c300 is described below

commit 9c68d03c300305a4628123835040d8163e64f8de
Author: Ruifeng Zheng 
AuthorDate: Thu Aug 24 08:47:31 2023 +0800

[SPARK-42017][PYTHON][CONNECT] df['col_name']` should validate the column 
name

### What changes were proposed in this pull request?
make `df['col_name']` validate the column name

### Why are the changes needed?
for parity

### Does this PR introduce _any_ user-facing change?
yes

before

```
In [1]: df = spark.range(0, 10)

In [2]: df["bad_key"]
Out[2]: Column<'bad_key'>

```

after

```
In [1]: df = spark.range(0, 10)

In [2]: df["bad_key"]
23/08/23 17:23:35 ERROR ErrorUtils: Spark Connect RPC error during: 
analyze. UserId: ruifeng.zheng. SessionId: 59de3f10-14b6-4239-be85-4156da43d495.
org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] 
A column, variable, or function parameter with name `bad_key` cannot be 
resolved. Did you mean one of the following? [`id`].;
'Project ['bad_key]
+- Range (0, 10, step=1, splits=Some(12))

...

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, 
or function parameter with name `bad_key` cannot be resolved. Did you mean one 
of the following? [`id`].;
'Project ['bad_key]
+- Range (0, 10, step=1, splits=Some(12))
```

### How was this patch tested?
enabled UT

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #42608 from zhengruifeng/tests_enable_test_access_column.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py| 5 +
 python/pyspark/sql/tests/connect/test_parity_column.py | 5 -
 python/pyspark/testing/connectutils.py | 1 +
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index d004ad48be6..0b19b979930 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1622,6 +1622,11 @@ class DataFrame:
 alias = self._get_alias()
 if self._plan is None:
 raise SparkConnectException("Cannot analyze on empty plan.")
+
+# validate the column name
+if not hasattr(self._session, "is_mock_session"):
+self.select(item).isLocal()
+
 return _to_col_with_plan_id(
 col=alias if alias is not None else item,
 plan_id=self._plan._plan_id,
diff --git a/python/pyspark/sql/tests/connect/test_parity_column.py 
b/python/pyspark/sql/tests/connect/test_parity_column.py
index 5cce063871a..d02fb289b7d 100644
--- a/python/pyspark/sql/tests/connect/test_parity_column.py
+++ b/python/pyspark/sql/tests/connect/test_parity_column.py
@@ -32,11 +32,6 @@ from pyspark.testing.connectutils import 
ReusedConnectTestCase
 
 
 class ColumnParityTests(ColumnTestsMixin, ReusedConnectTestCase):
-# TODO(SPARK-42017): df["bad_key"] does not raise AnalysisException
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_access_column(self):
-super().test_access_column()
-
 @unittest.skip("Requires JVM access.")
 def test_validate_column_types(self):
 super().test_validate_column_types()
diff --git a/python/pyspark/testing/connectutils.py 
b/python/pyspark/testing/connectutils.py
index ba81c783672..33c920eff86 100644
--- a/python/pyspark/testing/connectutils.py
+++ b/python/pyspark/testing/connectutils.py
@@ -76,6 +76,7 @@ class MockRemoteSession:
 def __init__(self):
 self.hooks = {}
 self.session_id = str(uuid.uuid4())
+self.is_mock_session = True
 
 def set_hook(self, name, hook):
 self.hooks[name] = hook


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44750][PYTHON][CONNECT] Apply configuration to sparksession during creation

2023-08-23 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e1b7a26d2f4 [SPARK-44750][PYTHON][CONNECT] Apply configuration to 
sparksession during creation
e1b7a26d2f4 is described below

commit e1b7a26d2f48f9f149498a9204db6944c7d5bca3
Author: Michael Zhang 
AuthorDate: Thu Aug 24 08:36:53 2023 +0800

[SPARK-44750][PYTHON][CONNECT] Apply configuration to sparksession during 
creation

### What changes were proposed in this pull request?

`SparkSession.Builder` now applies configuration options to the create 
`SparkSession`.

### Why are the changes needed?

It is reasonable to expect PySpark connect `SparkSession.Builder` to behave 
in the same way as other `SparkSession.Builder`s in Spark Connect. The 
`SparkSession.Builder` should apply the provided configuration options to the 
created `SparkSesssion`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests were added to verify that configuration options were applied to the 
`SparkSession`.

Closes #42548 from michaelzhan-db/SPARK-44750.

Lead-authored-by: Michael Zhang 
Co-authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
(cherry picked from commit c2e3171f3d3887302227edc39ee124bd61561b7d)
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/session.py   | 10 ++
 .../pyspark/sql/tests/connect/test_connect_basic.py | 21 +
 2 files changed, 31 insertions(+)

diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index d75a30c561f..2905f7e4269 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -176,6 +176,14 @@ class SparkSession:
 error_class="NOT_IMPLEMENTED", message_parameters={"feature": 
"enableHiveSupport"}
 )
 
+def _apply_options(self, session: "SparkSession") -> None:
+with self._lock:
+for k, v in self._options.items():
+try:
+session.conf.set(k, v)
+except Exception as e:
+warnings.warn(str(e))
+
 def create(self) -> "SparkSession":
 has_channel_builder = self._channel_builder is not None
 has_spark_remote = "spark.remote" in self._options
@@ -200,6 +208,7 @@ class SparkSession:
 session = SparkSession(connection=spark_remote)
 
 SparkSession._set_default_and_active_session(session)
+self._apply_options(session)
 return session
 
 def getOrCreate(self) -> "SparkSession":
@@ -209,6 +218,7 @@ class SparkSession:
 session = SparkSession._default_session
 if session is None:
 session = self.create()
+self._apply_options(session)
 return session
 
 _client: SparkConnectClient
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 9e8f5623971..54911c09b6f 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -3347,6 +3347,27 @@ class SparkConnectSessionTests(ReusedConnectTestCase):
 self.assertIn("Create a new SparkSession is only supported with 
SparkConnect.", str(e))
 
 
+class SparkConnectSessionWithOptionsTest(ReusedConnectTestCase):
+def setUp(self) -> None:
+self.spark = (
+PySparkSession.builder.config("string", "foo")
+.config("integer", 1)
+.config("boolean", False)
+.appName(self.__class__.__name__)
+.remote("local[4]")
+.getOrCreate()
+)
+
+def tearDown(self):
+self.spark.stop()
+
+def test_config(self):
+# Config
+self.assertEqual(self.spark.conf.get("string"), "foo")
+self.assertEqual(self.spark.conf.get("boolean"), "false")
+self.assertEqual(self.spark.conf.get("integer"), "1")
+
+
 @unittest.skipIf(not should_test_connect, connect_requirement_message)
 class ClientTests(unittest.TestCase):
 def test_retry_error_handling(self):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (d382c6b3aef -> c2e3171f3d3)

2023-08-23 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from d382c6b3aef [SPARK-44935][K8S] Fix `RELEASE` file to have the correct 
information in Docker images if exists
 add c2e3171f3d3 [SPARK-44750][PYTHON][CONNECT] Apply configuration to 
sparksession during creation

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/session.py   | 10 ++
 .../pyspark/sql/tests/connect/test_connect_basic.py | 21 +
 2 files changed, 31 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-44935][K8S] Fix `RELEASE` file to have the correct information in Docker images if exists

2023-08-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 392b12ee7d0 [SPARK-44935][K8S] Fix `RELEASE` file to have the correct 
information in Docker images if exists
392b12ee7d0 is described below

commit 392b12ee7d0ee794c717e1b2e98a84cd4d02a4d1
Author: Dongjoon Hyun 
AuthorDate: Wed Aug 23 16:00:55 2023 -0700

[SPARK-44935][K8S] Fix `RELEASE` file to have the correct information in 
Docker images if exists

### What changes were proposed in this pull request?

This PR aims to fix `RELEASE` file to have the correct information in 
Docker images if `RELEASE` file exists.

Please note that `RELEASE` file doesn't exists in SPARK_HOME directory when 
we run the K8s integration test from Spark Git repository. So, we keep the 
following empty `RELEASE` file generation and use `COPY` conditionally via glob 
syntax.


https://github.com/apache/spark/blob/2a3aec1f9040e08999a2df88f92340cd2710e552/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L37

### Why are the changes needed?

Currently, it's an empty file in the official Apache Spark Docker images.

```
$ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE
-rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE

$ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail 
-n1
-rw-r--r-- 1 root root 0 Feb 21  2022 /opt/spark/RELEASE
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually build image and check it with `docker run -it --rm NEW_IMAGE ls 
-al /opt/spark/RELEASE`

I copied this `Dockerfile` into Apache Spark 3.5.0 RC2 binary distribution 
and tested in the following way.
```
$ cd spark-3.5.0-rc2-bin-hadoop3

$ cp /tmp/Dockerfile kubernetes/dockerfiles/spark/Dockerfile

$ bin/docker-image-tool.sh -t SPARK-44935 build

$ docker run -it --rm docker.io/library/spark:SPARK-44935 ls -al 
/opt/spark/RELEASE | tail -n1
-rw-r--r-- 1 root root 165 Aug 18 21:10 /opt/spark/RELEASE

$ docker run -it --rm docker.io/library/spark:SPARK-44935 cat 
/opt/spark/RELEASE | tail -n2
Spark 3.5.0 (git revision 010c4a6a05) built for Hadoop 3.3.4
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 
-Phadoop-3 -Phive -Phive-thriftserver
```
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42636 from dongjoon-hyun/SPARK-44935.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit d382c6b3aef28bde6adcdf62b7be565ff1152942)
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 ++
 1 file changed, 2 insertions(+)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
index 6ed03624e59..51c485140a4 100644
--- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
+++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
@@ -43,6 +43,8 @@ RUN set -ex && \
 rm -rf /var/cache/apt/*
 
 COPY jars /opt/spark/jars
+# Copy RELEASE file if exists
+COPY RELEAS[E] /opt/spark/RELEASE
 COPY bin /opt/spark/bin
 COPY sbin /opt/spark/sbin
 COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-44935][K8S] Fix `RELEASE` file to have the correct information in Docker images if exists

2023-08-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 91d85c63489 [SPARK-44935][K8S] Fix `RELEASE` file to have the correct 
information in Docker images if exists
91d85c63489 is described below

commit 91d85c6348995d8a288aa527d705ffe8107e01f5
Author: Dongjoon Hyun 
AuthorDate: Wed Aug 23 16:00:55 2023 -0700

[SPARK-44935][K8S] Fix `RELEASE` file to have the correct information in 
Docker images if exists

### What changes were proposed in this pull request?

This PR aims to fix `RELEASE` file to have the correct information in 
Docker images if `RELEASE` file exists.

Please note that `RELEASE` file doesn't exists in SPARK_HOME directory when 
we run the K8s integration test from Spark Git repository. So, we keep the 
following empty `RELEASE` file generation and use `COPY` conditionally via glob 
syntax.


https://github.com/apache/spark/blob/2a3aec1f9040e08999a2df88f92340cd2710e552/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L37

### Why are the changes needed?

Currently, it's an empty file in the official Apache Spark Docker images.

```
$ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE
-rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE

$ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail 
-n1
-rw-r--r-- 1 root root 0 Feb 21  2022 /opt/spark/RELEASE
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually build image and check it with `docker run -it --rm NEW_IMAGE ls 
-al /opt/spark/RELEASE`

I copied this `Dockerfile` into Apache Spark 3.5.0 RC2 binary distribution 
and tested in the following way.
```
$ cd spark-3.5.0-rc2-bin-hadoop3

$ cp /tmp/Dockerfile kubernetes/dockerfiles/spark/Dockerfile

$ bin/docker-image-tool.sh -t SPARK-44935 build

$ docker run -it --rm docker.io/library/spark:SPARK-44935 ls -al 
/opt/spark/RELEASE | tail -n1
-rw-r--r-- 1 root root 165 Aug 18 21:10 /opt/spark/RELEASE

$ docker run -it --rm docker.io/library/spark:SPARK-44935 cat 
/opt/spark/RELEASE | tail -n2
Spark 3.5.0 (git revision 010c4a6a05) built for Hadoop 3.3.4
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 
-Phadoop-3 -Phive -Phive-thriftserver
```
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42636 from dongjoon-hyun/SPARK-44935.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit d382c6b3aef28bde6adcdf62b7be565ff1152942)
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 ++
 1 file changed, 2 insertions(+)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
index 53026016ee2..88304c87a79 100644
--- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
+++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
@@ -42,6 +42,8 @@ RUN set -ex && \
 rm -rf /var/cache/apt/* && rm -rf /var/lib/apt/lists/*
 
 COPY jars /opt/spark/jars
+# Copy RELEASE file if exists
+COPY RELEAS[E] /opt/spark/RELEASE
 COPY bin /opt/spark/bin
 COPY sbin /opt/spark/sbin
 COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44935][K8S] Fix `RELEASE` file to have the correct information in Docker images if exists

2023-08-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 27852078a43 [SPARK-44935][K8S] Fix `RELEASE` file to have the correct 
information in Docker images if exists
27852078a43 is described below

commit 27852078a43b2ff22cb81228c4982cf808f22070
Author: Dongjoon Hyun 
AuthorDate: Wed Aug 23 16:00:55 2023 -0700

[SPARK-44935][K8S] Fix `RELEASE` file to have the correct information in 
Docker images if exists

### What changes were proposed in this pull request?

This PR aims to fix `RELEASE` file to have the correct information in 
Docker images if `RELEASE` file exists.

Please note that `RELEASE` file doesn't exists in SPARK_HOME directory when 
we run the K8s integration test from Spark Git repository. So, we keep the 
following empty `RELEASE` file generation and use `COPY` conditionally via glob 
syntax.


https://github.com/apache/spark/blob/2a3aec1f9040e08999a2df88f92340cd2710e552/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L37

### Why are the changes needed?

Currently, it's an empty file in the official Apache Spark Docker images.

```
$ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE
-rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE

$ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail 
-n1
-rw-r--r-- 1 root root 0 Feb 21  2022 /opt/spark/RELEASE
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually build image and check it with `docker run -it --rm NEW_IMAGE ls 
-al /opt/spark/RELEASE`

I copied this `Dockerfile` into Apache Spark 3.5.0 RC2 binary distribution 
and tested in the following way.
```
$ cd spark-3.5.0-rc2-bin-hadoop3

$ cp /tmp/Dockerfile kubernetes/dockerfiles/spark/Dockerfile

$ bin/docker-image-tool.sh -t SPARK-44935 build

$ docker run -it --rm docker.io/library/spark:SPARK-44935 ls -al 
/opt/spark/RELEASE | tail -n1
-rw-r--r-- 1 root root 165 Aug 18 21:10 /opt/spark/RELEASE

$ docker run -it --rm docker.io/library/spark:SPARK-44935 cat 
/opt/spark/RELEASE | tail -n2
Spark 3.5.0 (git revision 010c4a6a05) built for Hadoop 3.3.4
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 
-Phadoop-3 -Phive -Phive-thriftserver
```
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42636 from dongjoon-hyun/SPARK-44935.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit d382c6b3aef28bde6adcdf62b7be565ff1152942)
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 ++
 1 file changed, 2 insertions(+)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
index 53026016ee2..88304c87a79 100644
--- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
+++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
@@ -42,6 +42,8 @@ RUN set -ex && \
 rm -rf /var/cache/apt/* && rm -rf /var/lib/apt/lists/*
 
 COPY jars /opt/spark/jars
+# Copy RELEASE file if exists
+COPY RELEAS[E] /opt/spark/RELEASE
 COPY bin /opt/spark/bin
 COPY sbin /opt/spark/sbin
 COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (2a3aec1f904 -> d382c6b3aef)

2023-08-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2a3aec1f904 [SPARK-44906][K8S] Make 
`Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of 
feature steps
 add d382c6b3aef [SPARK-44935][K8S] Fix `RELEASE` file to have the correct 
information in Docker images if exists

No new revisions were added by this update.

Summary of changes:
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 ++
 1 file changed, 2 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of feature steps

2023-08-23 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2a3aec1f904 [SPARK-44906][K8S] Make 
`Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of 
feature steps
2a3aec1f904 is described below

commit 2a3aec1f9040e08999a2df88f92340cd2710e552
Author: zwangsheng <2213335...@qq.com>
AuthorDate: Wed Aug 23 10:42:47 2023 -0700

[SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` 
substitute annotations instead of feature steps

### What changes were proposed in this pull request?

Move `Utils. SubstituteAppNExecIds` logic  into 
`KubernetesConf.annotations` as the default logic,

### Why are the changes needed?

Easy for users to reuse, rather than to rewrite it again at the same logic.

When user write custom feature step and using annotations, before this pr, 
they should call `Utils. SubstituteAppNExecIds` once.

### Does this PR introduce _any_ user-facing change?

Yes, but no sense for user to use annotations.

### How was this patch tested?

Add unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42600 from zwangsheng/SPARK-44906.

Lead-authored-by: zwangsheng <2213335...@qq.com>
Co-authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/deploy/k8s/KubernetesConf.scala  |  2 ++
 .../spark/deploy/k8s/features/BasicDriverFeatureStep.scala  |  3 +--
 .../deploy/k8s/features/BasicExecutorFeatureStep.scala  |  6 ++
 .../org/apache/spark/deploy/k8s/KubernetesConfSuite.scala   | 13 ++---
 4 files changed, 15 insertions(+), 9 deletions(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
index d8cb881bf08..4ebf31ae44e 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
@@ -117,6 +117,7 @@ private[spark] class KubernetesDriverConf(
 
   override def annotations: Map[String, String] = {
 KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, 
KUBERNETES_DRIVER_ANNOTATION_PREFIX)
+  .map { case (k, v) => (k, Utils.substituteAppNExecIds(v, appId, "")) }
   }
 
   def serviceLabels: Map[String, String] = {
@@ -188,6 +189,7 @@ private[spark] class KubernetesExecutorConf(
 
   override def annotations: Map[String, String] = {
 KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, 
KUBERNETES_EXECUTOR_ANNOTATION_PREFIX)
+  .map { case(k, v) => (k, Utils.substituteAppNExecIds(v, appId, 
executorId)) }
   }
 
   override def secretNamesToMountPaths: Map[String, String] = {
diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala
index 2b287ea8560..11a21bb68a6 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala
@@ -143,8 +143,7 @@ private[spark] class BasicDriverFeatureStep(conf: 
KubernetesDriverConf)
   .editOrNewMetadata()
 .withName(driverPodName)
 .addToLabels(conf.labels.asJava)
-.addToAnnotations(conf.annotations.map { case (k, v) =>
-  (k, Utils.substituteAppNExecIds(v, conf.appId, "")) }.asJava)
+.addToAnnotations(conf.annotations.asJava)
 .endMetadata()
   .editOrNewSpec()
 .withRestartPolicy("Never")
diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala
index 0b0bbc30ba4..f3e5cad8c9e 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala
@@ -255,14 +255,12 @@ private[spark] class BasicExecutorFeatureStep(
   case "statefulset" => "Always"
   case _ => "Never"
 }
-val annotations = kubernetesConf.annotations.map { case (k, v) =>
-  (k, Utils.substituteAppNExecIds(v, kubernetesConf.appId, 
kubernetesConf.executorId))
-}
+
 val executorPodBuilder = new PodBuilder(pod.pod)
  

[spark] branch branch-3.5 updated: [SPARK-44816][CONNECT] Improve error message when UDF class is not found

2023-08-23 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 2d7e05ddd98 [SPARK-44816][CONNECT] Improve error message when UDF 
class is not found
2d7e05ddd98 is described below

commit 2d7e05ddd980ad7a934e5c8fde77335ef6591c06
Author: Niranjan Jayakar 
AuthorDate: Wed Aug 23 18:42:04 2023 +0200

[SPARK-44816][CONNECT] Improve error message when UDF class is not found

### What changes were proposed in this pull request?

Improve the error messaging on the connect client when using
a UDF whose corresponding class has not been sync'ed with the
spark connect service.

Prior to this change, the client receives a cryptic error:

```
Exception in thread "main" org.apache.spark.SparkException: Main$
```

With this change, the message is improved to be:

```
Exception in thread "main" org.apache.spark.SparkException: Failed to load 
class: Main$. Make sure the artifact where the class is defined is installed by 
calling session.addArtifact.
```

### Why are the changes needed?

This change makes it clear to the user on what the error is.

### Does this PR introduce _any_ user-facing change?

Yes. The error message is improved. See details above.

### How was this patch tested?

Manually by running a connect server and client.

Closes #42500 from nija-at/improve-error.

Authored-by: Niranjan Jayakar 
Signed-off-by: Herman van Hovell 
(cherry picked from commit 2d0a0a00cb5dde6bcb8e561278357b6bb8b76dcc)
Signed-off-by: Herman van Hovell 
---
 .../org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala   | 5 +
 1 file changed, 5 insertions(+)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index e81e9bb51cb..46c465e4deb 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -1527,6 +1527,11 @@ class SparkConnectPlanner(val sessionHolder: 
SessionHolder) extends Logging {
   s"Failed to load class correctly due to $nsm. " +
 "Make sure the artifact where the class is defined is 
installed by calling" +
 " session.addArtifact.")
+  case cnf: ClassNotFoundException =>
+throw new ClassNotFoundException(
+  s"Failed to load class: ${cnf.getMessage}. " +
+"Make sure the artifact where the class is defined is 
installed by calling" +
+" session.addArtifact.")
   case _ => throw t
 }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44816][CONNECT] Improve error message when UDF class is not found

2023-08-23 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2d0a0a00cb5 [SPARK-44816][CONNECT] Improve error message when UDF 
class is not found
2d0a0a00cb5 is described below

commit 2d0a0a00cb5dde6bcb8e561278357b6bb8b76dcc
Author: Niranjan Jayakar 
AuthorDate: Wed Aug 23 18:42:04 2023 +0200

[SPARK-44816][CONNECT] Improve error message when UDF class is not found

### What changes were proposed in this pull request?

Improve the error messaging on the connect client when using
a UDF whose corresponding class has not been sync'ed with the
spark connect service.

Prior to this change, the client receives a cryptic error:

```
Exception in thread "main" org.apache.spark.SparkException: Main$
```

With this change, the message is improved to be:

```
Exception in thread "main" org.apache.spark.SparkException: Failed to load 
class: Main$. Make sure the artifact where the class is defined is installed by 
calling session.addArtifact.
```

### Why are the changes needed?

This change makes it clear to the user on what the error is.

### Does this PR introduce _any_ user-facing change?

Yes. The error message is improved. See details above.

### How was this patch tested?

Manually by running a connect server and client.

Closes #42500 from nija-at/improve-error.

Authored-by: Niranjan Jayakar 
Signed-off-by: Herman van Hovell 
---
 .../org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala   | 5 +
 1 file changed, 5 insertions(+)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index 70e6e926613..5b5018a1668 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -1523,6 +1523,11 @@ class SparkConnectPlanner(val sessionHolder: 
SessionHolder) extends Logging {
   s"Failed to load class correctly due to $nsm. " +
 "Make sure the artifact where the class is defined is 
installed by calling" +
 " session.addArtifact.")
+  case cnf: ClassNotFoundException =>
+throw new ClassNotFoundException(
+  s"Failed to load class: ${cnf.getMessage}. " +
+"Make sure the artifact where the class is defined is 
installed by calling" +
+" session.addArtifact.")
   case _ => throw t
 }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44861][CONNECT] jsonignore SparkListenerConnectOperationStarted.planRequest

2023-08-23 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a7941f15a0c [SPARK-44861][CONNECT] jsonignore 
SparkListenerConnectOperationStarted.planRequest
a7941f15a0c is described below

commit a7941f15a0c3034888b1adbd5affce2a9e12788e
Author: jdesjean 
AuthorDate: Wed Aug 23 18:39:49 2023 +0200

[SPARK-44861][CONNECT] jsonignore 
SparkListenerConnectOperationStarted.planRequest

### What changes were proposed in this pull request?
Add `JsonIgnore` to `SparkListenerConnectOperationStarted.planRequest`

### Why are the changes needed?
`SparkListenerConnectOperationStarted` was added as part of 
[SPARK-43923](https://issues.apache.org/jira/browse/SPARK-43923).
`SparkListenerConnectOperationStarted.planRequest` cannot be serialized & 
deserialized from json as it has recursive objects which causes failures when 
attempting these operations.
```
com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Direct 
self-reference leading to cycle (through reference chain: 
org.apache.spark.sql.connect.service.SparkListenerConnectOperationStarted["planRequest"]->org.apache.spark.connect.proto.ExecutePlanRequest["unknownFields"]->grpc_shaded.com.google.protobuf.UnknownFieldSet["defaultInstanceForType"])
at 
com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:77)
at 
com.fasterxml.jackson.databind.SerializerProvider.reportBadDefinition(SerializerProvider.java:1308)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit

Closes #42550 from jdesjean/SPARK-44861.

Authored-by: jdesjean 
Signed-off-by: Herman van Hovell 
(cherry picked from commit dd6cda5b614b4ede418afb4c5b1fdeea9613d32c)
Signed-off-by: Herman van Hovell 
---
 .../sql/connect/service/ExecuteEventsManager.scala |  37 ---
 .../service/ExecuteEventsManagerSuite.scala| 114 ++---
 .../ui/SparkConnectServerListenerSuite.scala   |   3 -
 .../connect/ui/SparkConnectServerPageSuite.scala   |   1 -
 4 files changed, 95 insertions(+), 60 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala
index 5b9267a9679..23a67b7292b 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala
@@ -119,19 +119,19 @@ case class ExecuteEventsManager(executeHolder: 
ExecuteHolder, clock: Clock) {
 s"${request.getPlan.getOpTypeCase} not supported.")
   }
 
-listenerBus.post(
-  SparkListenerConnectOperationStarted(
-jobTag,
-operationId,
-clock.getTimeMillis(),
-sessionId,
-request.getUserContext.getUserId,
-request.getUserContext.getUserName,
-Utils.redact(
-  sessionHolder.session.sessionState.conf.stringRedactionPattern,
-  ProtoUtils.abbreviate(plan, 
ExecuteEventsManager.MAX_STATEMENT_TEXT_SIZE).toString),
-Some(request),
-sparkSessionTags))
+val event = SparkListenerConnectOperationStarted(
+  jobTag,
+  operationId,
+  clock.getTimeMillis(),
+  sessionId,
+  request.getUserContext.getUserId,
+  request.getUserContext.getUserName,
+  Utils.redact(
+sessionHolder.session.sessionState.conf.stringRedactionPattern,
+ProtoUtils.abbreviate(plan, 
ExecuteEventsManager.MAX_STATEMENT_TEXT_SIZE).toString),
+  sparkSessionTags)
+event.planRequest = Some(request)
+listenerBus.post(event)
   }
 
   /**
@@ -290,8 +290,6 @@ case class ExecuteEventsManager(executeHolder: 
ExecuteHolder, clock: Clock) {
  *   Opaque userName set in the Connect request.
  * @param statementText:
  *   The connect request plan converted to text.
- * @param planRequest:
- *   The Connect request. None if the operation is not of type @link 
proto.ExecutePlanRequest
  * @param sparkSessionTags:
  *   Extra tags set by the user (via SparkSession.addTag).
  * @param extraTags:
@@ -305,10 +303,15 @@ case class SparkListenerConnectOperationStarted(
 userId: String,
 userName: String,
 statementText: String,
-planRequest: Option[proto.ExecutePlanRequest],
 sparkSessionTags: Set[String],
 extraTags: Map[String, String] = Map.empty)
-extends SparkListenerEvent
+extends SparkListenerEvent {
+
+  /**
+   * The Connect request. None if the operation is not of type @link 
proto.ExecutePlanRequest
+   */
+  @JsonIgnore var planRequest: 

[spark] branch master updated: [SPARK-44861][CONNECT] jsonignore SparkListenerConnectOperationStarted.planRequest

2023-08-23 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dd6cda5b614 [SPARK-44861][CONNECT] jsonignore 
SparkListenerConnectOperationStarted.planRequest
dd6cda5b614 is described below

commit dd6cda5b614b4ede418afb4c5b1fdeea9613d32c
Author: jdesjean 
AuthorDate: Wed Aug 23 18:39:49 2023 +0200

[SPARK-44861][CONNECT] jsonignore 
SparkListenerConnectOperationStarted.planRequest

### What changes were proposed in this pull request?
Add `JsonIgnore` to `SparkListenerConnectOperationStarted.planRequest`

### Why are the changes needed?
`SparkListenerConnectOperationStarted` was added as part of 
[SPARK-43923](https://issues.apache.org/jira/browse/SPARK-43923).
`SparkListenerConnectOperationStarted.planRequest` cannot be serialized & 
deserialized from json as it has recursive objects which causes failures when 
attempting these operations.
```
com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Direct 
self-reference leading to cycle (through reference chain: 
org.apache.spark.sql.connect.service.SparkListenerConnectOperationStarted["planRequest"]->org.apache.spark.connect.proto.ExecutePlanRequest["unknownFields"]->grpc_shaded.com.google.protobuf.UnknownFieldSet["defaultInstanceForType"])
at 
com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:77)
at 
com.fasterxml.jackson.databind.SerializerProvider.reportBadDefinition(SerializerProvider.java:1308)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit

Closes #42550 from jdesjean/SPARK-44861.

Authored-by: jdesjean 
Signed-off-by: Herman van Hovell 
---
 .../sql/connect/service/ExecuteEventsManager.scala |  37 ---
 .../service/ExecuteEventsManagerSuite.scala| 114 ++---
 .../ui/SparkConnectServerListenerSuite.scala   |   3 -
 .../connect/ui/SparkConnectServerPageSuite.scala   |   1 -
 4 files changed, 95 insertions(+), 60 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala
index 5b9267a9679..23a67b7292b 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala
@@ -119,19 +119,19 @@ case class ExecuteEventsManager(executeHolder: 
ExecuteHolder, clock: Clock) {
 s"${request.getPlan.getOpTypeCase} not supported.")
   }
 
-listenerBus.post(
-  SparkListenerConnectOperationStarted(
-jobTag,
-operationId,
-clock.getTimeMillis(),
-sessionId,
-request.getUserContext.getUserId,
-request.getUserContext.getUserName,
-Utils.redact(
-  sessionHolder.session.sessionState.conf.stringRedactionPattern,
-  ProtoUtils.abbreviate(plan, 
ExecuteEventsManager.MAX_STATEMENT_TEXT_SIZE).toString),
-Some(request),
-sparkSessionTags))
+val event = SparkListenerConnectOperationStarted(
+  jobTag,
+  operationId,
+  clock.getTimeMillis(),
+  sessionId,
+  request.getUserContext.getUserId,
+  request.getUserContext.getUserName,
+  Utils.redact(
+sessionHolder.session.sessionState.conf.stringRedactionPattern,
+ProtoUtils.abbreviate(plan, 
ExecuteEventsManager.MAX_STATEMENT_TEXT_SIZE).toString),
+  sparkSessionTags)
+event.planRequest = Some(request)
+listenerBus.post(event)
   }
 
   /**
@@ -290,8 +290,6 @@ case class ExecuteEventsManager(executeHolder: 
ExecuteHolder, clock: Clock) {
  *   Opaque userName set in the Connect request.
  * @param statementText:
  *   The connect request plan converted to text.
- * @param planRequest:
- *   The Connect request. None if the operation is not of type @link 
proto.ExecutePlanRequest
  * @param sparkSessionTags:
  *   Extra tags set by the user (via SparkSession.addTag).
  * @param extraTags:
@@ -305,10 +303,15 @@ case class SparkListenerConnectOperationStarted(
 userId: String,
 userName: String,
 statementText: String,
-planRequest: Option[proto.ExecutePlanRequest],
 sparkSessionTags: Set[String],
 extraTags: Map[String, String] = Map.empty)
-extends SparkListenerEvent
+extends SparkListenerEvent {
+
+  /**
+   * The Connect request. None if the operation is not of type @link 
proto.ExecutePlanRequest
+   */
+  @JsonIgnore var planRequest: Option[proto.ExecutePlanRequest] = None
+}
 
 /**
  * The event is sent after a Connect request has been analyzed (@link

[spark] branch master updated (75e6b731cce -> 378b76e7978)

2023-08-23 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 75e6b731cce [SPARK-44549][SQL] Support window functions in correlated 
scalar subqueries
 add 378b76e7978 [MINOR][PYTHON] Code cleanup: remove resolved todo items

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/dataframe.py | 1 -
 python/pyspark/sql/connect/functions.py | 4 
 python/pyspark/sql/group.py | 1 -
 3 files changed, 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44549][SQL] Support window functions in correlated scalar subqueries

2023-08-23 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 75e6b731cce [SPARK-44549][SQL] Support window functions in correlated 
scalar subqueries
75e6b731cce is described below

commit 75e6b731cced459cd4a7eaed75b0bb84334950a3
Author: Andrey Gubichev 
AuthorDate: Wed Aug 23 22:49:41 2023 +0800

[SPARK-44549][SQL] Support window functions in correlated scalar subqueries

### What changes were proposed in this pull request?

Support window functions in correlated scalar subqueries.
Support in IN/EXISTS subqueries will come after we migrate them into 
DecorrelateInnerQuery framework. In addition, correlation is not yet supported 
inside the window function itself.

### Why are the changes needed?

Supports more subqueries.

### Does this PR introduce _any_ user-facing change?

Yes, users can run more subqueries now.

### How was this patch tested?

Unit and query tests. The results of query tests were checked against 
Postgresql.

Closes #42383 from agubichev/SPARK-44549-corr-window.

Authored-by: Andrey Gubichev 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |   5 +
 .../catalyst/optimizer/DecorrelateInnerQuery.scala |  22 ++-
 .../optimizer/DecorrelateInnerQuerySuite.scala |  44 ++
 .../analyzer-results/join-lateral.sql.out  |  69 +
 .../exists-subquery/exists-aggregate.sql.out   |  24 
 .../subquery/in-subquery/in-group-by.sql.out   |  23 +++
 .../scalar-subquery-predicate.sql.out  | 159 +
 .../resources/sql-tests/inputs/join-lateral.sql|   2 +-
 .../subquery/exists-subquery/exists-aggregate.sql  |   7 +
 .../inputs/subquery/in-subquery/in-group-by.sql|   7 +-
 .../scalar-subquery/scalar-subquery-predicate.sql  |  38 +
 .../sql-tests/results/join-lateral.sql.out |  46 ++
 .../exists-subquery/exists-aggregate.sql.out   |  26 
 .../subquery/in-subquery/in-group-by.sql.out   |  25 
 .../scalar-subquery-predicate.sql.out  |  79 ++
 .../scala/org/apache/spark/sql/SubquerySuite.scala |   2 +-
 16 files changed, 511 insertions(+), 67 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index a86a6052708..0d663dfb5f9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -1351,6 +1351,11 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   failOnInvalidOuterReference(a)
   checkPlan(a.child, aggregated = true, canContainOuter)
 
+// Same as Aggregate above.
+case w: Window =>
+  failOnInvalidOuterReference(w)
+  checkPlan(w.child, aggregated = true, canContainOuter)
+
 // Distinct does not host any correlated expressions, but during the 
optimization phase
 // it will be rewritten as Aggregate, which can only be on a 
correlation path if the
 // correlation contains only the supported correlated equality 
predicates.
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
index a3e264579f4..a07177f6e8a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
@@ -478,7 +478,8 @@ object DecorrelateInnerQuery extends PredicateHelper {
 // parentOuterReferences: a set of parent outer references. As we recurse 
down we collect the
 // set of outer references that are part of the Domain, and use it to 
construct the DomainJoins
 // and join conditions.
-// aggregated: a boolean flag indicating whether the result of the plan 
will be aggregated.
+// aggregated: a boolean flag indicating whether the result of the plan 
will be aggregated
+// (or used as an input for a window function)
 // underSetOp: a boolean flag indicating whether a set operator (e.g. 
UNION) is a parent of the
 // inner plan.
 //
@@ -654,6 +655,25 @@ object DecorrelateInnerQuery extends PredicateHelper {
 val newProject = Project(newProjectList ++ referencesToAdd, 
newChild)
 (newProject, joinCond, outerReferenceMap)
 
+  case w @ Window(projectList, partitionSpec, orderSpec, child) =>
+val outerReferences = 

[spark] branch branch-3.5 updated: [SPARK-44914][BUILD] Upgrade `Apache ivy` from 2.5.1 to 2.5.2

2023-08-23 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new aee36e0c242 [SPARK-44914][BUILD] Upgrade `Apache ivy` from 2.5.1 to 
2.5.2
aee36e0c242 is described below

commit aee36e0c242935ab85aa4022d8b1ebf2b6b628e7
Author: Bjørn Jørgensen 
AuthorDate: Wed Aug 23 20:58:12 2023 +0800

[SPARK-44914][BUILD] Upgrade `Apache ivy` from 2.5.1 to 2.5.2

### What changes were proposed in this pull request?
Upgrade Apache ivy from 2.5.1 to 2.5.2

[Release 
notes](https://lists.apache.org/thread/9gcz4xrsn8c7o9gb377xfzvkb8jltffr)

### Why are the changes needed?
[CVE-2022-46751](https://www.cve.org/CVERecord?id=CVE-2022-46751)

The fix 
https://github.com/apache/ant-ivy/commit/2be17bc18b0e1d4123007d579e43ba1a4b6fab3d
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42613 from bjornjorgensen/ivy-2.5.2.

Authored-by: Bjørn Jørgensen 
Signed-off-by: yangjie01 
(cherry picked from commit 611e17e89260cd8d2b12edfc060f31a73773fa02)
Signed-off-by: yangjie01 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index b6aba589d5f..8f898fc1ef5 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -98,7 +98,7 @@ httpclient/4.5.14//httpclient-4.5.14.jar
 httpcore/4.4.16//httpcore-4.4.16.jar
 ini4j/0.5.4//ini4j-0.5.4.jar
 istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar
-ivy/2.5.1//ivy-2.5.1.jar
+ivy/2.5.2//ivy-2.5.2.jar
 jackson-annotations/2.15.2//jackson-annotations-2.15.2.jar
 jackson-core-asl/1.9.13//jackson-core-asl-1.9.13.jar
 jackson-core/2.15.2//jackson-core-2.15.2.jar
diff --git a/pom.xml b/pom.xml
index 2c340f847bf..e550e16553d 100644
--- a/pom.xml
+++ b/pom.xml
@@ -146,7 +146,7 @@
 9.4.51.v20230217
 4.0.3
 0.10.0
-2.5.1
+2.5.2
 2.0.8
 

[spark] branch master updated: [SPARK-44914][BUILD] Upgrade `Apache ivy` from 2.5.1 to 2.5.2

2023-08-23 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 611e17e8926 [SPARK-44914][BUILD] Upgrade `Apache ivy` from 2.5.1 to 
2.5.2
611e17e8926 is described below

commit 611e17e89260cd8d2b12edfc060f31a73773fa02
Author: Bjørn Jørgensen 
AuthorDate: Wed Aug 23 20:58:12 2023 +0800

[SPARK-44914][BUILD] Upgrade `Apache ivy` from 2.5.1 to 2.5.2

### What changes were proposed in this pull request?
Upgrade Apache ivy from 2.5.1 to 2.5.2

[Release 
notes](https://lists.apache.org/thread/9gcz4xrsn8c7o9gb377xfzvkb8jltffr)

### Why are the changes needed?
[CVE-2022-46751](https://www.cve.org/CVERecord?id=CVE-2022-46751)

The fix 
https://github.com/apache/ant-ivy/commit/2be17bc18b0e1d4123007d579e43ba1a4b6fab3d
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42613 from bjornjorgensen/ivy-2.5.2.

Authored-by: Bjørn Jørgensen 
Signed-off-by: yangjie01 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index bf1c568669e..2c63641931e 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -97,7 +97,7 @@ httpclient/4.5.14//httpclient-4.5.14.jar
 httpcore/4.4.16//httpcore-4.4.16.jar
 ini4j/0.5.4//ini4j-0.5.4.jar
 istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar
-ivy/2.5.1//ivy-2.5.1.jar
+ivy/2.5.2//ivy-2.5.2.jar
 jackson-annotations/2.15.2//jackson-annotations-2.15.2.jar
 jackson-core-asl/1.9.13//jackson-core-asl-1.9.13.jar
 jackson-core/2.15.2//jackson-core-2.15.2.jar
diff --git a/pom.xml b/pom.xml
index 804c2f8fb4b..4d302949fc1 100644
--- a/pom.xml
+++ b/pom.xml
@@ -146,7 +146,7 @@
 9.4.51.v20230217
 4.0.3
 0.10.0
-2.5.1
+2.5.2
 2.0.8
 

[spark] branch branch-3.5 updated: [SPARK-44908][ML][CONNECT] Fix cross validator foldCol param functionality

2023-08-23 Thread weichenxu123
This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 40ccabfd681 [SPARK-44908][ML][CONNECT] Fix cross validator foldCol 
param functionality
40ccabfd681 is described below

commit 40ccabfd68141eabe8f9b9bf15acad9fc6b7dff1
Author: Weichen Xu 
AuthorDate: Wed Aug 23 18:19:15 2023 +0800

[SPARK-44908][ML][CONNECT] Fix cross validator foldCol param functionality

### What changes were proposed in this pull request?

Fix cross validator foldCol param functionality.
In main branch the code calls `df.rdd` APIs but it is not supported in 
spark connect

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42605 from WeichenXu123/fix-tuning-connect-foldCol.

Authored-by: Weichen Xu 
Signed-off-by: Weichen Xu 
(cherry picked from commit 0d1b5975b2d308c616312d53b9f7ad754348a266)
Signed-off-by: Weichen Xu 
---
 python/pyspark/ml/connect/tuning.py| 24 ++---
 .../ml/tests/connect/test_legacy_mode_tuning.py| 25 ++
 2 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/python/pyspark/ml/connect/tuning.py 
b/python/pyspark/ml/connect/tuning.py
index c22c31e84e8..871e448966c 100644
--- a/python/pyspark/ml/connect/tuning.py
+++ b/python/pyspark/ml/connect/tuning.py
@@ -42,8 +42,7 @@ from pyspark.ml.connect.io_utils import (
 )
 from pyspark.ml.param import Params, Param, TypeConverters
 from pyspark.ml.param.shared import HasParallelism, HasSeed
-from pyspark.sql.functions import col, lit, rand, UserDefinedFunction
-from pyspark.sql.types import BooleanType
+from pyspark.sql.functions import col, lit, rand
 from pyspark.sql.dataframe import DataFrame
 from pyspark.sql import SparkSession
 
@@ -477,23 +476,14 @@ class CrossValidator(
 train = df.filter(~condition)
 datasets.append((train, validation))
 else:
-# Use user-specified fold numbers.
-def checker(foldNum: int) -> bool:
-if foldNum < 0 or foldNum >= nFolds:
-raise ValueError(
-"Fold number must be in range [0, %s), but got %s." % 
(nFolds, foldNum)
-)
-return True
-
-checker_udf = UserDefinedFunction(checker, BooleanType())
+# TODO:
+#  Add verification that foldCol column values are in range [0, 
nFolds)
 for i in range(nFolds):
-training = dataset.filter(checker_udf(dataset[foldCol]) & 
(col(foldCol) != lit(i)))
-validation = dataset.filter(
-checker_udf(dataset[foldCol]) & (col(foldCol) == lit(i))
-)
-if training.rdd.getNumPartitions() == 0 or 
len(training.take(1)) == 0:
+training = dataset.filter(col(foldCol) != lit(i))
+validation = dataset.filter(col(foldCol) == lit(i))
+if training.isEmpty():
 raise ValueError("The training data at fold %s is empty." 
% i)
-if validation.rdd.getNumPartitions() == 0 or 
len(validation.take(1)) == 0:
+if validation.isEmpty():
 raise ValueError("The validation data at fold %s is 
empty." % i)
 datasets.append((training, validation))
 
diff --git a/python/pyspark/ml/tests/connect/test_legacy_mode_tuning.py 
b/python/pyspark/ml/tests/connect/test_legacy_mode_tuning.py
index d6c813533d6..0ade227540c 100644
--- a/python/pyspark/ml/tests/connect/test_legacy_mode_tuning.py
+++ b/python/pyspark/ml/tests/connect/test_legacy_mode_tuning.py
@@ -246,6 +246,31 @@ class CrossValidatorTestsMixin:
 np.testing.assert_allclose(cv_model.avgMetrics, 
loaded_cv_model.avgMetrics)
 np.testing.assert_allclose(cv_model.stdMetrics, 
loaded_cv_model.stdMetrics)
 
+def test_crossvalidator_with_fold_col(self):
+sk_dataset = load_breast_cancer()
+
+train_dataset = self.spark.createDataFrame(
+zip(
+sk_dataset.data.tolist(),
+[int(t) for t in sk_dataset.target],
+[int(i % 3) for i in range(len(sk_dataset.target))],
+),
+schema="features: array, label: long, fold: long",
+)
+
+lorv2 = LORV2(numTrainWorkers=2)
+
+grid2 = ParamGridBuilder().addGrid(lorv2.maxIter, [2, 200]).build()
+cv = CrossValidator(
+estimator=lorv2,
+estimatorParamMaps=grid2,
+parallelism=2,
+

[spark] branch master updated: [SPARK-44908][ML][CONNECT] Fix cross validator foldCol param functionality

2023-08-23 Thread weichenxu123
This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0d1b5975b2d [SPARK-44908][ML][CONNECT] Fix cross validator foldCol 
param functionality
0d1b5975b2d is described below

commit 0d1b5975b2d308c616312d53b9f7ad754348a266
Author: Weichen Xu 
AuthorDate: Wed Aug 23 18:19:15 2023 +0800

[SPARK-44908][ML][CONNECT] Fix cross validator foldCol param functionality

### What changes were proposed in this pull request?

Fix cross validator foldCol param functionality.
In main branch the code calls `df.rdd` APIs but it is not supported in 
spark connect

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42605 from WeichenXu123/fix-tuning-connect-foldCol.

Authored-by: Weichen Xu 
Signed-off-by: Weichen Xu 
---
 python/pyspark/ml/connect/tuning.py| 24 ++---
 .../ml/tests/connect/test_legacy_mode_tuning.py| 25 ++
 2 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/python/pyspark/ml/connect/tuning.py 
b/python/pyspark/ml/connect/tuning.py
index c22c31e84e8..871e448966c 100644
--- a/python/pyspark/ml/connect/tuning.py
+++ b/python/pyspark/ml/connect/tuning.py
@@ -42,8 +42,7 @@ from pyspark.ml.connect.io_utils import (
 )
 from pyspark.ml.param import Params, Param, TypeConverters
 from pyspark.ml.param.shared import HasParallelism, HasSeed
-from pyspark.sql.functions import col, lit, rand, UserDefinedFunction
-from pyspark.sql.types import BooleanType
+from pyspark.sql.functions import col, lit, rand
 from pyspark.sql.dataframe import DataFrame
 from pyspark.sql import SparkSession
 
@@ -477,23 +476,14 @@ class CrossValidator(
 train = df.filter(~condition)
 datasets.append((train, validation))
 else:
-# Use user-specified fold numbers.
-def checker(foldNum: int) -> bool:
-if foldNum < 0 or foldNum >= nFolds:
-raise ValueError(
-"Fold number must be in range [0, %s), but got %s." % 
(nFolds, foldNum)
-)
-return True
-
-checker_udf = UserDefinedFunction(checker, BooleanType())
+# TODO:
+#  Add verification that foldCol column values are in range [0, 
nFolds)
 for i in range(nFolds):
-training = dataset.filter(checker_udf(dataset[foldCol]) & 
(col(foldCol) != lit(i)))
-validation = dataset.filter(
-checker_udf(dataset[foldCol]) & (col(foldCol) == lit(i))
-)
-if training.rdd.getNumPartitions() == 0 or 
len(training.take(1)) == 0:
+training = dataset.filter(col(foldCol) != lit(i))
+validation = dataset.filter(col(foldCol) == lit(i))
+if training.isEmpty():
 raise ValueError("The training data at fold %s is empty." 
% i)
-if validation.rdd.getNumPartitions() == 0 or 
len(validation.take(1)) == 0:
+if validation.isEmpty():
 raise ValueError("The validation data at fold %s is 
empty." % i)
 datasets.append((training, validation))
 
diff --git a/python/pyspark/ml/tests/connect/test_legacy_mode_tuning.py 
b/python/pyspark/ml/tests/connect/test_legacy_mode_tuning.py
index d6c813533d6..0ade227540c 100644
--- a/python/pyspark/ml/tests/connect/test_legacy_mode_tuning.py
+++ b/python/pyspark/ml/tests/connect/test_legacy_mode_tuning.py
@@ -246,6 +246,31 @@ class CrossValidatorTestsMixin:
 np.testing.assert_allclose(cv_model.avgMetrics, 
loaded_cv_model.avgMetrics)
 np.testing.assert_allclose(cv_model.stdMetrics, 
loaded_cv_model.stdMetrics)
 
+def test_crossvalidator_with_fold_col(self):
+sk_dataset = load_breast_cancer()
+
+train_dataset = self.spark.createDataFrame(
+zip(
+sk_dataset.data.tolist(),
+[int(t) for t in sk_dataset.target],
+[int(i % 3) for i in range(len(sk_dataset.target))],
+),
+schema="features: array, label: long, fold: long",
+)
+
+lorv2 = LORV2(numTrainWorkers=2)
+
+grid2 = ParamGridBuilder().addGrid(lorv2.maxIter, [2, 200]).build()
+cv = CrossValidator(
+estimator=lorv2,
+estimatorParamMaps=grid2,
+parallelism=2,
+evaluator=BinaryClassificationEvaluator(),
+foldCol="fold",
+numFolds=3,
+)
+

[spark] branch master updated: [SPARK-44923][PYTHON][BUILD] Some directories should be cleared when regenerating files

2023-08-23 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4d90c597ec9 [SPARK-44923][PYTHON][BUILD] Some directories should be 
cleared when regenerating files
4d90c597ec9 is described below

commit 4d90c597ec9089394af7e580cb3ad46cf1179098
Author: panbingkun 
AuthorDate: Wed Aug 23 17:12:14 2023 +0800

[SPARK-44923][PYTHON][BUILD] Some directories should be cleared when 
regenerating files

### What changes were proposed in this pull request?
The pr aims to fix some bug in regenerating pyspark docs in certain 
scenarios.

### Why are the changes needed?
- The following error occurred while I was regenerating the pyspark 
document.
   https://github.com/apache/spark/assets/15246973/548abd63-4349-4267-b1fe-a293bd1e7f3e;>

- We can simply reproduce this problem as follows:
 1.git reset --hard 3f380b9ecc8b27f6965b554061572e0990f0513
https://github.com/apache/spark/assets/15246973/5ab9c8fc-5835-4ced-8d92-9d5e020b262a;>
 2.make clean html, (at this point, it is successful.)
https://github.com/apache/spark/assets/15246973/5c3ce07f-cbe8-4177-ae22-b16c3fc62e01;>
3.git pull, (at this point, the function `chr` has been deleted, but the 
previously generated file(`pyspark.sql.functions.chr.rst`) will not be deleted.)
4.make clean html, (at this point, it is failed.)
https://github.com/apache/spark/assets/15246973/548abd63-4349-4267-b1fe-a293bd1e7f3e;>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
1.Pass GA.
2.Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42622 from panbingkun/SPARK-44923.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 python/docs/source/conf.py | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/python/docs/source/conf.py b/python/docs/source/conf.py
index 0f57cb37cee..e4ca39d4f0c 100644
--- a/python/docs/source/conf.py
+++ b/python/docs/source/conf.py
@@ -33,22 +33,16 @@ generate_supported_api(output_rst_file_path)
 
 # Remove previously generated rst files. Ignore errors just in case it stops
 # generating whole docs.
-shutil.rmtree(
-"%s/reference/api" % os.path.dirname(os.path.abspath(__file__)), 
ignore_errors=True)
-shutil.rmtree(
-"%s/reference/pyspark.pandas/api" % 
os.path.dirname(os.path.abspath(__file__)),
-ignore_errors=True)
-try:
-os.mkdir("%s/reference/api" % os.path.dirname(os.path.abspath(__file__)))
-except OSError as e:
-if e.errno != errno.EEXIST:
-raise
-try:
-os.mkdir("%s/reference/pyspark.pandas/api" % os.path.dirname(
-os.path.abspath(__file__)))
-except OSError as e:
-if e.errno != errno.EEXIST:
-raise
+gen_rst_dirs = ["reference/api", "reference/pyspark.pandas/api",
+"reference/pyspark.sql/api", "reference/pyspark.ss/api"]
+for gen_rst_dir in gen_rst_dirs:
+absolute_gen_rst_dir = "%s/%s" % 
(os.path.dirname(os.path.abspath(__file__)), gen_rst_dir)
+shutil.rmtree(absolute_gen_rst_dir, ignore_errors=True)
+try:
+os.mkdir(absolute_gen_rst_dir)
+except OSError as e:
+if e.errno != errno.EEXIST:
+raise
 
 # -- General configuration 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated (352810b2b45 -> aa6f6f74dc9)

2023-08-23 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


from 352810b2b45 [SPARK-44920][CORE] Use await() instead of 
awaitUninterruptibly() in TransportClientFactory.createClient()
 add aa6f6f74dc9 [SPARK-44871][SQL][3.3] Fix percentile_disc behaviour

No new revisions were added by this update.

Summary of changes:
 .../expressions/aggregate/percentiles.scala|  39 +--
 .../org/apache/spark/sql/internal/SQLConf.scala|  10 ++
 .../resources/sql-tests/inputs/percentiles.sql |  74 +
 .../sql-tests/results/percentiles.sql.out  | 118 +
 4 files changed, 234 insertions(+), 7 deletions(-)
 create mode 100644 sql/core/src/test/resources/sql-tests/inputs/percentiles.sql
 create mode 100644 
sql/core/src/test/resources/sql-tests/results/percentiles.sql.out


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44909][ML] Skip starting torch distributor log streaming server when it is not available

2023-08-23 Thread weichenxu123
This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 4f61662e91a [SPARK-44909][ML] Skip starting torch distributor log 
streaming server when it is not available
4f61662e91a is described below

commit 4f61662e91a77315a9d3d7454884f47c8bb9a6b1
Author: Weichen Xu 
AuthorDate: Wed Aug 23 15:30:50 2023 +0800

[SPARK-44909][ML] Skip starting torch distributor log streaming server when 
it is not available

### What changes were proposed in this pull request?

Skip starting torch distributor log streaming server when it is not 
available.

In some cases, e.g., in a databricks connect cluster, there is some network 
limitation that casues starting log streaming server failure, but, this does 
not need to break torch distributor training routine.

In this PR, it captures exception raised from log server `start` method, 
and set server port to be -1 if `start` failed.

### Why are the changes needed?

In some cases, e.g., in a databricks connect cluster, there is some network 
limitation that casues starting log streaming server failure, but, this does 
not need to break torch distributor training routine.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42606 from WeichenXu123/fix-torch-log-server-in-connect-mode.

Authored-by: Weichen Xu 
Signed-off-by: Weichen Xu 
(cherry picked from commit 80668dc1a36ac0def80f3c18f981fbdacfb2904d)
Signed-off-by: Weichen Xu 
---
 python/pyspark/ml/torch/distributor.py   | 16 +---
 python/pyspark/ml/torch/log_communication.py |  3 +++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/ml/torch/distributor.py 
b/python/pyspark/ml/torch/distributor.py
index b407672ac48..d0979f53d41 100644
--- a/python/pyspark/ml/torch/distributor.py
+++ b/python/pyspark/ml/torch/distributor.py
@@ -765,9 +765,19 @@ class TorchDistributor(Distributor):
 log_streaming_server = LogStreamingServer()
 self.driver_address = _get_conf(self.spark, "spark.driver.host", "")
 assert self.driver_address != ""
-log_streaming_server.start(spark_host_address=self.driver_address)
-time.sleep(1)  # wait for the server to start
-self.log_streaming_server_port = log_streaming_server.port
+try:
+log_streaming_server.start(spark_host_address=self.driver_address)
+time.sleep(1)  # wait for the server to start
+self.log_streaming_server_port = log_streaming_server.port
+except Exception as e:
+# If starting log streaming server failed, we don't need to break
+# the distributor training but emit a warning instead.
+self.log_streaming_server_port = -1
+self.logger.warning(
+"Start torch distributor log streaming server failed, "
+"You cannot receive logs sent from distributor workers, ",
+f"error: {repr(e)}.",
+)
 
 try:
 spark_task_function = self._get_spark_task_function(
diff --git a/python/pyspark/ml/torch/log_communication.py 
b/python/pyspark/ml/torch/log_communication.py
index ca91121d3e3..8efa83e62c3 100644
--- a/python/pyspark/ml/torch/log_communication.py
+++ b/python/pyspark/ml/torch/log_communication.py
@@ -156,6 +156,9 @@ class LogStreamingClient(LogStreamingClientBase):
 warnings.warn(f"{error_msg}: {traceback.format_exc()}\n")
 
 def _connect(self) -> None:
+if self.port == -1:
+self._fail("Log streaming server is not available.")
+return
 try:
 sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
 sock.settimeout(self.timeout)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (00cd5e846b6 -> 80668dc1a36)

2023-08-23 Thread weichenxu123
This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 00cd5e846b6 [SPARK-44899][PYTHON][DOCS] Refine the docstring of 
DataFrame.collect
 add 80668dc1a36 [SPARK-44909][ML] Skip starting torch distributor log 
streaming server when it is not available

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/torch/distributor.py   | 16 +---
 python/pyspark/ml/torch/log_communication.py |  3 +++
 2 files changed, 16 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44899][PYTHON][DOCS] Refine the docstring of DataFrame.collect

2023-08-23 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 00cd5e846b6 [SPARK-44899][PYTHON][DOCS] Refine the docstring of 
DataFrame.collect
00cd5e846b6 is described below

commit 00cd5e846b6bc859f802ce8e76887b8b62b14186
Author: allisonwang-db 
AuthorDate: Wed Aug 23 14:16:14 2023 +0800

[SPARK-44899][PYTHON][DOCS] Refine the docstring of DataFrame.collect

### What changes were proposed in this pull request?

This PR refines the docstring of `DataFrame.collect()` function.

### Why are the changes needed?

To improve the documentation of PySpark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

doc test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42592 from allisonwang-db/spark-44899-collect-docs.

Authored-by: allisonwang-db 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/dataframe.py | 53 +
 1 file changed, 49 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 3722e6dc9f3..10004015c6f 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1268,7 +1268,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 return int(self._jdf.count())
 
 def collect(self) -> List[Row]:
-"""Returns all the records as a list of :class:`Row`.
+"""Returns all the records in the DataFrame as a list of :class:`Row`.
 
 .. versionadded:: 1.3.0
 
@@ -1278,14 +1278,59 @@ class DataFrame(PandasMapOpsMixin, 
PandasConversionMixin):
 Returns
 ---
 list
-List of rows.
+A list of :class:`Row` objects, each representing a row in the 
DataFrame.
+
+See Also
+
+DataFrame.take : Returns the first `n` rows.
+DataFrame.head : Returns the first `n` rows.
+DataFrame.toPandas : Returns the data as a pandas DataFrame.
+
+Notes
+-
+This method should only be used if the resulting list is expected to 
be small,
+as all the data is loaded into the driver's memory.
 
 Examples
 
->>> df = spark.createDataFrame(
-... [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
+Example: Collecting all rows of a DataFrame
+
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
 >>> df.collect()
 [Row(age=14, name='Tom'), Row(age=23, name='Alice'), Row(age=16, 
name='Bob')]
+
+Example: Collecting all rows after filtering
+
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+>>> df.filter(df.age > 15).collect()
+[Row(age=23, name='Alice'), Row(age=16, name='Bob')]
+
+Example: Collecting all rows after selecting specific columns
+
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+>>> df.select("name").collect()
+[Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]
+
+Example: Collecting all rows after applying a function to a column
+
+>>> from pyspark.sql.functions import upper
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+>>> df.select(upper(df.name)).collect()
+[Row(upper(name)='TOM'), Row(upper(name)='ALICE'), 
Row(upper(name)='BOB')]
+
+Example: Collecting all rows from a DataFrame and converting a 
specific column to a list
+
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+>>> rows = df.collect()
+>>> [row["name"] for row in rows]
+['Tom', 'Alice', 'Bob']
+
+Example: Collecting all rows from a DataFrame and converting to a list 
of dictionaries
+
+>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, 
"Bob")], ["age", "name"])
+>>> rows = df.collect()
+>>> [row.asDict() for row in rows]
+[{'age': 14, 'name': 'Tom'}, {'age': 23, 'name': 'Alice'}, {'age': 16, 
'name': 'Bob'}]
 """
 with SCCallSiteSync(self._sc):
 sock_info = self._jdf.collectToPython()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org