[spark] branch master updated: [SPARK-43914][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2433-2437]

2023-06-27 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1c8c47cb55d [SPARK-43914][SQL] Assign names to the error class 
_LEGACY_ERROR_TEMP_[2433-2437]
1c8c47cb55d is described below

commit 1c8c47cb55da75526fef4dd41ed0734b01e71814
Author: Jiaan Geng 
AuthorDate: Wed Jun 28 08:22:01 2023 +0300

[SPARK-43914][SQL] Assign names to the error class 
_LEGACY_ERROR_TEMP_[2433-2437]

### What changes were proposed in this pull request?
The pr aims to assign names to the error class 
_LEGACY_ERROR_TEMP_[2433-2437].

### Why are the changes needed?
Improve the error framework.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Exists test cases updated.

Closes #41476 from beliefer/SPARK-43914.

Authored-by: Jiaan Geng 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   | 34 --
 .../sql/catalyst/analysis/CheckAnalysis.scala  | 65 +++
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 74 --
 .../org/apache/spark/sql/DataFrameSuite.scala  | 14 
 4 files changed, 120 insertions(+), 67 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 342af0ffa6c..e441686432a 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -5637,40 +5637,6 @@
   "Cannot change nullable column to non-nullable: ."
 ]
   },
-  "_LEGACY_ERROR_TEMP_2433" : {
-"message" : [
-  "Only a single table generating function is allowed in a SELECT clause, 
found:",
-  "."
-]
-  },
-  "_LEGACY_ERROR_TEMP_2434" : {
-"message" : [
-  "Failure when resolving conflicting references in Join:",
-  "",
-  "Conflicting attributes: ."
-]
-  },
-  "_LEGACY_ERROR_TEMP_2435" : {
-"message" : [
-  "Failure when resolving conflicting references in Intersect:",
-  "",
-  "Conflicting attributes: ."
-]
-  },
-  "_LEGACY_ERROR_TEMP_2436" : {
-"message" : [
-  "Failure when resolving conflicting references in Except:",
-  "",
-  "Conflicting attributes: ."
-]
-  },
-  "_LEGACY_ERROR_TEMP_2437" : {
-"message" : [
-  "Failure when resolving conflicting references in AsOfJoin:",
-  "",
-  "Conflicting attributes: ."
-]
-  },
   "_LEGACY_ERROR_TEMP_2446" : {
 "message" : [
   "Operation not allowed:  only works on table with location 
provided: "
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 7c0e8f1490d..a0296d27361 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -674,9 +674,8 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 }
 
   case p @ Project(exprs, _) if containsMultipleGenerators(exprs) =>
-p.failAnalysis(
-  errorClass = "_LEGACY_ERROR_TEMP_2433",
-  messageParameters = Map("sqlExprs" -> 
exprs.map(_.sql).mkString(",")))
+val generators = exprs.filter(expr => 
expr.exists(_.isInstanceOf[Generator]))
+throw QueryCompilationErrors.moreThanOneGeneratorError(generators, 
"SELECT")
 
   case p @ Project(projectList, _) =>
 projectList.foreach(_.transformDownWithPruning(
@@ -686,36 +685,48 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 })
 
   case j: Join if !j.duplicateResolved =>
-val conflictingAttributes = 
j.left.outputSet.intersect(j.right.outputSet)
-j.failAnalysis(
-  errorClass = "_LEGACY_ERROR_TEMP_2434",
-  messageParameters = Map(
-"plan" -> plan.toString,
-"conflictingAttributes" -> 
conflictingAttributes.mkString(",")))
+val conflictingAttributes =
+  
j.left.outputSet.intersect(j.right.outputSet).map(toSQLExpr(_)).mkString(", ")
+throw SparkException.internalError(
+  msg = s"""
+   |Failure when resolving conflicting references in 
${j.nodeName}:
+   |${planToString(plan)}
+   |Conflicting attributes: 
$conflictingAttributes.""".stripMargin,
+  context = j.origin.getQueryContext,
+  summary = j.origin.context.summary)
 
   case i: Intersect if !i.duplicateResolved =>
-val conflictingAttributes = 

[spark] branch master updated (f00de6f77a8 -> 3cd486070bf)

2023-06-27 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f00de6f77a8 [SPARK-44182][DOCS] Use Spark version variables in Python 
and Spark Connect installation docs
 add 3cd486070bf [SPARK-44221][BUILD] Upgrade RoaringBitmap from 0.9.44 to 
0.9.45

No new revisions were added by this update.

Summary of changes:
 core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt | 10 +-
 core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt | 10 +-
 core/benchmarks/MapStatusesConvertBenchmark-results.txt   | 10 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3 |  4 ++--
 pom.xml   |  2 +-
 5 files changed, 18 insertions(+), 18 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44182][DOCS] Use Spark version variables in Python and Spark Connect installation docs

2023-06-27 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f00de6f77a8 [SPARK-44182][DOCS] Use Spark version variables in Python 
and Spark Connect installation docs
f00de6f77a8 is described below

commit f00de6f77a80182215d0f3c07441849f2654b210
Author: Dongjoon Hyun 
AuthorDate: Wed Jun 28 11:16:42 2023 +0800

[SPARK-44182][DOCS] Use Spark version variables in Python and Spark Connect 
installation docs

### What changes were proposed in this pull request?

This PR aims to use Spark version placeholders in Python and Spark Connect 
installation docs
- `site.SPARK_VERSION_SHORT` in `md` files
- `|release|` in `rst` files

### Why are the changes needed?

To provide an up-to-date Apache Spark docs document always.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

![Screenshot 2023-06-26 at 1 51 42 
PM](https://github.com/apache/spark/assets/9700541/d4bc8166-e5cf-4c61-a1ab-0aa65810dc51)

![Screenshot 2023-06-27 at 9 21 23 
AM](https://github.com/apache/spark/assets/9700541/a5a5ed98-c37e-47c4-ba14-69923c50dfd7)

Closes #41728 from dongjoon-hyun/SPARK-44182.

Authored-by: Dongjoon Hyun 
Signed-off-by: Kent Yao 
---
 docs/spark-connect-overview.md | 10 +-
 python/docs/source/getting_started/install.rst |  8 
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/spark-connect-overview.md b/docs/spark-connect-overview.md
index 55cc825a148..1e1464cfba0 100644
--- a/docs/spark-connect-overview.md
+++ b/docs/spark-connect-overview.md
@@ -93,7 +93,7 @@ the release drop down at the top of the page. Then choose 
your package type, typ
 Now extract the Spark package you just downloaded on your computer, for 
example:
 
 {% highlight bash %}
-tar -xvf spark-3.4.0-bin-hadoop3.tgz
+tar -xvf spark-{{site.SPARK_VERSION_SHORT}}-bin-hadoop3.tgz
 {% endhighlight %}
 
 In a terminal window, go to the `spark` folder in the location where you 
extracted
@@ -101,13 +101,13 @@ Spark before and run the `start-connect-server.sh` script 
to start Spark server
 Spark Connect, like in this example:
 
 {% highlight bash %}
-./sbin/start-connect-server.sh --packages 
org.apache.spark:spark-connect_2.12:3.4.0
+./sbin/start-connect-server.sh --packages 
org.apache.spark:spark-connect_2.12:{{site.SPARK_VERSION_SHORT}}
 {% endhighlight %}
 
-Note that we include a Spark Connect package (`spark-connect_2.12:3.4.0`), 
when starting
+Note that we include a Spark Connect package 
(`spark-connect_2.12:{{site.SPARK_VERSION_SHORT}}`), when starting
 Spark server. This is required to use Spark Connect. Make sure to use the same 
version
 of the package as the Spark version you downloaded previously. In this example,
-Spark 3.4.0 with Scala 2.12.
+Spark {{site.SPARK_VERSION_SHORT}} with Scala 2.12.
 
 Now Spark server is running and ready to accept Spark Connect sessions from 
client
 applications. In the next section we will walk through how to use Spark Connect
@@ -270,4 +270,4 @@ APIs you are using are available before migrating existing 
code to Spark Connect
 [functions](api/scala/org/apache/spark/sql/functions$.html), and
 [Column](api/scala/org/apache/spark/sql/Column.html).
 
-Support for more APIs is planned for upcoming Spark releases.
\ No newline at end of file
+Support for more APIs is planned for upcoming Spark releases.
diff --git a/python/docs/source/getting_started/install.rst 
b/python/docs/source/getting_started/install.rst
index b5256f2f2cb..eb296dc16d6 100644
--- a/python/docs/source/getting_started/install.rst
+++ b/python/docs/source/getting_started/install.rst
@@ -129,17 +129,17 @@ PySpark is included in the distributions available at the 
`Apache Spark website
 You can download a distribution you want from the site. After that, uncompress 
the tar file into the directory where you want
 to install Spark, for example, as below:
 
-.. code-block:: bash
+.. parsed-literal::
 
-tar xzvf spark-3.4.0-bin-hadoop3.tgz
+tar xzvf spark-\ |release|\-bin-hadoop3.tgz
 
 Ensure the ``SPARK_HOME`` environment variable points to the directory where 
the tar file has been extracted.
 Update ``PYTHONPATH`` environment variable such that it can find the PySpark 
and Py4J under ``SPARK_HOME/python/lib``.
 One example of doing this is shown below:
 
-.. code-block:: bash
+.. parsed-literal::
 
-cd spark-3.4.0-bin-hadoop3
+cd spark-\ |release|\-bin-hadoop3
 export SPARK_HOME=`pwd`
 export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo 
"${ZIPS[*]}"):$PYTHONPATH
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional 

[spark] branch master updated: [SPARK-44206][SQL] DataSet.selectExpr scope Session.active

2023-06-27 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1a1ae1f8cf6 [SPARK-44206][SQL] DataSet.selectExpr scope Session.active
1a1ae1f8cf6 is described below

commit 1a1ae1f8cf6c1ba6f8fc40081db4a0782a54dd66
Author: zml1206 
AuthorDate: Wed Jun 28 11:02:09 2023 +0800

[SPARK-44206][SQL] DataSet.selectExpr scope Session.active

### What changes were proposed in this pull request?
`Dataset.selectExpr` are covered by withActive, to scope Session.active.

### Why are the changes needed?
[SPARK-30798](https://issues.apache.org/jira/browse/SPARK-30798) mentioned  
all SparkSession dataset methods should covered by withActive, but `selectExpr` 
not.
For example:
```
 val clone = spark.cloneSession()
clone.conf.set("spark.sql.legacy.interval.enabled", "true")
// sql1
clone.sql("select '2023-01-01'+ INTERVAL 1 YEAR as b").show()
// sql2
clone.sql("select '2023-01-01' as a").selectExpr("a + INTERVAL 1 YEAR as 
b").show()
```
sql1 can be executed successfully, sql2 failed.
Error message:
```
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + INTERVAL '1' 
YEAR)" due to data type mismatch: the left and right operands of the binary 
operator have incompatible types ("DOUBLE" and "INTERVAL YEAR").; line 1 pos 0;
'Project [(cast(a#2 as double) + INTERVAL '1' YEAR) AS b#4]
+- Project [2023-01-01 AS a#2]
   +- OneRowRelation

org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + INTERVAL '1' 
YEAR)" due to data type mismatch: the left and right operands of the binary 
operator have incompatible types ("DOUBLE" and "INTERVAL YEAR").; line 1 pos 0;
'Project [(cast(a#2 as double) + INTERVAL '1' YEAR) AS b#4]
+- Project [2023-01-01 AS a#2]
   +- OneRowRelation

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:280)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:267)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:267)
at scala.collection.immutable.Stream.foreach(Stream.scala:533)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:267)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:187)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:187)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:210)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:207)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76)
at 

[spark] branch branch-3.4 updated: [SPARK-44206][SQL] DataSet.selectExpr scope Session.active

2023-06-27 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 03043823c3e [SPARK-44206][SQL] DataSet.selectExpr scope Session.active
03043823c3e is described below

commit 03043823c3ebf824d9423d5a1dd0c57d0ba8147b
Author: zml1206 
AuthorDate: Wed Jun 28 11:02:09 2023 +0800

[SPARK-44206][SQL] DataSet.selectExpr scope Session.active

### What changes were proposed in this pull request?
`Dataset.selectExpr` are covered by withActive, to scope Session.active.

### Why are the changes needed?
[SPARK-30798](https://issues.apache.org/jira/browse/SPARK-30798) mentioned  
all SparkSession dataset methods should covered by withActive, but `selectExpr` 
not.
For example:
```
 val clone = spark.cloneSession()
clone.conf.set("spark.sql.legacy.interval.enabled", "true")
// sql1
clone.sql("select '2023-01-01'+ INTERVAL 1 YEAR as b").show()
// sql2
clone.sql("select '2023-01-01' as a").selectExpr("a + INTERVAL 1 YEAR as 
b").show()
```
sql1 can be executed successfully, sql2 failed.
Error message:
```
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + INTERVAL '1' 
YEAR)" due to data type mismatch: the left and right operands of the binary 
operator have incompatible types ("DOUBLE" and "INTERVAL YEAR").; line 1 pos 0;
'Project [(cast(a#2 as double) + INTERVAL '1' YEAR) AS b#4]
+- Project [2023-01-01 AS a#2]
   +- OneRowRelation

org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + INTERVAL '1' 
YEAR)" due to data type mismatch: the left and right operands of the binary 
operator have incompatible types ("DOUBLE" and "INTERVAL YEAR").; line 1 pos 0;
'Project [(cast(a#2 as double) + INTERVAL '1' YEAR) AS b#4]
+- Project [2023-01-01 AS a#2]
   +- OneRowRelation

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:280)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:267)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:267)
at scala.collection.immutable.Stream.foreach(Stream.scala:533)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:267)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:187)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:187)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:210)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:207)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76)
at 

[spark] branch master updated: [SPARK-44039][CONNECT][TESTS] Improve for PlanGenerationTestSuite & ProtoToParsedPlanTestSuite

2023-06-27 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c8c51d04741 [SPARK-44039][CONNECT][TESTS] Improve for 
PlanGenerationTestSuite & ProtoToParsedPlanTestSuite
c8c51d04741 is described below

commit c8c51d047411959ad6c648246a5bd6ea4ae13ce8
Author: panbingkun 
AuthorDate: Tue Jun 27 21:37:33 2023 -0400

[SPARK-44039][CONNECT][TESTS] Improve for PlanGenerationTestSuite & 
ProtoToParsedPlanTestSuite

### What changes were proposed in this pull request?
The pr aims to improve for PlanGenerationTestSuite & 
ProtoToParsedPlanTestSuite, include:
- When generating `GOLDEN` files, we should first delete the corresponding 
directories and generate new ones to avoid submitting some redundant files 
during the review process. eg:
When we write a test named `make_timestamp_ltz` for the overloaded method, 
and during the review process, the reviewer wishes to add more tests for the 
method. The name of this method has changed during the next submission process, 
such as `make_timestamp_ltz without timezone`.At this point, if the 
`queries/function_make_timestamp_ltz.json`, 
`queries/function_make_timestamp_ltz.proto.bin` and 
`explain-results/function_make_timestamp_ltz.explain` files of 
`function_make_timestamp_ltz`  [...]

- Clear and update some redundant files submitted incorrectly

### Why are the changes needed?
Make code clear.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #41572 from panbingkun/SPARK-44039.

Authored-by: panbingkun 
Signed-off-by: Herman van Hovell 
---
 .../apache/spark/sql/PlanGenerationTestSuite.scala |  28 ++
 .../explain-results/function_percentile.explain|   2 -
 .../function_regexp_extract_all.explain|   2 -
 .../explain-results/function_regexp_instr.explain  |   2 -
 .../query-tests/explain-results/read_path.explain  |   1 -
 .../query-tests/queries/function_lit_array.json|  58 ++---
 .../query-tests/queries/function_percentile.json   |  29 ---
 .../queries/function_percentile.proto.bin  | Bin 192 -> 0 bytes
 .../queries/function_regexp_extract_all.json   |  33 
 .../queries/function_regexp_extract_all.proto.bin  | Bin 212 -> 0 bytes
 .../query-tests/queries/function_regexp_instr.json |  33 
 .../queries/function_regexp_instr.proto.bin| Bin 203 -> 0 bytes
 .../resources/query-tests/queries/read_path.json   |  11 
 .../query-tests/queries/read_path.proto.bin|   3 --
 .../queries/streaming_table_API_with_options.json  |   3 +-
 .../sql/connect/ProtoToParsedPlanTestSuite.scala   |  23 
 16 files changed, 82 insertions(+), 146 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
index e8d04f37d7f..ecb7092b8d9 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
@@ -44,6 +44,7 @@ import org.apache.spark.sql.functions.lit
 import org.apache.spark.sql.protobuf.{functions => pbFn}
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.CalendarInterval
+import org.apache.spark.util.Utils
 
 // scalastyle:off
 /**
@@ -61,6 +62,14 @@ import org.apache.spark.unsafe.types.CalendarInterval
  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly 
org.apache.spark.sql.PlanGenerationTestSuite"
  * }}}
  *
+ * If you need to clean the orphaned golden files, you need to set the
+ * SPARK_CLEAN_ORPHANED_GOLDEN_FILES=1 environment variable before running 
this test, e.g.:
+ * {{{
+ *   SPARK_CLEAN_ORPHANED_GOLDEN_FILES=1 build/sbt 
"connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"
+ * }}}
+ * Note: not all orphaned golden files should be cleaned, some are reserved 
for testing backups
+ * compatibility.
+ *
  * Note that the plan protos are used as the input for the 
`ProtoToParsedPlanTestSuite` in the
  * `connector/connect/server` module
  */
@@ -74,6 +83,9 @@ class PlanGenerationTestSuite
   // Borrowed from SparkFunSuite
   private val regenerateGoldenFiles: Boolean = 
System.getenv("SPARK_GENERATE_GOLDEN_FILES") == "1"
 
+  private val cleanOrphanedGoldenFiles: Boolean =
+System.getenv("SPARK_CLEAN_ORPHANED_GOLDEN_FILES") == "1"
+
   protected val queryFilePath: Path = 
commonResourcePath.resolve("query-tests/queries")
 
   // A relative path to /connector/connect/server, used by 
`ProtoToParsedPlanTestSuite` to run
@@ -111,9 +123,25 @@ class 

[spark] branch master updated: [SPARK-44161][CONNECT] Handle Row input for UDFs

2023-06-27 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 05fc3497f00 [SPARK-44161][CONNECT] Handle Row input for UDFs
05fc3497f00 is described below

commit 05fc3497f00a0aad9240f14637ea21d271b2bbe4
Author: Zhen Li 
AuthorDate: Tue Jun 27 21:05:33 2023 -0400

[SPARK-44161][CONNECT] Handle Row input for UDFs

### What changes were proposed in this pull request?
If the client passes Rows as inputs to UDFs, the Spark connect planner will 
fail to create the RowEncoder for the Row input.

The Row encoder sent by the client contains no field or schema information. 
The real input schema should be obtained from the plan's output.

This PR ensures if the server planner failed to create the encoder for the 
UDF input using reflection, then it will fall back to use RowEncoders created 
from the plan.output schema.

This PR fixed 
[SPARK-43761](https://issues.apache.org/jira/browse/SPARK-43761) using the same 
logic.
This PR resolved 
[SPARK-43796](https://issues.apache.org/jira/browse/SPARK-43796). The error is 
just caused by the case class defined in the test.

### Why are the changes needed?
Fix the bug where the Row cannot be used as UDF inputs.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
E2E tests.

Closes #41704 from zhenlineo/rowEncoder.

Authored-by: Zhen Li 
Signed-off-by: Herman van Hovell 
---
 .../sql/expressions/UserDefinedFunction.scala  |  5 +-
 .../spark/sql/streaming/DataStreamWriter.scala | 11 +--
 .../sql/KeyValueGroupedDatasetE2ETestSuite.scala   | 45 +
 .../sql/UserDefinedFunctionE2ETestSuite.scala  | 36 +-
 .../spark/sql/streaming/StreamingQuerySuite.scala  | 55 ---
 .../sql/connect/planner/SparkConnectPlanner.scala  | 78 ++
 .../spark/sql/catalyst/ScalaReflection.scala   | 38 ---
 .../spark/sql/catalyst/ScalaReflectionSuite.scala  | 21 +-
 .../spark/sql/streaming/DataStreamWriter.scala |  6 +-
 9 files changed, 210 insertions(+), 85 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
index bfcd4572e03..14dfc0c6a86 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
@@ -142,7 +142,10 @@ object ScalarUserDefinedFunction {
 
 ScalarUserDefinedFunction(
   function = function,
-  inputEncoders = parameterTypes.map(tag => 
ScalaReflection.encoderFor(tag)),
+  // Input can be a row because the input data schema can be found from 
the plan.
+  inputEncoders =
+parameterTypes.map(tag => 
ScalaReflection.encoderForWithRowEncoderSupport(tag)),
+  // Output cannot be a row as there is no good way to get the return data 
type.
   outputEncoder = ScalaReflection.encoderFor(returnType))
   }
 
diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala
index 263e1e372c8..ed3d2bb8558 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala
@@ -30,7 +30,6 @@ import org.apache.spark.connect.proto.Command
 import org.apache.spark.connect.proto.WriteStreamOperationStart
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.{Dataset, ForeachWriter}
-import 
org.apache.spark.sql.catalyst.encoders.AgnosticEncoders.UnboundRowEncoder
 import org.apache.spark.sql.connect.common.ForeachWriterPacket
 import org.apache.spark.sql.execution.streaming.AvailableNowTrigger
 import org.apache.spark.sql.execution.streaming.ContinuousTrigger
@@ -215,15 +214,7 @@ final class DataStreamWriter[T] private[sql] (ds: 
Dataset[T]) extends Logging {
* @since 3.5.0
*/
   def foreach(writer: ForeachWriter[T]): DataStreamWriter[T] = {
-// TODO [SPARK-43761] Update this once resolved UnboundRowEncoder 
serialization issue.
-// ds.encoder equal to UnboundRowEncoder means type parameter T is Row,
-// which is not able to be serialized. Server will detect this and use 
default encoder.
-val rowEncoder = if (ds.encoder != UnboundRowEncoder) {
-  ds.encoder
-} else {
-  null
-}
-val serialized = Utils.serialize(ForeachWriterPacket(writer, rowEncoder))
+val 

[spark] branch master updated: [SPARK-43631][CONNECT][PS] Enable Series.interpolate with Spark Connect

2023-06-27 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 20a8fc87d67 [SPARK-43631][CONNECT][PS] Enable Series.interpolate with 
Spark Connect
20a8fc87d67 is described below

commit 20a8fc87d67c842ac3386dc6ae0c53a9533900c2
Author: itholic 
AuthorDate: Tue Jun 27 14:05:42 2023 -0700

[SPARK-43631][CONNECT][PS] Enable Series.interpolate with Spark Connect

### What changes were proposed in this pull request?

This PR proposes to add `LastNonNull` and `NullIndex` to 
SparkConnectPlanner to enable `Series.interpolate`.

### Why are the changes needed?

To increase pandas API coverage

### Does this PR introduce _any_ user-facing change?

Yes, `Series.interpolate` will be available from this fix.

### How was this patch tested?

Reusing the existing UT.

Closes #41670 from itholic/interpolate.

Authored-by: itholic 
Signed-off-by: Ruifeng Zheng 
---
 .../sql/connect/planner/SparkConnectPlanner.scala  |  8 +++
 python/pyspark/pandas/series.py|  9 ---
 python/pyspark/pandas/spark/functions.py   | 28 ++
 .../tests/connect/test_parity_generic_functions.py |  4 +++-
 python/pyspark/sql/utils.py| 14 ++-
 5 files changed, 56 insertions(+), 7 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index c19fc5fe90e..ff158990560 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -1768,6 +1768,14 @@ class SparkConnectPlanner(val sessionHolder: 
SessionHolder) extends Logging {
 val ignoreNA = extractBoolean(children(2), "ignoreNA")
 Some(EWM(children(0), alpha, ignoreNA))
 
+  case "last_non_null" if fun.getArgumentsCount == 1 =>
+val children = fun.getArgumentsList.asScala.map(transformExpression)
+Some(LastNonNull(children(0)))
+
+  case "null_index" if fun.getArgumentsCount == 1 =>
+val children = fun.getArgumentsList.asScala.map(transformExpression)
+Some(NullIndex(children(0)))
+
   // ML-specific functions
   case "vector_to_array" if fun.getArgumentsCount == 2 =>
 val expr = transformExpression(fun.getArguments(0))
diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index 0f1e814946a..95ca92e7878 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -53,7 +53,6 @@ from pandas.api.types import (  # type: ignore[attr-defined]
 CategoricalDtype,
 )
 from pandas.tseries.frequencies import DateOffset
-from pyspark import SparkContext
 from pyspark.sql import functions as F, Column as PySparkColumn, DataFrame as 
SparkDataFrame
 from pyspark.sql.types import (
 ArrayType,
@@ -70,7 +69,7 @@ from pyspark.sql.types import (
 TimestampType,
 )
 from pyspark.sql.window import Window
-from pyspark.sql.utils import get_column_class
+from pyspark.sql.utils import get_column_class, get_window_class
 
 from pyspark import pandas as ps  # For running doctests and reference 
resolution in PyCharm.
 from pyspark.pandas._typing import Axis, Dtype, Label, Name, Scalar, T
@@ -2257,10 +2256,10 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
 return self._psdf.copy()._psser_for(self._column_label)
 
 scol = self.spark.column
-sql_utils = SparkContext._active_spark_context._jvm.PythonSQLUtils
-last_non_null = PySparkColumn(sql_utils.lastNonNull(scol._jc))
-null_index = PySparkColumn(sql_utils.nullIndex(scol._jc))
+last_non_null = SF.last_non_null(scol)
+null_index = SF.null_index(scol)
 
+Window = get_window_class()
 window_forward = Window.orderBy(NATURAL_ORDER_COLUMN_NAME).rowsBetween(
 Window.unboundedPreceding, Window.currentRow
 )
diff --git a/python/pyspark/pandas/spark/functions.py 
b/python/pyspark/pandas/spark/functions.py
index 06d5692238d..44650fd4d20 100644
--- a/python/pyspark/pandas/spark/functions.py
+++ b/python/pyspark/pandas/spark/functions.py
@@ -157,3 +157,31 @@ def ewm(col: Column, alpha: float, ignore_na: bool) -> 
Column:
 else:
 sc = SparkContext._active_spark_context
 return Column(sc._jvm.PythonSQLUtils.ewm(col._jc, alpha, ignore_na))
+
+
+def last_non_null(col: Column) -> Column:
+if is_remote():
+from pyspark.sql.connect.functions import _invoke_function_over_columns
+
+return _invoke_function_over_columns(  # type: 

[spark] branch master updated: [SPARK-43979][SQL][FOLLOW-UP] CollectedMetrics should be treated as the same one for self-join

2023-06-27 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 954987f19dc [SPARK-43979][SQL][FOLLOW-UP] CollectedMetrics should be 
treated as the same one for self-join
954987f19dc is described below

commit 954987f19dca67064268cde023d489eb22d81439
Author: Rui Wang 
AuthorDate: Tue Jun 27 10:15:02 2023 -0700

[SPARK-43979][SQL][FOLLOW-UP] CollectedMetrics should be treated as the 
same one for self-join

### What changes were proposed in this pull request?

Use `transformUpWithNewOutput` than `resolveOperatorsUpWithNewOutput` to 
simplify the metrics plan. This is to in case that one plan is analyzed and 
another one is not analyzed.

### Why are the changes needed?

To fix the case where we have two CollectedMetrics plan to compare where 
one is analyzed and another one is not.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Existing tests

Closes #41745 from amaliujia/fix_metrics_path.

Authored-by: Rui Wang 
Signed-off-by: Gengliang Wang 
---
 .../scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 649140e466a..7c0e8f1490d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -1080,7 +1080,7 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
* duplicates metric definition.
*/
   private def simplifyPlanForCollectedMetrics(plan: LogicalPlan): LogicalPlan 
= {
-plan.resolveOperatorsUpWithNewOutput {
+plan.transformUpWithNewOutput {
   case p: Project if p.projectList.size == p.child.output.size =>
 val assignExprIdOnly = p.projectList.zip(p.child.output).forall {
   case (left: Alias, right: Attribute) =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-connect-go] branch master updated: [SPARK-43351] Support more data types when reading from spark connect arrow dataset to data frame; Also implement CreateTempView

2023-06-27 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git


The following commit(s) were added to refs/heads/master by this push:
 new e9001d2  [SPARK-43351] Support more data types when reading from spark 
connect arrow dataset to data frame; Also implement CreateTempView
e9001d2 is described below

commit e9001d2edbc2dd9ba83b7e721d79103bbc3bc598
Author: hiboyang <14280154+hiboy...@users.noreply.github.com>
AuthorDate: Tue Jun 27 09:46:20 2023 -0700

[SPARK-43351] Support more data types when reading from spark connect arrow 
dataset to data frame; Also implement CreateTempView

### What changes were proposed in this pull request?

Support more data types when reading from spark connect arrow dataset to 
data frame; Also implement CreateTempView

### Why are the changes needed?

Support more data types when reading from spark connect arrow dataset to 
data frame; Also implement CreateTempView

### Does this PR introduce _any_ user-facing change?

Yes, able to create temp view, e.g.
```
dataframe.CreateTempView(...)
```

### How was this patch tested?

Unit test, and also manual test by running example code

Closes #11 from hiboyang/bo-dev-03.

Authored-by: hiboyang <14280154+hiboy...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 client/sql/dataframe.go | 184 ++---
 client/sql/dataframe_test.go| 249 
 client/sql/datatype.go  |  77 
 cmd/spark-connect-example-spark-session/main.go |  15 ++
 4 files changed, 497 insertions(+), 28 deletions(-)

diff --git a/client/sql/dataframe.go b/client/sql/dataframe.go
index eb1718a..f2a0747 100644
--- a/client/sql/dataframe.go
+++ b/client/sql/dataframe.go
@@ -37,6 +37,8 @@ type DataFrame interface {
Collect() ([]Row, error)
// Write returns a data frame writer, which could be used to save data 
frame to supported storage.
Write() DataFrameWriter
+   // CreateTempView creates or replaces a temporary view.
+   CreateTempView(viewName string, replace bool, global bool) error
 }
 
 // dataFrameImpl is an implementation of DataFrame interface.
@@ -157,6 +159,30 @@ func (df *dataFrameImpl) Write() DataFrameWriter {
return 
 }
 
+func (df *dataFrameImpl) CreateTempView(viewName string, replace bool, global 
bool) error {
+   plan := {
+   OpType: _Command{
+   Command: {
+   CommandType: _CreateDataframeView{
+   CreateDataframeView: 
{
+   Input:df.relation,
+   Name: viewName,
+   Replace:  replace,
+   IsGlobal: global,
+   },
+   },
+   },
+   },
+   }
+
+   responseClient, err := df.sparkSession.executePlan(plan)
+   if err != nil {
+   return fmt.Errorf("failed to create temp view %s: %w", 
viewName, err)
+   }
+
+   return consumeExecutePlanClient(responseClient)
+}
+
 func (df *dataFrameImpl) createPlan() *proto.Plan {
return {
OpType: _Root{
@@ -208,38 +234,16 @@ func readArrowBatchData(data []byte, schema *StructType) 
([]Row, error) {
return nil, fmt.Errorf("failed to read arrow: 
%w", err)
}
}
-   numColumns := len(arrowReader.Schema().Fields())
+
+   values, err := readArrowRecord(record)
+   if err != nil {
+   return nil, err
+   }
+
numRows := int(record.NumRows())
if rows == nil {
rows = make([]Row, 0, numRows)
}
-   values := make([][]any, numRows)
-   for i := range values {
-   values[i] = make([]any, numColumns)
-   }
-   for columnIndex := 0; columnIndex < numColumns; columnIndex++ {
-   columnData := record.Column(columnIndex).Data()
-   dataTypeId := columnData.DataType().ID()
-   switch dataTypeId {
-   case arrow.STRING:
-   vector := array.NewStringData(columnData)
-   for rowIndex := 0; rowIndex < numRows; 
rowIndex++ {
-   values[rowIndex][columnIndex] = 
vector.Value(rowIndex)
-   }
-   case arrow.INT32:
-   

[spark] branch master updated: [SPARK-44171][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2279-2282] & delete some unused error classes

2023-06-27 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new be8b07a1534 [SPARK-44171][SQL] Assign names to the error class 
_LEGACY_ERROR_TEMP_[2279-2282] & delete some unused error classes
be8b07a1534 is described below

commit be8b07a15348d8fea15c33d35a75969ca1693ff6
Author: panbingkun 
AuthorDate: Tue Jun 27 19:31:30 2023 +0300

[SPARK-44171][SQL] Assign names to the error class 
_LEGACY_ERROR_TEMP_[2279-2282] & delete some unused error classes

### What changes were proposed in this pull request?
The pr aims to assign names to the error class 
_LEGACY_ERROR_TEMP_[2279-2282] and delete some unused error classes,  details 
as follows:
_LEGACY_ERROR_TEMP_0036 -> `Delete`
_LEGACY_ERROR_TEMP_1341 -> `Delete`
_LEGACY_ERROR_TEMP_1342 -> `Delete`
_LEGACY_ERROR_TEMP_1304 -> `Delete`
_LEGACY_ERROR_TEMP_2072 -> `Delete`
_LEGACY_ERROR_TEMP_2279 -> `Delete`
_LEGACY_ERROR_TEMP_2280 -> UNSUPPORTED_FEATURE.COMMENT_NAMESPACE
_LEGACY_ERROR_TEMP_2281 -> UNSUPPORTED_FEATURE.REMOVE_NAMESPACE_COMMENT
_LEGACY_ERROR_TEMP_2282 -> UNSUPPORTED_FEATURE.DROP_NAMESPACE_RESTRICT

### Why are the changes needed?
The changes improve the error framework.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #41721 from panbingkun/SPARK-44171.

Lead-authored-by: panbingkun 
Co-authored-by: panbingkun <84731...@qq.com>
Signed-off-by: Max Gekk 
---
 .../spark/sql/jdbc/v2/MySQLNamespaceSuite.scala| 19 +--
 core/src/main/resources/error/error-classes.json   | 60 ++
 .../spark/sql/errors/QueryCompilationErrors.scala  | 16 --
 .../spark/sql/errors/QueryExecutionErrors.scala| 30 +--
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  6 +--
 5 files changed, 47 insertions(+), 84 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala
index a7ef8d4e104..d58146fecdf 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala
@@ -73,7 +73,8 @@ class MySQLNamespaceSuite extends DockerJDBCIntegrationSuite 
with V2JDBCNamespac
   exception = intercept[SparkSQLFeatureNotSupportedException] {
 catalog.createNamespace(Array("foo"), Map("comment" -> "test 
comment").asJava)
   },
-  errorClass = "_LEGACY_ERROR_TEMP_2280"
+  errorClass = "UNSUPPORTED_FEATURE.COMMENT_NAMESPACE",
+  parameters = Map("namespace" -> "`foo`")
 )
 assert(catalog.namespaceExists(Array("foo")) === false)
 catalog.createNamespace(Array("foo"), Map.empty[String, String].asJava)
@@ -84,13 +85,25 @@ class MySQLNamespaceSuite extends 
DockerJDBCIntegrationSuite with V2JDBCNamespac
   Array("foo"),
   NamespaceChange.setProperty("comment", "comment for foo"))
   },
-  errorClass = "_LEGACY_ERROR_TEMP_2280")
+  errorClass = "UNSUPPORTED_FEATURE.COMMENT_NAMESPACE",
+  parameters = Map("namespace" -> "`foo`")
+)
 
 checkError(
   exception = intercept[SparkSQLFeatureNotSupportedException] {
 catalog.alterNamespace(Array("foo"), 
NamespaceChange.removeProperty("comment"))
   },
-  errorClass = "_LEGACY_ERROR_TEMP_2281")
+  errorClass = "UNSUPPORTED_FEATURE.REMOVE_NAMESPACE_COMMENT",
+  parameters = Map("namespace" -> "`foo`")
+)
+
+checkError(
+  exception = intercept[SparkSQLFeatureNotSupportedException] {
+catalog.dropNamespace(Array("foo"), cascade = false)
+  },
+  errorClass = "UNSUPPORTED_FEATURE.DROP_NAMESPACE",
+  parameters = Map("namespace" -> "`foo`")
+)
 catalog.dropNamespace(Array("foo"), cascade = true)
 assert(catalog.namespaceExists(Array("foo")) === false)
   }
diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 78b54d5230d..342af0ffa6c 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -2383,11 +2383,21 @@
   "Combination of ORDER BY/SORT BY/DISTRIBUTE BY/CLUSTER BY."
 ]
   },
+  "COMMENT_NAMESPACE" : {
+"message" : [
+  "Attach a comment to the namespace ."
+]
+  },
   "DESC_TABLE_COLUMN_PARTITION" : {
 "message" : [
   "DESC TABLE COLUMN for a specific partition."
 ]
   },
+  "DROP_NAMESPACE" : {
+"message" : [
+  "Drop the namespace ."
+ 

[spark] branch master updated: [SPARK-44197][BUILD][FOLLOWUP] Update `IsolatedClientLoader` hadoop version

2023-06-27 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 24960d84a9d [SPARK-44197][BUILD][FOLLOWUP] Update 
`IsolatedClientLoader` hadoop version
24960d84a9d is described below

commit 24960d84a9dac17728822f3e783335f221c49da3
Author: panbingkun 
AuthorDate: Tue Jun 27 07:58:44 2023 -0700

[SPARK-44197][BUILD][FOLLOWUP] Update `IsolatedClientLoader` hadoop version

### What changes were proposed in this pull request?
The pr aims to follow up SPARK-44197.

### Why are the changes needed?
When the Hadoop version that Spark relies on is upgraded from `3.3.5` to 
`3.3.6`, the corresponding versions in `IsolatedClientLoader` should also be 
upgraded synchronously.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #41758 from panbingkun/SPARK-44197_FOLLOWUP.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 assembly/README | 2 +-
 resource-managers/kubernetes/integration-tests/README.md| 2 +-
 .../scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/assembly/README b/assembly/README
index a380d8cb330..3dde243d3e6 100644
--- a/assembly/README
+++ b/assembly/README
@@ -9,4 +9,4 @@ This module is off by default. To activate it specify the 
profile in the command
 
 If you need to build an assembly for a different version of Hadoop the
 hadoop-version system property needs to be set as in this example:
-  -Dhadoop.version=3.3.5
+  -Dhadoop.version=3.3.6
diff --git a/resource-managers/kubernetes/integration-tests/README.md 
b/resource-managers/kubernetes/integration-tests/README.md
index 2944c189ed4..909e5b652d4 100644
--- a/resource-managers/kubernetes/integration-tests/README.md
+++ b/resource-managers/kubernetes/integration-tests/README.md
@@ -129,7 +129,7 @@ properties to Maven.  For example:
 
 mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \
 -Pkubernetes -Pkubernetes-integration-tests \
--Phadoop-3 -Dhadoop.version=3.3.5 \
+-Phadoop-3 -Dhadoop.version=3.3.6 \
 
-Dspark.kubernetes.test.sparkTgz=spark-3.0.0-SNAPSHOT-bin-example.tgz \
 -Dspark.kubernetes.test.imageTag=sometag \
 
-Dspark.kubernetes.test.imageRepo=docker.io/somerepo \
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
index 64718a9d35c..2765e6af521 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
@@ -66,7 +66,7 @@ private[hive] object IsolatedClientLoader extends Logging {
   case e: RuntimeException if e.getMessage.contains("hadoop") =>
 // If the error message contains hadoop, it is probably because 
the hadoop
 // version cannot be resolved.
-val fallbackVersion = "3.3.5"
+val fallbackVersion = "3.3.6"
 logWarning(s"Failed to resolve Hadoop artifacts for the version 
$hadoopVersion. We " +
   s"will change the hadoop version from $hadoopVersion to 
$fallbackVersion and try " +
   "again. It is recommended to set jars used by Hive metastore 
client through " +


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43081][FOLLOW-UP][ML][CONNECT] Make torch dataloader support torch 1.x

2023-06-27 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a758d6a0f9d [SPARK-43081][FOLLOW-UP][ML][CONNECT] Make torch 
dataloader support torch 1.x
a758d6a0f9d is described below

commit a758d6a0f9dfa32881cfcec263da0ab0c02f5c1d
Author: Weichen Xu 
AuthorDate: Tue Jun 27 07:57:21 2023 -0700

[SPARK-43081][FOLLOW-UP][ML][CONNECT] Make torch dataloader support torch 
1.x

### What changes were proposed in this pull request?

Make torch dataloader support torch 1.x.
Currently, when running with torch 1.x with num_workers > 0, an error is 
raised like:
```
ValueError: prefetch_factor option could only be specified in 
multiprocessing.let num_workers > 0 to enable multiprocessing.
```

### Why are the changes needed?

Compatibility fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually run unit tests with torch 1.x

Closes #41751 from WeichenXu123/support-torch-1.x.

Authored-by: Weichen Xu 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/ml/torch/distributor.py | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/ml/torch/distributor.py 
b/python/pyspark/ml/torch/distributor.py
index 9f9636e6b10..8b34acd959e 100644
--- a/python/pyspark/ml/torch/distributor.py
+++ b/python/pyspark/ml/torch/distributor.py
@@ -995,4 +995,11 @@ def _get_spark_partition_data_loader(
 
 dataset = _SparkPartitionTorchDataset(arrow_file, schema, num_samples)
 
-return DataLoader(dataset, batch_size, num_workers=num_workers, 
prefetch_factor=prefetch_factor)
+if num_workers > 0:
+return DataLoader(
+dataset, batch_size, num_workers=num_workers, 
prefetch_factor=prefetch_factor
+)
+else:
+# if num_workers is zero, we cannot set `prefetch_factor` otherwise
+# torch will raise error.
+return DataLoader(dataset, batch_size, num_workers=num_workers)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44204][SQL][HIVE] Add missing recordHiveCall for getPartitionNames

2023-06-27 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 43f7a86a05a [SPARK-44204][SQL][HIVE] Add missing recordHiveCall for 
getPartitionNames
43f7a86a05a is described below

commit 43f7a86a05ad8c7ec7060607e43d9ca4d0fe4166
Author: Cheng Pan 
AuthorDate: Tue Jun 27 16:42:14 2023 +0800

[SPARK-44204][SQL][HIVE] Add missing recordHiveCall for getPartitionNames

### What changes were proposed in this pull request?

The code was added by SPARK-35437, looks like it forgot to call 
`recordHiveCall` before `hive.getPartitionNames`

### Why are the changes needed?

Correctly call `recordHiveCall` on each HMS invocation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing CI and Review.

Closes #41756 from pan3793/SPARK-44204.

Authored-by: Cheng Pan 
Signed-off-by: yangjie01 
---
 sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala
index 08615b90d80..63f672b22ba 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala
@@ -1180,6 +1180,7 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
   })
 }
 
+recordHiveCall()
 val allPartitionNames = hive.getPartitionNames(
   table.getDbName, table.getTableName, -1).asScala
 val partNames = allPartitionNames.filter { p =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44192][BUILD][R] Support R 4.3.1

2023-06-27 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8229feab979 [SPARK-44192][BUILD][R] Support R 4.3.1
8229feab979 is described below

commit 8229feab97959c5e90bd45be5d0979b0ae41d6e2
Author: yangjie01 
AuthorDate: Mon Jun 26 23:55:41 2023 -0700

[SPARK-44192][BUILD][R] Support R 4.3.1

### What changes were proposed in this pull request?
This PR aims to support R 4.3.1 officially in Apache Spark 3.5.0 by 
upgrading AppVeyor to 4.3.1.

### Why are the changes needed?
R 4.3.1 is released on Jun 16, 2023.
- https://stat.ethz.ch/pipermail/r-announce/2023/000694.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions

Closes #41754 from LuciferYang/SPARK-44192.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 dev/appveyor-install-dependencies.ps1 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/appveyor-install-dependencies.ps1 
b/dev/appveyor-install-dependencies.ps1
index 6abcc116346..6848d3af43d 100644
--- a/dev/appveyor-install-dependencies.ps1
+++ b/dev/appveyor-install-dependencies.ps1
@@ -129,7 +129,7 @@ $env:PATH = "$env:HADOOP_HOME\bin;" + $env:PATH
 Pop-Location
 
 # == R
-$rVer = "4.3.0"
+$rVer = "4.3.1"
 $rToolsVer = "4.0.2"
 
 InstallR


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-docker] branch master updated: [SPARK-40513][DOCS] Add apache/spark docker image overview

2023-06-27 Thread yikun
This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new d02ff60  [SPARK-40513][DOCS] Add apache/spark docker image overview
d02ff60 is described below

commit d02ff6091835311a32c7ccc73d8ebae1d5817ecc
Author: Yikun Jiang 
AuthorDate: Tue Jun 27 14:28:21 2023 +0800

[SPARK-40513][DOCS] Add apache/spark docker image overview

### What changes were proposed in this pull request?
This PR add the `OVERVIEW.md`.

### Why are the changes needed?

This will be used in the page of https://hub.docker.com/r/apache/spark to 
introduce the spark docker image and tag info.

### Does this PR introduce _any_ user-facing change?
Yes, doc only

### How was this patch tested?
Doc only, review.

Closes #34 from Yikun/overview.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 OVERVIEW.md | 83 +
 1 file changed, 83 insertions(+)

diff --git a/OVERVIEW.md b/OVERVIEW.md
new file mode 100644
index 000..046
--- /dev/null
+++ b/OVERVIEW.md
@@ -0,0 +1,83 @@
+# What is Apache Spark™?
+
+Apache Spark™ is a multi-language engine for executing data engineering, data 
science, and machine learning on single-node machines or clusters. It provides 
high-level APIs in Scala, Java, Python, and R, and an optimized engine that 
supports general computation graphs for data analysis. It also supports a rich 
set of higher-level tools including Spark SQL for SQL and DataFrames, pandas 
API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph 
processing, and Structu [...]
+
+https://spark.apache.org/
+
+## Online Documentation
+
+You can find the latest Spark documentation, including a programming guide, on 
the [project web page](https://spark.apache.org/documentation.html). This 
README file only contains basic setup instructions.
+
+## Interactive Scala Shell
+
+The easiest way to start using Spark is through the Scala shell:
+
+```
+docker run -it apache/spark /opt/spark/bin/spark-shell
+```
+
+Try the following command, which should return 1,000,000,000:
+
+```
+scala> spark.range(1000 * 1000 * 1000).count()
+```
+
+## Interactive Python Shell
+
+The easiest way to start using PySpark is through the Python shell:
+
+```
+docker run -it apache/spark /opt/spark/bin/pyspark
+```
+
+And run the following command, which should also return 1,000,000,000:
+
+```
+>>> spark.range(1000 * 1000 * 1000).count()
+```
+
+## Interactive R Shell
+
+The easiest way to start using R on Spark is through the R shell:
+
+```
+docker run -it apache/spark:r /opt/spark/bin/sparkR
+```
+
+## Running Spark on Kubernetes
+
+https://spark.apache.org/docs/latest/running-on-kubernetes.html
+
+## Supported tags and respective Dockerfile links
+
+Currently, the `apache/spark` docker image supports 4 types for each version:
+
+Such as for v3.4.0:
+- [3.4.0-scala2.12-java11-python3-ubuntu, 3.4.0-python3, 3.4.0, python3, 
latest](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-python3-ubuntu)
+- [3.4.0-scala2.12-java11-r-ubuntu, 3.4.0-r, 
r](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-r-ubuntu)
+- [3.4.0-scala2.12-java11-ubuntu, 3.4.0-scala, 
scala](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-ubuntu)
+- 
[3.4.0-scala2.12-java11-python3-r-ubuntu](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-python3-r-ubuntu)
+
+## Environment Variable
+
+The environment variables of entrypoint.sh are listed below:
+
+| Environment Variable | Meaning |
+|--|---|
+| SPARK_EXTRA_CLASSPATH | The extra path to be added to the classpath, see 
also in 
https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
 |
+| PYSPARK_PYTHON | Python binary executable to use for PySpark in both driver 
and workers (default is python3 if available, otherwise python). Property 
spark.pyspark.python take precedence if it is set |
+| PYSPARK_DRIVER_PYTHON | Python binary executable to use for PySpark in 
driver only (default is PYSPARK_PYTHON). Property spark.pyspark.driver.python 
take precedence if it is set |
+| SPARK_DIST_CLASSPATH | Distribution-defined classpath to add to processes |
+| SPARK_DRIVER_BIND_ADDRESS | Hostname or IP address where to bind listening 
sockets. See also `spark.driver.bindAddress` |
+| SPARK_EXECUTOR_JAVA_OPTS | The Java opts of Spark Executor |
+| SPARK_APPLICATION_ID | A unique identifier for the Spark application |
+| SPARK_EXECUTOR_POD_IP | The Pod IP address of spark executor |
+| SPARK_RESOURCE_PROFILE_ID | 

[spark-docker] branch master updated: [SPARK-44175] Remove useless lib64 path link in dockerfile

2023-06-27 Thread yikun
This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new 5405b49  [SPARK-44175] Remove useless lib64 path link in dockerfile
5405b49 is described below

commit 5405b49b52aa1661d31ac80cdb8c9aad530d6847
Author: Yikun Jiang 
AuthorDate: Tue Jun 27 14:09:34 2023 +0800

[SPARK-44175] Remove useless lib64 path link in dockerfile

### What changes were proposed in this pull request?
Remove useless lib64 path

### Why are the changes needed?
Address comments: 
https://github.com/docker-library/official-images/pull/13089#issuecomment-1601813499

It was introduced by 
https://github.com/apache/spark/commit/f13ea15d79fb4752a0a75a05a4a89bd8625ea3d5 
to address the issue about snappy on alpine OS, but we already switch the OS to 
ubuntu, so `/lib64` hack can be cleanup.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #48 from Yikun/rm-lib64-hack.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 3.4.0/scala2.12-java11-ubuntu/Dockerfile | 1 -
 3.4.1/scala2.12-java11-ubuntu/Dockerfile | 1 -
 Dockerfile.template  | 1 -
 3 files changed, 3 deletions(-)

diff --git a/3.4.0/scala2.12-java11-ubuntu/Dockerfile 
b/3.4.0/scala2.12-java11-ubuntu/Dockerfile
index 77ace47..854f86c 100644
--- a/3.4.0/scala2.12-java11-ubuntu/Dockerfile
+++ b/3.4.0/scala2.12-java11-ubuntu/Dockerfile
@@ -23,7 +23,6 @@ RUN groupadd --system --gid=${spark_uid} spark && \
 
 RUN set -ex; \
 apt-get update; \
-ln -s /lib /lib64; \
 apt-get install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user 
libnss3 procps net-tools gosu libnss-wrapper; \
 mkdir -p /opt/spark; \
 mkdir /opt/spark/python; \
diff --git a/3.4.1/scala2.12-java11-ubuntu/Dockerfile 
b/3.4.1/scala2.12-java11-ubuntu/Dockerfile
index e782686..bf106a6 100644
--- a/3.4.1/scala2.12-java11-ubuntu/Dockerfile
+++ b/3.4.1/scala2.12-java11-ubuntu/Dockerfile
@@ -23,7 +23,6 @@ RUN groupadd --system --gid=${spark_uid} spark && \
 
 RUN set -ex; \
 apt-get update; \
-ln -s /lib /lib64; \
 apt-get install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user 
libnss3 procps net-tools gosu libnss-wrapper; \
 mkdir -p /opt/spark; \
 mkdir /opt/spark/python; \
diff --git a/Dockerfile.template b/Dockerfile.template
index 6fedce9..80b57e2 100644
--- a/Dockerfile.template
+++ b/Dockerfile.template
@@ -23,7 +23,6 @@ RUN groupadd --system --gid=${spark_uid} spark && \
 
 RUN set -ex; \
 apt-get update; \
-ln -s /lib /lib64; \
 apt-get install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user 
libnss3 procps net-tools gosu libnss-wrapper; \
 mkdir -p /opt/spark; \
 mkdir /opt/spark/python; \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org