date:20230606

[spark] branch master updated (d2f72a21677 -> 0ed48feab65)

2023-06-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from d2f72a21677 [SPARK-43901][SQL] Avro to Support custom decimal type 
backed by Long
 add 0ed48feab65 [SPARK-43985][PROTOBUF] spark protobuf: fix enums as ints 
bug

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/protobuf/ProtobufDeserializer.scala  |   2 +-
 .../test/resources/protobuf/functions_suite.desc   | Bin 11087 -> 11190 bytes
 .../test/resources/protobuf/functions_suite.proto  |   1 +
 .../sql/protobuf/ProtobufFunctionsSuite.scala  |  26 ++---
 4 files changed, 19 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43901][SQL] Avro to Support custom decimal type backed by Long

2023-06-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d2f72a21677 [SPARK-43901][SQL] Avro to Support custom decimal type 
backed by Long
d2f72a21677 is described below

commit d2f72a21677748793f0bb329630d72bc91449587
Author: Siying Dong 
AuthorDate: Tue Jun 6 19:41:09 2023 -0700

[SPARK-43901][SQL] Avro to Support custom decimal type backed by Long

### What changes were proposed in this pull request?
Add a logical type "custom-decimal" in Avro, which can only be backed by 
physical type long, and will be convert into decimal type.

### Why are the changes needed?
A user would like to represent currency (for money) after loading Avro into 
SQL type. However, there isn't a good way to represent it in Avro. This custom 
type will allow them to do that.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added several unit test cases to test the new "custom-decimal" to be loaded 
successfully and also exception cases.

Closes #41409 from siying/customdecimal.

Authored-by: Siying Dong 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/avro/CustomDecimal.scala  | 79 ++
 .../apache/spark/sql/avro/AvroDeserializer.scala   |  4 +
 .../org/apache/spark/sql/avro/AvroFileFormat.scala |  9 +++
 .../apache/spark/sql/avro/SchemaConverters.scala   |  2 +
 .../spark/sql/avro/AvroLogicalTypeSuite.scala  | 94 +-
 5 files changed, 187 insertions(+), 1 deletion(-)

diff --git 
a/connector/avro/src/main/java/org/apache/spark/sql/avro/CustomDecimal.scala 
b/connector/avro/src/main/java/org/apache/spark/sql/avro/CustomDecimal.scala
new file mode 100644
index 000..d76f40c7635
--- /dev/null
+++ b/connector/avro/src/main/java/org/apache/spark/sql/avro/CustomDecimal.scala
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.avro
+
+import org.apache.avro.LogicalType
+import org.apache.avro.Schema
+
+import org.apache.spark.sql.types.DecimalType
+
+object CustomDecimal {
+  val TYPE_NAME = "custom-decimal"
+}
+
+// A customized logical type, which will be registered to Avro. This logical 
type is similar to
+// Avro's builtin Decimal type, but is meant to be registered for long type. 
It indicates that
+// the long type should be converted to Spark's Decimal type, with provided 
precision and scale.
+private class CustomDecimal(schema: Schema) extends 
LogicalType(CustomDecimal.TYPE_NAME) {
+  val scale : Int = {
+val obj = schema.getObjectProp("scale")
+obj match {
+  case null =>
+throw new IllegalArgumentException(s"Invalid 
${CustomDecimal.TYPE_NAME}: missing scale");
+  case i : Integer =>
+i
+  case other =>
+throw new IllegalArgumentException(s"Expected int 
${CustomDecimal.TYPE_NAME}:scale")
+}
+  }
+  val precision : Int = {
+val obj = schema.getObjectProp("precision")
+obj match {
+  case null =>
+throw new IllegalArgumentException(
+  s"Invalid ${CustomDecimal.TYPE_NAME}: missing precision");
+  case i: Integer =>
+i
+  case other =>
+throw new IllegalArgumentException(s"Expected int 
${CustomDecimal.TYPE_NAME}:precision")
+}
+  }
+  val className : String = schema.getProp("className")
+
+  override def validate(schema: Schema): Unit = {
+super.validate(schema)
+if (schema.getType != Schema.Type.LONG) {
+  throw new IllegalArgumentException(
+s"${CustomDecimal.TYPE_NAME} can only be used with an underlying long 
type")
+}
+if (precision <= 0) {
+  throw new IllegalArgumentException(s"Invalid decimal precision: 
$precision" +
+" (must be positive)");
+} else if (precision > DecimalType.MAX_PRECISION) {
+  throw new IllegalArgumentException(
+s"cannot store $precision digits (max ${DecimalType.MAX_PRECISION})")
+}
+if (scale < 0) {
+  throw new IllegalArgumentException(s"Invalid decimal scale:

[spark] branch master updated: [SPARK-42750][SQL] Support Insert By Name statement

2023-06-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e6adc67d43d [SPARK-42750][SQL] Support Insert By Name statement
e6adc67d43d is described below

commit e6adc67d43d6beccf21013ee00aa274bed13107c
Author: Jia Fan 
AuthorDate: Wed Jun 7 10:30:59 2023 +0800

[SPARK-42750][SQL] Support Insert By Name statement

### What changes were proposed in this pull request?

In some use cases, users have incoming dataframes with fixed column names 
which might differ from the canonical order. Currently there's no way to handle 
this easily through the INSERT INTO API - the user has to make sure the columns 
are in the right order as they would when inserting a tuple. We should add an 
optional BY NAME clause, such that:

`INSERT INTO tgt BY NAME `

takes each column of  and inserts it into the column in `tgt` which 
has the same name according to the configured `resolver` logic.

Some definitions need to be clarified:
1. `BY NAME` and specified column insertion (`INSERT INTO t1 (a,b)`... ) is 
a mutually exclusive operation
2. But it supports to define partition while using `BY NAME`: `INSERT INTO 
t PARTITION(a=1) BY NAME `

At now don't support `INSERT OVERWRITE BY NAME` (I will supported in follow 
up)

### Why are the changes needed?
Add new feature `INSERT INTO BY NAME`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test.

Closes #40908 from Hisoka-X/SPARK-42750_insert_into_by_name.

Lead-authored-by: Jia Fan 
Co-authored-by: Hisoka 
Signed-off-by: Wenchen Fan 
---
 docs/sql-ref-ansi-compliance.md|  3 +-
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |  1 +
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |  4 +-
 .../spark/sql/catalyst/analysis/Analyzer.scala |  7 ++--
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  2 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala | 19 ++
 .../sql/catalyst/plans/logical/statements.scala|  7 +++-
 .../spark/sql/catalyst/parser/DDLParserSuite.scala | 34 +
 .../execution/datasources/DataSourceStrategy.scala |  9 +++--
 .../datasources/FallBackFileSourceV2.scala |  2 +-
 .../spark/sql/execution/datasources/rules.scala| 14 ---
 .../sql-tests/analyzer-results/explain-aqe.sql.out |  2 +-
 .../sql-tests/analyzer-results/explain.sql.out |  2 +-
 .../sql-tests/results/ansi/keywords.sql.out|  1 +
 .../sql-tests/results/explain-aqe.sql.out  |  2 +-
 .../resources/sql-tests/results/explain.sql.out|  2 +-
 .../resources/sql-tests/results/keywords.sql.out   |  1 +
 .../org/apache/spark/sql/SQLInsertTestSuite.scala  | 43 --
 .../execution/command/PlanResolutionSuite.scala|  4 +-
 .../ThriftServerWithSparkContextSuite.scala|  2 +-
 .../org/apache/spark/sql/hive/HiveStrategies.scala |  8 ++--
 21 files changed, 129 insertions(+), 40 deletions(-)

diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index 76b5d5aef73..f9c6f5ea6aa 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -350,7 +350,7 @@ By default, both `spark.sql.ansi.enabled` and 
`spark.sql.ansi.enforceReservedKey
 Below is a list of all the keywords in Spark SQL.
 
 |Keyword|Spark SQLANSI Mode|Spark SQLDefault Mode|SQL-2016|
-|---|--|-||
+|--|--|-||
 |ADD|non-reserved|non-reserved|non-reserved|
 |AFTER|non-reserved|non-reserved|non-reserved|
 |ALL|reserved|non-reserved|reserved|
@@ -527,6 +527,7 @@ Below is a list of all the keywords in Spark SQL.
 |MONTH|non-reserved|non-reserved|non-reserved|
 |MONTHS|non-reserved|non-reserved|non-reserved|
 |MSCK|non-reserved|non-reserved|non-reserved|
+|NAME|non-reserved|non-reserved|non-reserved|
 |NAMESPACE|non-reserved|non-reserved|non-reserved|
 |NAMESPACES|non-reserved|non-reserved|non-reserved|
 |NANOSECOND|non-reserved|non-reserved|non-reserved|
diff --git 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
index 6300221b542..ecd5f5912fd 100644
--- 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
+++ 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
@@ -264,6 +264,7 @@ MINUTES: 'MINUTES';
 MONTH: 'MONTH';
 MONTHS: 'MONTHS';
 MSCK: 'MSCK';
+NAME: 'NAME';
 NAMESPACE: 'NAMESPACE';
 NAMESPACES: 'NAMESPACES';
 NANOSECOND: 'NANOSECOND';
diff --git 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

[spark] branch master updated (747db6675da -> 54304493e0a)

2023-06-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 747db6675da [SPARK-43930][SQL][PYTHON][CONNECT] Add unix_* functions 
to Scala and Python
 add 54304493e0a [SPARK-43615][TESTS][PS][CONNECT] Enable unit test 
`test_eval`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/tests/connect/computation/test_parity_eval.py | 4 
 1 file changed, 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (1e197b5bc22 -> 747db6675da)

2023-06-06 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1e197b5bc22 [SPARK-43356][K8S] Migrate deprecated createOrReplace to 
serverSideApply
 add 747db6675da [SPARK-43930][SQL][PYTHON][CONNECT] Add unix_* functions 
to Scala and Python

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/functions.scala |  34 +++
 .../apache/spark/sql/PlanGenerationTestSuite.scala |  16 ++
 .../explain-results/function_unix_date.explain |   2 +
 .../explain-results/function_unix_micros.explain   |   2 +
 .../explain-results/function_unix_millis.explain   |   2 +
 .../explain-results/function_unix_seconds.explain  |   2 +
 .../query-tests/queries/function_unix_date.json|  34 +++
 .../queries/function_unix_date.proto.bin   | Bin 0 -> 153 bytes
 .../query-tests/queries/function_unix_micros.json  |  34 +++
 .../queries/function_unix_micros.proto.bin | Bin 0 -> 174 bytes
 .../query-tests/queries/function_unix_millis.json  |  34 +++
 .../queries/function_unix_millis.proto.bin | Bin 0 -> 174 bytes
 .../query-tests/queries/function_unix_seconds.json |  34 +++
 .../queries/function_unix_seconds.proto.bin| Bin 0 -> 175 bytes
 .../source/reference/pyspark.sql/functions.rst |   4 ++
 python/pyspark/sql/connect/functions.py|  28 ++
 python/pyspark/sql/functions.py|  62 +
 .../scala/org/apache/spark/sql/functions.scala |  42 ++
 .../org/apache/spark/sql/DateFunctionsSuite.scala  |  44 +++
 19 files changed, 374 insertions(+)
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/explain-results/function_unix_date.explain
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/explain-results/function_unix_micros.explain
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/explain-results/function_unix_millis.explain
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/explain-results/function_unix_seconds.explain
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/queries/function_unix_date.json
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/queries/function_unix_date.proto.bin
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/queries/function_unix_micros.json
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/queries/function_unix_micros.proto.bin
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/queries/function_unix_millis.json
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/queries/function_unix_millis.proto.bin
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/queries/function_unix_seconds.json
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/queries/function_unix_seconds.proto.bin


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43356][K8S] Migrate deprecated createOrReplace to serverSideApply

2023-06-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1e197b5bc22 [SPARK-43356][K8S] Migrate deprecated createOrReplace to 
serverSideApply
1e197b5bc22 is described below

commit 1e197b5bc2258f9c0657cf9d792005c540ccb7f4
Author: Cheng Pan 
AuthorDate: Tue Jun 6 19:06:49 2023 -0700

[SPARK-43356][K8S] Migrate deprecated createOrReplace to serverSideApply

### What changes were proposed in this pull request?

The deprecation message of `createOrReplace` indicates that we should 
change `createOrReplace` to `serverSideApply` instead.

```
deprecated please use {link ServerSideApplicable#serverSideApply()} or 
attempt a create and edit/patch operation.
```

The change is not fully equivalent, but I believe it's reasonable.

> With the caveat that the user may choose not to use forcing if they want 
to know when there are conflicting changes.
>
> Also unlike createOrReplace if the resourceVersion is set on the resource 
and a replace is attempted, it will be optimistically locked.

See more details at https://github.com/fabric8io/kubernetes-client/pull/5073

### Why are the changes needed?

Remove usage of deprecated API.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #41136 from pan3793/SPARK-43356.

Authored-by: Cheng Pan 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/deploy/k8s/submit/KubernetesClientApplication.scala   | 6 +++---
 .../test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala | 5 -
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala
index 9f9b5655e26..2b2dad1cf13 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala
@@ -137,7 +137,7 @@ private[spark] class Client(
 // setup resources before pod creation
 val preKubernetesResources = 
resolvedDriverSpec.driverPreKubernetesResources
 try {
-  kubernetesClient.resourceList(preKubernetesResources: 
_*).createOrReplace()
+  kubernetesClient.resourceList(preKubernetesResources: 
_*).forceConflicts().serverSideApply()
 } catch {
   case NonFatal(e) =>
 logError("Please check \"kubectl auth can-i create [resource]\" 
first." +
@@ -161,7 +161,7 @@ private[spark] class Client(
 // Refresh all pre-resources' owner references
 try {
   addOwnerReference(createdDriverPod, preKubernetesResources)
-  kubernetesClient.resourceList(preKubernetesResources: 
_*).createOrReplace()
+  kubernetesClient.resourceList(preKubernetesResources: 
_*).forceConflicts().serverSideApply()
 } catch {
   case NonFatal(e) =>
 kubernetesClient.pods().resource(createdDriverPod).delete()
@@ -173,7 +173,7 @@ private[spark] class Client(
 try {
   val otherKubernetesResources = 
resolvedDriverSpec.driverKubernetesResources ++ Seq(configMap)
   addOwnerReference(createdDriverPod, otherKubernetesResources)
-  kubernetesClient.resourceList(otherKubernetesResources: 
_*).createOrReplace()
+  kubernetesClient.resourceList(otherKubernetesResources: 
_*).forceConflicts().serverSideApply()
 } catch {
   case NonFatal(e) =>
 kubernetesClient.pods().resource(createdDriverPod).delete()
diff --git 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala
 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala
index 8c2be6c142d..a813b3a876f 100644
--- 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala
+++ 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala
@@ -181,6 +181,8 @@ class ClientSuite extends SparkFunSuite with BeforeAndAfter 
{
 createdPodArgumentCaptor = ArgumentCaptor.forClass(classOf[Pod])
 createdResourcesArgumentCaptor = 
ArgumentCaptor.forClass(classOf[HasMetadata])
 when(podsWithNamespace.resource(fullExpectedPod())).thenReturn(namedPods)
+when(resourceList.forceConflicts()).thenReturn(resourceList)
+when(namedPods.serverSideApply()).thenReturn(podWithOwnerReference())
 when(namedPods.create()).thenReturn(podWithOwnerReference())

[spark] branch master updated: [SPARK-43906][PYTHON][CONNECT] Implement the file support in SparkSession.addArtifacts

2023-06-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6fd1c649c72 [SPARK-43906][PYTHON][CONNECT] Implement the file support 
in SparkSession.addArtifacts
6fd1c649c72 is described below

commit 6fd1c649c72d4b53ecf83c1643d38002d80c9288
Author: Hyukjin Kwon 
AuthorDate: Wed Jun 7 10:54:24 2023 +0900

[SPARK-43906][PYTHON][CONNECT] Implement the file support in 
SparkSession.addArtifacts

### What changes were proposed in this pull request?

This PR proposes to add the support of the regular files in 
`SparkSession.addArtifacts`.

### Why are the changes needed?

So users can add the regular files in the worker nodes.

### Does this PR introduce _any_ user-facing change?

Yes, it adds the support of arbitrary regular files in 
`SparkSession.addArtifacts`.

### How was this patch tested?

Added a couple of unittests.

Also manually tested in `local-cluster`:

```bash
./sbin/start-connect-server.sh --jars `ls 
connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` --master 
"local-cluster[2,2,1024]"
./bin/pyspark --remote "sc://localhost:15002"
```

```python
import os
import tempfile
from pyspark.sql.functions import udf
from pyspark import SparkFiles

with tempfile.TemporaryDirectory() as d:
file_path = os.path.join(d, "my_file.txt")
with open(file_path, "w") as f:
f.write("Hello world!!")
udf("string")
def func(x):
with open(
os.path.join(SparkFiles.getRootDirectory(), "my_file.txt"), "r"
) as my_file:
return my_file.read().strip()
spark.addArtifacts(file_path, file=True)
spark.range(1).select(func("id")).show()
```

Closes #41415 from HyukjinKwon/addFile.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../artifact/SparkConnectArtifactManager.scala |  6 +++--
 python/pyspark/sql/connect/client/artifact.py  | 21 +++
 python/pyspark/sql/connect/client/core.py  |  4 +--
 python/pyspark/sql/connect/session.py  | 13 ++---
 .../sql/tests/connect/client/test_artifact.py  | 31 +++---
 5 files changed, 58 insertions(+), 17 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
index 604108f68d2..47c48d8e083 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
@@ -97,6 +97,7 @@ class SparkConnectArtifactManager private[connect] {
* @param session
* @param remoteRelativePath
* @param serverLocalStagingPath
+   * @param fragment
*/
   private[connect] def addArtifact(
   sessionHolder: SessionHolder,
@@ -135,8 +136,7 @@ class SparkConnectArtifactManager private[connect] {
   // previously added,
   if (Files.exists(target)) {
 throw new RuntimeException(
-  s"Duplicate Jar: $remoteRelativePath. " +
-s"Jars cannot be overwritten.")
+  s"Duplicate file: $remoteRelativePath. Files cannot be overwritten.")
   }
   Files.move(serverLocalStagingPath, target)
   if (remoteRelativePath.startsWith(s"jars${File.separator}")) {
@@ -154,6 +154,8 @@ class SparkConnectArtifactManager private[connect] {
 val canonicalUri =
   
fragment.map(UriBuilder.fromUri(target.toUri).fragment).getOrElse(target.toUri)
 sessionHolder.session.sparkContext.addArchive(canonicalUri.toString)
+  } else if (remoteRelativePath.startsWith(s"files${File.separator}")) {
+sessionHolder.session.sparkContext.addFile(target.toString)
   }
 }
   }
diff --git a/python/pyspark/sql/connect/client/artifact.py 
b/python/pyspark/sql/connect/client/artifact.py
index 64f89119e4f..9a848bd96b8 100644
--- a/python/pyspark/sql/connect/client/artifact.py
+++ b/python/pyspark/sql/connect/client/artifact.py
@@ -39,6 +39,7 @@ import pyspark.sql.connect.proto.base_pb2_grpc as grpc_lib
 JAR_PREFIX: str = "jars"
 PYFILE_PREFIX: str = "pyfiles"
 ARCHIVE_PREFIX: str = "archives"
+FILE_PREFIX: str = "files"
 
 
 class LocalData(metaclass=abc.ABCMeta):
@@ -107,6 +108,10 @@ def new_archive_artifact(file_name: str, storage: 
LocalData) -> Artifact:
 return _new_artifact(ARCHIVE_PREFIX, "", file_name, storage)
 
 
+def new_file_artifact(file_name: str, storage: LocalData) -> Artifact:
+return _new_artifact(FILE_PREFIX, "", file_name,

[spark] branch master updated: [SPARK-43970][PYTHON][CONNECT] Hide unsupported dataframe methods from auto-completion

2023-06-06 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e3957ce8718 [SPARK-43970][PYTHON][CONNECT] Hide unsupported dataframe 
methods from auto-completion
e3957ce8718 is described below

commit e3957ce8718697be2fcb2a95ede439bb49ceadad
Author: Ruifeng Zheng 
AuthorDate: Wed Jun 7 09:39:27 2023 +0800

[SPARK-43970][PYTHON][CONNECT] Hide unsupported dataframe methods from 
auto-completion

### What changes were proposed in this pull request?
Hide unsupported dataframe methods from auto-completion

### Why are the changes needed?

for better UX

before
https://github.com/apache/spark/assets/7322292/8f4f228c-e30e-4027-8d52-768f1657f19e;>

after
https://github.com/apache/spark/assets/7322292/1d308937-dd57-4ca6-b37a-29b09348bda5;>

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
manually check in ipython

Closes #41462 from zhengruifeng/connect_hide_unsupported_df_functions.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py | 48 +
 1 file changed, 12 insertions(+), 36 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index a8a4612aec7..6429645f0e0 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1572,6 +1572,18 @@ class DataFrame:
 raise PySparkAttributeError(
 error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", 
message_parameters={"attr_name": name}
 )
+elif name in [
+"rdd",
+"toJSON",
+"foreach",
+"foreachPartition",
+"checkpoint",
+"localCheckpoint",
+]:
+raise PySparkNotImplementedError(
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": f"{name}()"},
+)
 return self[name]
 
 @overload
@@ -1817,12 +1829,6 @@ class DataFrame:
 
 createOrReplaceGlobalTempView.__doc__ = 
PySparkDataFrame.createOrReplaceGlobalTempView.__doc__
 
-def rdd(self, *args: Any, **kwargs: Any) -> None:
-raise PySparkNotImplementedError(
-error_class="NOT_IMPLEMENTED",
-message_parameters={"feature": "RDD Support for Spark Connect"},
-)
-
 def cache(self) -> "DataFrame":
 if self._plan is None:
 raise Exception("Cannot cache on empty plan.")
@@ -1870,18 +1876,6 @@ class DataFrame:
 def is_cached(self) -> bool:
 return self.storageLevel != StorageLevel.NONE
 
-def foreach(self, *args: Any, **kwargs: Any) -> None:
-raise PySparkNotImplementedError(
-error_class="NOT_IMPLEMENTED",
-message_parameters={"feature": "foreach()"},
-)
-
-def foreachPartition(self, *args: Any, **kwargs: Any) -> None:
-raise PySparkNotImplementedError(
-error_class="NOT_IMPLEMENTED",
-message_parameters={"feature": "foreachPartition()"},
-)
-
 def toLocalIterator(self, prefetchPartitions: bool = False) -> 
Iterator[Row]:
 from pyspark.sql.connect.conversion import ArrowTableToRowsConversion
 
@@ -1905,18 +1899,6 @@ class DataFrame:
 
 toLocalIterator.__doc__ = PySparkDataFrame.toLocalIterator.__doc__
 
-def checkpoint(self, *args: Any, **kwargs: Any) -> None:
-raise PySparkNotImplementedError(
-error_class="NOT_IMPLEMENTED",
-message_parameters={"feature": "checkpoint()"},
-)
-
-def localCheckpoint(self, *args: Any, **kwargs: Any) -> None:
-raise PySparkNotImplementedError(
-error_class="NOT_IMPLEMENTED",
-message_parameters={"feature": "localCheckpoint()"},
-)
-
 def to_pandas_on_spark(
 self, index_col: Optional[Union[str, List[str]]] = None
 ) -> "PandasOnSparkDataFrame":
@@ -2001,12 +1983,6 @@ class DataFrame:
 
 writeStream.__doc__ = PySparkDataFrame.writeStream.__doc__
 
-def toJSON(self, *args: Any, **kwargs: Any) -> None:
-raise PySparkNotImplementedError(
-error_class="NOT_IMPLEMENTED",
-message_parameters={"feature": "toJSON()"},
-)
-
 def sameSemantics(self, other: "DataFrame") -> bool:
 assert self._plan is not None
 assert other._plan is not None


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43893][PYTHON][CONNECT] Non-atomic data type support in Arrow-optimized Python UDF

2023-06-06 Thread xinrong

This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 94098853592 [SPARK-43893][PYTHON][CONNECT] Non-atomic data type 
support in Arrow-optimized Python UDF
94098853592 is described below

commit 94098853592b524f52e9a340166b96ddeda4e898
Author: Xinrong Meng 
AuthorDate: Tue Jun 6 15:48:14 2023 -0700

[SPARK-43893][PYTHON][CONNECT] Non-atomic data type support in 
Arrow-optimized Python UDF

### What changes were proposed in this pull request?
Support non-atomic data types in input and output of Arrow-optimized Python 
UDF.

Non-atomic data types refer to: ArrayType, MapType, and StructType.

### Why are the changes needed?
Parity with pickled Python UDFs.

### Does this PR introduce _any_ user-facing change?
Non-atomic data types are accepted as both input and output of 
Arrow-optimized Python UDF.

For example,
```py
>>> df = spark.range(1).selectExpr("struct(1, struct('John', 30, ('value', 
10))) as nested_struct")
>>> df.select(udf(lambda x: str(x))("nested_struct")).first()
Row((nested_struct)="Row(col1=1, col2=Row(col1='John', col2=30, 
col3=Row(col1='value', col2=10)))")
```

### How was this patch tested?
Unit tests.

Closes #41321 from xinrong-meng/arrow_udf_struct.

Authored-by: Xinrong Meng 
Signed-off-by: Xinrong Meng 
---
 python/pyspark/sql/pandas/serializers.py  | 22 ---
 python/pyspark/sql/tests/test_arrow_python_udf.py | 17 -
 python/pyspark/sql/tests/test_udf.py  | 45 +++
 python/pyspark/sql/udf.py | 15 +---
 python/pyspark/worker.py  | 13 +--
 5 files changed, 79 insertions(+), 33 deletions(-)

diff --git a/python/pyspark/sql/pandas/serializers.py 
b/python/pyspark/sql/pandas/serializers.py
index 84471143367..12d0bee88ad 100644
--- a/python/pyspark/sql/pandas/serializers.py
+++ b/python/pyspark/sql/pandas/serializers.py
@@ -172,7 +172,7 @@ class ArrowStreamPandasSerializer(ArrowStreamSerializer):
 self._timezone = timezone
 self._safecheck = safecheck
 
-def arrow_to_pandas(self, arrow_column):
+def arrow_to_pandas(self, arrow_column, struct_in_pandas="dict"):
 # If the given column is a date type column, creates a series of 
datetime.date directly
 # instead of creating datetime64[ns] as intermediate data to avoid 
overflow caused by
 # datetime64[ns] type handling.
@@ -184,7 +184,7 @@ class ArrowStreamPandasSerializer(ArrowStreamSerializer):
 data_type=from_arrow_type(arrow_column.type, 
prefer_timestamp_ntz=True),
 nullable=True,
 timezone=self._timezone,
-struct_in_pandas="dict",
+struct_in_pandas=struct_in_pandas,
 error_on_duplicated_field_names=True,
 )
 return converter(s)
@@ -310,10 +310,18 @@ class 
ArrowStreamPandasUDFSerializer(ArrowStreamPandasSerializer):
 Serializer used by Python worker to evaluate Pandas UDFs
 """
 
-def __init__(self, timezone, safecheck, assign_cols_by_name, 
df_for_struct=False):
+def __init__(
+self,
+timezone,
+safecheck,
+assign_cols_by_name,
+df_for_struct=False,
+struct_in_pandas="dict",
+):
 super(ArrowStreamPandasUDFSerializer, self).__init__(timezone, 
safecheck)
 self._assign_cols_by_name = assign_cols_by_name
 self._df_for_struct = df_for_struct
+self._struct_in_pandas = struct_in_pandas
 
 def arrow_to_pandas(self, arrow_column):
 import pyarrow.types as types
@@ -323,13 +331,15 @@ class 
ArrowStreamPandasUDFSerializer(ArrowStreamPandasSerializer):
 
 series = [
 super(ArrowStreamPandasUDFSerializer, self)
-.arrow_to_pandas(column)
+.arrow_to_pandas(column, self._struct_in_pandas)
 .rename(field.name)
 for column, field in zip(arrow_column.flatten(), 
arrow_column.type)
 ]
 s = pd.concat(series, axis=1)
 else:
-s = super(ArrowStreamPandasUDFSerializer, 
self).arrow_to_pandas(arrow_column)
+s = super(ArrowStreamPandasUDFSerializer, self).arrow_to_pandas(
+arrow_column, self._struct_in_pandas
+)
 return s
 
 def _create_batch(self, series):
@@ -360,7 +370,7 @@ class 
ArrowStreamPandasUDFSerializer(ArrowStreamPandasSerializer):
 
 arrs = []
 for s, t in series:
-if t is not None and pa.types.is_struct(t):
+if self._struct_in_pandas == "dict" and t is not None and 
pa.types.is_struct(t):
 if not isinstance(s, pd.DataFrame):
 raise

[spark] branch branch-3.4 updated: [SPARK-43973][SS][UI][TESTS][FOLLOWUP][3.4] Fix compilation by switching QueryTerminatedEvent constructor

2023-06-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 77e077a96f7 [SPARK-43973][SS][UI][TESTS][FOLLOWUP][3.4] Fix 
compilation by switching QueryTerminatedEvent constructor
77e077a96f7 is described below

commit 77e077a96f7dafb1f43af7e9f7b495dce6b1e18a
Author: Dongjoon Hyun 
AuthorDate: Tue Jun 6 11:30:32 2023 -0700

[SPARK-43973][SS][UI][TESTS][FOLLOWUP][3.4] Fix compilation by switching 
QueryTerminatedEvent constructor

### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/41468 to fix 
`branch-3.4`'s compilation issue.

### Why are the changes needed?

To recover the compilation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Pass the CIs.

Closes #41484 from dongjoon-hyun/SPARK-43973.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/streaming/ui/StreamingQueryStatusListenerSuite.scala| 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatusListenerSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatusListenerSuite.scala
index 4b6d2dc4562..b01abd8032b 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatusListenerSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatusListenerSuite.scala
@@ -279,7 +279,7 @@ class StreamingQueryStatusListenerSuite extends StreamTest {
 val startEvent1 = new StreamingQueryListener.QueryStartedEvent(
   id1, runId1, "test1", "2023-01-01T20:50:00.800Z")
 listener.onQueryStarted(startEvent1)
-val terminateEvent1 = new StreamingQueryListener.QueryTerminatedEvent(id1, 
runId1, None, None)
+val terminateEvent1 = new StreamingQueryListener.QueryTerminatedEvent(id1, 
runId1, None)
 listener.onQueryTerminated(terminateEvent1)
 
 // failure (has exception) case
@@ -289,7 +289,7 @@ class StreamingQueryStatusListenerSuite extends StreamTest {
   id2, runId2, "test2", "2023-01-02T20:54:20.827Z")
 listener.onQueryStarted(startEvent2)
 val terminateEvent2 = new StreamingQueryListener.QueryTerminatedEvent(
-  id2, runId2, Option("ExampleException"), Option("EXAMPLE_ERROR_CLASS"))
+  id2, runId2, Option("ExampleException"))
 listener.onQueryTerminated(terminateEvent2)
 
 // check results


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b4ab34bf9b2 -> 957f0e4c9f2)

2023-06-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b4ab34bf9b2 [SPARK-43976][CORE] Handle the case where modifiedConfigs 
doesn't exist in event logs
 add 957f0e4c9f2 [SPARK-43959][SQL][TESTS] Make RowLevelOperationSuiteBase 
and AlignAssignmentsSuite abstract

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/connector/RowLevelOperationSuiteBase.scala | 2 +-
 .../{AlignAssignmentsSuite.scala => AlignAssignmentsSuiteBase.scala}| 2 +-
 .../apache/spark/sql/execution/command/AlignMergeAssignmentsSuite.scala | 2 +-
 .../spark/sql/execution/command/AlignUpdateAssignmentsSuite.scala   | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)
 rename 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/{AlignAssignmentsSuite.scala
 => AlignAssignmentsSuiteBase.scala} (99%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-43976][CORE] Handle the case where modifiedConfigs doesn't exist in event logs

2023-06-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new ababc572b66 [SPARK-43976][CORE] Handle the case where modifiedConfigs 
doesn't exist in event logs
ababc572b66 is described below

commit ababc572b66e0ccffa5fefed3c87b18e0f60cc50
Author: Dongjoon Hyun 
AuthorDate: Tue Jun 6 09:34:40 2023 -0700

[SPARK-43976][CORE] Handle the case where modifiedConfigs doesn't exist in 
event logs

### What changes were proposed in this pull request?

This prevents NPE by handling the case where `modifiedConfigs` doesn't 
exist in event logs.

### Why are the changes needed?

Basically, this is the same solution for that case.
- https://github.com/apache/spark/pull/34907

The new code was added here, but we missed the corner case.

- https://github.com/apache/spark/pull/35972

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #41472 from dongjoon-hyun/SPARK-43976.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit b4ab34bf9b22d0f0ca4ab13f9b6106f38ccfaebe)
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala| 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala
index 498bb2a6c1c..c0e6b65d634 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala
@@ -84,15 +84,14 @@ class ExecutionPage(parent: SQLTab) extends 
WebUIPage("execution") with Logging
 
   val metrics = sqlStore.executionMetrics(executionId)
   val graph = sqlStore.planGraph(executionId)
+  val configs = 
Option(executionUIData.modifiedConfigs).getOrElse(Map.empty)
 
   summary ++
 planVisualization(request, metrics, graph) ++
 physicalPlanDescription(executionUIData.physicalPlanDescription) ++
-modifiedConfigs(
-  executionUIData.modifiedConfigs.filterKeys(
-!_.startsWith(pandasOnSparkConfPrefix)).toMap) ++
+
modifiedConfigs(configs.filterKeys(!_.startsWith(pandasOnSparkConfPrefix)).toMap)
 ++
 modifiedPandasOnSparkConfigs(
-  
executionUIData.modifiedConfigs.filterKeys(_.startsWith(pandasOnSparkConfPrefix)).toMap)
+  configs.filterKeys(_.startsWith(pandasOnSparkConfPrefix)).toMap)
 }.getOrElse {
   No information to display for query {executionId}
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (8c6a54d70a7 -> b4ab34bf9b2)

2023-06-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8c6a54d70a7 [SPARK-43919][SQL] Extract JSON functionality out of Row
 add b4ab34bf9b2 [SPARK-43976][CORE] Handle the case where modifiedConfigs 
doesn't exist in event logs

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala| 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-43976][CORE] Handle the case where modifiedConfigs doesn't exist in event logs

2023-06-06 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 778beb3bc67 [SPARK-43976][CORE] Handle the case where modifiedConfigs 
doesn't exist in event logs
778beb3bc67 is described below

commit 778beb3bc67ea784ef0c4314a1f998482eea2e5f
Author: Dongjoon Hyun 
AuthorDate: Tue Jun 6 09:34:40 2023 -0700

[SPARK-43976][CORE] Handle the case where modifiedConfigs doesn't exist in 
event logs

### What changes were proposed in this pull request?

This prevents NPE by handling the case where `modifiedConfigs` doesn't 
exist in event logs.

### Why are the changes needed?

Basically, this is the same solution for that case.
- https://github.com/apache/spark/pull/34907

The new code was added here, but we missed the corner case.

- https://github.com/apache/spark/pull/35972

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #41472 from dongjoon-hyun/SPARK-43976.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit b4ab34bf9b22d0f0ca4ab13f9b6106f38ccfaebe)
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala| 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala
index 498bb2a6c1c..c0e6b65d634 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/ExecutionPage.scala
@@ -84,15 +84,14 @@ class ExecutionPage(parent: SQLTab) extends 
WebUIPage("execution") with Logging
 
   val metrics = sqlStore.executionMetrics(executionId)
   val graph = sqlStore.planGraph(executionId)
+  val configs = 
Option(executionUIData.modifiedConfigs).getOrElse(Map.empty)
 
   summary ++
 planVisualization(request, metrics, graph) ++
 physicalPlanDescription(executionUIData.physicalPlanDescription) ++
-modifiedConfigs(
-  executionUIData.modifiedConfigs.filterKeys(
-!_.startsWith(pandasOnSparkConfPrefix)).toMap) ++
+
modifiedConfigs(configs.filterKeys(!_.startsWith(pandasOnSparkConfPrefix)).toMap)
 ++
 modifiedPandasOnSparkConfigs(
-  
executionUIData.modifiedConfigs.filterKeys(_.startsWith(pandasOnSparkConfPrefix)).toMap)
+  configs.filterKeys(_.startsWith(pandasOnSparkConfPrefix)).toMap)
 }.getOrElse {
   No information to display for query {executionId}
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ae8aaec9cde -> 8c6a54d70a7)

2023-06-06 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ae8aaec9cde [SPARK-43977][CONNECT] Fix unexpected check result of 
`dev/connect-jvm-client-mima-check`
 add 8c6a54d70a7 [SPARK-43919][SQL] Extract JSON functionality out of Row

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/streaming/progress.scala  |   5 +-
 docs/sql-migration-guide.md|   1 +
 project/MimaExcludes.scala |   6 +
 .../src/main/scala/org/apache/spark/sql/Row.scala  | 106 +
 .../org/apache/spark/sql/util/ToJsonUtil.scala | 129 +
 .../scala/org/apache/spark/sql/RowJsonSuite.scala  |   7 +-
 .../test/scala/org/apache/spark/sql/RowTest.scala  |   3 +-
 .../org/apache/spark/sql/streaming/progress.scala  |   3 +-
 8 files changed, 149 insertions(+), 111 deletions(-)
 create mode 100644 
sql/catalyst/src/main/scala/org/apache/spark/sql/util/ToJsonUtil.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (2a82b42bcb1 -> ae8aaec9cde)

2023-06-06 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2a82b42bcb1 [SPARK-43097][ML] New pyspark ML logistic regression 
estimator implemented on top of distributor
 add ae8aaec9cde [SPARK-43977][CONNECT] Fix unexpected check result of 
`dev/connect-jvm-client-mima-check`

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala   | 2 +-
 dev/connect-jvm-client-mima-check   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43097][ML] New pyspark ML logistic regression estimator implemented on top of distributor

2023-06-06 Thread weichenxu123

This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2a82b42bcb1 [SPARK-43097][ML] New pyspark ML logistic regression 
estimator implemented on top of distributor
2a82b42bcb1 is described below

commit 2a82b42bcb100fade622d3d08ef5d316425d3e5a
Author: Weichen Xu 
AuthorDate: Tue Jun 6 21:53:46 2023 +0800

[SPARK-43097][ML] New pyspark ML logistic regression estimator implemented 
on top of distributor

### What changes were proposed in this pull request?

This PR takes over https://github.com/apache/spark/pull/40748

Closes https://github.com/apache/spark/pull/40748

### Why are the changes needed?

Distributed ML on spark connect project.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

Unit tests.

Manually testing code:

Start local pyspark shell by:

```
bin/pyspark --remote "local[2]"# spark connect mode
```
or
```
bin/pyspark --master "local[2]"   # legacy mode
```
to launch pyspark shell,

Then, paste following code to pyspark shell:
```
from pyspark.mlv2.classification import LogisticRegression as LORV2
lorv2 = LORV2(maxIter=2, numTrainWorkers=2)

from pyspark.ml.linalg import Vectors

df = spark.createDataFrame(
[
(1.0, Vectors.dense(0.0, 5.0)),
(0.0, Vectors.dense(1.0, 2.0)),
(1.0, Vectors.dense(2.0, 1.0)),
(0.0, Vectors.dense(3.0, 3.0)),
] * 100,
["label", "features"],
)

model.transform(df).show(truncate=False)
model.transform(df.toPandas())

model.set(model.probabilityCol, "")
model.transform(df).show(truncate=False)
model.transform(df.toPandas())
```

Closes #41383 from WeichenXu123/lor-torch-2.

Authored-by: Weichen Xu 
Signed-off-by: Weichen Xu 
---
 dev/sparktestsupport/modules.py|   2 +
 python/pyspark/ml/param/_shared_params_code_gen.py |  24 ++
 python/pyspark/ml/param/shared.py  |  89 ++
 python/pyspark/ml/torch/data.py|  13 +-
 python/pyspark/ml/torch/distributor.py |  13 +-
 python/pyspark/mlv2/base.py|  75 -
 python/pyspark/mlv2/classification.py  | 306 +
 .../tests/connect/test_parity_classification.py|  41 +++
 python/pyspark/mlv2/tests/test_classification.py   | 139 ++
 9 files changed, 696 insertions(+), 6 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index aa755033aa4..ecc471fd700 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -609,6 +609,7 @@ pyspark_ml = Module(
 "pyspark.mlv2.tests.test_summarizer",
 "pyspark.mlv2.tests.test_evaluation",
 "pyspark.mlv2.tests.test_feature",
+"pyspark.mlv2.tests.test_classification",
 ],
 excluded_python_implementations=[
 "PyPy"  # Skip these tests under PyPy since they require numpy and it 
isn't available there
@@ -823,6 +824,7 @@ pyspark_connect = Module(
 "pyspark.mlv2.tests.connect.test_parity_summarizer",
 "pyspark.mlv2.tests.connect.test_parity_evaluation",
 "pyspark.mlv2.tests.connect.test_parity_feature",
+"pyspark.mlv2.tests.connect.test_parity_classification",
 ],
 excluded_python_implementations=[
 "PyPy"  # Skip these tests under PyPy since they require numpy, 
pandas, and pyarrow and
diff --git a/python/pyspark/ml/param/_shared_params_code_gen.py 
b/python/pyspark/ml/param/_shared_params_code_gen.py
index 5df1782084a..2bec3a5053f 100644
--- a/python/pyspark/ml/param/_shared_params_code_gen.py
+++ b/python/pyspark/ml/param/_shared_params_code_gen.py
@@ -332,6 +332,30 @@ if __name__ == "__main__":
 "0.0",
 "TypeConverters.toFloat",
 ),
+(
+"numTrainWorkers",
+"number of training workers",
+"1",
+"TypeConverters.toInt",
+),
+(
+"batchSize",
+"number of training batch size",
+None,
+"TypeConverters.toInt",
+),
+(
+"learningRate",
+"learning rate for training",
+None,
+"TypeConverters.toFloat",
+),
+(
+"momentum",
+"momentum for training optimizer",
+None,
+"TypeConverters.toFloat",
+),
 ]
 
 code = []
diff --git a/python/pyspark/ml/param/shared.py 
b/python/pyspark/ml/param/shared.py
index fcfced2e566..d61d206d219 100644
--- a/python/pyspark/ml/param/shared.py
+++

[spark] branch master updated (89d44d092af -> 0a4f226280c)

2023-06-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 89d44d092af [SPARK-43510][YARN] Fix YarnAllocator internal state when 
adding running executor after processing completed containers
 add 0a4f226280c [SPARK-43205][SQL][FOLLOWUP] add 
ExpressionWithUnresolvedIdentifier to simplify code

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala |  20 +--
 ...useUtil.scala => ResolveIdentifierClause.scala} |  37 ++--
 .../spark/sql/catalyst/analysis/unresolved.scala   |  83 +++--
 .../spark/sql/catalyst/parser/AstBuilder.scala | 192 ++---
 .../spark/sql/catalyst/trees/TreePatterns.scala|   2 +-
 .../double-quoted-identifiers-disabled.sql.out |   4 +-
 .../ansi/double-quoted-identifiers-enabled.sql.out |   8 +-
 .../double-quoted-identifiers.sql.out  |   4 +-
 .../analyzer-results/identifier-clause.sql.out |   6 +-
 .../postgreSQL/create_view.sql.out |   4 +-
 .../double-quoted-identifiers-disabled.sql.out |   4 +-
 .../ansi/double-quoted-identifiers-enabled.sql.out |   8 +-
 .../results/double-quoted-identifiers.sql.out  |   4 +-
 .../sql-tests/results/identifier-clause.sql.out|   6 +-
 .../results/postgreSQL/create_view.sql.out |   4 +-
 .../org/apache/spark/sql/hive/InsertSuite.scala|   2 +-
 16 files changed, 178 insertions(+), 210 deletions(-)
 rename 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/{IdentifierClauseUtil.scala
 => ResolveIdentifierClause.scala} (56%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-43510][YARN] Fix YarnAllocator internal state when adding running executor after processing completed containers

2023-06-06 Thread tgraves

This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 63d59956024 [SPARK-43510][YARN] Fix YarnAllocator internal state when 
adding running executor after processing completed containers
63d59956024 is described below

commit 63d59956024781b062791dda9990a6043b6a10c1
Author: manuzhang 
AuthorDate: Tue Jun 6 08:28:52 2023 -0500

[SPARK-43510][YARN] Fix YarnAllocator internal state when adding running 
executor after processing completed containers

### What changes were proposed in this pull request?
Keep track of completed container ids in YarnAllocator and don't update 
internal state of a container if it's already completed.

### Why are the changes needed?
YarnAllocator updates internal state adding running executors after 
executor launch in a separate thread. That can happen after the containers are 
already completed (e.g. preempted) and processed by YarnAllocator. Then 
YarnAllocator mistakenly thinks there are still running executors which are 
already lost. As a result, application hangs without any running executors.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added UT.

Closes #41173 from manuzhang/spark-43510.

Authored-by: manuzhang 
Signed-off-by: Thomas Graves 
(cherry picked from commit 89d44d092af4ae53fec296ca6569e240ad4c2bc5)
Signed-off-by: Thomas Graves 
---
 .../apache/spark/deploy/yarn/YarnAllocator.scala   | 42 ++
 .../spark/deploy/yarn/YarnAllocatorSuite.scala | 28 ++-
 2 files changed, 54 insertions(+), 16 deletions(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
index 313b19f919d..dede5501a39 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -90,6 +90,9 @@ private[yarn] class YarnAllocator(
   @GuardedBy("this")
   private val releasedContainers = collection.mutable.HashSet[ContainerId]()
 
+  @GuardedBy("this")
+  private val launchingExecutorContainerIds = 
collection.mutable.HashSet[ContainerId]()
+
   @GuardedBy("this")
   private val runningExecutorsPerResourceProfileId = new HashMap[Int, 
mutable.Set[String]]()
 
@@ -742,19 +745,6 @@ private[yarn] class YarnAllocator(
   logInfo(s"Launching container $containerId on host $executorHostname " +
 s"for executor with ID $executorId for ResourceProfile Id $rpId")
 
-  def updateInternalState(): Unit = synchronized {
-getOrUpdateRunningExecutorForRPId(rpId).add(executorId)
-getOrUpdateNumExecutorsStartingForRPId(rpId).decrementAndGet()
-executorIdToContainer(executorId) = container
-containerIdToExecutorIdAndResourceProfileId(container.getId) = 
(executorId, rpId)
-
-val localallocatedHostToContainersMap = 
getOrUpdateAllocatedHostToContainersMapForRPId(rpId)
-val containerSet = 
localallocatedHostToContainersMap.getOrElseUpdate(executorHostname,
-  new HashSet[ContainerId])
-containerSet += containerId
-allocatedContainerToHostMap.put(containerId, executorHostname)
-  }
-
   val rp = rpIdToResourceProfile(rpId)
   val defaultResources = 
ResourceProfile.getDefaultProfileExecutorResources(sparkConf)
   val containerMem = rp.executorResources.get(ResourceProfile.MEMORY).
@@ -767,6 +757,7 @@ private[yarn] class YarnAllocator(
   val rpRunningExecs = getOrUpdateRunningExecutorForRPId(rpId).size
   if (rpRunningExecs < getOrUpdateTargetNumExecutorsForRPId(rpId)) {
 getOrUpdateNumExecutorsStartingForRPId(rpId).incrementAndGet()
+launchingExecutorContainerIds.add(containerId)
 if (launchContainers) {
   launcherPool.execute(() => {
 try {
@@ -784,10 +775,11 @@ private[yarn] class YarnAllocator(
 localResources,
 rp.id
   ).run()
-  updateInternalState()
+  updateInternalState(rpId, executorId, container)
 } catch {
   case e: Throwable =>
 getOrUpdateNumExecutorsStartingForRPId(rpId).decrementAndGet()
+launchingExecutorContainerIds.remove(containerId)
 if (NonFatal(e)) {
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
   // Assigned container should be released immediately
@@ -800,7 +792,7 @@ private[yarn] class YarnAllocator(
   })
 } else {
   // For test only
-  updateInternalState()
+

[spark] branch master updated: [SPARK-43510][YARN] Fix YarnAllocator internal state when adding running executor after processing completed containers

2023-06-06 Thread tgraves

This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 89d44d092af [SPARK-43510][YARN] Fix YarnAllocator internal state when 
adding running executor after processing completed containers
89d44d092af is described below

commit 89d44d092af4ae53fec296ca6569e240ad4c2bc5
Author: manuzhang 
AuthorDate: Tue Jun 6 08:28:52 2023 -0500

[SPARK-43510][YARN] Fix YarnAllocator internal state when adding running 
executor after processing completed containers

### What changes were proposed in this pull request?
Keep track of completed container ids in YarnAllocator and don't update 
internal state of a container if it's already completed.

### Why are the changes needed?
YarnAllocator updates internal state adding running executors after 
executor launch in a separate thread. That can happen after the containers are 
already completed (e.g. preempted) and processed by YarnAllocator. Then 
YarnAllocator mistakenly thinks there are still running executors which are 
already lost. As a result, application hangs without any running executors.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added UT.

Closes #41173 from manuzhang/spark-43510.

Authored-by: manuzhang 
Signed-off-by: Thomas Graves 
---
 .../apache/spark/deploy/yarn/YarnAllocator.scala   | 42 ++
 .../spark/deploy/yarn/YarnAllocatorSuite.scala | 28 ++-
 2 files changed, 54 insertions(+), 16 deletions(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
index b6ee21ed817..19c06f95731 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -91,6 +91,9 @@ private[yarn] class YarnAllocator(
   @GuardedBy("this")
   private val releasedContainers = collection.mutable.HashSet[ContainerId]()
 
+  @GuardedBy("this")
+  private val launchingExecutorContainerIds = 
collection.mutable.HashSet[ContainerId]()
+
   @GuardedBy("this")
   private val runningExecutorsPerResourceProfileId = new HashMap[Int, 
mutable.Set[String]]()
 
@@ -738,19 +741,6 @@ private[yarn] class YarnAllocator(
   logInfo(s"Launching container $containerId on host $executorHostname " +
 s"for executor with ID $executorId for ResourceProfile Id $rpId")
 
-  def updateInternalState(): Unit = synchronized {
-getOrUpdateRunningExecutorForRPId(rpId).add(executorId)
-getOrUpdateNumExecutorsStartingForRPId(rpId).decrementAndGet()
-executorIdToContainer(executorId) = container
-containerIdToExecutorIdAndResourceProfileId(container.getId) = 
(executorId, rpId)
-
-val localallocatedHostToContainersMap = 
getOrUpdateAllocatedHostToContainersMapForRPId(rpId)
-val containerSet = 
localallocatedHostToContainersMap.getOrElseUpdate(executorHostname,
-  new HashSet[ContainerId])
-containerSet += containerId
-allocatedContainerToHostMap.put(containerId, executorHostname)
-  }
-
   val rp = rpIdToResourceProfile(rpId)
   val defaultResources = 
ResourceProfile.getDefaultProfileExecutorResources(sparkConf)
   val containerMem = rp.executorResources.get(ResourceProfile.MEMORY).
@@ -763,6 +753,7 @@ private[yarn] class YarnAllocator(
   val rpRunningExecs = getOrUpdateRunningExecutorForRPId(rpId).size
   if (rpRunningExecs < getOrUpdateTargetNumExecutorsForRPId(rpId)) {
 getOrUpdateNumExecutorsStartingForRPId(rpId).incrementAndGet()
+launchingExecutorContainerIds.add(containerId)
 if (launchContainers) {
   launcherPool.execute(() => {
 try {
@@ -780,10 +771,11 @@ private[yarn] class YarnAllocator(
 localResources,
 rp.id
   ).run()
-  updateInternalState()
+  updateInternalState(rpId, executorId, container)
 } catch {
   case e: Throwable =>
 getOrUpdateNumExecutorsStartingForRPId(rpId).decrementAndGet()
+launchingExecutorContainerIds.remove(containerId)
 if (NonFatal(e)) {
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
   // Assigned container should be released immediately
@@ -796,7 +788,7 @@ private[yarn] class YarnAllocator(
   })
 } else {
   // For test only
-  updateInternalState()
+  updateInternalState(rpId, executorId, container)
 }
   } else {
 logInfo(("Skip launching

[spark] branch master updated: [SPARK-43913][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2426-2432]

2023-06-06 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0cd5ca5a7b3 [SPARK-43913][SQL] Assign names to the error class 
_LEGACY_ERROR_TEMP_[2426-2432]
0cd5ca5a7b3 is described below

commit 0cd5ca5a7b31f65a005c8ee2e90a6b4a29623ba7
Author: Jiaan Geng 
AuthorDate: Tue Jun 6 10:28:48 2023 +0300

[SPARK-43913][SQL] Assign names to the error class 
_LEGACY_ERROR_TEMP_[2426-2432]

### What changes were proposed in this pull request?
The pr aims to assign names to the error class 
`_LEGACY_ERROR_TEMP_[2426-2432]`.

### Why are the changes needed?
Improve the error framework.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Exists test cases.

Closes #41424 from beliefer/SPARK-43913.

Authored-by: Jiaan Geng 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   | 58 --
 .../sql/catalyst/analysis/CheckAnalysis.scala  | 51 +++
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 20 
 .../CreateTablePartitioningValidationSuite.scala   | 22 
 .../negative-cases/invalid-correlation.sql.out |  6 ++-
 .../negative-cases/invalid-correlation.sql.out |  6 ++-
 6 files changed, 93 insertions(+), 70 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index de80415d85b..8c3c076ce74 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -660,6 +660,11 @@
   "The event time  has the invalid type , but 
expected \"TIMESTAMP\"."
 ]
   },
+  "EXPRESSION_TYPE_IS_NOT_ORDERABLE" : {
+"message" : [
+  "Column expression  cannot be sorted because its type  
is not orderable."
+]
+  },
   "FAILED_EXECUTE_UDF" : {
 "message" : [
   "Failed to execute user defined function (: () 
=> )."
@@ -1541,6 +1546,24 @@
 ],
 "sqlState" : "42803"
   },
+  "MISSING_ATTRIBUTES" : {
+"message" : [
+  "Resolved attribute(s)  missing from  in 
operator ."
+],
+"subClass" : {
+  "RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION" : {
+"message" : [
+  "Attribute(s) with the same name appear in the operation: 
.",
+  "Please check if the right attribute(s) are used."
+]
+  },
+  "RESOLVED_ATTRIBUTE_MISSING_FROM_INPUT" : {
+"message" : [
+  ""
+]
+  }
+}
+  },
   "MISSING_GROUP_BY" : {
 "message" : [
   "The query does not include a GROUP BY clause. Add GROUP BY or turn it 
into the window functions using OVER clauses."
@@ -1945,6 +1968,11 @@
   "Query [id = , runId = ] terminated with exception: "
 ]
   },
+  "SUM_OF_LIMIT_AND_OFFSET_EXCEEDS_MAX_INT" : {
+"message" : [
+  "The sum of the LIMIT clause and the OFFSET clause must not be greater 
than the maximum 32-bit integer value (2,147,483,647) but found limit = 
, offset = ."
+]
+  },
   "TABLE_OR_VIEW_ALREADY_EXISTS" : {
 "message" : [
   "Cannot create table or view  because it already exists.",
@@ -2310,6 +2338,11 @@
   "Parameter markers in unexpected statement: . Parameter 
markers must only be used in a query, or DML statement."
 ]
   },
+  "PARTITION_WITH_NESTED_COLUMN_IS_UNSUPPORTED" : {
+"message" : [
+  "Invalid partitioning:  is missing or is in a map or array."
+]
+  },
   "PIVOT_AFTER_GROUP_BY" : {
 "message" : [
   "PIVOT clause following a GROUP BY clause. Consider pushing the 
GROUP BY into a subquery."
@@ -5525,31 +5558,6 @@
   "failed to evaluate expression : "
 ]
   },
-  "_LEGACY_ERROR_TEMP_2426" : {
-"message" : [
-  "nondeterministic expression  should not appear in grouping 
expression."
-]
-  },
-  "_LEGACY_ERROR_TEMP_2427" : {
-"message" : [
-  "sorting is not supported for columns of type ."
-]
-  },
-  "_LEGACY_ERROR_TEMP_2428" : {
-"message" : [
-  "The sum of the LIMIT clause and the OFFSET clause must not be greater 
than the maximum 32-bit integer value (2,147,483,647) but found limit = 
, offset = ."
-]
-  },
-  "_LEGACY_ERROR_TEMP_2431" : {
-"message" : [
-  "Invalid partitioning:  is missing or is in a map or array."
-]
-  },
-  "_LEGACY_ERROR_TEMP_2432" : {
-"message" : [
-  ""
-]
-  },
   "_LEGACY_ERROR_TEMP_2433" : {
 "message" : [
   "Only a single table generating function is allowed in a SELECT clause, 
found:",
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 594c0b666e8..9124890d4af 100644

[spark] branch master updated: [SPARK-43962][SQL] Improve error messages: `CANNOT_DECODE_URL`, `CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE`, `CANNOT_PARSE_DECIMAL`, `CANNOT_READ_FILE_FOOTER`, `CANNOT_RECOGNI

2023-06-06 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 61e6227fb62 [SPARK-43962][SQL] Improve error messages: 
`CANNOT_DECODE_URL`, `CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE`, 
`CANNOT_PARSE_DECIMAL`, `CANNOT_READ_FILE_FOOTER`, `CANNOT_RECOGNIZE_HIVE_TYPE`
61e6227fb62 is described below

commit 61e6227fb62c2452b01ac595c2bc43d4492686a0
Author: itholic 
AuthorDate: Tue Jun 6 10:25:24 2023 +0300

[SPARK-43962][SQL] Improve error messages: `CANNOT_DECODE_URL`, 
`CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE`, `CANNOT_PARSE_DECIMAL`, 
`CANNOT_READ_FILE_FOOTER`, `CANNOT_RECOGNIZE_HIVE_TYPE`

### What changes were proposed in this pull request?

This PR proposes to improve error messages for `CANNOT_DECODE_URL`, 
`CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE`, `CANNOT_PARSE_DECIMAL`, 
`CANNOT_READ_FILE_FOOTER`, `CANNOT_RECOGNIZE_HIVE_TYPE`.

**NOTE:** This PR is an experimental work that utilizes LLM to enhance 
error messages. The script was created using the `openai` Python library from 
OpenAI, and minimal review was conducted by author after executing the script. 
The five improved error messages were selected by the author.

### Why are the changes needed?

For improving errors to make them more actionable and usable.

### Does this PR introduce _any_ user-facing change?

No API changes, only error message improvement.

### How was this patch tested?

The existing CI should pass.

Closes #41455 from itholic/emi_1-5.

Authored-by: itholic 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index bceea072e92..de80415d85b 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -114,7 +114,7 @@
   },
   "CANNOT_DECODE_URL" : {
 "message" : [
-  "Cannot decode url : ."
+  "The provided URL cannot be decoded: . Please ensure that the URL 
is properly formatted and try again."
 ],
 "sqlState" : "22546"
   },
@@ -130,7 +130,7 @@
   },
   "CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE" : {
 "message" : [
-  "Failed to merge incompatible data types  and ."
+  "Failed to merge incompatible data types  and . Please 
check the data types of the columns being merged and ensure that they are 
compatible. If necessary, consider casting the columns to compatible data types 
before attempting the merge."
 ],
 "sqlState" : "42825"
   },
@@ -153,7 +153,7 @@
   },
   "CANNOT_PARSE_DECIMAL" : {
 "message" : [
-  "Cannot parse decimal."
+  "Cannot parse decimal. Please ensure that the input is a valid number 
with optional decimal point or comma separators."
 ],
 "sqlState" : "22018"
   },
@@ -176,12 +176,12 @@
   },
   "CANNOT_READ_FILE_FOOTER" : {
 "message" : [
-  "Could not read footer for file: ."
+  "Could not read footer for file: . Please ensure that the file is 
in either ORC or Parquet format. If not, please convert it to a valid format. 
If the file is in the valid format, please check if it is corrupt. If it is, 
you can choose to either ignore it or fix the corruption."
 ]
   },
   "CANNOT_RECOGNIZE_HIVE_TYPE" : {
 "message" : [
-  "Cannot recognize hive type string: , column: ."
+  "Cannot recognize hive type string: , column: . 
The specified data type for the field cannot be recognized by Spark SQL. Please 
check the data type of the specified field and ensure that it is a valid Spark 
SQL data type. Refer to the Spark SQL documentation for a list of valid data 
types and their format. If the data type is correct, please ensure that you are 
using a supported version of Spark SQL."
 ],
 "sqlState" : "429BB"
   },


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (1df1d7661a3 -> d0fe6d4b796)

2023-06-06 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1df1d7661a3 [SPARK-43516][ML][PYTHON] Update MLv2 Transformer 
interfaces
 add d0fe6d4b796 [SPARK-43948][SQL] Assign names to the error class 
_LEGACY_ERROR_TEMP_[0050|0057|0058|0059]

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/error-classes.json   | 47 +-
 .../spark/sql/errors/QueryParsingErrors.scala  | 15 ---
 .../spark/sql/catalyst/parser/DDLParserSuite.scala |  2 +-
 .../spark/sql/execution/SparkSqlParser.scala   |  2 +-
 .../command/v2/AlterTableReplaceColumnsSuite.scala | 17 +++-
 .../org/apache/spark/sql/sources/InsertSuite.scala | 12 +++---
 6 files changed, 61 insertions(+), 34 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (51a919ea8d6 -> 1df1d7661a3)

2023-06-06 Thread weichenxu123

This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 51a919ea8d6 [SPARK-43973][SS][UI] Structured Streaming UI should 
display failed queries correctly
 add 1df1d7661a3 [SPARK-43516][ML][PYTHON] Update MLv2 Transformer 
interfaces

No new revisions were added by this update.

Summary of changes:
 python/pyspark/mlv2/base.py   | 24 -
 python/pyspark/mlv2/feature.py| 16 +++---
 python/pyspark/mlv2/tests/test_feature.py |  6 ++
 python/pyspark/mlv2/util.py   | 35 ---
 4 files changed, 51 insertions(+), 30 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d2f72a21677 -> 0ed48feab65)

[spark] branch master updated: [SPARK-43901][SQL] Avro to Support custom decimal type backed by Long

[spark] branch master updated: [SPARK-42750][SQL] Support Insert By Name statement

[spark] branch master updated (747db6675da -> 54304493e0a)

[spark] branch master updated (1e197b5bc22 -> 747db6675da)

[spark] branch master updated: [SPARK-43356][K8S] Migrate deprecated createOrReplace to serverSideApply

[spark] branch master updated: [SPARK-43906][PYTHON][CONNECT] Implement the file support in SparkSession.addArtifacts

[spark] branch master updated: [SPARK-43970][PYTHON][CONNECT] Hide unsupported dataframe methods from auto-completion

[spark] branch master updated: [SPARK-43893][PYTHON][CONNECT] Non-atomic data type support in Arrow-optimized Python UDF

[spark] branch branch-3.4 updated: [SPARK-43973][SS][UI][TESTS][FOLLOWUP][3.4] Fix compilation by switching QueryTerminatedEvent constructor

[spark] branch master updated (b4ab34bf9b2 -> 957f0e4c9f2)

[spark] branch branch-3.3 updated: [SPARK-43976][CORE] Handle the case where modifiedConfigs doesn't exist in event logs

[spark] branch master updated (8c6a54d70a7 -> b4ab34bf9b2)

[spark] branch branch-3.4 updated: [SPARK-43976][CORE] Handle the case where modifiedConfigs doesn't exist in event logs

[spark] branch master updated (ae8aaec9cde -> 8c6a54d70a7)

[spark] branch master updated (2a82b42bcb1 -> ae8aaec9cde)

[spark] branch master updated: [SPARK-43097][ML] New pyspark ML logistic regression estimator implemented on top of distributor

[spark] branch master updated (89d44d092af -> 0a4f226280c)

[spark] branch branch-3.4 updated: [SPARK-43510][YARN] Fix YarnAllocator internal state when adding running executor after processing completed containers

[spark] branch master updated: [SPARK-43510][YARN] Fix YarnAllocator internal state when adding running executor after processing completed containers

[spark] branch master updated: [SPARK-43913][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2426-2432]

[spark] branch master updated: [SPARK-43962][SQL] Improve error messages: `CANNOT_DECODE_URL`, `CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE`, `CANNOT_PARSE_DECIMAL`, `CANNOT_READ_FILE_FOOTER`, `CANNOT_RECOGNI

[spark] branch master updated (1df1d7661a3 -> d0fe6d4b796)

[spark] branch master updated (51a919ea8d6 -> 1df1d7661a3)

24 matches

Site Navigation

Mail list logo

Footer information