[spark] branch master updated: [SPARK-44802][INFRA] Token based ASF JIRA authentication

2023-08-14 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c36d54a569e [SPARK-44802][INFRA] Token based ASF JIRA authentication
c36d54a569e is described below

commit c36d54a569e59e823aab83f47b97ae286fd7f4c4
Author: Kent Yao 
AuthorDate: Tue Aug 15 13:39:16 2023 +0800

[SPARK-44802][INFRA] Token based ASF JIRA authentication

### What changes were proposed in this pull request?

This PR add a env JIRA_ACCESS_TOKEN for merge script to enable token auth

### Why are the changes needed?

Tokens are more secure and easily revoked or expired.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

1. locally tested

```python
>>> JIRA_ACCESS_TOKEN is not None
True
>>> asf_jira.issue("SPARK-44801")

>>> if JIRA_ACCESS_TOKEN is not None:
... asf_jira = jira.client.JIRA(jira_server, 
token_auth=JIRA_ACCESS_TOKEN)
... else:
... asf_jira = jira.client.JIRA(
... jira_server, basic_auth=(JIRA_USERNAME, JIRA_PASSWORD)
... )
...
>>> asf_jira.issue("SPARK-44801")

```

2. merged https://github.com/apache/spark/pull/42470 with this

Closes #42484 from yaooqinn/SPARK-44802.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
---
 dev/merge_spark_pr.py | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py
index c96415c7aeb..0fd42e237a4 100755
--- a/dev/merge_spark_pr.py
+++ b/dev/merge_spark_pr.py
@@ -52,6 +52,11 @@ PUSH_REMOTE_NAME = os.environ.get("PUSH_REMOTE_NAME", 
"apache")
 JIRA_USERNAME = os.environ.get("JIRA_USERNAME", "")
 # ASF JIRA password
 JIRA_PASSWORD = os.environ.get("JIRA_PASSWORD", "")
+# ASF JIRA access token
+# If it is configured, username and password are dismissed
+# Go to https://issues.apache.org/jira/secure/ViewProfile.jspa -> Personal 
Access Tokens for
+# your own token management.
+JIRA_ACCESS_TOKEN = os.environ.get("JIRA_ACCESS_TOKEN")
 # OAuth key used for issuing requests against the GitHub API. If this is not 
defined, then requests
 # will be unauthenticated. You should only need to configure this if you find 
yourself regularly
 # exceeding your IP's unauthenticated request rate limit. You can create an 
OAuth key at
@@ -238,9 +243,12 @@ def cherry_pick(pr_num, merge_hash, default_branch):
 
 
 def resolve_jira_issue(merge_branches, comment, default_jira_id=""):
-asf_jira = jira.client.JIRA(
-{"server": JIRA_API_BASE}, basic_auth=(JIRA_USERNAME, JIRA_PASSWORD)
-)
+jira_server = {"server": JIRA_API_BASE}
+
+if JIRA_ACCESS_TOKEN is not None:
+asf_jira = jira.client.JIRA(jira_server, token_auth=JIRA_ACCESS_TOKEN)
+else:
+asf_jira = jira.client.JIRA(jira_server, basic_auth=(JIRA_USERNAME, 
JIRA_PASSWORD))
 
 jira_id = input("Enter a JIRA id [%s]: " % default_jira_id)
 if jira_id == "":


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44798][BUILD] Fix Scala 2.13 mima check after SPARK-44705 merged

2023-08-14 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9f897190bdb [SPARK-44798][BUILD] Fix Scala 2.13 mima check after 
SPARK-44705 merged
9f897190bdb is described below

commit 9f897190bdb29dd80e85bc0926ebacb70825b4ed
Author: yangjie01 
AuthorDate: Tue Aug 15 11:44:17 2023 +0800

[SPARK-44798][BUILD] Fix Scala 2.13 mima check after SPARK-44705 merged

### What changes were proposed in this pull request?
This pr aims add a new `ProblemFilters` to `MimaExcludes.scala` to fix mima 
check for Scala 2.13 after SPARK-44705 merged.

### Why are the changes needed?
Scala 2.13's daily tests have been failing the mima check for several days:

- https://github.com/apache/spark/actions/runs/5849050388/job/15856979487

https://github.com/apache/spark/assets/1475305/fa4f859a-293b-429b-a030-a332e6a587d3;>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions
- Manual verification:

1.  The mima check was passing before SPARK-44705
```
// [SPARK-44765][CONNECT] Simplify retries of ReleaseExecute
git reset --hard 9bde882fcb39e9fedced0df9702df2a36c1a84e6
dev/change-scala-version.sh 2.13
dev/mima -Pscala-2.13
```

```
[success] Total time: 48 s, completed 2023-8-14 13:02:11
```

2. The mima check failed after SPARK-44705 was merged

```
// [SPARK-44705][PYTHON] Make PythonRunner single-threaded
git reset --hard 8aaff55839493e80e3ce376f928c04aa8f31d18c
dev/change-scala-version.sh 2.13
dev/mima -Pscala-2.13
```
```
[error] spark-core: Failed binary compatibility check against 
org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 
4019)
[error]  * method 
this(org.apache.spark.api.python.BasePythonRunner,java.io.DataInputStream,org.apache.spark.api.python.BasePythonRunner#WriterThread,Long,org.apache.spark.SparkEnv,java.net.Socket,scala.Option,java.util.concurrent.atomic.AtomicBoolean,org.apache.spark.TaskContext)Unit
 in class org.apache.spark.api.python.BasePythonRunner#ReaderIterator's type is 
different in current version, where it is 
(org.apache.spark.api.python.BasePythonRunner,java.io.DataInputStream,org.apache.s
 [...]
[error]filter with: 
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.api.python.BasePythonRunner#ReaderIterator.this")
[error] java.lang.RuntimeException: Failed binary compatibility check 
against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems 
(filtered 4019)
[error] at scala.sys.package$.error(package.scala:30)
[error] at 
com.typesafe.tools.mima.plugin.SbtMima$.reportModuleErrors(SbtMima.scala:89)
[error] at 
com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2(MimaPlugin.scala:36)
[error] at 
com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2$adapted(MimaPlugin.scala:26)
[error] at scala.collection.Iterator.foreach(Iterator.scala:943)
[error] at scala.collection.Iterator.foreach$(Iterator.scala:943)
[error] at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
[error] at 
com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1(MimaPlugin.scala:26)
[error] at 
com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1$adapted(MimaPlugin.scala:25)
[error] at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] at 
sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63)
[error] at sbt.std.Transform$$anon$4.work(Transform.scala:69)
[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:283)
[error] at 
sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24)
[error] at sbt.Execute.work(Execute.scala:292)
[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:283)
[error] at 
sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error] at 
sbt.CompletionService$$anon$2.call(CompletionService.scala:65)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:750)
[error] (core / 

[spark] branch master updated (c9ff7025399 -> dc5fdb4aa1c)

2023-08-14 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c9ff7025399 [SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should 
not fail
 add dc5fdb4aa1c [SPARK-44796][BUILD][CONNECT] Remove `grpc-java` plugin 
related configuration from the `connect/connect-client-jvm` module

No new revisions were added by this update.

Summary of changes:
 project/SparkBuild.scala | 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44719][SQL][3.5] Fix NoClassDefFoundError when using Hive UDF

2023-08-14 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e12a79b6ebe [SPARK-44719][SQL][3.5] Fix NoClassDefFoundError when 
using Hive UDF
e12a79b6ebe is described below

commit e12a79b6ebe0bd5ef9220e0e0979d6579638f3ae
Author: Yuming Wang 
AuthorDate: Tue Aug 15 11:42:23 2023 +0800

[SPARK-44719][SQL][3.5] Fix NoClassDefFoundError when using Hive UDF

### What changes were proposed in this pull request?

This PR partially reverts SPARK-43225. Added jackson-core-asl and 
jackson-mapper-asl back to pre-built distributions.

### Why are the changes needed?

Fix `NoClassDefFoundError` when using Hive UDF:
```
spark-sql (default)> add jar 
/Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

manual test.

Closes #42447 from wangyum/SPARK-44719-branch-3.5.

Authored-by: Yuming Wang 
Signed-off-by: Kent Yao 
---
 core/pom.xml  |  8 
 dev/deps/spark-deps-hadoop-3-hive-2.3 |  2 ++
 pom.xml   | 13 +++--
 sql/hive-thriftserver/pom.xml |  9 -
 4 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/core/pom.xml b/core/pom.xml
index 674005c96ee..eb4a563f1f3 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -481,6 +481,14 @@
   commons-logging
   commons-logging
 
+
+  org.codehaus.jackson
+  jackson-mapper-asl
+
+
+  org.codehaus.jackson
+  jackson-core-asl
+
 
   com.fasterxml.jackson.core
   jackson-core
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index d182f39f882..14af8ea340f 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -100,11 +100,13 @@ ini4j/0.5.4//ini4j-0.5.4.jar
 istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar
 ivy/2.5.1//ivy-2.5.1.jar
 jackson-annotations/2.15.2//jackson-annotations-2.15.2.jar
+jackson-core-asl/1.9.13//jackson-core-asl-1.9.13.jar
 jackson-core/2.15.2//jackson-core-2.15.2.jar
 jackson-databind/2.15.2//jackson-databind-2.15.2.jar
 jackson-dataformat-cbor/2.15.2//jackson-dataformat-cbor-2.15.2.jar
 jackson-dataformat-yaml/2.15.2//jackson-dataformat-yaml-2.15.2.jar
 jackson-datatype-jsr310/2.15.2//jackson-datatype-jsr310-2.15.2.jar
+jackson-mapper-asl/1.9.13//jackson-mapper-asl-1.9.13.jar
 jackson-module-scala_2.12/2.15.2//jackson-module-scala_2.12-2.15.2.jar
 jakarta.annotation-api/1.3.5//jakarta.annotation-api-1.3.5.jar
 jakarta.inject/2.6.1//jakarta.inject-2.6.1.jar
diff --git a/pom.xml b/pom.xml
index 9438a112c24..d0c3bfc361a 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1321,6 +1321,10 @@
 asm
 asm
   
+  
+org.codehaus.jackson
+jackson-mapper-asl
+  
   
 org.ow2.asm
 asm
@@ -1821,12 +1825,17 @@
   
 
   
-  
+  
+org.codehaus.jackson
+jackson-core-asl
+${codehaus.jackson.version}
+${hadoop.deps.scope}
+  
   
 org.codehaus.jackson
 jackson-mapper-asl
 ${codehaus.jackson.version}
-test
+${hadoop.deps.scope}
   
   
 ${hive.group}
diff --git a/sql/hive-thriftserver/pom.xml b/sql/hive-thriftserver/pom.xml
index 7bcfb8e1908..3659a0f846a 100644
--- a/sql/hive-thriftserver/pom.xml
+++ b/sql/hive-thriftserver/pom.xml
@@ -147,15 +147,6 @@
   org.apache.httpcomponents
   httpcore
 
-
-
-  org.codehaus.jackson
-  jackson-mapper-asl
-  test
-
   
   
 
target/scala-${scala.binary.version}/classes


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44796][BUILD][CONNECT] Remove `grpc-java` plugin related configuration from the `connect/connect-client-jvm` module

2023-08-14 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 9336a155e0a [SPARK-44796][BUILD][CONNECT] Remove `grpc-java` plugin 
related configuration from the `connect/connect-client-jvm` module
9336a155e0a is described below

commit 9336a155e0a7321be50580d5fb5351c291209166
Author: yangjie01 
AuthorDate: Tue Aug 15 11:42:25 2023 +0800

[SPARK-44796][BUILD][CONNECT] Remove `grpc-java` plugin related 
configuration from the `connect/connect-client-jvm` module

### What changes were proposed in this pull request?
This pr remove `grpc-java` plugin related configuration from the 
`connect/connect-client-jvm module` for SBT because the code generation related 
to `grpc-java` is handled by the `connect-common` module. The reason this pr 
did not touch the Maven `pom.xml` is because the corresponding Maven modules 
originally did not have that configuration.

### Why are the changes needed?
Clean up unnecessary sbt configurations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions

Closes #42477 from LuciferYang/SPARK-44796.

Authored-by: yangjie01 
Signed-off-by: yangjie01 
(cherry picked from commit dc5fdb4aa1cd8ae727fc6d9dce434cbd1f0ddafc)
Signed-off-by: yangjie01 
---
 project/SparkBuild.scala | 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index bd65d3c4bd4..331ebe0c05a 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -774,7 +774,6 @@ object SparkConnect {
 SbtPomKeys.effectivePom.value.getProperties.get(
   "guava.failureaccess.version").asInstanceOf[String]
   Seq(
-"io.grpc" % "protoc-gen-grpc-java" % BuildCommons.gprcVersion 
asProtocPlugin(),
 "com.google.guava" % "guava" % guavaVersion,
 "com.google.guava" % "failureaccess" % guavaFailureaccessVersion,
 "com.google.protobuf" % "protobuf-java" % protoVersion % "protobuf"
@@ -840,14 +839,7 @@ object SparkConnect {
   case m if m.toLowerCase(Locale.ROOT).endsWith(".proto") => 
MergeStrategy.discard
   case _ => MergeStrategy.first
 }
-  ) ++ {
-Seq(
-  (Compile / PB.targets) := Seq(
-PB.gens.java -> (Compile / sourceManaged).value,
-PB.gens.plugin("grpc-java") -> (Compile / sourceManaged).value
-  )
-)
-  }
+  )
 }
 
 object SparkConnectClient {
@@ -924,14 +916,7 @@ object SparkConnectClient {
   case m if m.toLowerCase(Locale.ROOT).endsWith(".proto") => 
MergeStrategy.discard
   case _ => MergeStrategy.first
 }
-  ) ++ {
-Seq(
-  (Compile / PB.targets) := Seq(
-PB.gens.java -> (Compile / sourceManaged).value,
-PB.gens.plugin("grpc-java") -> (Compile / sourceManaged).value
-  )
-)
-  }
+  )
 }
 
 object SparkProtobuf {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should not fail

2023-08-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 4fd5e770209 [SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should 
not fail
4fd5e770209 is described below

commit 4fd5e770209ee4eb1c3eb5c0210588362cb53966
Author: Wenchen Fan 
AuthorDate: Tue Aug 15 10:55:57 2023 +0800

[SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should not fail

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/41448 . As an 
optimizer rule, the produced plan should be resolved and resolved expressions 
should be able to report data type. The `Instruction` expression fails to 
report data type and may break external optimizer rules. This PR makes it to 
return dummy NullType.

### Why are the changes needed?

to not break external optimizer rules.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #42482 from cloud-fan/merge.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit c9ff70253999bf396b07eec84a3a86ca3191efa3)
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala   | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala
index ecca758..9b1c8bc733a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala
@@ -21,7 +21,7 @@ import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeSet, Expre
 import org.apache.spark.sql.catalyst.plans.logical.MergeRows.{Instruction, 
ROW_ID}
 import org.apache.spark.sql.catalyst.trees.UnaryLike
 import org.apache.spark.sql.catalyst.util.truncatedString
-import org.apache.spark.sql.types.DataType
+import org.apache.spark.sql.types.{DataType, NullType}
 
 case class MergeRows(
 isSourceRowPresent: Expression,
@@ -74,7 +74,11 @@ object MergeRows {
 def condition: Expression
 def outputs: Seq[Seq[Expression]]
 override def nullable: Boolean = false
-override def dataType: DataType = throw new 
UnsupportedOperationException("dataType")
+// We return NullType here as only the `MergeRows` operator can contain 
`Instruction`
+// expressions and it doesn't care about the data type. Some external 
optimizer rules may
+// assume optimized plan is always resolved and Expression#dataType is 
always available, so
+// we can't just fail here.
+override def dataType: DataType = NullType
   }
 
   case class Keep(condition: Expression, output: Seq[Expression]) extends 
Instruction {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (46580ab4cb0 -> c9ff7025399)

2023-08-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 46580ab4cb0 [SPARK-43380][SQL] Revert `Fix Avro data type conversion 
issues`
 add c9ff7025399 [SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should 
not fail

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala   | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-43380][SQL] Revert `Fix Avro data type conversion issues`

2023-08-14 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 24bf4297f82 [SPARK-43380][SQL] Revert `Fix Avro data type conversion 
issues`
24bf4297f82 is described below

commit 24bf4297f8280b1db42026197282542863420b98
Author: zeruibao 
AuthorDate: Mon Aug 14 19:36:40 2023 -0700

[SPARK-43380][SQL] Revert `Fix Avro data type conversion issues`

### What changes were proposed in this pull request?
Revert my last PR https://github.com/apache/spark/pull/41052 that causes 
AVRO read performance regression since I change the code structure.

### Why are the changes needed?
Remove performance regression

### How was this patch tested?
Unit test

Closes #42458 from zeruibao/revert-avro-change.

Authored-by: zeruibao 
Signed-off-by: Gengliang Wang 
(cherry picked from commit 46580ab4cb02390ba71dace1235015749f048fff)
Signed-off-by: Gengliang Wang 
---
 .../src/main/resources/error/error-classes.json|  10 -
 .../apache/spark/sql/avro/AvroDeserializer.scala   | 456 +
 .../org/apache/spark/sql/avro/AvroSuite.scala  | 161 
 docs/sql-error-conditions.md   |  14 -
 docs/sql-migration-guide.md|   1 -
 .../spark/sql/errors/QueryCompilationErrors.scala  |  30 --
 .../org/apache/spark/sql/internal/SQLConf.scala|  12 -
 7 files changed, 189 insertions(+), 495 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index 74542f2b914..5fdb2fbe77e 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -69,16 +69,6 @@
   }
 }
   },
-  "AVRO_INCORRECT_TYPE" : {
-"message" : [
-  "Cannot convert Avro  to SQL  because the original 
encoded data type is , however you're trying to read the field as 
, which would lead to an incorrect answer. To allow reading this 
field, enable the SQL configuration: ."
-]
-  },
-  "AVRO_LOWER_PRECISION" : {
-"message" : [
-  "Cannot convert Avro  to SQL  because the original 
encoded data type is , however you're trying to read the field as 
, which leads to data being read as null. Please provide a wider 
decimal type to get the correct result. To allow reading null to this field, 
enable the SQL configuration: ."
-]
-  },
   "BATCH_METADATA_NOT_FOUND" : {
 "message" : [
   "Unable to find batch ."
diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
index d4d34a891e9..a78ee89a3e9 100644
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
@@ -35,9 +35,8 @@ import 
org.apache.spark.sql.catalyst.expressions.{SpecificInternalRow, UnsafeArr
 import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, 
DateTimeUtils, GenericArrayData}
 import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY
 import org.apache.spark.sql.catalyst.util.RebaseDateTime.RebaseSpec
-import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.execution.datasources.DataSourceUtils
-import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
+import org.apache.spark.sql.internal.LegacyBehaviorPolicy
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
@@ -118,268 +117,178 @@ private[sql] class AvroDeserializer(
 val incompatibleMsg = errorPrefix +
 s"schema is incompatible (avroType = $avroType, sqlType = 
${catalystType.sql})"
 
-val confKey = SQLConf.LEGACY_AVRO_ALLOW_INCOMPATIBLE_SCHEMA
-val preventReadingIncorrectType = !SQLConf.get.getConf(confKey)
+(avroType.getType, catalystType) match {
+  case (NULL, NullType) => (updater, ordinal, _) =>
+updater.setNullAt(ordinal)
 
-val logicalDataType = SchemaConverters.toSqlType(avroType).dataType
-avroType.getType match {
-  case NULL =>
-(logicalDataType, catalystType) match {
-  case (_, NullType) => (updater, ordinal, _) =>
-updater.setNullAt(ordinal)
-  case _ => throw new IncompatibleSchemaException(incompatibleMsg)
-}
   // TODO: we can avoid boxing if future version of avro provide primitive 
accessors.
-  case BOOLEAN =>
-(logicalDataType, catalystType) match {
-  case (_, BooleanType) => (updater, ordinal, value) =>
-updater.setBoolean(ordinal, value.asInstanceOf[Boolean])
-  case _ => throw new IncompatibleSchemaException(incompatibleMsg)
-}

[spark] branch master updated: [SPARK-43380][SQL] Revert `Fix Avro data type conversion issues`

2023-08-14 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 46580ab4cb0 [SPARK-43380][SQL] Revert `Fix Avro data type conversion 
issues`
46580ab4cb0 is described below

commit 46580ab4cb02390ba71dace1235015749f048fff
Author: zeruibao 
AuthorDate: Mon Aug 14 19:36:40 2023 -0700

[SPARK-43380][SQL] Revert `Fix Avro data type conversion issues`

### What changes were proposed in this pull request?
Revert my last PR https://github.com/apache/spark/pull/41052 that causes 
AVRO read performance regression since I change the code structure.

### Why are the changes needed?
Remove performance regression

### How was this patch tested?
Unit test

Closes #42458 from zeruibao/revert-avro-change.

Authored-by: zeruibao 
Signed-off-by: Gengliang Wang 
---
 .../src/main/resources/error/error-classes.json|  10 -
 .../apache/spark/sql/avro/AvroDeserializer.scala   | 456 +
 .../org/apache/spark/sql/avro/AvroSuite.scala  | 161 
 docs/sql-error-conditions.md   |  14 -
 docs/sql-migration-guide.md|   1 -
 .../spark/sql/errors/QueryCompilationErrors.scala  |  30 --
 .../org/apache/spark/sql/internal/SQLConf.scala|  12 -
 7 files changed, 189 insertions(+), 495 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index 08f79bcecbb..6ce24bc3b90 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -75,16 +75,6 @@
   }
 }
   },
-  "AVRO_INCORRECT_TYPE" : {
-"message" : [
-  "Cannot convert Avro  to SQL  because the original 
encoded data type is , however you're trying to read the field as 
, which would lead to an incorrect answer. To allow reading this 
field, enable the SQL configuration: ."
-]
-  },
-  "AVRO_LOWER_PRECISION" : {
-"message" : [
-  "Cannot convert Avro  to SQL  because the original 
encoded data type is , however you're trying to read the field as 
, which leads to data being read as null. Please provide a wider 
decimal type to get the correct result. To allow reading null to this field, 
enable the SQL configuration: ."
-]
-  },
   "BATCH_METADATA_NOT_FOUND" : {
 "message" : [
   "Unable to find batch ."
diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
index d4d34a891e9..a78ee89a3e9 100644
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
@@ -35,9 +35,8 @@ import 
org.apache.spark.sql.catalyst.expressions.{SpecificInternalRow, UnsafeArr
 import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, 
DateTimeUtils, GenericArrayData}
 import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY
 import org.apache.spark.sql.catalyst.util.RebaseDateTime.RebaseSpec
-import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.execution.datasources.DataSourceUtils
-import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
+import org.apache.spark.sql.internal.LegacyBehaviorPolicy
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
@@ -118,268 +117,178 @@ private[sql] class AvroDeserializer(
 val incompatibleMsg = errorPrefix +
 s"schema is incompatible (avroType = $avroType, sqlType = 
${catalystType.sql})"
 
-val confKey = SQLConf.LEGACY_AVRO_ALLOW_INCOMPATIBLE_SCHEMA
-val preventReadingIncorrectType = !SQLConf.get.getConf(confKey)
+(avroType.getType, catalystType) match {
+  case (NULL, NullType) => (updater, ordinal, _) =>
+updater.setNullAt(ordinal)
 
-val logicalDataType = SchemaConverters.toSqlType(avroType).dataType
-avroType.getType match {
-  case NULL =>
-(logicalDataType, catalystType) match {
-  case (_, NullType) => (updater, ordinal, _) =>
-updater.setNullAt(ordinal)
-  case _ => throw new IncompatibleSchemaException(incompatibleMsg)
-}
   // TODO: we can avoid boxing if future version of avro provide primitive 
accessors.
-  case BOOLEAN =>
-(logicalDataType, catalystType) match {
-  case (_, BooleanType) => (updater, ordinal, value) =>
-updater.setBoolean(ordinal, value.asInstanceOf[Boolean])
-  case _ => throw new IncompatibleSchemaException(incompatibleMsg)
-}
+  case (BOOLEAN, BooleanType) => (updater, ordinal, value) =>
+updater.setBoolean(ordinal, 

[spark] branch branch-3.5 updated: [SPARK-43063][SQL][FOLLOWUP] Add a space between -> and value when first value is null

2023-08-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 247948194a0 [SPARK-43063][SQL][FOLLOWUP] Add a space between -> and 
value when first value is null
247948194a0 is described below

commit 247948194a08c6d090bfd26daf6b63a68374e6f2
Author: Andy Grove 
AuthorDate: Tue Aug 15 10:12:55 2023 +0800

[SPARK-43063][SQL][FOLLOWUP] Add a space between -> and value when first 
value is null

As noted by cloud-fan 
https://github.com/apache/spark/pull/41432#discussion_r1242593592, 
https://github.com/apache/spark/pull/41432 fixed a formatting issue when 
casting map to string but did not fix it in the codegen case.

This PR fixes the issue and updates a unit test to test the codegen path. 
Without the fix, the test fails with:

```
- SPARK-22973 Cast map to string *** FAILED ***
  Incorrect evaluation (fallback mode = CODEGEN_ONLY): cast(map(keys: 
[1,2,3], values: [null,[B5c3fd9f3,[B70f84210]) as string),
  actual: {1 ->null, 2 -> a, 3 -> c},
expected: {1 -> null, 2 -> a, 3 -> c} (ExpressionEvalHelper.scala:270)
```

Closes #42434 from andygrove/space-before-null-cast-codegen.

Authored-by: Andy Grove 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 7e52169433575a7df164368106fa3f6c73e4233e)
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/expressions/ToStringBase.scala | 2 +-
 .../org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala| 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala
index f903863bec6..1eac386d873 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala
@@ -352,7 +352,7 @@ trait ToStringBase { self: UnaryExpression with 
TimeZoneAwareExpression =>
|  $buffer.append($keyToStringFunc($getMapFirstKey));
|  $buffer.append(" ->");
|  if ($map.valueArray().isNullAt(0)) {
-   |${appendNull(buffer, isFirstElement = true)}
+   |${appendNull(buffer, isFirstElement = false)}
|  } else {
|$buffer.append(" ");
|$buffer.append($valueToStringFunc($getMapFirstValue));
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala
index 34f87f940a7..0172fd9b3e4 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala
@@ -826,9 +826,10 @@ abstract class CastSuiteBase extends SparkFunSuite with 
ExpressionEvalHelper {
 val ret1 = cast(Literal.create(Map(1 -> "a", 2 -> "b", 3 -> "c")), 
StringType)
 checkEvaluation(ret1, s"${lb}1 -> a, 2 -> b, 3 -> c$rb")
 val ret2 = cast(
-  Literal.create(Map("1" -> "a".getBytes, "2" -> null, "3" -> 
"c".getBytes)),
+  Literal.create(Map("1" -> null, "2" -> "a".getBytes, "3" -> null, 
"4" -> "c".getBytes)),
   StringType)
-checkEvaluation(ret2, s"${lb}1 -> a, 2 ->${if (legacyCast) "" else " 
null"}, 3 -> c$rb")
+val nullStr = if (legacyCast) "" else " null"
+checkEvaluation(ret2, s"${lb}1 ->$nullStr, 2 -> a, 3 ->$nullStr, 4 -> 
c$rb")
 val ret3 = cast(
   Literal.create(Map(
 1 -> Date.valueOf("2014-12-03"),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (420e6878c68 -> 7e521694335)

2023-08-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 420e6878c68 [SPARK-43780][SQL] Support correlated references in join 
predicates for scalar and lateral subqueries
 add 7e521694335 [SPARK-43063][SQL][FOLLOWUP] Add a space between -> and 
value when first value is null

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/expressions/ToStringBase.scala | 2 +-
 .../org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala| 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-website] branch asf-site updated: Add note on generative tooling to developer tools

2023-08-14 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new fc89ca1ed2 Add note on generative tooling to developer tools
fc89ca1ed2 is described below

commit fc89ca1ed20551c66dc31ebbe28664d12689bd13
Author: zero323 
AuthorDate: Mon Aug 14 21:04:28 2023 -0500

Add note on generative tooling to developer tools

This PR adds notes on generative tooling and link to the relevant ASF 
policy.

As requested in comments to https://github.com/apache/spark/pull/42469

Author: zero323 

Closes #472 from zero323/SPARK-44782-generative-tooling-notes.
---
 developer-tools.md| 9 +
 site/developer-tools.html | 9 +
 2 files changed, 18 insertions(+)

diff --git a/developer-tools.md b/developer-tools.md
index e0a1844ae7..73e708116e 100644
--- a/developer-tools.md
+++ b/developer-tools.md
@@ -549,3 +549,12 @@ When running Spark tests through SBT, add `javaOptions in 
Test += "-agentpath:/p
 to `SparkBuild.scala` to launch the tests with the YourKit profiler agent 
enabled.  
 The platform-specific paths to the profiler agents are listed in the 
 https://www.yourkit.com/docs/java/help/agent.jsp;>YourKit 
documentation.
+
+Generative tooling usage
+
+In general, the ASF allows contributions co-authored using generative AI 
tools. However, there are several considerations when you submit a patch 
containing generated content.
+
+Foremost, you are required to disclose usage of such tool. Furthermore, you 
are responsible for ensuring that the terms and conditions of the tool in 
question are
+compatible with usage in an Open Source project and inclusion of the generated 
content doesn't pose a risk of copyright violation.
+
+Please refer to https://www.apache.org/legal/generative-tooling.html;>The ASF Generative 
Tooling Guidance for details and developments.
diff --git a/site/developer-tools.html b/site/developer-tools.html
index a43786ff91..de94619481 100644
--- a/site/developer-tools.html
+++ b/site/developer-tools.html
@@ -657,6 +657,15 @@ to SparkBuild.scala to
 The platform-specific paths to the profiler agents are listed in the 
 https://www.yourkit.com/docs/java/help/agent.jsp;>YourKit 
documentation.
 
+Generative tooling usage
+
+In general, the ASF allows contributions co-authored using generative AI 
tools. However, there are several considerations when you submit a patch 
containing generated content.
+
+Foremost, you are required to disclose usage of such tool. Furthermore, you 
are responsible for ensuring that the terms and conditions of the tool in 
question are
+compatible with usage in an Open Source project and inclusion of the generated 
content doesnt pose a risk of copyright violation.
+
+Please refer to https://www.apache.org/legal/generative-tooling.html;>The ASF Generative 
Tooling Guidance for details and developments.
+
 
 
   


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] srowen closed pull request #472: Add note on generative tooling to developer tools

2023-08-14 Thread via GitHub


srowen closed pull request #472: Add note on generative tooling to developer 
tools
URL: https://github.com/apache/spark-website/pull/472


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43780][SQL] Support correlated references in join predicates for scalar and lateral subqueries

2023-08-14 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 420e6878c68 [SPARK-43780][SQL] Support correlated references in join 
predicates for scalar and lateral subqueries
420e6878c68 is described below

commit 420e6878c68f9ff68171cc6d43ca95d69c49eac4
Author: Andrey Gubichev 
AuthorDate: Tue Aug 15 09:52:12 2023 +0800

[SPARK-43780][SQL] Support correlated references in join predicates for 
scalar and lateral subqueries

### What changes were proposed in this pull request?

This PR adds support to subqueries that involve joins with correlated 
references in join predicates, e.g.

```
select * from t0 join lateral (select * from t1 join t2 on t1a = t2a and 
t1a = t0a);
```

(full example in https://issues.apache.org/jira/browse/SPARK-43780)

Currently we only handle scalar and lateral subqueries.

### Why are the changes needed?

This is a valid SQL that is not yet supported by Spark SQL.

### Does this PR introduce _any_ user-facing change?

Yes, previously unsupported queries become supported.

### How was this patch tested?

Query and unit tests

Closes #41301 from agubichev/spark-43780-corr-predicate.

Authored-by: Andrey Gubichev 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |   1 +
 .../catalyst/optimizer/DecorrelateInnerQuery.scala |  92 --
 .../org/apache/spark/sql/internal/SQLConf.scala|  10 ++
 .../optimizer/DecorrelateInnerQuerySuite.scala | 133 -
 .../analyzer-results/join-lateral.sql.out  |  66 ++
 .../scalar-subquery-predicate.sql.out  |  81 +
 .../scalar-subquery/scalar-subquery-set-op.sql.out |  80 +
 .../resources/sql-tests/inputs/join-lateral.sql|   5 +
 .../scalar-subquery/scalar-subquery-predicate.sql  |  20 
 .../scalar-subquery/scalar-subquery-set-op.sql |  20 
 .../sql-tests/results/join-lateral.sql.out |  39 ++
 .../scalar-subquery-predicate.sql.out  |  37 ++
 .../scalar-subquery/scalar-subquery-set-op.sql.out |  32 +
 13 files changed, 605 insertions(+), 11 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 48c38a9bd4c..c7346809f3f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -1173,6 +1173,7 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 def canHostOuter(plan: LogicalPlan): Boolean = plan match {
   case _: Filter => true
   case _: Project => usingDecorrelateInnerQueryFramework
+  case _: Join => usingDecorrelateInnerQueryFramework
   case _ => false
 }
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
index 86fa78e96a5..a3e264579f4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
@@ -804,11 +804,73 @@ object DecorrelateInnerQuery extends PredicateHelper {
 (d.copy(child = newChild), joinCond, outerReferenceMap)
 
   case j @ Join(left, right, joinType, condition, _) =>
-val outerReferences = collectOuterReferences(j.expressions)
-// Join condition containing outer references is not supported.
-assert(outerReferences.isEmpty, s"Correlated column is not allowed 
in join: $j")
-val newOuterReferences = parentOuterReferences ++ outerReferences
-val shouldPushToLeft = joinType match {
+// Given 'condition', computes the tuple of
+// (correlated, uncorrelated, equalityCond, predicates, 
equivalences).
+// 'correlated' and 'uncorrelated' are the conjuncts with (resp. 
without)
+// outer (correlated) references. Furthermore, correlated 
conjuncts are split
+// into 'equalityCond' (those that are equalities) and all rest 
('predicates').
+// 'equivalences' track equivalent attributes given 'equalityCond'.
+// The split is only performed if 'shouldDecorrelatePredicates' is 
true.
+// The input parameter 'isInnerJoin' is set to true for INNER 
joins and helps
+// determine whether some predicates can be lifted up from the 
join (this 

[spark] branch master updated (1d562904e4e -> f7002fb25ca)

2023-08-14 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1d562904e4e [SPARK-44795][CONNECT] CodeGenerator Cache should be 
classloader specific
 add f7002fb25ca [SPARK-44749][PYTHON][FOLLOWUP][TESTS] Add more tests for 
named arguments in Python UDTF

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/functions.py   |  2 +-
 python/pyspark/sql/tests/test_udtf.py | 88 +++
 2 files changed, 80 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific

2023-08-14 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 76e61c44aa2 [SPARK-44795][CONNECT] CodeGenerator Cache should be 
classloader specific
76e61c44aa2 is described below

commit 76e61c44aa28e12d363d86a401fd581bf3b054c1
Author: Herman van Hovell 
AuthorDate: Tue Aug 15 03:10:42 2023 +0200

[SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific

### What changes were proposed in this pull request?
When you currently use a REPL generated class in a UDF you can get an error 
saying that that class is not equal to that class. This error is thrown in a 
code generated class. The problem is that the classes have been loaded by 
different classloaders. We cache generated code and use the textual code as the 
string. The problem with this is that in Spark Connect users are free in 
supplying user classes that can have arbitrary names, a name can point to an 
entirely different class, or it  [...]

There are roughly two ways how this problem can arise:
1. Two sessions use the same class names. This is particularly easy when 
you use the REPL because this  always generates the same names.
2. You run in single process mode. In this case wholestage codegen will 
test compile the class using a different classloader then the 'executor', while 
sharing the same code generator cache.

### Why are the changes needed?
We want to be able to use REPL (and other)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
I added a test to the `ReplE2ESuite`.

Closes #42478 from hvanhovell/SPARK-44795.

Authored-by: Herman van Hovell 
Signed-off-by: Herman van Hovell 
(cherry picked from commit 1d562904e4e75aec3ea8d4999ede0183fda326c7)
Signed-off-by: Herman van Hovell 
---
 .../spark/sql/application/ReplE2ESuite.scala   |  9 
 .../spark/sql/catalyst/encoders/OuterScopes.scala  | 49 +-
 .../expressions/codegen/CodeGenerator.scala| 30 ++---
 3 files changed, 54 insertions(+), 34 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala
index 0e69b5afa45..0cab66eef3d 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala
@@ -276,4 +276,13 @@ class ReplE2ESuite extends RemoteSparkSession with 
BeforeAndAfterEach {
 val output = runCommandsInShell(input)
 assertContains("Array[MyTestClass] = Array(MyTestClass(1), 
MyTestClass(3))", output)
   }
+
+  test("REPL class in UDF") {
+val input = """
+|case class MyTestClass(value: Int)
+|spark.range(2).map(i => MyTestClass(i.toInt)).collect()
+  """.stripMargin
+val output = runCommandsInShell(input)
+assertContains("Array[MyTestClass] = Array(MyTestClass(0), 
MyTestClass(1))", output)
+  }
 }
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala
index c2ac504c846..6c10e8ece80 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala
@@ -26,28 +26,9 @@ import org.apache.spark.util.SparkClassUtils
 
 object OuterScopes {
   private[this] val queue = new ReferenceQueue[AnyRef]
-  private class HashableWeakReference(v: AnyRef) extends 
WeakReference[AnyRef](v, queue) {
-private[this] val hash = v.hashCode()
-override def hashCode(): Int = hash
-override def equals(obj: Any): Boolean = {
-  obj match {
-case other: HashableWeakReference =>
-  // Note that referential equality is used to identify & purge
-  // references from the map whose' referent went out of scope.
-  if (this eq other) {
-true
-  } else {
-val referent = get()
-val otherReferent = other.get()
-referent != null && otherReferent != null && 
Objects.equals(referent, otherReferent)
-  }
-case _ => false
-  }
-}
-  }
 
   private def classLoaderRef(c: Class[_]): HashableWeakReference = {
-new HashableWeakReference(c.getClassLoader)
+new HashableWeakReference(c.getClassLoader, queue)
   }
 
   private[this] val outerScopes = {
@@ -154,3 +135,31 @@ object OuterScopes {
   // e.g. `ammonite.$sess.cmd8$Helper$Foo` -> 
`ammonite.$sess.cmd8.instance.Foo`
   private[this] val AmmoniteREPLClass = 

[spark] branch master updated: [SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific

2023-08-14 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1d562904e4e [SPARK-44795][CONNECT] CodeGenerator Cache should be 
classloader specific
1d562904e4e is described below

commit 1d562904e4e75aec3ea8d4999ede0183fda326c7
Author: Herman van Hovell 
AuthorDate: Tue Aug 15 03:10:42 2023 +0200

[SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific

### What changes were proposed in this pull request?
When you currently use a REPL generated class in a UDF you can get an error 
saying that that class is not equal to that class. This error is thrown in a 
code generated class. The problem is that the classes have been loaded by 
different classloaders. We cache generated code and use the textual code as the 
string. The problem with this is that in Spark Connect users are free in 
supplying user classes that can have arbitrary names, a name can point to an 
entirely different class, or it  [...]

There are roughly two ways how this problem can arise:
1. Two sessions use the same class names. This is particularly easy when 
you use the REPL because this  always generates the same names.
2. You run in single process mode. In this case wholestage codegen will 
test compile the class using a different classloader then the 'executor', while 
sharing the same code generator cache.

### Why are the changes needed?
We want to be able to use REPL (and other)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
I added a test to the `ReplE2ESuite`.

Closes #42478 from hvanhovell/SPARK-44795.

Authored-by: Herman van Hovell 
Signed-off-by: Herman van Hovell 
---
 .../spark/sql/application/ReplE2ESuite.scala   |  9 
 .../spark/sql/catalyst/encoders/OuterScopes.scala  | 49 +-
 .../expressions/codegen/CodeGenerator.scala| 30 ++---
 3 files changed, 54 insertions(+), 34 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala
index 0e69b5afa45..0cab66eef3d 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala
@@ -276,4 +276,13 @@ class ReplE2ESuite extends RemoteSparkSession with 
BeforeAndAfterEach {
 val output = runCommandsInShell(input)
 assertContains("Array[MyTestClass] = Array(MyTestClass(1), 
MyTestClass(3))", output)
   }
+
+  test("REPL class in UDF") {
+val input = """
+|case class MyTestClass(value: Int)
+|spark.range(2).map(i => MyTestClass(i.toInt)).collect()
+  """.stripMargin
+val output = runCommandsInShell(input)
+assertContains("Array[MyTestClass] = Array(MyTestClass(0), 
MyTestClass(1))", output)
+  }
 }
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala
index c2ac504c846..6c10e8ece80 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala
@@ -26,28 +26,9 @@ import org.apache.spark.util.SparkClassUtils
 
 object OuterScopes {
   private[this] val queue = new ReferenceQueue[AnyRef]
-  private class HashableWeakReference(v: AnyRef) extends 
WeakReference[AnyRef](v, queue) {
-private[this] val hash = v.hashCode()
-override def hashCode(): Int = hash
-override def equals(obj: Any): Boolean = {
-  obj match {
-case other: HashableWeakReference =>
-  // Note that referential equality is used to identify & purge
-  // references from the map whose' referent went out of scope.
-  if (this eq other) {
-true
-  } else {
-val referent = get()
-val otherReferent = other.get()
-referent != null && otherReferent != null && 
Objects.equals(referent, otherReferent)
-  }
-case _ => false
-  }
-}
-  }
 
   private def classLoaderRef(c: Class[_]): HashableWeakReference = {
-new HashableWeakReference(c.getClassLoader)
+new HashableWeakReference(c.getClassLoader, queue)
   }
 
   private[this] val outerScopes = {
@@ -154,3 +135,31 @@ object OuterScopes {
   // e.g. `ammonite.$sess.cmd8$Helper$Foo` -> 
`ammonite.$sess.cmd8.instance.Foo`
   private[this] val AmmoniteREPLClass = 
"""^(ammonite\.\$sess\.cmd(?:\d+)\$).*""".r
 }
+
+/**
+ * A [[WeakReference]] that has a stable hash-key. When the referent is still 

[GitHub] [spark-website] zero323 commented on pull request #472: Add note on generative tooling to developer tools

2023-08-14 Thread via GitHub


zero323 commented on PR #472:
URL: https://github.com/apache/spark-website/pull/472#issuecomment-1677700165

   cc @srowen @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] zero323 opened a new pull request, #472: Add note on generative tooling to developer tools

2023-08-14 Thread via GitHub


zero323 opened a new pull request, #472:
URL: https://github.com/apache/spark-website/pull/472

   This PR adds notes on generative tooling and link to the relevant ASF policy.
   
   As requested in comments to https://github.com/apache/spark/pull/42469


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF

2023-08-14 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d4629563492 [SPARK-44749][SQL][PYTHON] Support named arguments in 
Python UDTF
d4629563492 is described below

commit d4629563492ec3090b5bd5b924790507c42f4e86
Author: Takuya UESHIN 
AuthorDate: Mon Aug 14 08:57:55 2023 -0700

[SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF

### What changes were proposed in this pull request?

Supports named arguments in Python UDTF.

For example:

```py
>>> udtf(returnType="a: int")
... class TestUDTF:
... def eval(self, a, b):
... yield a,
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> TestUDTF(a=lit(10), b=lit("x")).show()
+---+
|  a|
+---+
| 10|
+---+

>>> TestUDTF(b=lit("x"), a=lit(10)).show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show()
+---+
|  a|
+---+
| 10|
+---+
```

or:

```py
>>> udtf
... class TestUDTF:
... staticmethod
... def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult:
... return AnalyzeResult(
... StructType(
... [StructField(key, arg.data_type) for key, arg in 
sorted(kwargs.items())]
... )
... )
... def eval(self, **kwargs):
... yield tuple(value for _, value in sorted(kwargs.items()))
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show()
+---+---+-+
|  a|  b|x|
+---+---+-+
| 10|  x|100.0|
+---+---+-+

>>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show()
+---+---+-+
|  a|  x|z|
+---+---+-+
|  x| 10|100.0|
+---+---+-+
```

### Why are the changes needed?

Now that named arguments are supported 
(https://github.com/apache/spark/pull/41796, 
https://github.com/apache/spark/pull/42020).

It should be supported in Python UDTF.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for Python UDTF.

### How was this patch tested?

Added related tests.

Closes #42422 from ueshin/issues/SPARK-44749/kwargs.

Authored-by: Takuya UESHIN 
Signed-off-by: Takuya UESHIN 
---
 .../main/protobuf/spark/connect/expressions.proto  |   9 ++
 .../sql/connect/planner/SparkConnectPlanner.scala  |   7 ++
 python/pyspark/sql/column.py   |  14 +++
 python/pyspark/sql/connect/expressions.py  |  20 
 .../pyspark/sql/connect/proto/expressions_pb2.py   | 122 +++--
 .../pyspark/sql/connect/proto/expressions_pb2.pyi  |  34 ++
 python/pyspark/sql/connect/udtf.py |  18 +--
 python/pyspark/sql/functions.py|  38 +++
 python/pyspark/sql/tests/test_udtf.py  |  88 +++
 python/pyspark/sql/udtf.py |  36 --
 python/pyspark/sql/worker/analyze_udtf.py  |  20 +++-
 python/pyspark/worker.py   |  29 -
 .../plans/logical/FunctionBuilderBase.scala|  67 +++
 .../execution/python/ArrowEvalPythonUDTFExec.scala |   5 +-
 .../execution/python/ArrowPythonUDTFRunner.scala   |   8 +-
 .../execution/python/BatchEvalPythonUDTFExec.scala |  30 +++--
 .../sql/execution/python/EvalPythonUDTFExec.scala  |  33 --
 .../python/UserDefinedPythonFunction.scala |  42 +--
 18 files changed, 472 insertions(+), 148 deletions(-)

diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto
index 557b9db9123..b222f663cd0 100644
--- a/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto
+++ b/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto
@@ -47,6 +47,7 @@ message Expression {
 UnresolvedNamedLambdaVariable unresolved_named_lambda_variable = 14;
 CommonInlineUserDefinedFunction common_inline_user_defined_function = 15;
 CallFunction call_function = 16;
+NamedArgumentExpression named_argument_expression = 17;
 
 // This field is used to mark extensions to the protocol. When plugins 
generate arbitrary
 // relations they can add them here. During the planning the correct 
resolution is done.
@@ -380,3 +381,11 @@ message CallFunction {
   // (Optional) Function arguments. Empty arguments are allowed.
   

[spark] branch master updated: [SPARK-44783][SQL][TESTS] Checks arrays as named and positional parameters

2023-08-14 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0fc85a24a15 [SPARK-44783][SQL][TESTS] Checks arrays as named and 
positional parameters
0fc85a24a15 is described below

commit 0fc85a24a1531c74e7bc5b2d0a36635580a395a6
Author: Max Gekk 
AuthorDate: Mon Aug 14 18:32:58 2023 +0800

[SPARK-44783][SQL][TESTS] Checks arrays as named and positional parameters

### What changes were proposed in this pull request?
In the PR, I propose to add new test which checks arrays as named and 
positional parameters.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *ParametersSuite"
```

Closes #42470 from MaxGekk/sql-parameterized-by-array.

Authored-by: Max Gekk 
Signed-off-by: Kent Yao 
---
 .../org/apache/spark/sql/ParametersSuite.scala | 27 ++
 1 file changed, 27 insertions(+)

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ParametersSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/ParametersSuite.scala
index a72c9a600ad..6310a5a50e0 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/ParametersSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/ParametersSuite.scala
@@ -502,4 +502,31 @@ class ParametersSuite extends QueryTest with 
SharedSparkSession {
 start = 24,
 stop = 36))
   }
+
+  test("SPARK-44783: arrays as parameters") {
+checkAnswer(
+  spark.sql("SELECT array_position(:arrParam, 'abc')", Map("arrParam" -> 
Array.empty[String])),
+  Row(0))
+checkAnswer(
+  spark.sql("SELECT array_position(?, 0.1D)", Array(Array.empty[Double])),
+  Row(0))
+checkAnswer(
+  spark.sql("SELECT array_contains(:arrParam, 10)", Map("arrParam" -> 
Array(10, 20, 30))),
+  Row(true))
+checkAnswer(
+  spark.sql("SELECT array_contains(?, ?)", Array(Array("a", "b", "c"), 
"b")),
+  Row(true))
+checkAnswer(
+  spark.sql("SELECT :arr[1]", Map("arr" -> Array(10, 20, 30))),
+  Row(20))
+checkAnswer(
+  spark.sql("SELECT ?[?]", Array(Array(1f, 2f, 3f), 0)),
+  Row(1f))
+checkAnswer(
+  spark.sql("SELECT :arr[0][1]", Map("arr" -> Array(Array(1, 2), 
Array(20), Array.empty[Int]))),
+  Row(2))
+checkAnswer(
+  spark.sql("SELECT ?[?][?]", Array(Array(Array(1f, 2f), 
Array.empty[Float], Array(3f)), 0, 1)),
+  Row(2f))
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org