from:"wenchen"

(spark) tag v4.0.0-preview1-rc2 created (now 7cfe5a6e44e8)

2024-05-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to tag v4.0.0-preview1-rc2
in repository https://gitbox.apache.org/repos/asf/spark.git


  at 7cfe5a6e44e8 (commit)
This tag includes the following new commits:

 new 7cfe5a6e44e8 Preparing Spark release v4.0.0-preview1-rc2

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) 01/01: Preparing Spark release v4.0.0-preview1-rc2

2024-05-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to tag v4.0.0-preview1-rc2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66
Author: Wenchen Fan 
AuthorDate: Fri May 24 18:53:15 2024 +

Preparing Spark release v4.0.0-preview1-rc2
---
 R/pkg/R/sparkR.R   | 4 ++--
 assembly/pom.xml   | 2 +-
 common/kvstore/pom.xml | 2 +-
 common/network-common/pom.xml  | 2 +-
 common/network-shuffle/pom.xml | 2 +-
 common/network-yarn/pom.xml| 2 +-
 common/sketch/pom.xml  | 2 +-
 common/tags/pom.xml| 2 +-
 common/unsafe/pom.xml  | 2 +-
 common/utils/pom.xml   | 2 +-
 common/variant/pom.xml | 2 +-
 connector/avro/pom.xml | 2 +-
 connector/connect/client/jvm/pom.xml   | 2 +-
 connector/connect/common/pom.xml   | 2 +-
 connector/connect/server/pom.xml   | 2 +-
 connector/docker-integration-tests/pom.xml | 2 +-
 connector/kafka-0-10-assembly/pom.xml  | 2 +-
 connector/kafka-0-10-sql/pom.xml   | 2 +-
 connector/kafka-0-10-token-provider/pom.xml| 2 +-
 connector/kafka-0-10/pom.xml   | 2 +-
 connector/kinesis-asl-assembly/pom.xml | 2 +-
 connector/kinesis-asl/pom.xml  | 2 +-
 connector/profiler/pom.xml | 2 +-
 connector/protobuf/pom.xml | 2 +-
 connector/spark-ganglia-lgpl/pom.xml   | 2 +-
 core/pom.xml   | 2 +-
 docs/_config.yml   | 6 +++---
 examples/pom.xml   | 2 +-
 graphx/pom.xml | 2 +-
 hadoop-cloud/pom.xml   | 2 +-
 launcher/pom.xml   | 2 +-
 mllib-local/pom.xml| 2 +-
 mllib/pom.xml  | 2 +-
 pom.xml| 2 +-
 python/pyspark/version.py  | 2 +-
 repl/pom.xml   | 2 +-
 resource-managers/kubernetes/core/pom.xml  | 2 +-
 resource-managers/kubernetes/integration-tests/pom.xml | 2 +-
 resource-managers/yarn/pom.xml | 2 +-
 sql/api/pom.xml| 2 +-
 sql/catalyst/pom.xml   | 2 +-
 sql/core/pom.xml   | 2 +-
 sql/hive-thriftserver/pom.xml  | 2 +-
 sql/hive/pom.xml   | 2 +-
 streaming/pom.xml  | 2 +-
 tools/pom.xml  | 2 +-
 46 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index 0be7e5da24d2..478acf514ef3 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -456,8 +456,8 @@ sparkR.session <- function(
 
   # Check if version number of SparkSession matches version number of SparkR 
package
   jvmVersion <- callJMethod(sparkSession, "version")
-  # Remove -SNAPSHOT from jvm versions
-  jvmVersionStrip <- gsub("-SNAPSHOT", "", jvmVersion, fixed = TRUE)
+  # Remove -preview1 from jvm versions
+  jvmVersionStrip <- gsub("-preview1", "", jvmVersion, fixed = TRUE)
   rPackageVersion <- paste0(packageVersion("SparkR"))
 
   if (jvmVersionStrip != rPackageVersion) {
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 58e7ae5bb0c7..417e7c23ca9f 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.13
-4.0.0-SNAPSHOT
+4.0.0-preview1
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 046648e9c2ae..e1a4497387a2 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.13
-4.0.0-SNAPSHOT
+4.0.0-preview1
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index cdb5bd72158a..d8dff6996cec 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.13
-4.0.0-SNAPSHOT
+4.0.0-preview1
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 0f7036ef

(spark) tag v4.0.0-preview-rc1 deleted (was 9fec87d16a04)

2024-05-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to tag v4.0.0-preview-rc1
in repository https://gitbox.apache.org/repos/asf/spark.git


*** WARNING: tag v4.0.0-preview-rc1 was deleted! ***

 was 9fec87d16a04 Preparing Spark release v4.0.0-preview-rc1

This change permanently discards the following revisions:

 discard 9fec87d16a04 Preparing Spark release v4.0.0-preview-rc1


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Assigned] (SPARK-48364) Type casting for AbstractMapType

2024-05-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48364:
---

Assignee: Uroš Bojanić

> Type casting for AbstractMapType
> 
>
> Key: SPARK-48364
> URL: https://issues.apache.org/jira/browse/SPARK-48364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48364) Type casting for AbstractMapType

2024-05-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48364.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46661
[https://github.com/apache/spark/pull/46661]

> Type casting for AbstractMapType
> 
>
> Key: SPARK-48364
> URL: https://issues.apache.org/jira/browse/SPARK-48364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48364][SQL] Add AbstractMapType type casting and fix RaiseError parameter map to work with collated strings

2024-05-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6be3560f3c89 [SPARK-48364][SQL] Add AbstractMapType type casting and 
fix RaiseError parameter map to work with collated strings
6be3560f3c89 is described below

commit 6be3560f3c89e212e850a0788d24a7c0755ea35b
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed May 22 05:21:23 2024 -0700

[SPARK-48364][SQL] Add AbstractMapType type casting and fix RaiseError 
parameter map to work with collated strings

### What changes were proposed in this pull request?
Following up on the introduction of AbstractMapType 
(https://github.com/apache/spark/pull/46458) and changes that introduce 
collation awareness for RaiseError expression 
(https://github.com/apache/spark/pull/46461), this PR should add the 
appropriate type casting rules for AbstractMapType.

### Why are the changes needed?
Fix the CI failure for the `Support RaiseError misc expression with 
collation` test when ANSI is off.

### Does this PR introduce _any_ user-facing change?
Yes, type casting is now allowed for map types with collated strings.

### How was this patch tested?
Extended suite `CollationSQLExpressionsANSIOffSuite` with ANSI disabled.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46661 from uros-db/fix-abstract-map.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CollationTypeCasts.scala | 15 -
 .../spark/sql/catalyst/analysis/TypeCoercion.scala | 13 +--
 .../spark/sql/catalyst/expressions/misc.scala  |  4 ++--
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 10 +++--
 .../org/apache/spark/sql/CollationSuite.scala  | 25 ++
 5 files changed, 37 insertions(+), 30 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
index a50dad7c8cdb..00abdf4ee19d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
@@ -25,7 +25,7 @@ import 
org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType, haveS
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.types.{ArrayType, DataType, StringType}
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StringType}
 
 object CollationTypeCasts extends TypeCoercionRule {
   override val transform: PartialFunction[Expression, Expression] = {
@@ -85,6 +85,11 @@ object CollationTypeCasts extends TypeCoercionRule {
   private def extractStringType(dt: DataType): StringType = dt match {
 case st: StringType => st
 case ArrayType(et, _) => extractStringType(et)
+case MapType(kt, vt, _) => if (hasStringType(kt)) {
+extractStringType(kt)
+  } else {
+extractStringType(vt)
+  }
   }
 
   /**
@@ -102,6 +107,14 @@ object CollationTypeCasts extends TypeCoercionRule {
   case st: StringType if st.collationId != castType.collationId => castType
   case ArrayType(arrType, nullable) =>
 castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case MapType(keyType, valueType, nullable) =>
+val newKeyType = castStringType(keyType, castType).getOrElse(keyType)
+val newValueType = castStringType(valueType, 
castType).getOrElse(valueType)
+if (newKeyType != keyType || newValueType != valueType) {
+  MapType(newKeyType, newValueType, nullable)
+} else {
+  null
+}
   case _ => null
 }
 Option(ret)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
index 936bb22baa46..7866f47c28b1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
@@ -31,7 +31,7 @@ import org.apache.spark.sql.catalyst.trees.AlwaysProcess
 import org.apache.spark.sql.catalyst.types.DataTypeUtils
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.internal.types.{AbstractArrayType, 
AbstractStringType, StringTypeAnyCollation}
+import or

[jira] [Resolved] (SPARK-48215) DateFormatClass (all collations)

2024-05-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48215.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46561
[https://github.com/apache/spark/pull/46561]

> DateFormatClass (all collations)
> 
>
> Key: SPARK-48215
> URL: https://issues.apache.org/jira/browse/SPARK-48215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Nebojsa Savic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Enable collation support for the *DateFormatClass* built-in function in 
> Spark. First confirm what is the expected behaviour for this expression when 
> given collated strings, and then move on to implementation and testing. You 
> will find this expression in the *datetimeExpressions.scala* file, and it 
> should be considered a pass-through function with respect to collation 
> awareness. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *DateFormatClass* 
> expression so that it supports all collation types currently supported in 
> Spark. To understand what changes were introduced in order to enable full 
> collation support for other existing functions in Spark, take a look at the 
> Spark PRs and Jira tickets for completed tasks in this parent (for example: 
> Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, 
> FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48215) DateFormatClass (all collations)

2024-05-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48215:
---

Assignee: Nebojsa Savic

> DateFormatClass (all collations)
> 
>
> Key: SPARK-48215
> URL: https://issues.apache.org/jira/browse/SPARK-48215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Nebojsa Savic
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *DateFormatClass* built-in function in 
> Spark. First confirm what is the expected behaviour for this expression when 
> given collated strings, and then move on to implementation and testing. You 
> will find this expression in the *datetimeExpressions.scala* file, and it 
> should be considered a pass-through function with respect to collation 
> awareness. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *DateFormatClass* 
> expression so that it supports all collation types currently supported in 
> Spark. To understand what changes were introduced in order to enable full 
> collation support for other existing functions in Spark, take a look at the 
> Spark PRs and Jira tickets for completed tasks in this parent (for example: 
> Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, 
> FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48215][SQL] Extending support for collated strings on date_format expression

2024-05-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e04d3d7c430a [SPARK-48215][SQL] Extending support for collated strings 
on date_format expression
e04d3d7c430a is described below

commit e04d3d7c430a1fa446f0379680f619b8b14b5eb5
Author: Nebojsa Savic 
AuthorDate: Wed May 22 04:28:06 2024 -0700

[SPARK-48215][SQL] Extending support for collated strings on date_format 
expression

### What changes were proposed in this pull request?
We are extending support for collated strings on date_format function, 
since currently it throws DATATYPE_MISSMATCH exception when collated strings 
are passed as "format" parameter. 
https://docs.databricks.com/en/sql/language-manual/functions/date_format.html

### Why are the changes needed?
Exception is thrown on invocation when collated strings are passed as 
arguments to date_format.

### Does this PR introduce _any_ user-facing change?
No user facing changes, extending support.

### How was this patch tested?
Tests are added with this PR.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46561 from nebojsa-db/SPARK-48215.

Authored-by: Nebojsa Savic 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/expressions/datetimeExpressions.scala |  5 ++--
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 32 ++
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
index 081a42f5608e..8caf8c5d48c2 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
@@ -36,6 +36,7 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils._
 import org.apache.spark.sql.catalyst.util.LegacyDateFormats.SIMPLE_DATE_FORMAT
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.types.StringTypeAnyCollation
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.types.DayTimeIntervalType.DAY
 import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}
@@ -951,9 +952,9 @@ case class DateFormatClass(left: Expression, right: 
Expression, timeZoneId: Opti
 
   def this(left: Expression, right: Expression) = this(left, right, None)
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType, 
StringType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType, 
StringTypeAnyCollation)
 
   override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression =
 copy(timeZoneId = Option(timeZoneId))
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index 0d48f9f0a88d..828245bb3fdd 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -1600,6 +1600,38 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("DateFormat expression with collation") {
+case class DateFormatTestCase[R](date: String, format: String, collation: 
String, result: R)
+val testCases = Seq(
+  DateFormatTestCase("2021-01-01", "-MM-dd", "UTF8_BINARY", 
"2021-01-01"),
+  DateFormatTestCase("2021-01-01", "-dd", "UTF8_BINARY_LCASE", 
"2021-01"),
+  DateFormatTestCase("2021-01-01", "-MM-dd", "UNICODE", "2021-01-01"),
+  DateFormatTestCase("2021-01-01", "", "UNICODE_CI", "2021")
+)
+
+for {
+  collateDate <- Seq(true, false)
+  collateFormat <- Seq(true, false)
+} {
+  testCases.foreach(t => {
+val dateArg = if (collateDate) s"collate('${t.date}', 
'${t.collation}')" else s"'${t.date}'"
+val formatArg =
+  if (collateFormat) {
+s"collate('${t.format}', '${t.collation}')"
+  } else {
+s"'${t.format}'"
+  }
+
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> t.collation) {
+  val query = s"SELECT date_format(${dateArg}, ${formatArg})"
+

(spark) branch master updated: [SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support

2024-05-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 617ac1aec748 [SPARK-48031] Decompose viewSchemaMode config, add SHOW 
CREATE TABLE support
617ac1aec748 is described below

commit 617ac1aec7481d6063af539b02980692e98beb70
Author: Serge Rielau 
AuthorDate: Mon May 20 16:01:24 2024 +0800

[SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support

### What changes were proposed in this pull request?

We separate enablement of WITH SCHEMA ... clause from the change in default 
from SCHEMA BINDING to SCHEMA COMPENSATION.
This allows user to upgrade in two steps:
1. Enable the feature, and deal with DESCRIBE EXTENDED.
2. Get their affairs in order by ALTER VIEW to SCHEMA BINDING for those 
views they aim to keep in that mode
3. Switch the default.

### Why are the changes needed?

It allows customers to upgrade more safely.

### Does this PR introduce _any_ user-facing change?

Yes

### How was this patch tested?

Added more tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46652 from srielau/SPARK-48031-view-evolutiion-part2.

Lead-authored-by: Serge Rielau 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 docs/sql-migration-guide.md|   3 +-
 .../sql/catalyst/catalog/SessionCatalog.scala  |   6 +-
 .../spark/sql/catalyst/catalog/interface.scala |   6 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala |  14 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|  26 ++-
 .../spark/sql/execution/command/tables.scala   |   7 +
 .../view-schema-binding-config.sql.out | 166 +--
 .../analyzer-results/view-schema-binding.sql.out   |  24 +--
 .../inputs/view-schema-binding-config.sql  |  52 +++--
 .../sql-tests/inputs/view-schema-binding.sql   |   2 +-
 .../sql-tests/results/charvarchar.sql.out  |   1 +
 .../sql-tests/results/show-create-table.sql.out|   6 +
 .../results/view-schema-binding-config.sql.out | 231 ++---
 .../sql-tests/results/view-schema-binding.sql.out  |  25 +--
 .../apache/spark/sql/execution/SQLViewSuite.scala  |   2 +-
 .../spark/sql/execution/SQLViewTestSuite.scala |   7 +-
 16 files changed, 453 insertions(+), 125 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 15205e9284cd..02a4fae5d262 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -54,7 +54,8 @@ license: |
 - Since Spark 4.0, The default value for 
`spark.sql.legacy.ctePrecedencePolicy` has been changed from `EXCEPTION` to 
`CORRECTED`. Instead of raising an error, inner CTE definitions take precedence 
over outer definitions.
 - Since Spark 4.0, The default value for `spark.sql.legacy.timeParserPolicy` 
has been changed from `EXCEPTION` to `CORRECTED`. Instead of raising an 
`INCONSISTENT_BEHAVIOR_CROSS_VERSION` error, `CANNOT_PARSE_TIMESTAMP` will be 
raised if ANSI mode is enable. `NULL` will be returned if ANSI mode is 
disabled. See [Datetime Patterns for Formatting and 
Parsing](sql-ref-datetime-pattern.html).
 - Since Spark 4.0, A bug falsely allowing `!` instead of `NOT` when `!` is not 
a prefix operator has been fixed. Clauses such as `expr ! IN (...)`, `expr ! 
BETWEEN ...`, or `col ! NULL` now raise syntax errors. To restore the previous 
behavior, set `spark.sql.legacy.bangEqualsNot` to `true`. 
-- Since Spark 4.0, Views allow control over how they react to underlying query 
changes. By default views tolerate column type changes in the query and 
compensate with casts. To restore the previous behavior, allowing up-casts 
only, set `spark.sql.viewSchemaBindingMode` to `DISABLED`. This disables the 
feature and also disallows the `WITH SCHEMA` clause.
+- Since Spark 4.0, By default views tolerate column type changes in the query 
and compensate with casts. To restore the previous behavior, allowing up-casts 
only, set `spark.sql.legacy.viewSchemaCompensation` to `false`.
+- Since Spark 4.0, Views allow control over how they react to underlying query 
changes. By default views tolerate column type changes in the query and 
compensate with casts. To disable thsi feature set 
`spark.sql.legacy.viewSchemaBindingMode` to `false`. This also removes the 
clause from `DESCRIBE EXTENDED` and `SHOW CREATE TABLE`.
 
 ## Upgrading from Spark SQL 3.5.1 to 3.5.2
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
index 96883afcfc5c..dbf2102a183a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog

[jira] [Resolved] (SPARK-48305) CurrentLike - Database/Schema, Catalog, User (all collations)

2024-05-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48305.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46613
[https://github.com/apache/spark/pull/46613]

> CurrentLike - Database/Schema, Catalog, User (all collations)
> -
>
> Key: SPARK-48305
> URL: https://issues.apache.org/jira/browse/SPARK-48305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48305][SQL] Add collation support for CurrentLike expressions

2024-05-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6a17b794338b [SPARK-48305][SQL] Add collation support for CurrentLike 
expressions
6a17b794338b is described below

commit 6a17b794338b0473c11ae17e5c8f1450c0b3f358
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Mon May 20 15:51:23 2024 +0800

[SPARK-48305][SQL] Add collation support for CurrentLike expressions

### What changes were proposed in this pull request?
Introduce collation awareness for CurrentLike expressions: 
current_database/current_schema, current_catalog, 
user/current_user/session_user.

### Why are the changes needed?
Add collation support for CurrentLike expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
CurrentLike functions: current_database/current_schema, current_catalog, 
user/current_user/session_user.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46613 from uros-db/current-like-expressions.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/expressions/misc.scala |  6 +++---
 .../spark/sql/catalyst/optimizer/finishAnalysis.scala|  7 ---
 .../apache/spark/sql/CollationSQLExpressionsSuite.scala  | 16 
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
index eda65ae48f00..e9fa362de14c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
@@ -200,7 +200,7 @@ object AssertTrue {
   since = "1.6.0",
   group = "misc_funcs")
 case class CurrentDatabase() extends LeafExpression with Unevaluable {
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
   override def nullable: Boolean = false
   override def prettyName: String = "current_schema"
   final override val nodePatterns: Seq[TreePattern] = Seq(CURRENT_LIKE)
@@ -219,7 +219,7 @@ case class CurrentDatabase() extends LeafExpression with 
Unevaluable {
   since = "3.1.0",
   group = "misc_funcs")
 case class CurrentCatalog() extends LeafExpression with Unevaluable {
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
   override def nullable: Boolean = false
   override def prettyName: String = "current_catalog"
   final override val nodePatterns: Seq[TreePattern] = Seq(CURRENT_LIKE)
@@ -335,7 +335,7 @@ case class TypeOf(child: Expression) extends 
UnaryExpression {
 // scalastyle:on line.size.limit
 case class CurrentUser() extends LeafExpression with Unevaluable {
   override def nullable: Boolean = false
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
   override def prettyName: String =
 getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse("current_user")
   final override val nodePatterns: Seq[TreePattern] = Seq(CURRENT_LIKE)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
index 92ac7599a8ff..48753fbfe326 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
@@ -33,6 +33,7 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.{convertSpecialDate, 
convertSpecialTimestamp, convertSpecialTimestampNTZ, instantToMicros, 
localDateTimeToMicros}
 import org.apache.spark.sql.catalyst.util.TypeUtils.toSQLExpr
 import org.apache.spark.sql.connector.catalog.CatalogManager
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 
 
@@ -151,11 +152,11 @@ case class ReplaceCurrentLike(catalogManager: 
CatalogManager) extends Rule[Logic
 
 plan.transformAllExpressionsWithPruning(_.containsPattern(CURRENT_LIKE)) {
   case CurrentDatabase() =>
-Literal.create(currentNamespace, StringType)
+Literal.create(currentNamespace, SQLConf.get.defaultStringType)
   case CurrentCatalog() =>
-

[jira] [Assigned] (SPARK-48175) Store collation information in metadata and not in type for SER/DE

2024-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48175:
---

Assignee: Stefan Kandic

> Store collation information in metadata and not in type for SER/DE
> --
>
> Key: SPARK-48175
> URL: https://issues.apache.org/jira/browse/SPARK-48175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> Changing serialization and deserialization of collated strings so that the 
> collation information is put in the metadata of the enclosing struct field - 
> and then read back from there during parsing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48175) Store collation information in metadata and not in type for SER/DE

2024-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48175.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46280
[https://github.com/apache/spark/pull/46280]

> Store collation information in metadata and not in type for SER/DE
> --
>
> Key: SPARK-48175
> URL: https://issues.apache.org/jira/browse/SPARK-48175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Changing serialization and deserialization of collated strings so that the 
> collation information is put in the metadata of the enclosing struct field - 
> and then read back from there during parsing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE

2024-05-18 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6f6b4860268d [SPARK-48175][SQL][PYTHON] Store collation information in 
metadata and not in type for SER/DE
6f6b4860268d is described below

commit 6f6b4860268dc250d8e31a251d740733798aa512
Author: Stefan Kandic 
AuthorDate: Sat May 18 15:17:56 2024 +0800

[SPARK-48175][SQL][PYTHON] Store collation information in metadata and not 
in type for SER/DE

### What changes were proposed in this pull request?
Changing serialization and deserialization of collated strings so that the 
collation information is put in the metadata of the enclosing struct field - 
and then read back from there during parsing.

Format of serialization will look something like this:
```json
{
  "type": "struct",
  "fields": [
"name": "colName",
"type": "string",
"nullable": true,
"metadata": {
  "__COLLATIONS": {
"colName": "UNICODE"
  }
}
  ]
}
```

If we have a map we will add suffixes `.key` and `.value` in the metadata:
```json
{
  "type": "struct",
  "fields": [
{
  "name": "mapField",
  "type": {
"type": "map",
"keyType": "string",
"valueType": "string",
"valueContainsNull": true
  },
  "nullable": true,
  "metadata": {
"__COLLATIONS": {
  "mapField.key": "UNICODE",
  "mapField.value": "UNICODE"
}
  }
}
  ]
}
```
It will be a similar story for arrays (we will add `.element` suffix). We 
could have multiple suffixes when working with deeply nested data types 
(Map[String, Array[Array[String]]] - see tests for this example)

### Why are the changes needed?
Putting collation info in field metadata is the only way to not break old 
clients reading new tables with collations. `CharVarcharUtils` does a similar 
thing but this is much less hacky, and more friendly for all 3p clients - which 
is especially important since delta also uses spark for schema ser/de.

It will also remove the need for additional logic introduced in #46083 to 
remove collations before writing to HMS as this way the tables will be fully 
HMS compatible.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
With unit tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46280 from stefankandic/newDeltaSchema.

Lead-authored-by: Stefan Kandic 
Co-authored-by: Stefan Kandic 
<154237371+stefankan...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/CollationFactory.java  |  99 +++-
 .../src/main/resources/error/error-conditions.json |  12 +
 python/pyspark/errors/error-conditions.json|  10 +
 .../pyspark/sql/tests/connect/test_parity_types.py |   4 +
 python/pyspark/sql/tests/test_types.py | 249 +++--
 python/pyspark/sql/types.py| 178 +--
 .../org/apache/spark/sql/types/DataType.scala  |  74 +-
 .../org/apache/spark/sql/types/StringType.scala|   7 +
 .../org/apache/spark/sql/types/StructField.scala   |  62 -
 .../org/apache/spark/sql/types/DataTypeSuite.scala | 181 ++-
 .../apache/spark/sql/types/StructTypeSuite.scala   | 183 +++
 .../streaming/StreamingDeduplicationSuite.scala|   2 +-
 .../spark/sql/streaming/StreamingQuerySuite.scala  |   2 +-
 13 files changed, 1004 insertions(+), 59 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
index 863445b6..0133c3feb611 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
@@ -36,11 +36,62 @@ import org.apache.spark.unsafe.types.UTF8String;
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+private final String provider;
+private final String name

(spark) branch master updated (15fb4787354a -> 3edd6c7e1d50)

2024-05-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 15fb4787354a [SPARK-48321][CONNECT][TESTS] Avoid using deprecated 
methods in dsl
 add 3edd6c7e1d50 [SPARK-48312][SQL] Improve 
Alias.removeNonInheritableMetadata performance

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/sql/types/Metadata.scala   |  7 +++
 .../spark/sql/catalyst/expressions/namedExpressions.scala  | 14 +++---
 2 files changed, 18 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Resolved] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48308.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46619
[https://github.com/apache/spark/pull/46619]

> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Assignee: Johan Lasperas
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
>  
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48308:
---

Assignee: Johan Lasperas

> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Assignee: Johan Lasperas
>Priority: Trivial
>  Labels: pull-request-available
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
>  
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48308][CORE] Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 57948c865e06 [SPARK-48308][CORE] Unify getting data schema without 
partition columns in FileSourceStrategy
57948c865e06 is described below

commit 57948c865e064469a75c92f8b58c632b9b40fdd3
Author: Johan Lasperas 
AuthorDate: Thu May 16 22:38:02 2024 +0800

[SPARK-48308][CORE] Unify getting data schema without partition columns in 
FileSourceStrategy

### What changes were proposed in this pull request?
Compute the schema of the data without partition columns only once in 
FileSourceStrategy.

### Why are the changes needed?
In FileSourceStrategy, the schema of the data excluding partition columns 
is computed 2 times in a slightly different way, using an AttributeSet 
(`partitionSet`) and using the attributes directly (`partitionColumns`)
These don't have the exact same semantics, AttributeSet will only use 
expression ids for comparison while comparing with the actual attributes will 
use the name, type, nullability and metadata. We want to use the former here.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46619 from johanl-db/reuse-schema-without-partition-columns.

Authored-by: Johan Lasperas 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/execution/datasources/FileSourceStrategy.scala| 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
index 8333c276cdd8..d31cb111924b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
@@ -216,9 +216,8 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
   val requiredExpressions: Seq[NamedExpression] = filterAttributes.toSeq 
++ projects
   val requiredAttributes = AttributeSet(requiredExpressions)
 
-  val readDataColumns = dataColumns
+  val readDataColumns = dataColumnsWithoutPartitionCols
 .filter(requiredAttributes.contains)
-.filterNot(partitionColumns.contains)
 
   // Metadata attributes are part of a column of type struct up to this 
point. Here we extract
   // this column from the schema and specify a matcher for that.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Resolved] (SPARK-48288) Add source data type to connector.Cast expression

2024-05-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48288.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46596
[https://github.com/apache/spark/pull/46596]

> Add source data type to connector.Cast expression
> -
>
> Key: SPARK-48288
> URL: https://issues.apache.org/jira/browse/SPARK-48288
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, 
> V2ExpressionBuilder will build connector.Cast expression from catalyst.Cast 
> expression.
> Catalyst cast have expression data type, but connector cast does not have it.
> Since some casts are not allowed on external engine, we need to know source 
> and target data type, since we want finer granularity to block some 
> unsupported casts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated (fa83d0f8fce7 -> 4be0828e6e6a)

2024-05-16 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from fa83d0f8fce7 [SPARK-48296][SQL] Codegen Support for `to_xml`
 add 4be0828e6e6a [SPARK-48288] Add source data type for connector cast 
expression

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/connector/expressions/Cast.java   | 18 +-
 .../sql/connector/util/V2ExpressionSQLBuilder.java |  6 +++---
 .../spark/sql/catalyst/util/V2ExpressionBuilder.scala  |  2 +-
 .../scala/org/apache/spark/sql/jdbc/JdbcDialects.scala |  4 ++--
 4 files changed, 23 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Resolved] (SPARK-48252) Update CommonExpressionRef when necessary

2024-05-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48252.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46552
[https://github.com/apache/spark/pull/46552]

> Update CommonExpressionRef when necessary
> -
>
> Key: SPARK-48252
> URL: https://issues.apache.org/jira/browse/SPARK-48252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48252) Update CommonExpressionRef when necessary

2024-05-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48252:
---

Assignee: Wenchen Fan

> Update CommonExpressionRef when necessary
> -
>
> Key: SPARK-48252
> URL: https://issues.apache.org/jira/browse/SPARK-48252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48252][SQL] Update CommonExpressionRef when necessary

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ca3593288d57 [SPARK-48252][SQL] Update CommonExpressionRef when 
necessary
ca3593288d57 is described below

commit ca3593288d577435a193f356b5214cf6f4bd534a
Author: Wenchen Fan 
AuthorDate: Thu May 16 09:42:36 2024 +0800

[SPARK-48252][SQL] Update CommonExpressionRef when necessary

### What changes were proposed in this pull request?

The `With` expression assumes that it should be created after all input 
expressions are fully resolved. This is mostly true (function lookup happens 
after function input expressions are resolved), but there is a special case of 
column resolution in HAVING: we use `TempResolvedColumn` to try one column 
resolution option. If it doesn't work, re-resolve the column, which may be a 
different data type. `With` expression should update the refs when this happens.

### Why are the changes needed?

bug fix, otherwise the query will fail

### Does this PR introduce _any_ user-facing change?

This feature is not released yet.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46552 from cloud-fan/with.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/catalyst/expressions/With.scala   | 18 +-
 .../optimizer/RewriteWithExpressionSuite.scala | 14 ++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
index 29794b33641c..5f6f9afa5797 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
@@ -40,7 +40,23 @@ case class With(child: Expression, defs: 
Seq[CommonExpressionDef])
   override def children: Seq[Expression] = child +: defs
   override protected def withNewChildrenInternal(
   newChildren: IndexedSeq[Expression]): Expression = {
-copy(child = newChildren.head, defs = 
newChildren.tail.map(_.asInstanceOf[CommonExpressionDef]))
+val newDefs = newChildren.tail.map(_.asInstanceOf[CommonExpressionDef])
+// If any `CommonExpressionDef` has been updated (data type or 
nullability), also update its
+// `CommonExpressionRef` in the `child`.
+val newChild = newDefs.filter(_.resolved).foldLeft(newChildren.head) { 
(result, newDef) =>
+  defs.find(_.id == newDef.id).map { oldDef =>
+if (newDef.dataType != oldDef.dataType || newDef.nullable != 
oldDef.nullable) {
+  val newRef = new CommonExpressionRef(newDef)
+  result.transform {
+case oldRef: CommonExpressionRef if oldRef.id == newRef.id =>
+  newRef
+  }
+} else {
+  result
+}
+  }.getOrElse(result)
+}
+copy(child = newChild, defs = newDefs)
   }
 
   /**
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
index aa8ffb2b0454..0aeca961aa51 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
@@ -18,6 +18,7 @@
 package org.apache.spark.sql.catalyst.optimizer
 
 import org.apache.spark.SparkException
+import org.apache.spark.sql.catalyst.analysis.TempResolvedColumn
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._
 import org.apache.spark.sql.catalyst.expressions._
@@ -438,4 +439,17 @@ class RewriteWithExpressionSuite extends PlanTest {
   Optimizer.execute(plan)
 }
   }
+
+  test("SPARK-48252: TempResolvedColumn in common expression") {
+val a = testRelation.output.head
+val tempResolved = TempResolvedColumn(a, Seq("a"))
+val expr = With(tempResolved) { case Seq(ref) =>
+  ref === 1
+}
+val plan = testRelation.having($"b")(avg("a").as("a"))(expr).analyze
+comparePlans(
+  Optimizer.execute(plan),
+  testRelation.groupBy($"b")(avg("a").as("a")).where($"a" === 1).analyze
+)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48172][SQL] Fix escaping issues in JDBCDialects

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 0e7156d2d801 [SPARK-48172][SQL] Fix escaping issues in JDBCDialects
0e7156d2d801 is described below

commit 0e7156d2d80171876c7a5e674349c53ee013be38
Author: Mihailo Milosevic 
AuthorDate: Wed May 15 22:15:52 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBCDialects

This PR is a fix of https://github.com/apache/spark/pull/46437. The 
previous PR was reverted as `LONGTEXT` is not supported by all dialects.

Special case escaping for MySQL and fix issues with redundant escaping for 
' character.
New changes introduced in the fix include change `LONGTEXT` -> 
`VARCHAR(50)`, as well as fix for table naming in the tests.

When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\' syntax instead of ESCAPE '' which would 
cause errors when trying to push down.

Yes

Tests for each existing dialect.

No.

Closes #46588 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 9e386b472981979e368a5921c58da5bfefe3acfe)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   3 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 14 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 1a25cd2802dd..fd99bb2a3bc5 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -67,6 +67,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col VARCHAR(50)
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..5f4f0b7a3afb 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote''_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index a527c6f8cb5b..51f31220d9a5 100644
--

[jira] [Resolved] (SPARK-48172) Fix escaping issues in JDBCDialects

2024-05-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48172.
-
Fix Version/s: 3.4.4
   3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 46588
[https://github.com/apache/spark/pull/46588]

> Fix escaping issues in JDBCDialects
> ---
>
> Key: SPARK-48172
> URL: https://issues.apache.org/jira/browse/SPARK-48172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.4, 3.5.2, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48172][SQL] Fix escaping issues in JDBCDialects

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 210ed2521d3d [SPARK-48172][SQL] Fix escaping issues in JDBCDialects
210ed2521d3d is described below

commit 210ed2521d3dc1202cd1ba855ed5e729a5d940d0
Author: Mihailo Milosevic 
AuthorDate: Wed May 15 22:15:52 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBCDialects

This PR is a fix of https://github.com/apache/spark/pull/46437. The 
previous PR was reverted as `LONGTEXT` is not supported by all dialects.

Special case escaping for MySQL and fix issues with redundant escaping for 
' character.
New changes introduced in the fix include change `LONGTEXT` -> 
`VARCHAR(50)`, as well as fix for table naming in the tests.

When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\' syntax instead of ESCAPE '' which would 
cause errors when trying to push down.

Yes

Tests for each existing dialect.

No.

Closes #46588 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 9e386b472981979e368a5921c58da5bfefe3acfe)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   3 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 14 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 9a78244f5326..5bcc8afefb1d 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -80,6 +80,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col VARCHAR(50)
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..5f4f0b7a3afb 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote''_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index 0dc3a39f4db5..0bb2ea8249b3 100644
--

(spark) branch master updated: [SPARK-48172][SQL] Fix escaping issues in JDBCDialects

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9e386b472981 [SPARK-48172][SQL] Fix escaping issues in JDBCDialects
9e386b472981 is described below

commit 9e386b472981979e368a5921c58da5bfefe3acfe
Author: Mihailo Milosevic 
AuthorDate: Wed May 15 22:15:52 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBCDialects

This PR is a fix of https://github.com/apache/spark/pull/46437. The 
previous PR was reverted as `LONGTEXT` is not supported by all dialects.

### What changes were proposed in this pull request?
Special case escaping for MySQL and fix issues with redundant escaping for 
' character.
New changes introduced in the fix include change `LONGTEXT` -> 
`VARCHAR(50)`, as well as fix for table naming in the tests.

### Why are the changes needed?
When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\' syntax instead of ESCAPE '' which would 
cause errors when trying to push down.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Tests for each existing dialect.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46588 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   1 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 12 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 3642094d11b2..57129e9d846f 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -62,6 +62,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col VARCHAR(50)
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..5f4f0b7a3afb 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote''_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-15 Thread Wenchen Fan

Thanks all for the feedback here! Let me put up a new version, which
clarifies the definition of "users":

Behavior changes mean user-visible functional changes in a new release via
public APIs. The "user" here is not only the user who writes queries and/or
develops Spark plugins, but also the user who deploys and/or manages Spark
clusters. New features, and even bug fixes that eliminate NPE or correct
query results, are behavior changes. Things like performance improvement,
code refactoring, and changes to unreleased APIs/features are not. All
behavior changes should be called out in the PR description. We need to
write an item in the migration guide (and probably legacy config) for those
that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any non-additive change to the public Python/SQL/Scala/Java/R APIs
   (including developer APIs): rename function, remove parameters, add
   parameters, rename parameters, change parameter default values, etc. These
   changes should be avoided in general, or done in a binary-compatible
   way like deprecating and adding a new function instead of renaming.
   - Any non-additive change to the way Spark should be deployed and
   managed.

The list above is not supposed to be comprehensive. Anyone can raise your
concern when reviewing PRs and ask the PR author to add migration guide if
you believe the change is risky and may break users.

On Thu, May 2, 2024 at 10:25 PM Will Raschkowski 
wrote:

> To add some user perspective, I wanted to share our experience from
> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at
> Palantir:
>
>
>
> We didn't mind "loud" changes that threw exceptions. We have some infra to
> try run jobs with Spark 3 and fallback to Spark 2 if there's an exception.
> E.g., the datetime parsing and rebasing migration in Spark 3 was great:
> Spark threw a helpful exception but never silently changed results.
> Similarly, for things listed in the migration guide as silent changes
> (e.g., add_months's handling of last-day-of-month), we wrote custom check
> rules to throw unless users acknowledged the change through config.
>
>
>
> Silent changes *not* in the migration guide were really bad for us:
> Trusting the migration guide to be exhaustive, we automatically upgraded
> jobs which then “succeeded” but wrote incorrect results. For example, some
> expression increased timestamp precision in Spark 3; a query implicitly
> relied on the reduced precision, and then produced bad results on upgrade.
> It’s a silly query but a note in the migration guide would have helped.
>
>
>
> To summarize: the migration guide was invaluable, we appreciated every
> entry, and we'd appreciate Wenchen's stricter definition of "behavior
> changes" (especially for silent ones).
>
>
>
> *From: *Nimrod Ofek 
> *Date: *Thursday, 2 May 2024 at 11:57
> *To: *Wenchen Fan 
> *Cc: *Erik Krogen , Spark dev list <
> dev@spark.apache.org>
> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes
>
> *CAUTION:* This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Message" button built into Outlook.
>
>
>
> Hi Erik and Wenchen,
>
>
>
> I think that usually a good practice with public api and with internal api
> that has big impact and a lot of usage is to ease in changes by providing
> defaults to new parameters that will keep former behaviour in a method with
> the previous signature with deprecation notice, and deleting that
> deprecated function in the next release- so the actual break will be in the
> next release after all libraries had the chance to align with the api and
> upgrades can be done while already using the new version.
>
>
>
> Another thing is that we should probably examine what private apis are
> used externally to provide better experience and provide proper public apis
> to meet those needs (for instance, applicative metrics and some way of
> creating custom behaviour columns).
>
>
>
> Thanks,
>
> Nimrod
>
>
>
> בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏:
>
> Hi Erik,
>
>
>
> Thanks for sharing your thoughts! Note: developer APIs are also public
> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
> changes should be avoided as much as we can and new APIs should be
> mentioned

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-15 Thread Wenchen Fan

RC1 failed because of this issue. I'll cut RC2 after we downgrade Jetty to
9.x.

On Sat, May 11, 2024 at 3:37 PM Cheng Pan  wrote:

> -1 (non-binding)
>
> A small question, the tag is orphan but I suppose it should belong to the
> master branch.
>
> Seems YARN integration is broken due to javax =>  jakarta namespace
> migration, I filled SPARK-48238, and left some comments on
> https://github.com/apache/spark/pull/45154
>
> Caused by: java.lang.IllegalStateException: class
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a
> jakarta.servlet.Filter
> at
> org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:724)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
> ~[?:?]
> at
> java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
> ~[?:?]
> at
> java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
> ~[?:?]
> at
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> ... 38 more
>
> Thanks,
> Cheng Pan
>
>
> > On May 11, 2024, at 13:55, Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 4.0.0-preview1.
> >
> > The vote is open until May 16 PST and passes if a majority +1 PMC votes
> are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v4.0.0-preview1-rc1 (commit
> 7dcf77c739c3854260464d732dbfb9a0f54706e7):
> > https://github.com/apache/spark/tree/v4.0.0-preview1-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1454/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/
> >
> > The list of bug fixes going into 4.0.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353359
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
>
>

[jira] [Resolved] (SPARK-48277) Improve error message for ErrorClassesJsonReader.getErrorMessage

2024-05-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48277.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46584
[https://github.com/apache/spark/pull/46584]

> Improve error message for ErrorClassesJsonReader.getErrorMessage
> 
>
> Key: SPARK-48277
> URL: https://issues.apache.org/jira/browse/SPARK-48277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated (8c0a7ba82c98 -> 5e87e9fbd6e6)

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8c0a7ba82c98 [SPARK-48160][SQL] Add collation support for XPATH 
expressions
 add 5e87e9fbd6e6 [SPARK-48277] Improve error message for 
ErrorClassesJsonReader.getErrorMessage

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala   | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Assigned] (SPARK-48160) XPath expressions (all collations)

2024-05-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48160:
---

Assignee: Uroš Bojanić

> XPath expressions (all collations)
> --
>
> Key: SPARK-48160
> URL: https://issues.apache.org/jira/browse/SPARK-48160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48160) XPath expressions (all collations)

2024-05-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48160.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46508
[https://github.com/apache/spark/pull/46508]

> XPath expressions (all collations)
> --
>
> Key: SPARK-48160
> URL: https://issues.apache.org/jira/browse/SPARK-48160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48160][SQL] Add collation support for XPATH expressions

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8c0a7ba82c98 [SPARK-48160][SQL] Add collation support for XPATH 
expressions
8c0a7ba82c98 is described below

commit 8c0a7ba82c98c7f7e686c4ee81d2aad49cc7a6e0
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed May 15 14:24:46 2024 +0800

[SPARK-48160][SQL] Add collation support for XPATH expressions

### What changes were proposed in this pull request?
Introduce collation awareness for XPath expressions: xpath_boolean, 
xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_string, 
xpath.

### Why are the changes needed?
Add collation support for Xpath expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
XPath functions: xpath_boolean, xpath_short, xpath_int, xpath_long, 
xpath_float, xpath_double, xpath_string, xpath.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46508 from uros-db/xpath-expressions.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/xml/xpath.scala | 11 --
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 44 ++
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala
index c3a285178c11..f65061e8d0ea 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala
@@ -23,6 +23,8 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.Cast._
 import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
 import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.types.StringTypeAnyCollation
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
@@ -39,7 +41,8 @@ abstract class XPathExtract
   /** XPath expressions are always nullable, e.g. if the xml string is empty. 
*/
   override def nullable: Boolean = true
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation)
 
   override def checkInputDataTypes(): TypeCheckResult = {
 if (!path.foldable) {
@@ -47,7 +50,7 @@ abstract class XPathExtract
 errorSubClass = "NON_FOLDABLE_INPUT",
 messageParameters = Map(
   "inputName" -> toSQLId("path"),
-  "inputType" -> toSQLType(StringType),
+  "inputType" -> toSQLType(StringTypeAnyCollation),
   "inputExpr" -> toSQLExpr(path)
 )
   )
@@ -221,7 +224,7 @@ case class XPathDouble(xml: Expression, path: Expression) 
extends XPathExtract {
 // scalastyle:on line.size.limit
 case class XPathString(xml: Expression, path: Expression) extends XPathExtract 
{
   override def prettyName: String = "xpath_string"
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def nullSafeEval(xml: Any, path: Any): Any = {
 val ret = xpathUtil.evalString(xml.asInstanceOf[UTF8String].toString, 
pathString)
@@ -245,7 +248,7 @@ case class XPathString(xml: Expression, path: Expression) 
extends XPathExtract {
 // scalastyle:on line.size.limit
 case class XPathList(xml: Expression, path: Expression) extends XPathExtract {
   override def prettyName: String = "xpath"
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
+  override def dataType: DataType = ArrayType(SQLConf.get.defaultStringType, 
containsNull = false)
 
   override def nullSafeEval(xml: Any, path: Any): Any = {
 val nodeList = 
xpathUtil.evalNodeList(xml.asInstanceOf[UTF8String].toString, pathString)
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index 48c3853bb5cf..37dcdf9bd721 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -548,6 +548,5

[jira] [Assigned] (SPARK-48162) Miscellaneous expressions (all collations)

2024-05-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48162:
---

Assignee: Uroš Bojanić

> Miscellaneous expressions (all collations)
> --
>
> Key: SPARK-48162
> URL: https://issues.apache.org/jira/browse/SPARK-48162
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48162) Miscellaneous expressions (all collations)

2024-05-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48162.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46461
[https://github.com/apache/spark/pull/46461]

> Miscellaneous expressions (all collations)
> --
>
> Key: SPARK-48162
> URL: https://issues.apache.org/jira/browse/SPARK-48162
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48162][SQL] Add collation support for MISC expressions

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 723354039f1d [SPARK-48162][SQL] Add collation support for MISC 
expressions
723354039f1d is described below

commit 723354039f1de587cacdf4ba48c076a896fdffd1
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed May 15 14:23:31 2024 +0800

[SPARK-48162][SQL] Add collation support for MISC expressions

### What changes were proposed in this pull request?
Introduce collation awareness for misc expressions: raise_error, uuid, 
version, typeof, aes_encrypt, aes_decrypt.

### Why are the changes needed?
Add collation support for misc expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
misc functions: raise_error, uuid, version, typeof, aes_encrypt, aes_decrypt.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46461 from uros-db/misc-expressions.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../explain-results/function_aes_decrypt.explain   |   2 +-
 .../function_aes_decrypt_with_mode.explain |   2 +-
 .../function_aes_decrypt_with_mode_padding.explain |   2 +-
 ...ction_aes_decrypt_with_mode_padding_aad.explain |   2 +-
 .../explain-results/function_aes_encrypt.explain   |   2 +-
 .../function_aes_encrypt_with_mode.explain |   2 +-
 .../function_aes_encrypt_with_mode_padding.explain |   2 +-
 ...nction_aes_encrypt_with_mode_padding_iv.explain |   2 +-
 ...on_aes_encrypt_with_mode_padding_iv_aad.explain |   2 +-
 .../function_try_aes_decrypt.explain   |   2 +-
 .../function_try_aes_decrypt_with_mode.explain |   2 +-
 ...ction_try_aes_decrypt_with_mode_padding.explain |   2 +-
 ...n_try_aes_decrypt_with_mode_padding_aad.explain |   2 +-
 .../spark/sql/catalyst/expressions/misc.scala  |  14 ++-
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 136 +
 15 files changed, 157 insertions(+), 19 deletions(-)

diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt.explain
 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt.explain
index 31e03b79eb98..55f1c314671a 100644
--- 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt.explain
+++ 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt.explain
@@ -1,2 +1,2 @@
-Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils, BinaryType, 
aesDecrypt, cast(g#0 as binary), cast(g#0 as binary), GCM, DEFAULT, cast( as 
binary), BinaryType, BinaryType, StringType, StringType, BinaryType, true, 
true, true) AS aes_decrypt(g, g, GCM, DEFAULT, )#0]
+Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils, BinaryType, 
aesDecrypt, cast(g#0 as binary), cast(g#0 as binary), GCM, DEFAULT, cast( as 
binary), BinaryType, BinaryType, StringTypeAnyCollation, 
StringTypeAnyCollation, BinaryType, true, true, true) AS aes_decrypt(g, g, GCM, 
DEFAULT, )#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode.explain
 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode.explain
index fc572e8fe7c6..762a4f47a058 100644
--- 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode.explain
+++ 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode.explain
@@ -1,2 +1,2 @@
-Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils, BinaryType, 
aesDecrypt, cast(g#0 as binary), cast(g#0 as binary), g#0, DEFAULT, cast( as 
binary), BinaryType, BinaryType, StringType, StringType, BinaryType, true, 
true, true) AS aes_decrypt(g, g, g, DEFAULT, )#0]
+Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils, BinaryType, 
aesDecrypt, cast(g#0 as binary), cast(g#0 as binary), g#0, DEFAULT, cast( as 
binary), BinaryType, BinaryType, StringTypeAnyCollation, 
StringTypeAnyCollation, BinaryType, true, true, true) AS aes_decrypt(g, g, g, 
DEFAULT, )#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode_padding.explain
 
b/connector/connect/com

[jira] [Updated] (SPARK-48271) Turn match error in RowEncoder into UNSUPPORTED_DATA_TYPE_FOR_ENCODER

2024-05-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-48271:

Summary: Turn match error in RowEncoder into 
UNSUPPORTED_DATA_TYPE_FOR_ENCODER  (was: support char/varchar in RowEncoder)

> Turn match error in RowEncoder into UNSUPPORTED_DATA_TYPE_FOR_ENCODER
> -
>
> Key: SPARK-48271
> URL: https://issues.apache.org/jira/browse/SPARK-48271
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48263) Collate function support for non UTF8_BINARY strings

2024-05-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48263.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46574
[https://github.com/apache/spark/pull/46574]

> Collate function support for non UTF8_BINARY strings
> 
>
> Key: SPARK-48263
> URL: https://issues.apache.org/jira/browse/SPARK-48263
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nebojsa Savic
>Assignee: Nebojsa Savic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When default collation level config is set to some collation other than 
> UTF8_BINARY (i.e. UTF8_BINARY_LCASE) and when we try to execute COLLATE (or 
> collation) expression, this will fail because it is only accepting 
> StringType(0) as argument for collation name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48263] Collate function support for non UTF8_BINARY strings

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 91da2caa409c [SPARK-48263] Collate function support for non 
UTF8_BINARY strings
91da2caa409c is described below

commit 91da2caa409cb156a970fea0fc8355fcd8c6a2e6
Author: Nebojsa Savic 
AuthorDate: Tue May 14 23:39:26 2024 +0800

[SPARK-48263] Collate function support for non UTF8_BINARY strings

### What changes were proposed in this pull request?
collate("xx", "") does not work when there is a config for 
default collation set which configures non UTF8_BINARY collation as default.

### Why are the changes needed?
Fixing the compatibility issue with default collation config and collate 
function.

### Does this PR introduce _any_ user-facing change?
Customers will be able to execute collation(, ) function 
even when default collation config is configured to some other collation than 
UTF8_BINARY. We are expanding the surface area for cx.

### How was this patch tested?
Added tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46574 from nebojsa-db/SPARK-48263.

Authored-by: Nebojsa Savic 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/collationExpressions.scala|  4 ++--
 .../test/scala/org/apache/spark/sql/CollationSuite.scala   | 14 --
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
index 6af00e193d94..7c02475a60ad 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
@@ -57,14 +57,14 @@ object CollateExpressionBuilder extends ExpressionBuilder {
 expressions match {
   case Seq(e: Expression, collationExpr: Expression) =>
 (collationExpr.dataType, collationExpr.foldable) match {
-  case (StringType, true) =>
+  case (_: StringType, true) =>
 val evalCollation = collationExpr.eval()
 if (evalCollation == null) {
   throw QueryCompilationErrors.unexpectedNullError("collation", 
collationExpr)
 } else {
   Collate(e, evalCollation.toString)
 }
-  case (StringType, false) => throw 
QueryCompilationErrors.nonFoldableArgumentError(
+  case (_: StringType, false) => throw 
QueryCompilationErrors.nonFoldableArgumentError(
 funcName, "collationName", StringType)
   case (_, _) => throw 
QueryCompilationErrors.unexpectedInputDataTypeError(
 funcName, 1, StringType, collationExpr)
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
index fce9ad3cc184..b22a762a2954 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
@@ -67,8 +67,18 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
   }
 
   test("collate function syntax") {
-assert(sql(s"select collate('aaa', 'utf8_binary')").schema(0).dataType == 
StringType(0))
-assert(sql(s"select collate('aaa', 
'utf8_binary_lcase')").schema(0).dataType == StringType(1))
+assert(sql(s"select collate('aaa', 'utf8_binary')").schema(0).dataType ==
+  StringType("UTF8_BINARY"))
+assert(sql(s"select collate('aaa', 
'utf8_binary_lcase')").schema(0).dataType ==
+  StringType("UTF8_BINARY_LCASE"))
+  }
+
+  test("collate function syntax with default collation set") {
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UTF8_BINARY_LCASE") {
+  assert(sql(s"select collate('aaa', 
'utf8_binary_lcase')").schema(0).dataType ==
+StringType("UTF8_BINARY_LCASE"))
+  assert(sql(s"select collate('aaa', 'UNICODE')").schema(0).dataType == 
StringType("UNICODE"))
+}
   }
 
   test("collate function syntax invalid arg count") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Assigned] (SPARK-48263) Collate function support for non UTF8_BINARY strings

2024-05-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48263:
---

Assignee: Nebojsa Savic

> Collate function support for non UTF8_BINARY strings
> 
>
> Key: SPARK-48263
> URL: https://issues.apache.org/jira/browse/SPARK-48263
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nebojsa Savic
>Assignee: Nebojsa Savic
>Priority: Major
>  Labels: pull-request-available
>
> When default collation level config is set to some collation other than 
> UTF8_BINARY (i.e. UTF8_BINARY_LCASE) and when we try to execute COLLATE (or 
> collation) expression, this will fail because it is only accepting 
> StringType(0) as argument for collation name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-47301][SQL][TESTS][FOLLOWUP] Remove workaround for ParquetIOSuite

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 97bf1ee9f6f7 [SPARK-47301][SQL][TESTS][FOLLOWUP] Remove workaround for 
ParquetIOSuite
97bf1ee9f6f7 is described below

commit 97bf1ee9f6f76d49df50560bf792135308f289a9
Author: panbingkun 
AuthorDate: Tue May 14 23:37:47 2024 +0800

[SPARK-47301][SQL][TESTS][FOLLOWUP] Remove workaround for ParquetIOSuite

### What changes were proposed in this pull request?
The pr aims to remove workaround for ParquetIOSuite.

### Why are the changes needed?
After https://github.com/apache/spark/pull/46562 is completed, the reason 
why the ut `SPARK-7837 Do not close output writer twice when commitTask() 
fails` failed due to different event processing time sequence no longer exists, 
so we remove the previous workaround here.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Manually test.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46577 from panbingkun/SPARK-47301_FOLLOWUP.

Authored-by: panbingkun 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/datasources/parquet/ParquetIOSuite.scala  | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
index ba8fef0b3a8d..4fb8faa43a39 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
@@ -1589,12 +1589,8 @@ class ParquetIOWithoutOutputCommitCoordinationSuite
 .coalesce(1)
   
df.write.partitionBy("a").options(extraOptions).parquet(dir.getCanonicalPath)
 }
-if (m2.getErrorClass != null) {
-  assert(m2.getErrorClass == "TASK_WRITE_FAILED")
-  assert(m2.getCause.getMessage.contains("Intentional exception for 
testing purposes"))
-} else {
-  assert(m2.getMessage.contains("TASK_WRITE_FAILED"))
-}
+assert(m2.getErrorClass == "TASK_WRITE_FAILED")
+assert(m2.getCause.getMessage.contains("Intentional exception for 
testing purposes"))
   }
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Resolved] (SPARK-48172) Fix escaping issues in JDBCDialects

2024-05-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48172.
-
Fix Version/s: 3.4.4
   3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 46437
[https://github.com/apache/spark/pull/46437]

> Fix escaping issues in JDBCDialects
> ---
>
> Key: SPARK-48172
> URL: https://issues.apache.org/jira/browse/SPARK-48172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.4, 3.5.2, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new a848e2790cba [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects
a848e2790cba is described below

commit a848e2790cba0b7ee77d391dc534146bd35ee50a
Author: Mihailo Milosevic 
AuthorDate: Tue May 14 23:31:46 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

Special case escaping for MySQL and fix issues with redundant escaping for 
' character.

When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\\' syntax instead of ESCAPE '\' which would 
cause errors when trying to push down.

Yes

Tests for each existing dialect.

No.

Closes #46437 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 47006a493f98ca85196194d16d58b5847177b1a3)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   3 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 14 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 1a25cd2802dd..11ddce68aecd 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -67,6 +67,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col LONGTEXT
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..a42caeafe6fe 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote\\'_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index a527c6f8cb5b..6658b5ed6c77 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
@@ -66,6 +66,12 @@ class MsSqlServerIntegrationSuite extends 
DockerJ

(spark) branch branch-3.5 updated: [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new f37fa436cd4e [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects
f37fa436cd4e is described below

commit f37fa436cd4e0ef9f486a60f9af91a3ce0195df9
Author: Mihailo Milosevic 
AuthorDate: Tue May 14 23:31:46 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

Special case escaping for MySQL and fix issues with redundant escaping for 
' character.

When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\\' syntax instead of ESCAPE '\' which would 
cause errors when trying to push down.

Yes

Tests for each existing dialect.

No.

Closes #46437 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 47006a493f98ca85196194d16d58b5847177b1a3)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   3 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 14 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 9a78244f5326..9b4916ddd36b 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -80,6 +80,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col LONGTEXT
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..a42caeafe6fe 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote\\'_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index 0dc3a39f4db5..57a2667557fa 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
@@ -86,6 +86,12 @@ class MsSqlServerIntegrationSuite extends 
DockerJ

(spark) branch master updated: [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 47006a493f98 [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects
47006a493f98 is described below

commit 47006a493f98ca85196194d16d58b5847177b1a3
Author: Mihailo Milosevic 
AuthorDate: Tue May 14 23:31:46 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

### What changes were proposed in this pull request?
Special case escaping for MySQL and fix issues with redundant escaping for 
' character.

### Why are the changes needed?
When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\\' syntax instead of ESCAPE '\' which would 
cause errors when trying to push down.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Tests for each existing dialect.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46437 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   1 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 12 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 3642094d11b2..36795747319d 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -62,6 +62,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col LONGTEXT
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..a42caeafe6fe 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote\\'_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index b1b8aec5ad33..46530fe5419a 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
+++ 
b/connector/docker-integration-tes

[jira] [Resolved] (SPARK-48155) PropagateEmpty relation cause LogicalQueryStage only with broadcast without join then execute failed

2024-05-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48155.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46523
[https://github.com/apache/spark/pull/46523]

> PropagateEmpty relation cause LogicalQueryStage only with broadcast without 
> join then execute failed
> 
>
> Key: SPARK-48155
> URL: https://issues.apache.org/jira/browse/SPARK-48155
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.5.1, 3.3.4
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> 24/05/07 09:48:55 ERROR [main] PlanChangeLogger:
> === Applying Rule 
> org.apache.spark.sql.execution.adaptive.AQEPropagateEmptyRelation ===
>  Project [date#124, station_name#0, shipment_id#14]
>  +- Filter (status#2L INSET 1, 149, 2, 36, 400, 417, 418, 419, 49, 5, 50, 581 
> AND station_type#1 IN (3,12))
>     +- Aggregate [date#124, shipment_id#14], [date#124, shipment_id#14, ... 3 
> more fields] 
> !      +- Join LeftOuter, ((cast(date#124 as timestamp) >= 
> cast(from_unixtime((ctime#27L - 0), -MM-dd HH:mm:ss, 
> Some(Asia/Singapore)) as timestamp)) AND (cast(date#124 as timestamp) + 
> INTERVAL '-4' DAY <= cast(from_unixtime((ctime#27L - 0), -MM-dd HH:mm:ss, 
> Some(Asia/Singapore)) as timestamp)))
> !         :- LogicalQueryStage Generate 
> explode(org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3a191e40), 
> false, [date#124], BroadcastQueryStage 0
> !         +- LocalRelation , [shipment_id#14, station_name#5, ... 3 
> more fields]24/05/07 09:48:55 ERROR [main] 
> Project [date#124, station_name#0, shipment_id#14]
>  +- Filter (status#2L INSET 1, 149, 2, 36, 400, 417, 418, 419, 49, 5, 50, 581 
> AND station_type#1 IN (3,12))
>     +- Aggregate [date#124, shipment_id#14], [date#124, shipment_id#14, ... 3 
> more fields]
> !      +- Project [date#124, cast(null as string) AS shipment_id#14, ... 4 
> more fields]
> !         +- LogicalQueryStage Generate 
> explode(org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3a191e40), 
> false, [date#124], BroadcastQueryStage 0 {code}
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.executeCodePathUnsupportedError(QueryExecutionErrors.scala:1652)
> at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:203)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:119)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
> at 
> org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:526)
> at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:454)
> at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:453)
> at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:497)
> at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:50)
> at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:132)  
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:750)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>

[jira] [Assigned] (SPARK-48155) PropagateEmpty relation cause LogicalQueryStage only with broadcast without join then execute failed

2024-05-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48155:
---

Assignee: angerszhu

> PropagateEmpty relation cause LogicalQueryStage only with broadcast without 
> join then execute failed
> 
>
> Key: SPARK-48155
> URL: https://issues.apache.org/jira/browse/SPARK-48155
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.5.1, 3.3.4
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> 24/05/07 09:48:55 ERROR [main] PlanChangeLogger:
> === Applying Rule 
> org.apache.spark.sql.execution.adaptive.AQEPropagateEmptyRelation ===
>  Project [date#124, station_name#0, shipment_id#14]
>  +- Filter (status#2L INSET 1, 149, 2, 36, 400, 417, 418, 419, 49, 5, 50, 581 
> AND station_type#1 IN (3,12))
>     +- Aggregate [date#124, shipment_id#14], [date#124, shipment_id#14, ... 3 
> more fields] 
> !      +- Join LeftOuter, ((cast(date#124 as timestamp) >= 
> cast(from_unixtime((ctime#27L - 0), -MM-dd HH:mm:ss, 
> Some(Asia/Singapore)) as timestamp)) AND (cast(date#124 as timestamp) + 
> INTERVAL '-4' DAY <= cast(from_unixtime((ctime#27L - 0), -MM-dd HH:mm:ss, 
> Some(Asia/Singapore)) as timestamp)))
> !         :- LogicalQueryStage Generate 
> explode(org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3a191e40), 
> false, [date#124], BroadcastQueryStage 0
> !         +- LocalRelation , [shipment_id#14, station_name#5, ... 3 
> more fields]24/05/07 09:48:55 ERROR [main] 
> Project [date#124, station_name#0, shipment_id#14]
>  +- Filter (status#2L INSET 1, 149, 2, 36, 400, 417, 418, 419, 49, 5, 50, 581 
> AND station_type#1 IN (3,12))
>     +- Aggregate [date#124, shipment_id#14], [date#124, shipment_id#14, ... 3 
> more fields]
> !      +- Project [date#124, cast(null as string) AS shipment_id#14, ... 4 
> more fields]
> !         +- LogicalQueryStage Generate 
> explode(org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3a191e40), 
> false, [date#124], BroadcastQueryStage 0 {code}
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.executeCodePathUnsupportedError(QueryExecutionErrors.scala:1652)
> at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:203)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:119)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
> at 
> org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:526)
> at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:454)
> at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:453)
> at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:497)
> at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:50)
> at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:132)  
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:750)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkP

(spark) branch master updated: [SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if remain child is just BroadcastQueryStageExec

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e5ad5e94a8c8 [SPARK-48155][SQL] AQEPropagateEmptyRelation for join 
should check if remain child is just BroadcastQueryStageExec
e5ad5e94a8c8 is described below

commit e5ad5e94a8c891210637084a69359c1364201653
Author: Angerszh 
AuthorDate: Tue May 14 17:32:56 2024 +0800

[SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if 
remain child is just BroadcastQueryStageExec

### What changes were proposed in this pull request?
It's a new approach to fix 
[SPARK-39551](https://issues.apache.org/jira/browse/SPARK-39551)
This situation happened for AQEPropagateEmptyRelation when one side is 
empty and one side is BroadcastQueryStateExec
This pr avoid do propagate, not to revert all queryStagePreparationRules's 
result.

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manuel tested `SPARK-39551: Invalid plan check - invalid broadcast query 
stage`, it can work well without origin fix and current pr

For added UT,
```
  test("SPARK-48155: AQEPropagateEmptyRelation check remained child for 
join") {
withSQLConf(
  SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
  val (_, adaptivePlan) = runAdaptiveAndVerifyResult(
"""
  |SELECT /*+ BROADCAST(t3) */ t3.b, count(t3.a) FROM testData2 t1
  |INNER JOIN (
  |  SELECT * FROM testData2
  |  WHERE b = 0
  |  UNION ALL
  |  SELECT * FROM testData2
  |  WHErE b != 0
  |) t2
  |ON t1.b = t2.b AND t1.a = 0
  |RIGHT OUTER JOIN testData2 t3
  |ON t1.a > t3.a
  |GROUP BY t3.b
""".stripMargin
  )
  assert(findTopLevelBroadcastNestedLoopJoin(adaptivePlan).size == 1)
  assert(findTopLevelUnion(adaptivePlan).size == 0)
}
  }
```

before this pr the adaptive plan is
```
*(9) HashAggregate(keys=[b#226], functions=[count(1)], output=[b#226, 
count(a)#228L])
+- AQEShuffleRead coalesced
   +- ShuffleQueryStage 3
  +- Exchange hashpartitioning(b#226, 5), ENSURE_REQUIREMENTS, 
[plan_id=356]
 +- *(8) HashAggregate(keys=[b#226], functions=[partial_count(1)], 
output=[b#226, count#232L])
+- *(8) Project [b#226]
   +- BroadcastNestedLoopJoin BuildRight, RightOuter, (a#23 > 
a#225)
  :- *(7) Project [a#23]
  :  +- *(7) SortMergeJoin [b#24], [b#220], Inner
  : :- *(5) Sort [b#24 ASC NULLS FIRST], false, 0
  : :  +- AQEShuffleRead coalesced
  : : +- ShuffleQueryStage 0
  : :+- Exchange hashpartitioning(b#24, 5), 
ENSURE_REQUIREMENTS, [plan_id=211]
  : :   +- *(1) Filter (a#23 = 0)
  : :  +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
  : : +- Scan[obj#22]
  : +- *(6) Sort [b#220 ASC NULLS FIRST], false, 0
  :+- AQEShuffleRead coalesced
  :   +- ShuffleQueryStage 1
  :  +- Exchange hashpartitioning(b#220, 5), 
ENSURE_REQUIREMENTS, [plan_id=233]
  : +- Union
  ::- *(2) Project [b#220]
  ::  +- *(2) Filter (b#220 = 0)
  :: +- *(2) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#219, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#220]
  ::+- Scan[obj#218]
  :+- *(3) Project [b#223]
  :   +- *(3) Filter NOT (b#223 = 0)
  :  +- *(3) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#222, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#223]
  : +-

[jira] [Created] (SPARK-48271) support char/varchar in RowEncoder

2024-05-14 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-48271:
---

 Summary: support char/varchar in RowEncoder
 Key: SPARK-48271
 URL: https://issues.apache.org/jira/browse/SPARK-48271
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-46707][SQL][FOLLOWUP] Push down throwable predicate through aggregates

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6766c39b458a [SPARK-46707][SQL][FOLLOWUP] Push down throwable 
predicate through aggregates
6766c39b458a is described below

commit 6766c39b458ad7abacd1a5b11c896efabf36f95c
Author: zml1206 
AuthorDate: Tue May 14 15:53:43 2024 +0800

[SPARK-46707][SQL][FOLLOWUP] Push down throwable predicate through 
aggregates

### What changes were proposed in this pull request?
Push down throwable predicate through aggregates and add ut for "can't push 
down nondeterministic filter through aggregate".

### Why are the changes needed?
If we can push down a filter through Aggregate, it means the filter only 
references the grouping keys. The Aggregate operator can't reduce grouping keys 
so the filter won't see any new data after pushing down. So push down throwable 
filter through aggregate does not affect exception thrown.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44975 from zml1206/SPARK-46707-FOLLOWUP.

Authored-by: zml1206 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/optimizer/Optimizer.scala  |  8 ++--
 .../sql/catalyst/optimizer/FilterPushdownSuite.scala  | 19 ---
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index dfc1e17c2a29..4ee6d9027a9c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -1768,6 +1768,10 @@ object PushPredicateThroughNonJoin extends 
Rule[LogicalPlan] with PredicateHelpe
   val aliasMap = getAliasMap(project)
   project.copy(child = Filter(replaceAlias(condition, aliasMap), 
grandChild))
 
+// We can push down deterministic predicate through Aggregate, including 
throwable predicate.
+// If we can push down a filter through Aggregate, it means the filter 
only references the
+// grouping keys or constants. The Aggregate operator can't reduce 
distinct values of grouping
+// keys so the filter won't see any new data after push down.
 case filter @ Filter(condition, aggregate: Aggregate)
   if aggregate.aggregateExpressions.forall(_.deterministic)
 && aggregate.groupingExpressions.nonEmpty =>
@@ -1777,8 +1781,8 @@ object PushPredicateThroughNonJoin extends 
Rule[LogicalPlan] with PredicateHelpe
   // attributes produced by the aggregate operator's child operator.
   val (pushDown, stayUp) = splitConjunctivePredicates(condition).partition 
{ cond =>
 val replaced = replaceAlias(cond, aliasMap)
-cond.deterministic && !cond.throwable &&
-  cond.references.nonEmpty && 
replaced.references.subsetOf(aggregate.child.outputSet)
+cond.deterministic && cond.references.nonEmpty &&
+  replaced.references.subsetOf(aggregate.child.outputSet)
   }
 
   if (pushDown.nonEmpty) {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
index 03e65412d166..5027222be6b8 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
@@ -219,6 +219,17 @@ class FilterPushdownSuite extends PlanTest {
 comparePlans(optimized, correctAnswer)
   }
 
+  test("Can't push down nondeterministic filter through aggregate") {
+val originalQuery = testRelation
+  .groupBy($"a")($"a", count($"b") as "c")
+  .where(Rand(10) > $"a")
+  .analyze
+
+val optimized = Optimize.execute(originalQuery)
+
+comparePlans(optimized, originalQuery)
+  }
+
   test("filters: combines filters") {
 val originalQuery = testRelation
   .select($"a")
@@ -1483,14 +1494,16 @@ class FilterPushdownSuite extends PlanTest {
   test("SPARK-46707: push down predicate with sequence (without step) through 
aggregates") {
 val x = testRelation.subquery("x")
 
-// do not push down when sequence has step param
+// Always push down sequence as it's deterministic
 val queryWithStep = x.groupBy($"x.a", $"x.b"

[jira] [Resolved] (SPARK-48157) CSV expressions (all collations)

2024-05-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48157.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46504
[https://github.com/apache/spark/pull/46504]

> CSV expressions (all collations)
> 
>
> Key: SPARK-48157
> URL: https://issues.apache.org/jira/browse/SPARK-48157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Enable collation support for *CSV* built-in string functions in Spark 
> ({*}CsvToStructs{*}, {*}SchemaOfCsv{*}, {*}StructsToCsv{*}). First confirm 
> what is the expected behaviour for these functions when given collated 
> strings, and then move on to implementation and testing. You will find these 
> expressions in the *csvExpressions.scala* file, and they should mostly be 
> pass-through functions. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *CSV* expressions so that 
> they support all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Ascii, Chr, Base64, 
> UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48157) CSV expressions (all collations)

2024-05-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48157:
---

Assignee: Uroš Bojanić

> CSV expressions (all collations)
> 
>
> Key: SPARK-48157
> URL: https://issues.apache.org/jira/browse/SPARK-48157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for *CSV* built-in string functions in Spark 
> ({*}CsvToStructs{*}, {*}SchemaOfCsv{*}, {*}StructsToCsv{*}). First confirm 
> what is the expected behaviour for these functions when given collated 
> strings, and then move on to implementation and testing. You will find these 
> expressions in the *csvExpressions.scala* file, and they should mostly be 
> pass-through functions. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *CSV* expressions so that 
> they support all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Ascii, Chr, Base64, 
> UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48157][SQL] Add collation support for CSV expressions

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e6c914f63079 [SPARK-48157][SQL] Add collation support for CSV 
expressions
e6c914f63079 is described below

commit e6c914f630793992eba7a409ec2cd061f385ce02
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue May 14 14:17:45 2024 +0800

[SPARK-48157][SQL] Add collation support for CSV expressions

### What changes were proposed in this pull request?
Introduce collation awareness for CSV expressions: from_csv, schema_of_csv, 
to_csv.

### Why are the changes needed?
Add collation support for CSV expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
CSV functions: from_csv, schema_of_csv, to_csv.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46504 from uros-db/csv-expressions.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/csvExpressions.scala  |   7 +-
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 112 +
 2 files changed, 116 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
index 4714fc1ded9c..cb10440c4832 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
@@ -31,6 +31,7 @@ import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.catalyst.util.TypeUtils._
 import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryErrorsBase}
 import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.types.StringTypeAnyCollation
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
@@ -146,7 +147,7 @@ case class CsvToStructs(
 converter(parser.parse(csv))
   }
 
-  override def inputTypes: Seq[AbstractDataType] = StringType :: Nil
+  override def inputTypes: Seq[AbstractDataType] = StringTypeAnyCollation :: 
Nil
 
   override def prettyName: String = "from_csv"
 
@@ -177,7 +178,7 @@ case class SchemaOfCsv(
 child = child,
 options = ExprUtils.convertToMapData(options))
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def nullable: Boolean = false
 
@@ -300,7 +301,7 @@ case class StructsToCsv(
 (row: Any) => 
UTF8String.fromString(gen.writeToString(row.asInstanceOf[InternalRow]))
   }
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression =
 copy(timeZoneId = Option(timeZoneId))
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index 22b29154cd78..f8b3548b956c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -313,6 +313,118 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("Support CsvToStructs csv expression with collation") {
+case class CsvToStructsTestCase(
+ input: String,
+ collationName: String,
+ schema: String,
+ options: String,
+ result: Row,
+ structFields: Seq[StructField]
+)
+
+val testCases = Seq(
+  CsvToStructsTestCase("1", "UTF8_BINARY", "'a INT'", "",
+Row(1), Seq(
+  StructField("a", IntegerType, nullable = true)
+)),
+  CsvToStructsTestCase("true, 0.8", "UTF8_BINARY_LCASE", "'A BOOLEAN, B 
DOUBLE'", "",
+Row(true, 0.8), Seq(
+  StructField("A", BooleanType, nullable = true),
+  StructField("B", DoubleType, nullable = true)
+)),
+  CsvToStructsTestCase("\"Spark\"", "UNICODE", "'a STRING'", "",
+Row("Spark"), Seq(
+  StructField("a", StringType("UNICODE"), nullable = true)
+)),
+  CsvToStructsTestCase("26/08/2015", "UTF8_BINARY", "'time Timestamp'",
+

[jira] [Resolved] (SPARK-48229) inputFile expressions (all collations)

2024-05-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48229.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46503
[https://github.com/apache/spark/pull/46503]

> inputFile expressions (all collations)
> --
>
> Key: SPARK-48229
> URL: https://issues.apache.org/jira/browse/SPARK-48229
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48229][SQL] Add collation support for inputFile expressions

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9241b8e8c0df [SPARK-48229][SQL] Add collation support for inputFile 
expressions
9241b8e8c0df is described below

commit 9241b8e8c0dfe35fbe1631fd440527eb72d88de8
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue May 14 14:08:30 2024 +0800

[SPARK-48229][SQL] Add collation support for inputFile expressions

### What changes were proposed in this pull request?
Introduce collation awareness for inputFile expressions: input_file_name.

### Why are the changes needed?
Add collation support for inputFile expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
inputFile functions: input_file_name.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46503 from uros-db/input-file-block.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/inputFileBlock.scala |  5 +++--
 .../apache/spark/sql/CollationSQLExpressionsSuite.scala | 17 +
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala
index 6cd88367aa9a..65eb995ff32f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala
@@ -21,7 +21,8 @@ import org.apache.spark.rdd.InputFileBlockHolder
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
CodeGenerator, ExprCode, FalseLiteral}
 import org.apache.spark.sql.catalyst.expressions.codegen.Block._
-import org.apache.spark.sql.types.{DataType, LongType, StringType}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{DataType, LongType}
 import org.apache.spark.unsafe.types.UTF8String
 
 // scalastyle:off whitespace.end.of.line
@@ -39,7 +40,7 @@ case class InputFileName() extends LeafExpression with 
Nondeterministic {
 
   override def nullable: Boolean = false
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def prettyName: String = "input_file_name"
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index dd5703d1284a..22b29154cd78 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -1275,6 +1275,23 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("Support InputFileName expression with collation") {
+// Supported collations
+Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", 
"UNICODE_CI").foreach(collationName => {
+  val query =
+s"""
+   |select input_file_name()
+   |""".stripMargin
+  // Result
+  withSQLConf(SqlApiConf.DEFAULT_COLLATION -> collationName) {
+val testQuery = sql(query)
+checkAnswer(testQuery, Row(""))
+val dataType = StringType(collationName)
+assert(testQuery.schema.fields.head.dataType.sameType(dataType))
+  }
+})
+  }
+
   // TODO: Add more tests for other SQL expressions
 
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Assigned] (SPARK-48265) Infer window group limit batch should do constant folding

2024-05-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48265:
---

Assignee: angerszhu

> Infer window group limit batch should do constant folding
> -
>
> Key: SPARK-48265
> URL: https://issues.apache.org/jira/browse/SPARK-48265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger:
> === Result of Batch LocalRelation ===
>  GlobalLimit 21                                                               
>                                                               GlobalLimit 21
>  +- LocalLimit 21                                                             
>                                                               +- LocalLimit 21
> !   +- Union false, false                                                     
>                                                                  +- 
> LocalLimit 21
> !      :- LocalLimit 21                                                       
>                                                                     +- 
> Project [item_id#647L]
> !      :  +- Project [item_id#647L]                                           
>                                                                        +- 
> Filter (((isnotnull(tz_type#734) AND (tz_type#734 = local)) AND 
> (grass_region#735 = BR)) AND isnotnull(grass_region#735))
> !      :     +- Filter (((isnotnull(tz_type#734) AND (tz_type#734 = local)) 
> AND (grass_region#735 = BR)) AND isnotnull(grass_region#735))               
> +- Relation db.table[,... 91 more fields] parquet
> !      :        +- Relation db.table[,... 91 more fields] parquet
> !      +- LocalLimit 21
> !         +- Project [item_id#738L]
> !            +- LocalRelation , [, ... 91 more fields]
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Check Cartesian 
> Products has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch RewriteSubquery has no 
> effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch 
> NormalizeFloatingNumbers has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch 
> ReplaceUpdateFieldsExpression has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Optimize Metadata Only 
> Query has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch PartitionPruning has 
> no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch InjectRuntimeFilter 
> has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Pushdown Filters from 
> PartitionPruning has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Cleanup filters that 
> cannot be pushed down has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Extract Python UDFs 
> has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger:
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateLimits ===
>  GlobalLimit 21                                                               
>                                                            GlobalLimit 21
> !+- LocalLimit 21                                                             
>                                                            +- LocalLimit 
> least(, ... 2 more fields)
> !   +- LocalLimit 21                                                          
>                                                               +- Project 
> [item_id#647L]
> !      +- Project [item_id#647L]                                              
>                                                                  +- Filter 
> (((isnotnull(tz_type#734) AND (tz_type#734 = local)) AND (grass_region#735 = 
> BR)) AND isnotnull(grass_region#735))
> !         +- Filter (((isnotnull(tz_type#734) AND (tz_type#734 = local)) AND 
> (grass_region#735 = BR)) AND isnotnull(grass_region#735))            +- 
> Relation db.table[,... 91 more fields] parquet
> !            +- Relation db.table[,... 91 more fields] parquet
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48265) Infer window group limit batch should do constant folding

2024-05-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48265.
-
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 46568
[https://github.com/apache/spark/pull/46568]

> Infer window group limit batch should do constant folding
> -
>
> Key: SPARK-48265
> URL: https://issues.apache.org/jira/browse/SPARK-48265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> {code:java}
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger:
> === Result of Batch LocalRelation ===
>  GlobalLimit 21                                                               
>                                                               GlobalLimit 21
>  +- LocalLimit 21                                                             
>                                                               +- LocalLimit 21
> !   +- Union false, false                                                     
>                                                                  +- 
> LocalLimit 21
> !      :- LocalLimit 21                                                       
>                                                                     +- 
> Project [item_id#647L]
> !      :  +- Project [item_id#647L]                                           
>                                                                        +- 
> Filter (((isnotnull(tz_type#734) AND (tz_type#734 = local)) AND 
> (grass_region#735 = BR)) AND isnotnull(grass_region#735))
> !      :     +- Filter (((isnotnull(tz_type#734) AND (tz_type#734 = local)) 
> AND (grass_region#735 = BR)) AND isnotnull(grass_region#735))               
> +- Relation db.table[,... 91 more fields] parquet
> !      :        +- Relation db.table[,... 91 more fields] parquet
> !      +- LocalLimit 21
> !         +- Project [item_id#738L]
> !            +- LocalRelation , [, ... 91 more fields]
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Check Cartesian 
> Products has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch RewriteSubquery has no 
> effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch 
> NormalizeFloatingNumbers has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch 
> ReplaceUpdateFieldsExpression has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Optimize Metadata Only 
> Query has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch PartitionPruning has 
> no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch InjectRuntimeFilter 
> has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Pushdown Filters from 
> PartitionPruning has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Cleanup filters that 
> cannot be pushed down has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger: Batch Extract Python UDFs 
> has no effect.
> 24/05/13 17:39:25 ERROR [main] PlanChangeLogger:
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateLimits ===
>  GlobalLimit 21                                                               
>                                                            GlobalLimit 21
> !+- LocalLimit 21                                                             
>                                                            +- LocalLimit 
> least(, ... 2 more fields)
> !   +- LocalLimit 21                                                          
>                                                               +- Project 
> [item_id#647L]
> !      +- Project [item_id#647L]                                              
>                                                                  +- Filter 
> (((isnotnull(tz_type#734) AND (tz_type#734 = local)) AND (grass_region#735 = 
> BR)) AND isnotnull(grass_region#735))
> !         +- Filter (((isnotnull(tz_type#734) AND (tz_type#734 = local)) AND 
> (grass_region#735 = BR)) AND isnotnull(grass_region#735))            +- 
> Relation db.table[,... 91 more fields] parquet
> !            +- Relation db.table[,... 91 more fields] parquet
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48265][SQL] Infer window group limit batch should do constant folding

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 34588a82239a [SPARK-48265][SQL] Infer window group limit batch should 
do constant folding
34588a82239a is described below

commit 34588a82239a5c12fefed13e271edd963b821b1c
Author: Angerszh 
AuthorDate: Tue May 14 13:44:47 2024 +0800

[SPARK-48265][SQL] Infer window group limit batch should do constant folding

### What changes were proposed in this pull request?
Plan after PropagateEmptyRelation may generate double local limit
```
 GlobalLimit 21
 +- LocalLimit 21
!   +- Union false, false
!  :- LocalLimit 21
!  :  +- Project [item_id#647L]
!  : +- Filter ()
!  :+- Relation db.table[,... 91 more fields] parquet
!  +- LocalLimit 21
! +- Project [item_id#738L]
!+- LocalRelation , [, ... 91 more fields]
```
to
```
 GlobalLimit 21
+- LocalLimit 21
   - LocalLimit 21
  +- Project [item_id#647L]
+- Filter ()
   +- Relation db.table[,... 91 more fields] parquet
```
after `Infer window group limit batch` batch's `EliminateLimits`
will be
```
 GlobalLimit 21
+- LocalLimit least(21, 21)
  +- Project [item_id#647L]
+- Filter ()
   +- Relation db.table[,... 91 more fields] parquet
```
It can't work, here miss a `ConstantFolding`

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46568 from AngersZh/SPARK-48265.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 7974811218c9fb52ac9d07f8983475a885ada81b)
Signed-off-by: Wenchen Fan 
---
 .../src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
index 70a35ea91153..6173703ef3cd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
@@ -89,7 +89,8 @@ class SparkOptimizer(
   InferWindowGroupLimit,
   LimitPushDown,
   LimitPushDownThroughWindow,
-  EliminateLimits) :+
+  EliminateLimits,
+  ConstantFolding) :+
 Batch("User Provided Optimizers", fixedPoint, 
experimentalMethods.extraOptimizations: _*) :+
 Batch("Replace CTE with Repartition", Once, ReplaceCTERefWithRepartition)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48265][SQL] Infer window group limit batch should do constant folding

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7974811218c9 [SPARK-48265][SQL] Infer window group limit batch should 
do constant folding
7974811218c9 is described below

commit 7974811218c9fb52ac9d07f8983475a885ada81b
Author: Angerszh 
AuthorDate: Tue May 14 13:44:47 2024 +0800

[SPARK-48265][SQL] Infer window group limit batch should do constant folding

### What changes were proposed in this pull request?
Plan after PropagateEmptyRelation may generate double local limit
```
 GlobalLimit 21
 +- LocalLimit 21
!   +- Union false, false
!  :- LocalLimit 21
!  :  +- Project [item_id#647L]
!  : +- Filter ()
!  :+- Relation db.table[,... 91 more fields] parquet
!  +- LocalLimit 21
! +- Project [item_id#738L]
!+- LocalRelation , [, ... 91 more fields]
```
to
```
 GlobalLimit 21
+- LocalLimit 21
   - LocalLimit 21
  +- Project [item_id#647L]
+- Filter ()
   +- Relation db.table[,... 91 more fields] parquet
```
after `Infer window group limit batch` batch's `EliminateLimits`
will be
```
 GlobalLimit 21
+- LocalLimit least(21, 21)
  +- Project [item_id#647L]
+- Filter ()
   +- Relation db.table[,... 91 more fields] parquet
```
It can't work, here miss a `ConstantFolding`

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46568 from AngersZh/SPARK-48265.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
---
 .../src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
index 70a35ea91153..6173703ef3cd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
@@ -89,7 +89,8 @@ class SparkOptimizer(
   InferWindowGroupLimit,
   LimitPushDown,
   LimitPushDownThroughWindow,
-  EliminateLimits) :+
+  EliminateLimits,
+  ConstantFolding) :+
 Batch("User Provided Optimizers", fixedPoint, 
experimentalMethods.extraOptimizations: _*) :+
 Batch("Replace CTE with Repartition", Once, ReplaceCTERefWithRepartition)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0ea808880e22 [SPARK-48027][SQL][FOLLOWUP] Add comments for the other 
code branch
0ea808880e22 is described below

commit 0ea808880e22e2b6cc97a3e946123bec035ade93
Author: beliefer 
AuthorDate: Tue May 14 13:26:17 2024 +0800

[SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch

### What changes were proposed in this pull request?
This PR propose to add comments for the other code branch.

### Why are the changes needed?
https://github.com/apache/spark/pull/46263 missing the comments for the 
other code branch.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
N/A

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #46536 from beliefer/SPARK-48027_followup.

Authored-by: beliefer 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/optimizer/InjectRuntimeFilter.scala| 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala
index 3bb7c4d1ceca..176e927b2d21 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala
@@ -123,21 +123,20 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) =>
 // Runtime filters use one side of the [[Join]] to build a set of join 
key values and prune
 // the other side of the [[Join]]. It's also OK to use a superset of 
the join key values
-// (ignore null values) to do the pruning.
+// (ignore null values) to do the pruning. We can also extract from 
the other side if the
+// join keys are transitive, and the other side always produces a 
superset output of join
+// key values. Any join side always produce a superset output of its 
corresponding
+// join keys, but for transitive join keys we need to check the join 
type.
 // We assume other rules have already pushed predicates through join 
if possible.
 // So the predicate references won't pass on anymore.
 if (left.output.exists(_.semanticEquals(targetKey))) {
   extract(left, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
 currentPlan = left, targetKey = targetKey).orElse {
-// We can also extract from the right side if the join keys are 
transitive, and
-// the right side always produces a superset output of join left 
keys.
-// Let's look at an example
+// An example that extract from the right side if the join keys 
are transitive.
 // left table: 1, 2, 3
 // right table, 3, 4
-// left outer join output: (1, null), (2, null), (3, 3)
-// left key output: 1, 2, 3
-// Any join side always produce a superset output of its 
corresponding
-// join keys, but for transitive join keys we need to check the 
join type.
+// right outer join output: (3, 3), (null, 4)
+// right key output: 3, 4
 if (canPruneLeft(joinType)) {
   lkeys.zip(rkeys).find(_._1.semanticEquals(targetKey)).map(_._2)
 .flatMap { newTargetKey =>
@@ -152,7 +151,11 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
 } else if (right.output.exists(_.semanticEquals(targetKey))) {
   extract(right, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
 currentPlan = right, targetKey = targetKey).orElse {
-// We can also extract from the left side if the join keys are 
transitive.
+// An example that extract from the left side if the join keys are 
transitive.
+// left table: 1, 2, 3
+// right table, 3, 4
+// left outer join output: (1, null), (2, null), (3, 3)
+// left key output: 1, 2, 3
 if (canPruneRight(joinType)) {
   rkeys.zip(lkeys).find(_._1.semanticEquals(targetKey)).map(_._2)
 .flatMap { newTargetKey =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Resolved] (SPARK-48241) CSV parsing failure with char/varchar type columns

2024-05-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48241.
-
Fix Version/s: 3.5.2
   Resolution: Fixed

Issue resolved by pull request 46565
[https://github.com/apache/spark/pull/46565]

> CSV parsing failure with char/varchar type columns
> --
>
> Key: SPARK-48241
> URL: https://issues.apache.org/jira/browse/SPARK-48241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Jiayi Liu
>Assignee: Jiayi Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> CSV table containing char and varchar columns will result in the following 
> error when selecting from the CSV table:
> {code:java}
> java.lang.IllegalArgumentException: requirement failed: requiredSchema 
> (struct) should be the subset of dataSchema 
> (struct).
>     at scala.Predef$.require(Predef.scala:281)
>     at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.(UnivocityParser.scala:56)
>     at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125){code}
> The reason for the error is that the StringType columns in the dataSchema and 
> requiredSchema of UnivocityParser are not consistent. It is due to the 
> metadata contained in the StringType StructField of the dataSchema, which is 
> missing in the requiredSchema. We need to retain the metadata when resolving 
> schema.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48241) CSV parsing failure with char/varchar type columns

2024-05-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48241:
---

Assignee: Jiayi Liu

> CSV parsing failure with char/varchar type columns
> --
>
> Key: SPARK-48241
> URL: https://issues.apache.org/jira/browse/SPARK-48241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Jiayi Liu
>Assignee: Jiayi Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> CSV table containing char and varchar columns will result in the following 
> error when selecting from the CSV table:
> {code:java}
> java.lang.IllegalArgumentException: requirement failed: requiredSchema 
> (struct) should be the subset of dataSchema 
> (struct).
>     at scala.Predef$.require(Predef.scala:281)
>     at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.(UnivocityParser.scala:56)
>     at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125){code}
> The reason for the error is that the StringType columns in the dataSchema and 
> requiredSchema of UnivocityParser are not consistent. It is due to the 
> metadata contained in the StringType StructField of the dataSchema, which is 
> missing in the requiredSchema. We need to retain the metadata when resolving 
> schema.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 19d12b249f0f [SPARK-48241][SQL][3.5] CSV parsing failure with 
char/varchar type columns
19d12b249f0f is described below

commit 19d12b249f0fe4cb5b20b9722188c5a850147cec
Author: joey.ljy 
AuthorDate: Tue May 14 13:06:57 2024 +0800

[SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns

### What changes were proposed in this pull request?
CSV table containing char and varchar columns will result in the following 
error when selecting from the CSV table:
```
spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
  id INT,
  name CHAR(10))
USING csv
```
```
java.lang.IllegalArgumentException: requirement failed: requiredSchema 
(struct) should be the subset of dataSchema 
(struct).
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.(UnivocityParser.scala:56)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
```

### Why are the changes needed?
For char and varchar types, Spark will convert them to `StringType` in 
`CharVarcharUtils.replaceCharVarcharWithStringInSchema` and record 
`__CHAR_VARCHAR_TYPE_STRING` in the metadata.

The reason for the above error is that the `StringType` columns in the 
`dataSchema` and `requiredSchema` of `UnivocityParser` are not consistent. The 
`StringType` in the `dataSchema` has metadata, while the metadata in the 
`requiredSchema` is empty. We need to retain the metadata when resolving schema.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add a new test case in `CSVSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46565 from liujiayi771/branch-3.5-SPARK-48241.

Authored-by: joey.ljy 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/plans/logical/LogicalPlan.scala   |  4 +++-
 sql/core/src/test/resources/test-data/char.csv |  4 
 .../sql/execution/datasources/csv/CSVSuite.scala   | 24 ++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index 374eb070db1c..7fe8bd356ea9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -116,7 +116,9 @@ abstract class LogicalPlan
   def resolve(schema: StructType, resolver: Resolver): Seq[Attribute] = {
 schema.map { field =>
   resolve(field.name :: Nil, resolver).map {
-case a: AttributeReference => a
+case a: AttributeReference =>
+  // Keep the metadata in given schema.
+  a.withMetadata(field.metadata)
 case _ => throw 
QueryExecutionErrors.resolveCannotHandleNestedSchema(this)
   }.getOrElse {
 throw QueryCompilationErrors.cannotResolveAttributeError(
diff --git a/sql/core/src/test/resources/test-data/char.csv 
b/sql/core/src/test/resources/test-data/char.csv
new file mode 100644
index ..d2be68a15fc1
--- /dev/null
+++ b/sql/core/src/test/resources/test-data/char.csv
@@ -0,0 +1,4 @@
+color,name
+pink,Bob
+blue,Mike
+grey,Tom
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index a91adb787838..3762c00ff1a1 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -80,6 +80,7 @@ abstract class CSVSuite
   private val valueMalformedFile = "test-data/value-malformed.csv"
   private val badAfterGoodFile = "test-data/bad_after_good.csv"
   privat

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Wenchen Fan

+1

On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:

> +1 (non-binding)
>
> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> *Zhou JIANG*
>
>

(spark) branch master updated: [SPARK-48241][SQL] CSV parsing failure with char/varchar type columns

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b14abb3a2ed0 [SPARK-48241][SQL] CSV parsing failure with char/varchar 
type columns
b14abb3a2ed0 is described below

commit b14abb3a2ed086d2ff8f340f60c0dc1e460c7a59
Author: joey.ljy 
AuthorDate: Mon May 13 22:42:31 2024 +0800

[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns

### What changes were proposed in this pull request?
CSV table containing char and varchar columns will result in the following 
error when selecting from the CSV table:
```
spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
  id INT,
  name CHAR(10))
USING csv
```
```
java.lang.IllegalArgumentException: requirement failed: requiredSchema 
(struct) should be the subset of dataSchema 
(struct).
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.(UnivocityParser.scala:56)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
```

### Why are the changes needed?
For char and varchar types, Spark will convert them to `StringType` in 
`CharVarcharUtils.replaceCharVarcharWithStringInSchema` and record 
`__CHAR_VARCHAR_TYPE_STRING` in the metadata.

The reason for the above error is that the `StringType` columns in the 
`dataSchema` and `requiredSchema` of `UnivocityParser` are not consistent. The 
`StringType` in the `dataSchema` has metadata, while the metadata in the 
`requiredSchema` is empty. We need to retain the metadata when resolving schema.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add a new test case in `CSVSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46537 from liujiayi771/csv-char.

Authored-by: joey.ljy 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/plans/logical/LogicalPlan.scala   |  4 +++-
 sql/core/src/test/resources/test-data/char.csv |  4 
 .../sql/execution/datasources/csv/CSVSuite.scala   | 24 ++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index b989233da674..98e91585c2a0 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -118,7 +118,9 @@ abstract class LogicalPlan
   def resolve(schema: StructType, resolver: Resolver): Seq[Attribute] = {
 schema.map { field =>
   resolve(field.name :: Nil, resolver).map {
-case a: AttributeReference => a
+case a: AttributeReference =>
+  // Keep the metadata in given schema.
+  a.withMetadata(field.metadata)
 case _ => throw 
QueryExecutionErrors.resolveCannotHandleNestedSchema(this)
   }.getOrElse {
 throw QueryCompilationErrors.cannotResolveAttributeError(
diff --git a/sql/core/src/test/resources/test-data/char.csv 
b/sql/core/src/test/resources/test-data/char.csv
new file mode 100644
index ..d2be68a15fc1
--- /dev/null
+++ b/sql/core/src/test/resources/test-data/char.csv
@@ -0,0 +1,4 @@
+color,name
+pink,Bob
+blue,Mike
+grey,Tom
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index 22ea133ee19a..0e58b96531da 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -80,6 +80,7 @@ abstract class CSVSuite
   private val valueMalformedFile = "test-data/value-malformed.csv"
   private val badAfterGoodFile = "test-data/bad_after_good.csv"
   private val malformedRowFile = "test-data/m

(spark) branch master updated (42f2132d1fc9 -> 3456d4f69a86)

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 42f2132d1fc9 [SPARK-48206][SQL][TESTS] Add tests for window rewrites 
with RewriteWithExpression
 add 3456d4f69a86 [SPARK-47681][FOLLOWUP] Fix schema_of_variant(decimal)

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/variant/variantExpressions.scala  |  7 +++
 .../test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala | 10 ++
 2 files changed, 13 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Resolved] (SPARK-48206) Add tests for window expression rewrites in RewriteWithExpression

2024-05-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48206.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46492
[https://github.com/apache/spark/pull/46492]

> Add tests for window expression rewrites in RewriteWithExpression
> -
>
> Key: SPARK-48206
> URL: https://issues.apache.org/jira/browse/SPARK-48206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kelvin Jiang
>Assignee: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Window expressions can be potentially problematic if we pull out a window 
> expression outside a `Window` operator. Right now this shouldn't happen but 
> we should add some tests to make sure it doesn't break.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48206) Add tests for window expression rewrites in RewriteWithExpression

2024-05-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48206:
---

Assignee: Kelvin Jiang

> Add tests for window expression rewrites in RewriteWithExpression
> -
>
> Key: SPARK-48206
> URL: https://issues.apache.org/jira/browse/SPARK-48206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kelvin Jiang
>Assignee: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
>
> Window expressions can be potentially problematic if we pull out a window 
> expression outside a `Window` operator. Right now this shouldn't happen but 
> we should add some tests to make sure it doesn't break.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48206][SQL][TESTS] Add tests for window rewrites with RewriteWithExpression

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 42f2132d1fc9 [SPARK-48206][SQL][TESTS] Add tests for window rewrites 
with RewriteWithExpression
42f2132d1fc9 is described below

commit 42f2132d1fc99bf2ec5bd398d21dcbdbd5cbde47
Author: Kelvin Jiang 
AuthorDate: Mon May 13 22:28:27 2024 +0800

[SPARK-48206][SQL][TESTS] Add tests for window rewrites with 
RewriteWithExpression

### What changes were proposed in this pull request?

This PR adds more testing for `RewriteWithExpression` around `Window` 
operators.

### Why are the changes needed?

Adds more testing for `RewriteWithExpression`, which can be fragile around 
`WindowExpressions`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46492 from kelvinjian-db/SPARK-48206-window.

Authored-by: Kelvin Jiang 
Signed-off-by: Wenchen Fan 
---
 .../optimizer/RewriteWithExpressionSuite.scala | 223 +
 1 file changed, 135 insertions(+), 88 deletions(-)

diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
index 8f023fa4156b..aa8ffb2b0454 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
@@ -24,7 +24,6 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.PlanTest
 import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
 import org.apache.spark.sql.catalyst.rules.RuleExecutor
-import org.apache.spark.sql.types.IntegerType
 
 class RewriteWithExpressionSuite extends PlanTest {
 
@@ -37,6 +36,20 @@ class RewriteWithExpressionSuite extends PlanTest {
   private val testRelation = LocalRelation($"a".int, $"b".int)
   private val testRelation2 = LocalRelation($"x".int, $"y".int)
 
+  private def normalizeCommonExpressionIds(plan: LogicalPlan): LogicalPlan = {
+plan.transformAllExpressions {
+  case a: Alias if a.name.startsWith("_common_expr") =>
+a.withName("_common_expr_0")
+  case a: AttributeReference if a.name.startsWith("_common_expr") =>
+a.withName("_common_expr_0")
+}
+  }
+
+  override def comparePlans(
+plan1: LogicalPlan, plan2: LogicalPlan, checkAnalysis: Boolean = true): 
Unit = {
+super.comparePlans(normalizeCommonExpressionIds(plan1), 
normalizeCommonExpressionIds(plan2))
+  }
+
   test("simple common expression") {
 val a = testRelation.output.head
 val expr = With(a) { case Seq(ref) =>
@@ -52,65 +65,48 @@ class RewriteWithExpressionSuite extends PlanTest {
   ref * ref
 }
 val plan = testRelation.select(expr.as("col"))
-val commonExprId = expr.defs.head.id.id
-val commonExprName = s"_common_expr_$commonExprId"
 comparePlans(
   Optimizer.execute(plan),
   testRelation
-.select((testRelation.output :+ (a + a).as(commonExprName)): _*)
-.select(($"$commonExprName" * $"$commonExprName").as("col"))
+.select((testRelation.output :+ (a + a).as("_common_expr_0")): _*)
+.select(($"_common_expr_0" * $"_common_expr_0").as("col"))
 .analyze
 )
   }
 
   test("nested WITH expression in the definition expression") {
-val a = testRelation.output.head
+val Seq(a, b) = testRelation.output
 val innerExpr = With(a + a) { case Seq(ref) =>
   ref + ref
 }
-val innerCommonExprId = innerExpr.defs.head.id.id
-val innerCommonExprName = s"_common_expr_$innerCommonExprId"
-
-val b = testRelation.output.last
 val outerExpr = With(innerExpr + b) { case Seq(ref) =>
   ref * ref
 }
-val outerCommonExprId = outerExpr.defs.head.id.id
-val outerCommonExprName = s"_common_expr_$outerCommonExprId"
 
 val plan = testRelation.select(outerExpr.as("col"))
-val rewrittenOuterExpr = ($"$innerCommonExprName" + 
$"$innerCommonExprName" + b)
-  .as(outerCommonExprName)
-val outerExprAttr = AttributeReference(outerCommonExprName, IntegerType)(
-  exprId = rewrittenOuterExpr.exprId)
 comparePlans(
   Optimizer.execute(plan),
   testRelation
-.selec

[jira] [Assigned] (SPARK-48031) Add schema evolution options to views

2024-05-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48031:
---

Assignee: Serge Rielau

> Add schema evolution options to views 
> --
>
> Key: SPARK-48031
> URL: https://issues.apache.org/jira/browse/SPARK-48031
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
>
> We want to provide the ability for views to react to changes in the query 
> resolution in manners differently than just failing the view.
> For example we want the view to be able to compensate for type changes by 
> casting the query result to the view column types.
> Or to adopt any type of column arity changes into a view.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48031) Add schema evolution options to views

2024-05-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48031.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46267
[https://github.com/apache/spark/pull/46267]

> Add schema evolution options to views 
> --
>
> Key: SPARK-48031
> URL: https://issues.apache.org/jira/browse/SPARK-48031
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We want to provide the ability for views to react to changes in the query 
> resolution in manners differently than just failing the view.
> For example we want the view to be able to compensate for type changes by 
> casting the query result to the view column types.
> Or to adopt any type of column arity changes into a view.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-13 Thread Wenchen Fan

Hi Nicholas,

Thanks for your help! I'm definitely interested in participating in this
unification work. Let me know how I can help.

Wenchen

On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas 
wrote:

> Re: unification
>
> We also have a long-standing problem with how we manage Python
> dependencies, something I’ve tried (unsuccessfully
> <https://github.com/apache/spark/pull/27928>) to fix in the past.
>
> Consider, for example, how many separate places this numpy dependency is
> installed:
>
> 1.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
> 2.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
> 3.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
> 4.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
> 5.
> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
> 6.
> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
> 7.
> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
> 8.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
> 9.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
> 10.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
> 11.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
> 12.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92
>
> None of those installations reference a unified version requirement, so
> naturally they are inconsistent across all these different lines. Some say
> `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In
> several cases there is no version requirement specified at all.
>
> I’m interested in trying again to fix this problem, but it needs to be in
> collaboration with a committer since I cannot fully test the release
> scripts. (This testing gap is what doomed my last attempt at fixing this
> problem.)
>
> Nick
>
>
> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
>
> After finishing the 4.0.0-preview1 RC1, I have more experience with this
> topic now.
>
> In fact, the main job of the release process: building packages and
> documents, is tested in Github Action jobs. However, the way we test them
> is different from what we do in the release scripts.
>
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile:
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile:
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release
> process needs to set up more things so it may not be viable to use a single
> Dockerfile for both.
>
> 2. the execution code is different. Use building documents as an example:
> The release scripts:
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job:
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify
> them.
>
> It's better if we can run the release scripts as Github Action jobs, but I
> think it's more important to do the unification now.
>
> Thanks,
> Wenchen
>
>
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:
>
>> Hello,
>>
>> I can answer some of your common questions with other Apache projects.
>>
>> > Who currently has permissions for Github actions? Is there a specific
>> owner for that today or a different volunteer each time?
>>
>> The Apache organization owns Github Actions, and committers (contributors
>> with write permissions) can retrigger/cancel a Github Actions workflow, but
>> Github Actions runners are managed by the Apache infra team.
>>
>> > What are the current limits of GitHub Actions, who set them - and what
>> is the process to change those (if possible at all, but I presume not all
>> Apache projects have the same limits)?
>>
>> For limits, I don't think there is

[jira] [Created] (SPARK-48260) disable output committer coordination in one test of ParquetIOSuite

2024-05-13 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-48260:
---

 Summary: disable output committer coordination in one test of 
ParquetIOSuite
 Key: SPARK-48260
 URL: https://issues.apache.org/jira/browse/SPARK-48260
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48252) Update CommonExpressionRef when necessary

2024-05-13 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-48252:
---

 Summary: Update CommonExpressionRef when necessary
 Key: SPARK-48252
 URL: https://issues.apache.org/jira/browse/SPARK-48252
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Wenchen Fan

+1

On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:

> +1
>
> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
> >
> > +1
> >
> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
> wrote:
> >>
> >> +1
> >>
> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
> >>>
> >>> +1
> >>>
> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
> >>> >
> >>> > +1
> >>> >
> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
> wrote:
> >>> >>
> >>> >> Hi all,
> >>> >>
> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
> Catalogs.
> >>> >>
> >>> >> Please also refer to:
> >>> >>
> >>> >>- Discussion thread:
> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
> >>> >>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >>> >>
> >>> >>
> >>> >> Please vote on the SPIP for the next 72 hours:
> >>> >>
> >>> >> [ ] +1: Accept the proposal as an official SPIP
> >>> >> [ ] +0
> >>> >> [ ] -1: I don’t think this is a good idea because …
> >>> >>
> >>> >>
> >>> >> Thank you!
> >>> >>
> >>> >> Liang-Chi Hsieh
> >>> >>
> >>> >>
> -
> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Wenchen Fan

After finishing the 4.0.0-preview1 RC1, I have more experience with this
topic now.

In fact, the main job of the release process: building packages and
documents, is tested in Github Action jobs. However, the way we test them
is different from what we do in the release scripts.

1. the execution environment is different:
The release scripts define the execution environment with this Dockerfile:
https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
However, Github Action jobs use a different Dockerfile:
https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
We should figure out a way to unify it. The docker image for the release
process needs to set up more things so it may not be viable to use a single
Dockerfile for both.

2. the execution code is different. Use building documents as an example:
The release scripts:
https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
The Github Action job:
https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
I don't know which one is more correct, but we should definitely unify them.

It's better if we can run the release scripts as Github Action jobs, but I
think it's more important to do the unification now.

Thanks,
Wenchen

On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:

> Hello,
>
> I can answer some of your common questions with other Apache projects.
>
> > Who currently has permissions for Github actions? Is there a specific
> owner for that today or a different volunteer each time?
>
> The Apache organization owns Github Actions, and committers (contributors
> with write permissions) can retrigger/cancel a Github Actions workflow, but
> Github Actions runners are managed by the Apache infra team.
>
> > What are the current limits of GitHub Actions, who set them - and what
> is the process to change those (if possible at all, but I presume not all
> Apache projects have the same limits)?
>
> For limits, I don't think there is any significant limit, especially since
> the Apache organization has 900 donated runners used by its projects, and
> there is an initiative from the Infra team to add self-hosted runners
> running on Kubernetes (document
> <https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners>
> ).
>
> > Where should the artifacts be stored?
>
> Usually, we use Maven for jars, DockerHub for Docker images, and Github
> cache for workflow cache. But we can use Github artifacts to store any kind
> of package (even Docker images in the ghcr), which is fully accepted by
> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure,
> ...), a bucket can be used to store some of the packages.
>
>
>  > Who should be permitted to sign a version - and what is the process for
> that?
>
> The Apache documentation is clear about this, by default only PMC members
> can be release managers, but we can contact the infra team to add one of
> the committers as a release manager (document
> <https://infra.apache.org/release-publishing.html#releasemanager>). The
> process of creating a new version is described in this document
> <https://www.apache.org/legal/release-policy.html#policy>.
>
>
> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek  wrote:
>
>> Following the conversation started with Spark 4.0.0 release, this is a
>> thread to discuss improvements to our release processes.
>>
>> I'll Start by raising some questions that probably should have answers to
>> start the discussion:
>>
>>
>>1. What is currently running in GitHub Actions?
>>2. Who currently has permissions for Github actions? Is there a
>>specific owner for that today or a different volunteer each time?
>>3. What are the current limits of GitHub Actions, who set them - and
>>what is the process to change those (if possible at all, but I presume not
>>all Apache projects have the same limits)?
>>4. What versions should we support as an output for the build?
>>5. Where should the artifacts be stored?
>>6. What should be the output? only tar or also a docker image
>>published somewhere?
>>7. Do we want to have a release on fixed dates or a manual release
>>upon request?
>>8. Who should be permitted to sign a version - and what is the
>>process for that?
>>
>>
>> Thanks!
>> Nimrod
>>
>

[VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-10 Thread Wenchen Fan

Please vote on releasing the following candidate as Apache Spark version
4.0.0-preview1.

The vote is open until May 16 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0-preview1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v4.0.0-preview1-rc1 (commit
7dcf77c739c3854260464d732dbfb9a0f54706e7):
https://github.com/apache/spark/tree/v4.0.0-preview1-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1454/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/

The list of bug fixes going into 4.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

svn commit: r69098 - /dev/spark/v4.0.0-preview1-rc1-bin/

2024-05-10 Thread wenchen

Author: wenchen
Date: Sat May 11 04:28:26 2024
New Revision: 69098

Log:
Apache Spark v4.0.0-preview1-rc1

Added:
dev/spark/v4.0.0-preview1-rc1-bin/
dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz   
(with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz   
(with props)

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.asc
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.sha512

Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc Sat May 
11 04:28:26 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY+8UQTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WkV1D/44BoMRwBQPQybc9ldlemMhKNQ/1OLB
+mUwhLpeUryOpUjO8AXa60YBajHqg9hivRxAUiuoaBSn7HjWY+3+nwkbcA7ZyMaV2
+Hgvfu4orB2kYXx4JgiE+dd2Zbuq+HFTv32dDUe+FyiHvhFw/bL0TIYUNJfKNcBtq
+KZDl9K5wemNjmpUSQAfEh3/vkikv5xOGxV+yEohgpB3t5Wg3hTETISXLfx/mHDu5
+GPjdCZ1omcqxZsV16CFZHV/uzK5aEDXfPdo2OO5V94xyQL0EQaMnzzMUdHkxPJ3p
+747tTf/q5rXHOb7S67MtNoBZ8myR23mQGJTwlV6E8CJWcbH7R6SEHekG9kIPGd3i
+UHoBAmroi+KfAdRej2Nqvz7SfeDeAmFw2kBRIm42FYWIqalAqbKU9LlXSpjyvYkO
+82df+5mwOzJf5VSU9D3krmjqWMFdjlLbDI1O1hLMNHyZkCYzPf+pmFhABsfGMXZH
+D8vURqF5aL9BmEuwi1SF0zSa9bI0otQj0DBvCbZnUeULSHB+P/eFqHoXjtNX2ArB
+43zmyaDywfqPXoMItvb+sGGUvatbLTCjjl6yfwgZEKOHs5noCygmL1WoLVQV+UYe
+UXb/hOJrP4FdUARpnMmz6R0NYSgQ7RZ7lOjQqs3VB7W1ashh0EWDD1hbeqMpvdx/
++fBbOLMrdzxifw==
+=2il7
+-END PGP SIGNATURE-

Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512
==
--- dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 
(added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 Sat 
May 11 04:28:26 2024
@@ -0,0 +1 @@
+60c0f5348da36d3399b596648e104202b2e9925a5b52694bf83cec9be1b4e78db6b4aa7f2f9257bca74dd514dca176ab6b51ab4c0abad2b31fb3fc5b5c14
  SparkR_4.0.0-preview1.tar.gz

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc Sat May 11 
04:28:26 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY+8UYTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WsnjD/4m0Dyb8ZcxS/JScvFxl3eg7KRWi8d8
+bGHs/pHZxdwS/HUkBRtv0w6HXJV6ZtQW1CPtbZ0VKOqElUfGPS/VaxE91I7c2Vmb
++/P2/buVX6fBlF+vIUPECyVgblnhBeZKbBb5Wcz3xpL1Jfj/6qi3o9uLnFFfy55S
+N6FWIJ5xrjl9mlo6+s4qqL/06u982NaEyUsu51eNgapTQcNUAjFKme13WC3W7n0S
+i6ixtW1oXmfY74CzSfn6KNC+5QvxKwJznS7ZxrG3g/chcaR8rApUZ526v4XL7LP0
+BDNeqCI+blAjVYFUzBIkvZp8SR/BbJv2HSySq5hbf0S6l0O+iuj8tZ/oa8Z0hCNf
+lXUw2ORG7RJKUZePdC+F+vYrmISyDRiWb4ddSUAjkzXy8KEWw6y55VULCq4vHbDc
+1Zwmf2izaujavcSJMjBnMhoZZ1PBlxgVQwHYu0Pi3qLCxyIn4oTd1wW7h6u5IGMr
++1LjMaGCrKbWSafp+cXGtzfJGjzPjCdIN2HqX6l53Vli4jn8I8yGJZs7hp+SZ281
+QBmzgiDLWUdQf+72bGNNlvy1FliPg0k7

svn commit: r69097 - /dev/spark/v4.0.0-preview1-rc1-bin/

2024-05-10 Thread wenchen

Author: wenchen
Date: Sat May 11 03:59:33 2024
New Revision: 69097

Log:
prepare for re-uploading

Removed:
dev/spark/v4.0.0-preview1-rc1-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69092 - in /dev/spark/v4.0.0-preview1-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/articles/sparkr-vignettes_files/ _site/api/R/articles/sparkr-vignettes_

2024-05-10 Thread wenchen

Author: wenchen
Date: Fri May 10 16:44:08 2024
New Revision: 69092

Log:
Apache Spark v4.0.0-preview1-rc1 docs


[This commit notification would consist of 4810 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser

2024-05-10 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a6632ffa16f6 [SPARK-48143][SQL] Use lightweight exceptions for 
control-flow between UnivocityParser and FailureSafeParser
a6632ffa16f6 is described below

commit a6632ffa16f6907eba96e745920d571924bf4b63
Author: Vladimir Golubev 
AuthorDate: Sat May 11 00:37:54 2024 +0800

[SPARK-48143][SQL] Use lightweight exceptions for control-flow between 
UnivocityParser and FailureSafeParser

# What changes were proposed in this pull request?
New lightweight exception for control-flow between UnivocityParser and 
FalureSafeParser to speed-up malformed CSV parsing.

This is a different way to implement these reverted changes: 
https://github.com/apache/spark/pull/46478

The previous implementation was more invasive - removing `cause` from 
`BadRecordException` could break upper code, which unwraps errors and checks 
the types of the causes. This implementation only touches `FailureSafeParser` 
and `UnivocityParser` since in the codebase they are always used together, 
unlike `JacksonParser` and `StaxXmlParser`. Removing stacktrace from 
`BadRecordException` is safe, since the cause itself has an adequate stacktrace 
(except pure control-flow cases).

### Why are the changes needed?
Parsing in `PermissiveMode` is slow due to heavy exception construction 
(stacktrace filling + string template substitution in `SparkRuntimeException`)

### Does this PR introduce _any_ user-facing change?
No, since `FailureSafeParser` unwraps `BadRecordException` and correctly 
rethrows user-facing exceptions in `FailFastMode`

### How was this patch tested?
- `testOnly org.apache.spark.sql.catalyst.csv.UnivocityParserSuite`
- Manually run csv benchmark
- Manually checked correct and malformed csv in sherk-shell 
(org.apache.spark.SparkException is thrown with the stacktrace)

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46500 from 
vladimirg-db/vladimirg-db/use-special-lighweight-exception-for-control-flow-between-univocity-parser-and-failure-safe-parser.

Authored-by: Vladimir Golubev 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/csv/UnivocityParser.scala   |  5 +++--
 .../sql/catalyst/util/BadRecordException.scala | 22 +++---
 .../sql/catalyst/util/FailureSafeParser.scala  | 11 +--
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
index a5158d8a22c6..4d95097e1681 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
@@ -316,7 +316,7 @@ class UnivocityParser(
   throw BadRecordException(
 () => getCurrentInput,
 () => Array.empty,
-QueryExecutionErrors.malformedCSVRecordError(""))
+LazyBadRecordCauseWrapper(() => 
QueryExecutionErrors.malformedCSVRecordError("")))
 }
 
 val currentInput = getCurrentInput
@@ -326,7 +326,8 @@ class UnivocityParser(
   // However, we still have chance to parse some of the tokens. It 
continues to parses the
   // tokens normally and sets null when `ArrayIndexOutOfBoundsException` 
occurs for missing
   // tokens.
-  Some(QueryExecutionErrors.malformedCSVRecordError(currentInput.toString))
+  Some(LazyBadRecordCauseWrapper(
+() => 
QueryExecutionErrors.malformedCSVRecordError(currentInput.toString)))
 } else None
 // When the length of the returned tokens is identical to the length of 
the parsed schema,
 // we just need to:
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
index 65a56c1064e4..654b0b8c73e5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
@@ -67,16 +67,32 @@ case class PartialResultArrayException(
   extends Exception(cause)
 
 /**
- * Exception thrown when the underlying parser meet a bad record and can't 
parse it.
+ * Exception thrown when the underlying parser met a bad record and can't 
parse it.
+ * The stacktrace is not collected for better preformance, and thus, this 
exception should
+ * not be used in a user-facing context.
  * @param record a function to return the record that cause the parser to fail
  * @param partialResults a fu

[jira] [Assigned] (SPARK-48146) Fix error with aggregate function in With child

2024-05-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48146:
---

Assignee: Kelvin Jiang

> Fix error with aggregate function in With child
> ---
>
> Key: SPARK-48146
> URL: https://issues.apache.org/jira/browse/SPARK-48146
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kelvin Jiang
>Assignee: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
>
> Right now, if we have an aggregate function in the child of a With 
> expression, we fail an assertion. However, queries like this used to work:
> {code:sql}
> select
> id between cast(max(id between 1 and 2) as int) and id
> from range(10)
> group by id
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48146) Fix error with aggregate function in With child

2024-05-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48146.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46443
[https://github.com/apache/spark/pull/46443]

> Fix error with aggregate function in With child
> ---
>
> Key: SPARK-48146
> URL: https://issues.apache.org/jira/browse/SPARK-48146
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kelvin Jiang
>Assignee: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Right now, if we have an aggregate function in the child of a With 
> expression, we fail an assertion. However, queries like this used to work:
> {code:sql}
> select
> id between cast(max(id between 1 and 2) as int) and id
> from range(10)
> group by id
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48146][SQL] Fix aggregate function in With expression child assertion

2024-05-10 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7ef0440ef221 [SPARK-48146][SQL] Fix aggregate function in With 
expression child assertion
7ef0440ef221 is described below

commit 7ef0440ef22161a6160f7b9000c70b26c84eecf7
Author: Kelvin Jiang 
AuthorDate: Fri May 10 22:39:15 2024 +0800

[SPARK-48146][SQL] Fix aggregate function in With expression child assertion

### What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/46034, there was a complicated edge 
case where common expression references in aggregate functions in the child of 
a `With` expression could become dangling. An assertion was added to avoid that 
case from happening, but the assertion wasn't fully accurate as a query like:
```
select
  id between max(if(id between 1 and 2, 2, 1)) over () and id
from range(10)
```
would fail the assertion.

This PR fixes the assertion to be more accurate.

### Why are the changes needed?

This addresses a regression in https://github.com/apache/spark/pull/46034.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46443 from kelvinjian-db/SPARK-48146-agg.

Authored-by: Kelvin Jiang 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/With.scala  | 26 +
 .../optimizer/RewriteWithExpressionSuite.scala | 27 +-
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
index 14deedd9c70f..29794b33641c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
@@ -17,7 +17,8 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
-import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE_EXPRESSION, 
COMMON_EXPR_REF, TreePattern, WITH_EXPRESSION}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, 
TreePattern, WITH_EXPRESSION}
 import org.apache.spark.sql.types.DataType
 
 /**
@@ -27,9 +28,11 @@ import org.apache.spark.sql.types.DataType
  */
 case class With(child: Expression, defs: Seq[CommonExpressionDef])
   extends Expression with Unevaluable {
-  // We do not allow With to be created with an AggregateExpression in the 
child, as this would
-  // create a dangling CommonExpressionRef after rewriting it in 
RewriteWithExpression.
-  assert(!child.containsPattern(AGGREGATE_EXPRESSION))
+  // We do not allow creating a With expression with an AggregateExpression 
that contains a
+  // reference to a common expression defined in that scope (note that it can 
contain another With
+  // expression with a common expression ref of the inner With). This is to 
prevent the creation of
+  // a dangling CommonExpressionRef after rewriting it in 
RewriteWithExpression.
+  assert(!With.childContainsUnsupportedAggExpr(this))
 
   override val nodePatterns: Seq[TreePattern] = Seq(WITH_EXPRESSION)
   override def dataType: DataType = child.dataType
@@ -92,6 +95,21 @@ object With {
 val commonExprRefs = commonExprDefs.map(new CommonExpressionRef(_))
 With(replaced(commonExprRefs), commonExprDefs)
   }
+
+  private[sql] def childContainsUnsupportedAggExpr(withExpr: With): Boolean = {
+lazy val commonExprIds = withExpr.defs.map(_.id).toSet
+withExpr.child.exists {
+  case agg: AggregateExpression =>
+// Check that the aggregate expression does not contain a reference to 
a common expression
+// in the outer With expression (it is ok if it contains a reference 
to a common expression
+// for a nested With expression).
+agg.exists {
+  case r: CommonExpressionRef => commonExprIds.contains(r.id)
+  case _ => false
+}
+  case _ => false
+}
+  }
 }
 
 case class CommonExpressionId(id: Long = CommonExpressionId.newId, 
canonicalized: Boolean = false) {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
index d482b18d9331..8f023fa4156b 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/

[jira] [Resolved] (SPARK-48158) XML expressions (all collations)

2024-05-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48158.
-
Fix Version/s: 4.0.0
 Assignee: Uroš Bojanić
   Resolution: Fixed

> XML expressions (all collations)
> 
>
> Key: SPARK-48158
> URL: https://issues.apache.org/jira/browse/SPARK-48158
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Enable collation support for *XML* built-in string functions in Spark 
> ({*}XmlToStructs{*}, {*}SchemaOfXml{*}, {*}StructsToXml{*}). First confirm 
> what is the expected behaviour for these functions when given collated 
> strings, and then move on to implementation and testing. You will find these 
> expressions in the *xmlExpressions.scala* file, and they should mostly be 
> pass-through functions. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *XML* expressions so that 
> they support all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Ascii, Chr, Base64, 
> UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated (33cac4436e59 -> 2df494fd4e4e)

2024-05-10 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 33cac4436e59 [SPARK-47847][CORE] Deprecate 
`spark.network.remoteReadNioBufferConversion`
 add 2df494fd4e4e [SPARK-48158][SQL] Add collation support for XML 
expressions

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/xmlExpressions.scala  |   9 +-
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 124 +
 2 files changed, 129 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Wenchen Fan

Thanks for leading this project! Let's move forward.

On Fri, May 10, 2024 at 10:31 AM L. C. Hsieh  wrote:

> Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and
> others if I miss those who are participating in the discussion.
>
> I suppose we have reached a consensus or close to being in the design.
>
> If you have some more comments, please let us know.
>
> If not, I will go to start a vote soon after a few days.
>
> Thank you.
>
> On Thu, May 9, 2024 at 6:12 PM Anton Okolnychyi 
> wrote:
> >
> > Thanks to everyone who commented on the design doc. I updated the
> proposal and it is ready for another look. I hope we can converge and move
> forward with this effort!
> >
> > - Anton
> >
> > пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi 
> пише:
> >>
> >> Hi folks,
> >>
> >> I'd like to start a discussion on SPARK-44167 that aims to enable
> catalogs to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
> >>
> >> SPIP [1] contains proposed API changes and parser extensions. Any
> feedback is more than welcome!
> >>
> >> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
> >>
> >> Liang-Chi was kind enough to shepherd this effort. Thanks!
> >>
> >> - Anton
> >>
> >> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> >> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
> >>
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[jira] [Assigned] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48222:
---

Assignee: Nicholas Chammas

> Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
> -
>
> Key: SPARK-48222
> URL: https://issues.apache.org/jira/browse/SPARK-48222
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48222.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46512
[https://github.com/apache/spark/pull/46512]

> Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
> -
>
> Key: SPARK-48222
> URL: https://issues.apache.org/jira/browse/SPARK-48222
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9a2818820f11 [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 
and refresh Gem lock file
9a2818820f11 is described below

commit 9a2818820f11f9bdcc042f4ab80850918911c68c
Author: Nicholas Chammas 
AuthorDate: Fri May 10 09:58:16 2024 +0800

[SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock 
file

### What changes were proposed in this pull request?

Sync the version of Bundler that we are using across various scripts and 
documentation. Also refresh the Gem lock file.

### Why are the changes needed?

We are seeing inconsistent build behavior, likely due to the inconsistent 
Bundler versions.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI + the preview release process.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46512 from nchammas/bundler-sync.

Authored-by: Nicholas Chammas 
Signed-off-by: Wenchen Fan 
---
 .github/workflows/build_and_test.yml   |  3 +++
 dev/create-release/spark-rm/Dockerfile |  2 +-
 docs/Gemfile.lock  | 16 
 docs/README.md |  2 +-
 4 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 4a11823aee60..881fb8cb0674 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -872,6 +872,9 @@ jobs:
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
 - name: Install dependencies for documentation generation
   run: |
+# Keep the version of Bundler here in sync with the following 
locations:
+#   - dev/create-release/spark-rm/Dockerfile
+#   - docs/README.md
 gem install bundler -v 2.4.22
 cd docs
 bundle install
diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index 8d5ca38ba88e..13f4112ca03d 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -38,7 +38,7 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true
 ARG APT_INSTALL="apt-get install --no-install-recommends -y"
 
 ARG PIP_PKGS="sphinx==4.5.0 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.13.3 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==3.1.2 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==2.0.3 pyarrow==10.0.1 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17 grpcio==1.62.0 protobuf==4.21.6 
grpcio-status==1.62.0 googleapis-common-protos==1.56.4"
-ARG GEM_PKGS="bundler:2.3.8"
+ARG GEM_PKGS="bundler:2.4.22"
 
 # Install extra needed repos and refresh.
 # - CRAN repo
diff --git a/docs/Gemfile.lock b/docs/Gemfile.lock
index 4e38f18703f3..e137f0f039b9 100644
--- a/docs/Gemfile.lock
+++ b/docs/Gemfile.lock
@@ -4,16 +4,16 @@ GEM
 addressable (2.8.6)
   public_suffix (>= 2.0.2, < 6.0)
 colorator (1.1.0)
-concurrent-ruby (1.2.2)
+concurrent-ruby (1.2.3)
 em-websocket (0.5.3)
   eventmachine (>= 0.12.9)
   http_parser.rb (~> 0)
 eventmachine (1.2.7)
 ffi (1.16.3)
 forwardable-extended (2.6.0)
-google-protobuf (3.25.2)
+google-protobuf (3.25.3)
 http_parser.rb (0.8.0)
-i18n (1.14.1)
+i18n (1.14.5)
   concurrent-ruby (~> 1.0)
 jekyll (4.3.3)
   addressable (~> 2.4)
@@ -42,22 +42,22 @@ GEM
 kramdown-parser-gfm (1.1.0)
   kramdown (~> 2.0)
 liquid (4.0.4)
-listen (3.8.0)
+listen (3.9.0)
   rb-fsevent (~> 0.10, >= 0.10.3)
   rb-inotify (~> 0.9, >= 0.9.10)
 mercenary (0.4.0)
 pathutil (0.16.2)
   forwardable-extended (~> 2.6)
-public_suffix (5.0.4)
-rake (13.1.0)
+public_suffix (5.0.5)
+rake (13.2.1)
 rb-fsevent (0.11.2)
 rb-inotify (0.10.1)
   ffi (~> 1.0)
 rexml (3.2.6)
 rouge (3.30.0)
 safe_yaml (1.0.5)
-sass-embedded (1.69.7)
-  google-protobuf (~> 3.25)
+sass-embedded (1.63.6)
+  google-protobuf (~> 3.23)
   rake (>= 13.0.0)
 terminal-table (3.0.2)
   unicode-display_width (>= 1.1.1, < 3)
diff --git a/docs/README.md b/docs/README.md
index 414c8dbd8303..363f1c207636 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -36,7 +36,7 @@ You need to have [Ruby 3][ruby] and [Python 3][python] 
installed. Make sure the
 [python]: https://www.python.org/downloads/
 
 ```sh
-$ gem install bundler
+$ gem install bundler -v 2.4.22
 ```
 
 After this all the required Ruby dependencies can be installed from the 
`docs/` directory

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

UPDATE:

I've successfully uploaded the release packages:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
(I skipped SparkR as I was not able to fix the errors, I'll get back to it
later)

However, there is a new issue with doc building:
https://github.com/apache/spark/pull/44628#discussion_r1595718574

I'll continue after the issue is fixed.

On Fri, May 10, 2024 at 12:29 AM Dongjoon Hyun 
wrote:

> Please re-try to upload, Wenchen. ASF Infra team bumped up our upload
> limit based on our request.
>
> > Your upload limit has been increased to 650MB
>
> Dongjoon.
>
>
>
> On Thu, May 9, 2024 at 8:12 AM Wenchen Fan  wrote:
>
>> I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776
>>
>> On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
>> wrote:
>>
>>> In addition, FYI, I was the latest release manager with Apache Spark
>>> 3.4.3 (2024-04-15 Vote)
>>>
>>> According to my work log, I uploaded the following binaries to SVN from
>>> EC2 (us-west-2) without any issues.
>>>
>>> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
>>> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
>>> spark-3.4.3-bin-hadoop3-scala2.13.tgz
>>> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
>>> spark-3.4.3-bin-hadoop3.tgz
>>> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
>>> spark-3.4.3-bin-without-hadoop.tgz
>>> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
>>> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>>>
>>> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination,
>>> the total size should be smaller than 3.4.3 binaires.
>>>
>>> Given that, if there is any INFRA change, that could happen after 4/15.
>>>
>>> Dongjoon.
>>>
>>> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Could you file an INFRA JIRA issue with the error message and context
>>>> first, Wenchen?
>>>>
>>>> As you know, if we see something, we had better file a JIRA issue
>>>> because it could be not only an Apache Spark project issue but also all ASF
>>>> project issues.
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan 
>>>> wrote:
>>>>
>>>>> UPDATE:
>>>>>
>>>>> After resolving a few issues in the release scripts, I can finally
>>>>> build the release packages. However, I can't upload them to the staging 
>>>>> SVN
>>>>> repo due to a transmitting error, and it seems like a limitation from the
>>>>> server side. I tried it on both my local laptop and remote AWS instance,
>>>>> but neither works. These package binaries are like 300-400 MBs, and we 
>>>>> just
>>>>> did a release last month. Not sure if this is a new limitation due to cost
>>>>> saving.
>>>>>
>>>>> While I'm looking for help to get unblocked, I'm wondering if we can
>>>>> upload release packages to a public git repo instead, under the Apache
>>>>> account?
>>>>>
>>>>>>
>>>>>>>>>>>>>>

svn commit: r69065 - /dev/spark/v4.0.0-preview1-rc1-bin/

2024-05-09 Thread wenchen

Author: wenchen
Date: Thu May  9 16:31:11 2024
New Revision: 69065

Log:
Apache Spark v4.0.0-preview1-rc1

Added:
dev/spark/v4.0.0-preview1-rc1-bin/
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz   
(with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz   
(with props)

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.asc
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.sha512

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc Thu May  9 
16:31:11 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY8+e4THHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/Wv78D/9aNsBANuVpIjYr+XkWYaimRLJ5IT0Z
+qKehjJBuMBDaBMMN3iWconDHBiASQT0FTYGDBeYI72fLFSMKBna5+Lu22+KD/K6h
+V8SZxPSQsAHQABYq9ha++XXyo1Vo+msPQ0pQAblmTrSpsvSWZmC8spzb5GbKYvK5
+kxr4Qt1XnHeGNJNToqGlbl/Hc2Etg5PkPBxMPBWMh7kLknMEscMNUf87JqCIa8LG
+hMid/0lrrevEm8gkuu0ol9Vgz4P+dreKE9eCfmWOXCod04y8tJnVPs83wUOZfmKV
+dHkELaMVwz3fa40QP77gK38K5i22aUgYk6dvhB+OgtatZ5tk0Dxp3AI2OObngEUm
+4cGmQLwcses53vApwkExq427gS8td4sTE2G1D4+hSdEcm8Fj69w4Ado/DlIAHZob
+KLV15qtNOyaIapT4GxBqoeqsw7tnRmxiP8K8UxFcPV/vZC1yQKIIULigPjttZKoW
++REE2N7ZyPvbvgItwjAL8hpCeYEkd7RDa7ofHAv6icC1qSsJZ9gxFM4rJvriI4g2
+tnYEvZduGpBunhlwVb0R3kAF5XoLIZQ5qm6kyWAzioc0gxzYVc3Rd+bXjm+vmopt
+bXHOM6N2lLQwqnWlHsyjGVFugrkkRXZbQbIV6FynXpKaz5YtkUhUMkofz7mOYhBi
++1Z8nZ04B6YLbw==
+=85FX
+-END PGP SIGNATURE-

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 Thu May  
9 16:31:11 2024
@@ -0,0 +1 @@
+2509cf6473495b0cd5c132d87f5e1c33593fa7375ca01bcab1483093cea92bdb6ad7afc7c72095376b28fc5acdc71bb323935d17513f33ee5276c6991ff668d1
  pyspark-4.0.0.dev1.tar.gz

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc 
(added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc Thu 
May  9 16:31:11 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJGBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY8+fATHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WoCMD/iZjkaGTUqt3jkIjWIUzpQo+kLn8//m
+f+hwUtAguXvbMJXwBOz/Q/f+KvGk0tutsbd6rmBB6cHjH4GoZPp1x6iBitFAO47r
+kHy/0xYkb70SPQCWIGQQpRv3g0uxTmpqL9H4YcIvexkV2wXG5VSwGvbSI4596n7l
+x7M3rRmFzrxhcNIYLQdhNuat0mwuJFWe6R7Zk7UYFFishn9dNt8EOYx8vsGAuMP8
+Uy3+7oZQOAGqdQGSL7Ev4Pqve7MrrPgGXaixGukXibi707NCURnHTDcenPfoEEiQ
+Hj83I3G+JrRhtsue/103a/GnHheUgwE8oEkefnUX7qC5tSn4T8lI2KpDBv9AL1pm
+Bv0eXf5X5xEM4wvO7DCgbeEDPLg72jjt9X8zjAYx05HddvTuPjeKEL+Ga6G0ueTz
+HRXHrgd1EFZ1znPZhWiSTmeqZTXdrb6wKTYt8Y6mk1oEGL3b0qE2LNkSED+4l40u
+41MlV3pmZyjRGYZl29XZKf4isKYyjec7UbJSM5ok4zCRF0p8Gvj0EihGS4X6rYpW
+9XxwjViKMIp7DCEcWjWpO6pJ8Ygb2Snh1UTFFgtzSVAoMqUgHnBHejJ4RA4ncHu6

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776

On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
wrote:

> In addition, FYI, I was the latest release manager with Apache Spark 3.4.3
> (2024-04-15 Vote)
>
> According to my work log, I uploaded the following binaries to SVN from
> EC2 (us-west-2) without any issues.
>
> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
> spark-3.4.3-bin-hadoop3-scala2.13.tgz
> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
> spark-3.4.3-bin-hadoop3.tgz
> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
> spark-3.4.3-bin-without-hadoop.tgz
> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>
> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
> total size should be smaller than 3.4.3 binaires.
>
> Given that, if there is any INFRA change, that could happen after 4/15.
>
> Dongjoon.
>
> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
> wrote:
>
>> Could you file an INFRA JIRA issue with the error message and context
>> first, Wenchen?
>>
>> As you know, if we see something, we had better file a JIRA issue because
>> it could be not only an Apache Spark project issue but also all ASF project
>> issues.
>>
>> Dongjoon.
>>
>>
>> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>>
>>> UPDATE:
>>>
>>> After resolving a few issues in the release scripts, I can finally build
>>> the release packages. However, I can't upload them to the staging SVN repo
>>> due to a transmitting error, and it seems like a limitation from the server
>>> side. I tried it on both my local laptop and remote AWS instance, but
>>> neither works. These package binaries are like 300-400 MBs, and we just did
>>> a release last month. Not sure if this is a new limitation due to cost
>>> saving.
>>>
>>> While I'm looking for help to get unblocked, I'm wondering if we can
>>> upload release packages to a public git repo instead, under the Apache
>>> account?
>>>
>>>>
>>>>>>>>>>>>

[jira] [Assigned] (SPARK-47409) StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only)

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47409:
---

Assignee: David Milicevic

> StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only)
> --
>
> Key: SPARK-47409
> URL: https://issues.apache.org/jira/browse/SPARK-47409
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *StringTrim* built-in string function in 
> Spark (including {*}StringTrimBoth{*}, {*}StringTrimLeft{*}, 
> {*}StringTrimRight{*}). First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|[https://www.postgresql.org/docs/]].
>  
> The goal for this Jira ticket is to implement the *StringTrim* function so it 
> supports binary & lowercase collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47409) StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only)

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47409.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46206
[https://github.com/apache/spark/pull/46206]

> StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only)
> --
>
> Key: SPARK-47409
> URL: https://issues.apache.org/jira/browse/SPARK-47409
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Enable collation support for the *StringTrim* built-in string function in 
> Spark (including {*}StringTrimBoth{*}, {*}StringTrimLeft{*}, 
> {*}StringTrimRight{*}). First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|[https://www.postgresql.org/docs/]].
>  
> The goal for this Jira ticket is to implement the *StringTrim* function so it 
> supports binary & lowercase collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 21333f8c1fc0 [SPARK-47409][SQL] Add support for collation for 
StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)
21333f8c1fc0 is described below

commit 21333f8c1fc01756e6708ad6ccf21f585fcb881d
Author: David Milicevic 
AuthorDate: Thu May 9 23:05:20 2024 +0800

[SPARK-47409][SQL] Add support for collation for StringTrim type of 
functions/expressions (for UTF8_BINARY & LCASE)

Recreating [original PR](https://github.com/apache/spark/pull/45749) 
because code has been reorganized in [this 
PR](https://github.com/apache/spark/pull/45978).

### What changes were proposed in this pull request?
This PR is created to add support for collations to StringTrim family of 
functions/expressions, specifically:
- `StringTrim`
- `StringTrimBoth`
- `StringTrimLeft`
- `StringTrimRight`

Changes:
- `CollationSupport.java`
  - Add new `StringTrim`, `StringTrimLeft` and `StringTrimRight` classes 
with corresponding logic.
  - `CollationAwareUTF8String` - add new `trim`, `trimLeft` and `trimRight` 
methods that actually implement trim logic.
- `UTF8String.java` - expose some of the methods publicly.
- `stringExpressions.scala`
  - Change input types.
  - Change eval and code gen logic.
- `CollationTypeCasts.scala` - add `StringTrim*` expressions to 
`CollationTypeCasts` rules.

### Why are the changes needed?
We are incrementally adding collation support to a built-in string 
functions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes:
- User should now be able to use non-default collations in string trim 
functions.

### How was this patch tested?
Already existing tests + new unit/e2e tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46206 from davidm-db/string-trim-functions.

Authored-by: David Milicevic 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 470 ++
 .../spark/sql/catalyst/util/CollationSupport.java  | 534 -
 .../org/apache/spark/unsafe/types/UTF8String.java  |   2 +-
 .../spark/unsafe/types/CollationSupportSuite.java  | 193 
 .../sql/catalyst/analysis/CollationTypeCasts.scala |   2 +-
 .../catalyst/expressions/stringExpressions.scala   |  53 +-
 .../sql/CollationStringExpressionsSuite.scala  | 161 ++-
 7 files changed, 1054 insertions(+), 361 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
new file mode 100644
index ..ee0d611d7e65
--- /dev/null
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -0,0 +1,470 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.lang.UCharacter;
+import com.ibm.icu.text.BreakIterator;
+import com.ibm.icu.text.StringSearch;
+import com.ibm.icu.util.ULocale;
+
+import org.apache.spark.unsafe.UTF8StringBuilder;
+import org.apache.spark.unsafe.types.UTF8String;
+
+import static org.apache.spark.unsafe.Platform.BYTE_ARRAY_OFFSET;
+import static org.apache.spark.unsafe.Platform.copyMemory;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Utility class for collation-aware UTF8String operations.
+ */
+public class CollationAwareUTF8String {
+  public static UTF8String replace(final UTF8String src, final UTF8String 
search,
+  final UTF8String replace, final int collationId) {
+// This collation aware implementation is based on existing implementation 
on UTF8String
+if (src.numBytes() == 0 || search.numBytes() == 0) {
+  return src;
+}
+
+StringSearch stringSearch = CollationFactory.getStringSearch(src, search,

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Wenchen Fan

Thanks for starting the discussion! To add a bit more color, we should at
least add a test job to make sure the release script can produce the
packages correctly. Today it's kind of being manually tested by the
release manager each time, which slows down the release process. It's
better if we can automate it entirely, so that making a release is a simple
click by authorized people.

On Thu, May 9, 2024 at 9:48 PM Nimrod Ofek  wrote:

> Following the conversation started with Spark 4.0.0 release, this is a
> thread to discuss improvements to our release processes.
>
> I'll Start by raising some questions that probably should have answers to
> start the discussion:
>
>
>1. What is currently running in GitHub Actions?
>2. Who currently has permissions for Github actions? Is there a
>specific owner for that today or a different volunteer each time?
>3. What are the current limits of GitHub Actions, who set them - and
>what is the process to change those (if possible at all, but I presume not
>all Apache projects have the same limits)?
>4. What versions should we support as an output for the build?
>5. Where should the artifacts be stored?
>6. What should be the output? only tar or also a docker image
>published somewhere?
>7. Do we want to have a release on fixed dates or a manual release
>upon request?
>8. Who should be permitted to sign a version - and what is the process
>for that?
>
>
> Thanks!
> Nimrod
>

(spark) branch master updated: [SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3fd38d4c07f6 [SPARK-47803][FOLLOWUP] Check nulls when casting nested 
type to variant
3fd38d4c07f6 is described below

commit 3fd38d4c07f6c998ec8bb234796f83a6aecfc0d2
Author: Chenhao Li 
AuthorDate: Thu May 9 22:45:10 2024 +0800

[SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant

### What changes were proposed in this pull request?

It adds null checks when accessing a nested element when casting a nested 
type to variant. It is necessary because the `get` API doesn't guarantee to 
return null when the slot is null. For example, `ColumnarArray.get` may return 
the default value of a primitive type if the slot is null.

### Why are the changes needed?

It is a bug fix is necessary for the cast-to-variant expression to work 
correctly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Two new unit tests. One directly uses `ColumnarArray` as the input of the 
cast. The other creates a real-world situation where `ColumnarArray` is the 
input of the cast (scan). Both of them would fail without the code change in 
this PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46486 from chenhao-db/fix_cast_nested_to_variant.

Authored-by: Chenhao Li 
Signed-off-by: Wenchen Fan 
---
 .../variant/VariantExpressionEvalUtils.scala   |  9 --
 .../apache/spark/sql/VariantEndToEndSuite.scala| 33 --
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
index eb235eb854e0..f7f7097173bb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
@@ -103,7 +103,8 @@ object VariantExpressionEvalUtils {
 val offsets = new 
java.util.ArrayList[java.lang.Integer](data.numElements())
 for (i <- 0 until data.numElements()) {
   offsets.add(builder.getWritePos - start)
-  buildVariant(builder, data.get(i, elementType), elementType)
+  val element = if (data.isNullAt(i)) null else data.get(i, 
elementType)
+  buildVariant(builder, element, elementType)
 }
 builder.finishWritingArray(start, offsets)
   case MapType(StringType, valueType, _) =>
@@ -116,7 +117,8 @@ object VariantExpressionEvalUtils {
   val key = keys.getUTF8String(i).toString
   val id = builder.addKey(key)
   fields.add(new VariantBuilder.FieldEntry(key, id, 
builder.getWritePos - start))
-  buildVariant(builder, values.get(i, valueType), valueType)
+  val value = if (values.isNullAt(i)) null else values.get(i, 
valueType)
+  buildVariant(builder, value, valueType)
 }
 builder.finishWritingObject(start, fields)
   case StructType(structFields) =>
@@ -127,7 +129,8 @@ object VariantExpressionEvalUtils {
   val key = structFields(i).name
   val id = builder.addKey(key)
   fields.add(new VariantBuilder.FieldEntry(key, id, 
builder.getWritePos - start))
-  buildVariant(builder, data.get(i, structFields(i).dataType), 
structFields(i).dataType)
+  val value = if (data.isNullAt(i)) null else data.get(i, 
structFields(i).dataType)
+  buildVariant(builder, value, structFields(i).dataType)
 }
 builder.finishWritingObject(start, fields)
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
index 3964bf3aedec..53be9d50d351 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
@@ -16,11 +16,13 @@
  */
 package org.apache.spark.sql
 
-import org.apache.spark.sql.catalyst.expressions.{CreateArray, 
CreateNamedStruct, JsonToStructs, Literal, StructsToJson}
+import org.apache.spark.sql.catalyst.expressions.{Cast, CreateArray, 
CreateNamedStruct, JsonToStructs, Literal, StructsToJson}
 import org.apache.spark.sql.catalyst.expressions.variant.ParseJson
 import org.apache.spark.sql.execution.WholeStageCodegenExec
+import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector
 import org.apache.spark.sql.test.SharedSparkSession
-import org.apache.spark.sql.types.Va

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 17337 matches

Mail list logo