from:"wenchen"

(spark-website) branch asf-site updated: add a behavior change guideline (#518)

2024-06-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 93d08b7bff add a behavior change guideline (#518)
93d08b7bff is described below

commit 93d08b7bff0875917e59a0158ed1daf794ddff99
Author: Wenchen Fan 
AuthorDate: Sat Jun 8 07:12:35 2024 +0800

add a behavior change guideline (#518)

* behavior change guide

* Apply suggestions from code review

Co-authored-by: Niranjan 

* address comments

* address comments

-

Co-authored-by: Niranjan 
---
 contributing.md| 31 +++
 site/contributing.html | 33 +
 2 files changed, 64 insertions(+)

diff --git a/contributing.md b/contributing.md
index 8f0ec49869..06f5fdf5b9 100644
--- a/contributing.md
+++ b/contributing.md
@@ -209,6 +209,37 @@ When writing error messages, you should:
 
 See the error message 
guidelines for more details.
 
+Behavior changes
+
+Behavior changes are user-visible functional changes in a new release via 
public APIs. The term 'user' here refers
+not only to those who write queries and/or develop Spark plugins, but also to 
those who deploy and/or manage Spark
+clusters. New features and bug fixes, such as correcting query results or 
schemas and failing unsupported queries
+that previously returned incorrect results, are considered behavior changes. 
However, performance improvements,
+code refactoring, and changes to unreleased APIs/features are not.
+
+Everyone makes mistakes, including Spark developers. We will continue to fix 
defects in Spark as they arise.
+However, it is important to communicate these behavior changes so that Spark 
users can be prepared for version
+upgrades. If a PR introduces behavior changes, it should be explicitly 
mentioned in the PR description. If the
+behavior change may require additional user actions, this should be 
highlighted in the migration guide
+(docs/sql-migration-guide.md for the SQL component and similar files for other 
components). Where possible,
+provide options to restore the previous behavior and mention these options in 
the error message. Some examples include:
+
+- Bug fixes that change query results. Users may need to backfill to correct 
existing data and must be informed about
+these correctness fixes.
+- Bug fixes that change the query schema. Users may need to update the schema 
of tables in their data pipelines and must
+be informed about these changes.
+- Removing or renaming Spark configurations.
+- Renaming error classes or conditions.
+- Any non-additive changes to the public Python/SQL/Scala/Java/R APIs 
(including developer APIs), such as renaming
+functions, removing parameters, adding parameters, renaming parameters, or 
changing parameter default values. These
+changes should generally be avoided, or if necessary, done in a 
binary-compatible manner by deprecating the old function
+and introducing a new one instead.
+- Any non-additive changes to the way Spark should be deployed and managed: 
renaming argument names in deployment scripts,
+updates to the REST API, changes to the method of loading configuration files, 
etc.
+
+This list is not meant to be comprehensive. Anyone reviewing a PR can ask the 
PR author to add to the migration guide
+if they believe the change is risky and may disrupt users during an upgrade.
+
 Code review criteria
 
 Before considering how to contribute code, it's useful to understand how code 
is reviewed, 
diff --git a/site/contributing.html b/site/contributing.html
index 47d4d8d662..aeaeeceb82 100644
--- a/site/contributing.html
+++ b/site/contributing.html
@@ -362,6 +362,39 @@ error messages.
 
 See the error message 
guidelines for more details.
 
+Behavior changes
+
+Behavior changes are user-visible functional changes in a new release via 
public APIs. The term user here refers
+not only to those who write queries and/or develop Spark plugins, but also to 
those who deploy and/or manage Spark
+clusters. New features and bug fixes, such as correcting query results or 
schemas and failing unsupported queries
+that previously returned incorrect results, are considered behavior changes. 
However, performance improvements,
+code refactoring, and changes to unreleased APIs/features are not.
+
+Everyone makes mistakes, including Spark developers. We will continue to 
fix defects in Spark as they arise.
+However, it is important to communicate these behavior changes so that Spark 
users can be prepared for version
+upgrades. If a PR introduces behavior changes, it should be explicitly 
mentioned in the PR description. If the
+behavior change may require additional user actions, this should be 
highlighted in the migration guide
+(docs/sql-migration-guide.md for the SQL component and similar files

(spark) branch master updated (d81b1e3d358c -> 8911d59005e8)

2024-06-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from d81b1e3d358c [SPARK-48559][SQL] Fetch globalTempDatabase name directly 
without invoking initialization of GlobalaTempViewManager
 add 8911d59005e8 [SPARK-46393][SQL][FOLLOWUP] Classify exceptions in 
JDBCTableCatalog.loadTable and Fix UT

No new revisions were added by this update.

Summary of changes:
 .../src/main/resources/error/error-conditions.json |  5 +++
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 36 ++
 .../datasources/v2/jdbc/JDBCTableCatalog.scala | 13 +---
 3 files changed, 30 insertions(+), 24 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (87b0f5995383 -> d81b1e3d358c)

2024-06-07 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 87b0f5995383 [SPARK-48561][PS][CONNECT] Throw 
`PandasNotImplementedError` for unsupported plotting functions
 add d81b1e3d358c [SPARK-48559][SQL] Fetch globalTempDatabase name directly 
without invoking initialization of GlobalaTempViewManager

No new revisions were added by this update.

Summary of changes:
 .../catalyst/catalog/GlobalTempViewManager.scala   |  2 +-
 .../sql/catalyst/catalog/SessionCatalog.scala  | 35 +++---
 .../org/apache/spark/sql/internal/SQLConf.scala|  2 ++
 .../sql/catalyst/catalog/SessionCatalogSuite.scala | 20 ++---
 .../execution/command/AnalyzeColumnCommand.scala   |  2 +-
 .../apache/spark/sql/internal/SharedState.scala|  3 +-
 .../org/apache/spark/sql/CachedTableSuite.scala|  8 ++---
 .../spark/sql/StatisticsCollectionSuite.scala  |  2 +-
 .../spark/sql/execution/GlobalTempViewSuite.scala  |  2 +-
 .../apache/spark/sql/execution/SQLViewSuite.scala  |  6 ++--
 .../spark/sql/execution/SQLViewTestSuite.scala |  2 +-
 .../command/AlterTableDropPartitionSuiteBase.scala |  2 +-
 .../AlterTableRenamePartitionSuiteBase.scala   |  2 +-
 .../spark/sql/execution/command/DDLSuite.scala |  2 +-
 .../execution/command/TruncateTableSuiteBase.scala |  4 +--
 .../command/v1/AlterTableAddPartitionSuite.scala   |  2 +-
 .../command/v2/AlterTableAddPartitionSuite.scala   |  2 +-
 .../thriftserver/SparkGetColumnsOperation.scala|  2 +-
 .../thriftserver/SparkGetSchemasOperation.scala|  2 +-
 .../thriftserver/SparkGetTablesOperation.scala |  2 +-
 .../ThriftServerWithSparkContextSuite.scala|  2 +-
 .../spark/sql/hive/HiveSharedStateSuite.scala  |  2 +-
 22 files changed, 56 insertions(+), 52 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48286] Fix analysis of column with exists default expression - Add user facing error

2024-06-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a00c11546273 [SPARK-48286] Fix analysis of column with exists default 
expression - Add user facing error
a00c11546273 is described below

commit a00c11546273089dbfa993fa4c170eb70beecbc3
Author: Uros Stankovic 
AuthorDate: Thu Jun 6 13:08:48 2024 -0700

[SPARK-48286] Fix analysis of column with exists default expression - Add 
user facing error

FIRST CHANGE

Pass correct parameter list to 
`org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` when it is 
invoked from 
`org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`.

`org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` method 
accepts 3 parameter

1) Field to analyze
2) Statement type - String
3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT

Method 
`org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`
pass `fieldToAnalyze` and `EXISTS_DEFAULT` as second parameter, so it is 
not metadata key, instead of that, it is statement type, so different 
expression is analyzed.

Pull requests where original change was introduced
https://github.com/apache/spark/pull/40049 - Initial commit
https://github.com/apache/spark/pull/44876 - Refactor that did not touch 
the issue
https://github.com/apache/spark/pull/44935 - Another refactor that did not 
touch the issue

SECOND CHANGE
Add user facing exception when default value is not foldable or resolved. 
Otherwise, user would see message "You hit a bug in Spark ...".
It is needed to pass correct value to `Column` object

Yes, this is a bug fix, existence default value has now proper expression, 
but before this change, existence default value was actually current default 
value of column.

Unit test

No

Closes #46594 from 
urosstan-db/SPARK-48286-Analyze-exists-default-expression-instead-of-current-default-expression.

Lead-authored-by: Uros Stankovic 
Co-authored-by: Uros Stankovic 
<155642965+urosstan...@users.noreply.github.com>
    Signed-off-by: Wenchen Fan 
(cherry picked from commit 0f21df0b29cc18f0e0c7b12543f3a037e4032e65)
    Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/ResolveDefaultColumnsUtil.scala  | 16 +++
 .../sql/connector/catalog/CatalogV2Util.scala  |  7 ++-
 .../DataSourceV2DataFrameSessionCatalogSuite.scala |  9 +++-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 24 ++
 4 files changed, 54 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
index 50ff3eeab0c1..f55fa2d8f5e8 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
@@ -279,6 +279,7 @@ object ResolveDefaultColumns extends QueryErrorsBase with 
ResolveDefaultColumnsU
   throw 
QueryCompilationErrors.defaultValuesMayNotContainSubQueryExpressions(
 statementType, colName, defaultSQL)
 }
+
 // Analyze the parse result.
 val plan = try {
   val analyzer: Analyzer = DefaultColumnAnalyzer
@@ -293,6 +294,21 @@ object ResolveDefaultColumns extends QueryErrorsBase with 
ResolveDefaultColumnsU
 val analyzed: Expression = plan.collectFirst {
   case Project(Seq(a: Alias), OneRowRelation()) => a.child
 }.get
+
+if (!analyzed.foldable) {
+  throw QueryCompilationErrors.defaultValueNotConstantError(statementType, 
colName, defaultSQL)
+}
+
+// Another extra check, expressions should already be resolved if 
AnalysisException is not
+// thrown in the code block above
+if (!analyzed.resolved) {
+  throw QueryCompilationErrors.defaultValuesUnresolvedExprError(
+statementType,
+colName,
+defaultSQL,
+cause = null)
+}
+
 // Perform implicit coercion from the provided expression type to the 
required column type.
 if (dataType == analyzed.dataType) {
   analyzed
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
index be569b1de9db..47c438f154ab 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
@@ -512,10 +512,15 @@ private[sql] object CatalogV2Util {
 }
 
 if (isDefaultCol

(spark) branch master updated: [SPARK-48286] Fix analysis of column with exists default expression - Add user facing error

2024-06-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0f21df0b29cc [SPARK-48286] Fix analysis of column with exists default 
expression - Add user facing error
0f21df0b29cc is described below

commit 0f21df0b29cc18f0e0c7b12543f3a037e4032e65
Author: Uros Stankovic 
AuthorDate: Thu Jun 6 13:08:48 2024 -0700

[SPARK-48286] Fix analysis of column with exists default expression - Add 
user facing error

### What changes were proposed in this pull request?

FIRST CHANGE

Pass correct parameter list to 
`org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` when it is 
invoked from 
`org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`.

`org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` method 
accepts 3 parameter

1) Field to analyze
2) Statement type - String
3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT

Method 
`org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`
pass `fieldToAnalyze` and `EXISTS_DEFAULT` as second parameter, so it is 
not metadata key, instead of that, it is statement type, so different 
expression is analyzed.

Pull requests where original change was introduced
https://github.com/apache/spark/pull/40049 - Initial commit
https://github.com/apache/spark/pull/44876 - Refactor that did not touch 
the issue
https://github.com/apache/spark/pull/44935 - Another refactor that did not 
touch the issue

SECOND CHANGE
Add user facing exception when default value is not foldable or resolved. 
Otherwise, user would see message "You hit a bug in Spark ...".
### Why are the changes needed?
It is needed to pass correct value to `Column` object

### Does this PR introduce _any_ user-facing change?
Yes, this is a bug fix, existence default value has now proper expression, 
but before this change, existence default value was actually current default 
value of column.

### How was this patch tested?
Unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46594 from 
urosstan-db/SPARK-48286-Analyze-exists-default-expression-instead-of-current-default-expression.

Lead-authored-by: Uros Stankovic 
Co-authored-by: Uros Stankovic 
<155642965+urosstan...@users.noreply.github.com>
    Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/ResolveDefaultColumnsUtil.scala  | 16 +++
 .../sql/connector/catalog/CatalogV2Util.scala  |  7 ++-
 .../DataSourceV2DataFrameSessionCatalogSuite.scala |  9 +++-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 24 ++
 4 files changed, 54 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
index d73e2ca6bd9d..ad104b6e0c76 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
@@ -284,6 +284,7 @@ object ResolveDefaultColumns extends QueryErrorsBase
   throw 
QueryCompilationErrors.defaultValuesMayNotContainSubQueryExpressions(
 statementType, colName, defaultSQL)
 }
+
 // Analyze the parse result.
 val plan = try {
   val analyzer: Analyzer = DefaultColumnAnalyzer
@@ -298,6 +299,21 @@ object ResolveDefaultColumns extends QueryErrorsBase
 val analyzed: Expression = plan.collectFirst {
   case Project(Seq(a: Alias), OneRowRelation()) => a.child
 }.get
+
+if (!analyzed.foldable) {
+  throw QueryCompilationErrors.defaultValueNotConstantError(statementType, 
colName, defaultSQL)
+}
+
+// Another extra check, expressions should already be resolved if 
AnalysisException is not
+// thrown in the code block above
+if (!analyzed.resolved) {
+  throw QueryCompilationErrors.defaultValuesUnresolvedExprError(
+statementType,
+colName,
+defaultSQL,
+cause = null)
+}
+
 // Perform implicit coercion from the provided expression type to the 
required column type.
 coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL)
   }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
index 5485f5255b6e..f36310e8ad89 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sq

(spark) branch master updated: [SPARK-48283][SQL] Modify string comparison for UTF8_BINARY_LCASE

2024-06-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 84fa0527834b [SPARK-48283][SQL] Modify string comparison for 
UTF8_BINARY_LCASE
84fa0527834b is described below

commit 84fa0527834b947ad12e4a6398512c75929cc99b
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Thu Jun 6 12:05:28 2024 -0700

[SPARK-48283][SQL] Modify string comparison for UTF8_BINARY_LCASE

### What changes were proposed in this pull request?
String comparison and hashing in UTF8_BINARY_LCASE is now context-unaware, 
and uses ICU root locale rules to convert string to lowercase at code point 
level, taking into consideration special cases for one-to-many case mapping. 
For example: comparing "ΘΑΛΑΣΣΙΝΟΣ" and "θαλασσινοσ" under UTF8_BINARY_LCASE 
now returns true, because Greek final sigma is special-cased in the new 
comparison implementation.

### Why are the changes needed?
1. UTF8_BINARY_LCASE should use ICU root locale rules (instead of JVM)
2. comparing strings under UTF8_BINARY_LCASE should be context-insensitive

### Does this PR introduce _any_ user-facing change?
Yes, comparing strings under UTF8_BINARY_LCASE will now give different 
results in two kinds of special cases (Turkish dotted letter "i" and Greek 
final letter "sigma").

### How was this patch tested?
Unit tests in `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46700 from uros-db/lcase-casing.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java|  90 
 .../spark/sql/catalyst/util/CollationFactory.java  |   4 +-
 .../org/apache/spark/unsafe/types/UTF8String.java  |  30 +---
 .../spark/unsafe/types/CollationSupportSuite.java  | 151 +
 .../apache/spark/unsafe/types/UTF8StringSuite.java |  23 
 5 files changed, 244 insertions(+), 54 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index cf3b5c86dcf6..056b202bc398 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -183,6 +183,54 @@ public class CollationAwareUTF8String {
 return MATCH_NOT_FOUND;
   }
 
+  /**
+   * Lowercase UTF8String comparison used for UTF8_BINARY_LCASE collation. 
While the default
+   * UTF8String comparison is equivalent to 
a.toLowerCase().binaryCompare(b.toLowerCase()), this
+   * method uses code points to compare the strings in a case-insensitive 
manner using ICU rules,
+   * as well as handling special rules for one-to-many case mappings (see: 
lowerCaseCodePoints).
+   *
+   * @param left The first UTF8String to compare.
+   * @param right The second UTF8String to compare.
+   * @return An integer representing the comparison result.
+   */
+  public static int compareLowerCase(final UTF8String left, final UTF8String 
right) {
+// Only if both strings are ASCII, we can use faster comparison (no string 
allocations).
+if (left.isFullAscii() && right.isFullAscii()) {
+  return compareLowerCaseAscii(left, right);
+}
+return compareLowerCaseSlow(left, right);
+  }
+
+  /**
+   * Fast version of the `compareLowerCase` method, used when both arguments 
are ASCII strings.
+   *
+   * @param left The first ASCII UTF8String to compare.
+   * @param right The second ASCII UTF8String to compare.
+   * @return An integer representing the comparison result.
+   */
+  private static int compareLowerCaseAscii(final UTF8String left, final 
UTF8String right) {
+int leftBytes = left.numBytes(), rightBytes = right.numBytes();
+for (int curr = 0; curr < leftBytes && curr < rightBytes; curr++) {
+  int lowerLeftByte = Character.toLowerCase(left.getByte(curr));
+  int lowerRightByte = Character.toLowerCase(right.getByte(curr));
+  if (lowerLeftByte != lowerRightByte) {
+return lowerLeftByte - lowerRightByte;
+  }
+}
+return leftBytes - rightBytes;
+  }
+
+  /**
+   * Slow version of the `compareLowerCase` method, used when both arguments 
are non-ASCII strings.
+   *
+   * @param left The first non-ASCII UTF8String to compare.
+   * @param right The second non-ASCII UTF8String to compare.
+   * @return An integer representing the comparison result.
+   */
+  private static int compareLowerCaseSlow(final UTF8String left, final 
UTF8S

(spark) branch master updated (3878b57e6e88 -> b5a4b3200362)

2024-06-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3878b57e6e88 [SPARK-48526][SS] Allow passing custom sink to 
testStream()
 add b5a4b3200362 [SPARK-48435][SQL] UNICODE collation should not support 
binary equality

No new revisions were added by this update.

Summary of changes:
 .../catalyst/util/CollationAwareUTF8String.java|  5 +-
 .../spark/sql/catalyst/util/CollationFactory.java  |  2 +-
 .../spark/unsafe/types/CollationSupportSuite.java  | 36 +--
 .../spark/unsafe/types/CollationFactorySuite.scala | 10 ++-
 .../expressions/CollationExpressionSuite.scala |  8 +--
 .../CollationRegexpExpressionsSuite.scala  | 71 +++---
 .../apache/spark/sql/CollationSQLRegexpSuite.scala | 31 +-
 .../sql/CollationStringExpressionsSuite.scala  | 32 +-
 .../streaming/StreamingDeduplicationSuite.scala|  2 +-
 9 files changed, 86 insertions(+), 111 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48526][SS] Allow passing custom sink to testStream()

2024-06-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3878b57e6e88 [SPARK-48526][SS] Allow passing custom sink to 
testStream()
3878b57e6e88 is described below

commit 3878b57e6e88631826c1c8690eb9052e5efa5aa1
Author: Johan Lasperas 
AuthorDate: Thu Jun 6 11:19:53 2024 -0700

[SPARK-48526][SS] Allow passing custom sink to testStream()

### What changes were proposed in this pull request?
Update `StreamTest:testStream()` to allow passing a custom sink. This 
allows writing better tests covering streaming sinks, in particular:
- reusing a sink across calls to testStream.
- passing a custom sink implementation.

### Why are the changes needed?
Better testing infrastructure.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46866 from johanl-db/allow-passing-custom-sink-stream-test.

Authored-by: Johan Lasperas 
Signed-off-by: Wenchen Fan 
---
 .../src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala| 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
index d7401897ff6a..7439c7ab6d6e 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala
@@ -346,7 +346,8 @@ trait StreamTest extends QueryTest with SharedSparkSession 
with TimeLimits with
   def testStream(
   _stream: Dataset[_],
   outputMode: OutputMode = OutputMode.Append,
-  extraOptions: Map[String, String] = Map.empty)(actions: StreamAction*): 
Unit = synchronized {
+  extraOptions: Map[String, String] = Map.empty,
+  sink: MemorySink = new MemorySink())(actions: StreamAction*): Unit = 
synchronized {
 import org.apache.spark.sql.streaming.util.StreamManualClock
 
 // `synchronized` is added to prevent the user from calling multiple 
`testStream`s concurrently
@@ -359,7 +360,6 @@ trait StreamTest extends QueryTest with SharedSparkSession 
with TimeLimits with
 var currentStream: StreamExecution = null
 var lastStream: StreamExecution = null
 val awaiting = new mutable.HashMap[Int, OffsetV2]() // source index -> 
offset to wait for
-val sink = new MemorySink
 val resetConfValues = mutable.Map[String, Option[String]]()
 val defaultCheckpointLocation =
   Utils.createTempDir(namePrefix = "streaming.metadata").getCanonicalPath


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (7cba1ab4d6ac -> 9f4007f3d89e)

2024-06-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 7cba1ab4d6ac [SPARK-48554][INFRA] Use R 4.4.0 in `windows` R GitHub 
Action Window job
 add 9f4007f3d89e [SPARK-48546][SQL] Fix ExpressionEncoder after replacing 
NullPointerExceptions with proper error classes in AssertNotNull expression

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/encoders/ExpressionEncoder.scala  |  5 +++
 .../catalyst/encoders/EncoderResolutionSuite.scala | 15 +++-
 .../sql/catalyst/encoders/RowEncoderSuite.scala|  2 +-
 .../scala/org/apache/spark/sql/DatasetSuite.scala  | 40 ++
 4 files changed, 29 insertions(+), 33 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47552][CORE][FOLLOWUP] Set spark.hadoop.fs.s3a.connection.establish.timeout to numeric

2024-06-05 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 966c3d9ef1ed [SPARK-47552][CORE][FOLLOWUP] Set 
spark.hadoop.fs.s3a.connection.establish.timeout to numeric
966c3d9ef1ed is described below

commit 966c3d9ef1edc8b2f7d53b8a592ff4e2a2f9b80b
Author: Wenchen Fan 
AuthorDate: Wed Jun 5 20:49:03 2024 -0700

[SPARK-47552][CORE][FOLLOWUP] Set 
spark.hadoop.fs.s3a.connection.establish.timeout to numeric

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/45710 . Some 
custom `FileSystem` implementations read the 
`hadoop.fs.s3a.connection.establish.timeout` config as numeric, and do not 
support the `30s` syntax. To make it safe, this PR proposes to set this conf to 
`3` instead of `30s`. I checked the doc page and this config is 
milliseconds.

### Why are the changes needed?

more compatible with custom `FileSystem` implementations.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manual

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46874 from cloud-fan/follow.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 core/src/main/scala/org/apache/spark/SparkContext.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 90d8cef00ef8..6eb2bea40bdb 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -421,7 +421,7 @@ class SparkContext(config: SparkConf) extends Logging {
 }
 // HADOOP-19097 Set fs.s3a.connection.establish.timeout to 30s
 // We can remove this after Apache Hadoop 3.4.1 releases
-conf.setIfMissing("spark.hadoop.fs.s3a.connection.establish.timeout", 
"30s")
+conf.setIfMissing("spark.hadoop.fs.s3a.connection.establish.timeout", 
"3")
 // This should be set as early as possible.
 SparkContext.fillMissingMagicCommitterConfsIfNeeded(_conf)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48307][SQL][FOLLOWUP] Allow outer references in un-referenced CTE relations

2024-06-05 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d5c33c6bfb57 [SPARK-48307][SQL][FOLLOWUP] Allow outer references in 
un-referenced CTE relations
d5c33c6bfb57 is described below

commit d5c33c6bfb5757b243fc8e1734daeaa4fe3b9b32
Author: Wenchen Fan 
AuthorDate: Wed Jun 5 14:38:44 2024 -0700

[SPARK-48307][SQL][FOLLOWUP] Allow outer references in un-referenced CTE 
relations

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/46617 .  
Subquery expression has a bunch of correlation checks which need to match 
certain plan shapes. We broke this by leaving `WithCTE` in the plan for 
un-referenced CTE relations. This PR fixes the issue by skipping CTE plan nodes 
in correlated subquery expression checks.

### Why are the changes needed?

bug fix
### Does this PR introduce _any_ user-facing change?

no bug is not released yet.

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46869 from cloud-fan/check.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  7 +
 .../plans/logical/basicLogicalOperators.scala  |  4 +++
 .../sql-tests/analyzer-results/cte-legacy.sql.out  | 24 +++
 .../sql-tests/analyzer-results/cte-nested.sql.out  | 34 ++
 .../analyzer-results/cte-nonlegacy.sql.out | 34 ++
 .../test/resources/sql-tests/inputs/cte-nested.sql | 12 
 .../resources/sql-tests/results/cte-legacy.sql.out | 22 ++
 .../resources/sql-tests/results/cte-nested.sql.out | 22 ++
 .../sql-tests/results/cte-nonlegacy.sql.out| 22 ++
 9 files changed, 181 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 8c380a7228c6..f4408220ac93 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -1371,6 +1371,13 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 aggregated,
 canContainOuter && 
SQLConf.get.getConf(SQLConf.DECORRELATE_OFFSET_ENABLED))
 
+// We always inline CTE relations before analysis check, and only 
un-referenced CTE
+// relations will be kept in the plan. Here we should simply skip them 
and check the
+// children, as un-referenced CTE relations won't be executed anyway 
and doesn't need to
+// be restricted by the current subquery correlation limitations.
+case _: WithCTE | _: CTERelationDef =>
+  plan.children.foreach(p => checkPlan(p, aggregated, canContainOuter))
+
 // Category 4: Any other operators not in the above 3 categories
 // cannot be on a correlation path, that is they are allowed only
 // under a correlation point but they and their descendant operators
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index 9242a06cf1d6..0135fcfb3cc8 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -911,6 +911,10 @@ case class WithCTE(plan: LogicalPlan, cteDefs: 
Seq[CTERelationDef]) extends Logi
   def withNewPlan(newPlan: LogicalPlan): WithCTE = {
 withNewChildren(children.init :+ newPlan).asInstanceOf[WithCTE]
   }
+
+  override def maxRows: Option[Long] = plan.maxRows
+
+  override def maxRowsPerPartition: Option[Long] = plan.maxRowsPerPartition
 }
 
 /**
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/cte-legacy.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/cte-legacy.sql.out
index 594a30b054ed..f9b78e94236f 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/cte-legacy.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/cte-legacy.sql.out
@@ -43,6 +43,30 @@ Project [scalar-subquery#x [] AS scalarsubquery()#x]
 +- OneRowRelation
 
 
+-- !query
+SELECT (
+  WITH unreferenced AS (SELECT id)
+  SELECT 1
+) FROM range(1)
+-- !query analysis
+Project [scalar-subquery#x [] AS scalarsubquery()#x]
+:  +- Project [1 AS 1#x]
+: +- OneRowRelation
++- Range

(spark) branch master updated (34ac7de89711 -> 490a4b3b1fdf)

2024-06-05 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 34ac7de89711 [SPARK-48536][PYTHON][CONNECT] Cache user specified 
schema in applyInPandas and applyInArrow
 add 490a4b3b1fdf [SPARK-48498][SQL] Always do char padding in predicates

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/internal/SQLConf.scala|  8 +
 .../datasources/ApplyCharTypePadding.scala | 39 --
 .../apache/spark/sql/CharVarcharTestSuite.scala| 28 
 .../org/apache/spark/sql/PlanStabilitySuite.scala  |  8 +++--
 4 files changed, 70 insertions(+), 13 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48307][SQL] InlineCTE should keep not-inlined relations in the original WithCTE node

2024-06-04 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8a0927c07a14 [SPARK-48307][SQL] InlineCTE should keep not-inlined 
relations in the original WithCTE node
8a0927c07a14 is described below

commit 8a0927c07a1483bcd9125bdc2062a63759b0a337
Author: Wenchen Fan 
AuthorDate: Tue Jun 4 15:04:22 2024 -0700

[SPARK-48307][SQL] InlineCTE should keep not-inlined relations in the 
original WithCTE node

### What changes were proposed in this pull request?

I noticed an outdated comment in the rule `InlineCTE`
```
  // CTEs in SQL Commands have been inlined by `CTESubstitution` 
already, so it is safe to add
  // WithCTE as top node here.
```

This is not true anymore after https://github.com/apache/spark/pull/42036 . 
It's not a big deal as we replace not-inlined CTE relations with `Repartition` 
during optimization, so it doesn't matter where we put the `WithCTE` node with 
not-inlined CTE relations, as it will disappear eventually. But it's still 
better to keep it at its original place, as third-party rules may be sensitive 
about the plan shape.

### Why are the changes needed?

to keep the plan shape as much as can after inlining CTE relations.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46617 from cloud-fan/cte.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  45 +--
 .../spark/sql/catalyst/optimizer/InlineCTE.scala   | 133 +
 .../sql/catalyst/optimizer/InlineCTESuite.scala|  42 +++
 3 files changed, 132 insertions(+), 88 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 1c2baa78be1b..8c380a7228c6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -143,50 +143,17 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   errorClass, missingCol, orderedCandidates, a.origin)
   }
 
-  private def checkUnreferencedCTERelations(
-  cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])],
-  visited: mutable.Map[Long, Boolean],
-  danglingCTERelations: mutable.ArrayBuffer[CTERelationDef],
-  cteId: Long): Unit = {
-if (visited(cteId)) {
-  return
-}
-val (cteDef, _, refMap) = cteMap(cteId)
-refMap.foreach { case (id, _) =>
-  checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, id)
-}
-danglingCTERelations.append(cteDef)
-visited(cteId) = true
-  }
-
   def checkAnalysis(plan: LogicalPlan): Unit = {
-val inlineCTE = InlineCTE(alwaysInline = true)
-val cteMap = mutable.HashMap.empty[Long, (CTERelationDef, Int, 
mutable.Map[Long, Int])]
-inlineCTE.buildCTEMap(plan, cteMap)
-val danglingCTERelations = mutable.ArrayBuffer.empty[CTERelationDef]
-val visited: mutable.Map[Long, Boolean] = 
mutable.Map.empty.withDefaultValue(false)
-// If a CTE relation is never used, it will disappear after inline. Here 
we explicitly collect
-// these dangling CTE relations, and put them back in the main query, to 
make sure the entire
-// query plan is valid.
-cteMap.foreach { case (cteId, (_, refCount, _)) =>
-  // If a CTE relation ref count is 0, the other CTE relations that 
reference it should also be
-  // collected. This code will also guarantee the leaf relations that do 
not reference
-  // any others are collected first.
-  if (refCount == 0) {
-checkUnreferencedCTERelations(cteMap, visited, danglingCTERelations, 
cteId)
-  }
-}
-// Inline all CTEs in the plan to help check query plan structures in 
subqueries.
-var inlinedPlan: LogicalPlan = plan
-try {
-  inlinedPlan = inlineCTE(plan)
+// We should inline all CTE relations to restore the original plan shape, 
as the analysis check
+// may need to match certain plan shapes. For dangling CTE relations, they 
will still be kept
+// in the original `WithCTE` node, as we need to perform analysis check 
for them as well.
+val inlineCTE = InlineCTE(alwaysInline = true, keepDanglingRelations = 
true)
+val inlinedPlan: LogicalPlan = try {
+  inlineCTE(plan)
 } catch {
   case e: AnalysisException =>
 throw new ExtendedAnalysisEx

(spark) branch master updated (651f68782ab7 -> c7caac9b10ca)

2024-06-04 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 651f68782ab7 [SPARK-48531][INFRA] Fix `Black` target version to Python 
3.9
 add c7caac9b10ca [SPARK-47972][SQL][FOLLOWUP] Restrict CAST expression for 
collations

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala | 1 -
 1 file changed, 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48318][SQL] Enable hash join support for all collations (complex types)

2024-06-04 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c852c4f72acb [SPARK-48318][SQL] Enable hash join support for all 
collations (complex types)
c852c4f72acb is described below

commit c852c4f72acb658ff0193f16b526c8f653188a4e
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue Jun 4 00:10:50 2024 -0700

[SPARK-48318][SQL] Enable hash join support for all collations (complex 
types)

### What changes were proposed in this pull request?
Enable collation support for hash join on complex types.

- Logical plan is rewritten in analysis to (recursively) replace all 
non-binary strings with CollationKey
- CollationKey is a unary expression that transforms StringType to 
BinaryType
- Collation keys allow correct & efficient string comparison under specific 
collation rules

### Why are the changes needed?
Improve JOIN performance for complex types containing collated strings.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- Unit tests for `CollationKey` in `CollationExpressionSuite`
- E2e SQL tests for `RewriteCollationJoin` in `CollationSuite`
- Various queries with JOIN in existing TPCDS collation test suite

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46722 from uros-db/hash-join-cmx.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
    Signed-off-by: Wenchen Fan 
---
 .../catalyst/analysis/RewriteCollationJoin.scala   |  72 ++-
 .../org/apache/spark/sql/CollationSuite.scala  | 228 -
 2 files changed, 289 insertions(+), 11 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala
index fd443fd19a1f..ae29d21c7a71 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala
@@ -17,24 +17,27 @@
 
 package org.apache.spark.sql.catalyst.analysis
 
-import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
CollationKey, Equality}
+import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
 import org.apache.spark.sql.catalyst.rules.Rule
-import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.catalyst.util.UnsafeRowUtils
+import org.apache.spark.sql.types._
 import org.apache.spark.sql.types.StringType
+import org.apache.spark.util.ArrayImplicits.SparkArrayOps
 
+/**
+ * This rule rewrites Join conditions to ensure that all types containing 
non-binary collated
+ * strings are compared correctly. This is necessary because join conditions 
are evaluated using
+ * binary equality, which does not work correctly for non-binary collated 
strings. However, by
+ * injecting CollationKey expressions into the join condition, we can ensure 
that the comparison
+ * is done correctly, which then allows HashJoin to work properly on this type 
of data.
+ */
 object RewriteCollationJoin extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case j @ Join(_, _, _, Some(condition), _) =>
   val newCondition = condition transform {
 case e @ Equality(l: AttributeReference, r: AttributeReference) =>
-  (l.dataType, r.dataType) match {
-case (st: StringType, _: StringType)
-  if 
!CollationFactory.fetchCollation(st.collationId).supportsBinaryEquality =>
-e.withNewChildren(Seq(CollationKey(l), CollationKey(r)))
-case _ =>
-  e
-  }
+  e.withNewChildren(Seq(processExpression(l, l.dataType), 
processExpression(r, r.dataType)))
   }
   if (!newCondition.fastEquals(condition)) {
 j.copy(condition = Some(newCondition))
@@ -42,4 +45,55 @@ object RewriteCollationJoin extends Rule[LogicalPlan] {
 j
   }
   }
+
+  /**
+   * Recursively process the expression in order to replace non-binary 
collated strings with their
+   * associated collation keys. This is necessary to ensure that the join 
condition is evaluated
+   * correctly for all types containing non-binary collated strings, including 
structs and arrays.
+   */
+  private def processExpression(expr: Expression, dt: DataType): Expression = {
+dt match {
+  // For binary stable expressions, no special handling is needed.
+  case _ if UnsafeRowUtils.isBinaryStable(dt) =>
+expr
+
+  // Inj

(spark) branch master updated: [SPARK-47972][SQL] Restrict CAST expression for collations

2024-06-03 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e4e8bb5936d3 [SPARK-47972][SQL] Restrict CAST expression for collations
e4e8bb5936d3 is described below

commit e4e8bb5936d305d27961c3a9c04d06ee1901977f
Author: Mihailo Milosevic 
AuthorDate: Mon Jun 3 16:16:48 2024 -0700

[SPARK-47972][SQL] Restrict CAST expression for collations

### What changes were proposed in this pull request?
Block of syntax CAST(value AS STRING COLLATE collation_name).

### Why are the changes needed?
Current state of code allows for calls like CAST(1 AS STRING COLLATE 
UNICODE). We want to restrict CAST expression to only be able to cast to 
default collation string, and to only allow COLLATE expression to produce 
explicitly collated strings.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Test in CollationSuite.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46474 from mihailom-db/SPARK-47972.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CollationTypeCasts.scala |  2 --
 .../spark/sql/catalyst/parser/AstBuilder.scala | 29 
 .../org/apache/spark/sql/CollationSuite.scala  | 40 ++
 3 files changed, 69 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
index 16f8ec78e03e..b832cd4416a9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
@@ -132,8 +132,6 @@ object CollationTypeCasts extends TypeCoercionRule {
   def getOutputCollation(expr: Seq[Expression]): StringType = {
 val explicitTypes = expr.filter {
 case _: Collate => true
-case cast: Cast if 
cast.getTagValue(Cast.USER_SPECIFIED_CAST).isDefined =>
-  cast.dataType.isInstanceOf[StringType]
 case _ => false
   }
   .map(_.dataType.asInstanceOf[StringType].collationId)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index e2c975433ebd..86490a2eea97 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.catalyst.parser
 import java.util.Locale
 import java.util.concurrent.TimeUnit
 
+import scala.collection.immutable.Seq
 import scala.collection.mutable.{ArrayBuffer, Set}
 import scala.jdk.CollectionConverters._
 import scala.util.{Left, Right}
@@ -2265,6 +2266,20 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
*/
   override def visitCast(ctx: CastContext): Expression = withOrigin(ctx) {
 val rawDataType = typedVisit[DataType](ctx.dataType())
+ctx.dataType() match {
+  case context: PrimitiveDataTypeContext =>
+val typeCtx = context.`type`()
+if (typeCtx.start.getType == STRING) {
+  typeCtx.children.asScala.toSeq match {
+case Seq(_, cctx: CollateClauseContext) =>
+  throw QueryParsingErrors.dataTypeUnsupportedError(
+rawDataType.typeName,
+ctx.dataType().asInstanceOf[PrimitiveDataTypeContext])
+case _ =>
+  }
+}
+  case _ =>
+}
 val dataType = 
CharVarcharUtils.replaceCharVarcharWithStringForCast(rawDataType)
 ctx.name.getType match {
   case SqlBaseParser.CAST =>
@@ -2284,6 +2299,20 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
*/
   override def visitCastByColon(ctx: CastByColonContext): Expression = 
withOrigin(ctx) {
 val rawDataType = typedVisit[DataType](ctx.dataType())
+ctx.dataType() match {
+  case context: PrimitiveDataTypeContext =>
+val typeCtx = context.`type`()
+if (typeCtx.start.getType == STRING) {
+  typeCtx.children.asScala.toSeq match {
+case Seq(_, cctx: CollateClauseContext) =>
+  throw QueryParsingErrors.dataTypeUnsupportedError(
+rawDataType.typeName,
+ctx.dataType().asInstanceOf[PrimitiveDataTypeContext])
+case _ =>
+  }
+}
+  case _ =>
+}
 val dataType = 
CharVarcharUtils.replaceCharVarcharWithStringForCast(rawDataType)
 val cast = C

(spark) branch master updated: [SPARK-48413][SQL] ALTER COLUMN with collation

2024-06-03 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f9542d008402 [SPARK-48413][SQL] ALTER COLUMN with collation
f9542d008402 is described below

commit f9542d008402f8cef96d5ec347583c7c1d30d840
Author: Nikola Mandic 
AuthorDate: Mon Jun 3 13:00:34 2024 -0700

[SPARK-48413][SQL] ALTER COLUMN with collation

### What changes were proposed in this pull request?

Add support for changing collation of a column with `ALTER COLUMN` command. 
Use existing support for `ALTER COLUMN` with type to enable changing collations 
of column. Syntax example:
```
ALTER TABLE t1 ALTER COLUMN col TYPE STRING COLLATE UTF8_BINARY_LCASE
```

### Why are the changes needed?

Enable changing collation on column.

### Does this PR introduce _any_ user-facing change?

Yes, it adds support for changing collation of column.

### How was this patch tested?

Added tests to `DDLSuite` and `DataTypeSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46734 from nikolamand-db/SPARK-48413.

Authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json |   6 ++
 .../org/apache/spark/sql/types/DataType.scala  |  35 +++
 .../spark/sql/errors/QueryCompilationErrors.scala  |   9 ++
 .../org/apache/spark/sql/types/DataTypeSuite.scala | 109 +
 .../apache/spark/sql/execution/command/ddl.scala   |  50 +++---
 .../spark/sql/execution/command/DDLSuite.scala |  94 ++
 6 files changed, 290 insertions(+), 13 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 69965e58fb79..5bab14e3eebf 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -119,6 +119,12 @@
 ],
 "sqlState" : "42KDE"
   },
+  "CANNOT_ALTER_COLLATION_BUCKET_COLUMN" : {
+"message" : [
+  "ALTER TABLE (ALTER|CHANGE) COLUMN cannot change collation of 
type/subtypes of bucket columns, but found the bucket column  in 
the table ."
+],
+"sqlState" : "428FR"
+  },
   "CANNOT_ALTER_PARTITION_COLUMN" : {
 "message" : [
   "ALTER TABLE (ALTER|CHANGE) COLUMN is not supported for partition 
columns, but found the partition column  in the table ."
diff --git a/sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala
index ea90aa2ca397..12c7905f62d1 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala
@@ -408,6 +408,41 @@ object DataType {
 }
   }
 
+  /**
+   * Check if `from` is equal to `to` type except for collations, which are 
checked to be
+   * compatible so that data of type `from` can be interpreted as of type `to`.
+   */
+  private[sql] def equalsIgnoreCompatibleCollation(
+  from: DataType,
+  to: DataType): Boolean = {
+(from, to) match {
+  // String types with possibly different collations are compatible.
+  case (_: StringType, _: StringType) => true
+
+  case (ArrayType(fromElement, fromContainsNull), ArrayType(toElement, 
toContainsNull)) =>
+(fromContainsNull == toContainsNull) &&
+  equalsIgnoreCompatibleCollation(fromElement, toElement)
+
+  case (MapType(fromKey, fromValue, fromContainsNull),
+  MapType(toKey, toValue, toContainsNull)) =>
+fromContainsNull == toContainsNull &&
+  // Map keys cannot change collation.
+  fromKey == toKey &&
+  equalsIgnoreCompatibleCollation(fromValue, toValue)
+
+  case (StructType(fromFields), StructType(toFields)) =>
+fromFields.length == toFields.length &&
+  fromFields.zip(toFields).forall { case (fromField, toField) =>
+fromField.name == toField.name &&
+  fromField.nullable == toField.nullable &&
+  fromField.metadata == toField.metadata &&
+  equalsIgnoreCompatibleCollation(fromField.dataType, 
toField.dataType)
+  }
+
+  case (fromDataType, toDataType) => fromDataType == toDataType
+}
+  }
+
   /**
* Returns true if the two data types share the same "shape", i.e. the types
* are the same, but the field names don't need to be the same.
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
 
b/sql

(spark) branch master updated: [SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed

2024-06-03 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5d71ef0716f7 [SPARK-48503][SQL] Fix invalid scalar subqueries with 
group-by on non-equivalent columns that were incorrectly allowed
5d71ef0716f7 is described below

commit 5d71ef0716f7a2d470d05bf3c04441382cd80138
Author: Jack Chen 
AuthorDate: Mon Jun 3 10:51:11 2024 -0700

[SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on 
non-equivalent columns that were incorrectly allowed

### What changes were proposed in this pull request?

Fixes CheckAnalysis to reject invalid scalar subquery group-bys that were 
previously allowed and returned wrong results.

For example, this query is not legal and should give an error, but instead 
we incorrectly allowed it and it returns wrong results prior to this PR (full 
repro with table data in the jira):

```
select *, (select count(*) from y where y1 > x1 group by y1) from x;
```

It returns two rows, even though there's only one row of x. The correct 
result is an error, because there is more than one row returned by the scalar 
subquery.

Another problem case is if the correlation condition is an equality but 
it's under another operator like an OUTER JOIN or UNION. Various other 
expressions that are not equi-joins between the inner and outer fields hit this 
too, e.g. `where y1 + y2 = x1 group by y1`. See the comments in the code and 
the tests for more examples.

This PR fixes the logic which checks for valid vs invalid group-bys. 
However, note that this new logic may block some queries that are actually 
valid, for example `a + 1 = outer(b)` is valid but would be rejected. 
Therefore, we add a conf flag that can be used to restore the legacy behavior, 
as well as logging for when the legacy behavior is used and differs from the 
new behavior. (In general, to accurately run valid queries and reject invalid 
queries, the check must be moved from com [...]

This is a longstanding bug. The bug is in CheckAnalysis in 
checkAggregateInScalarSubquery. It allows grouping columns that are present in 
correlation predicates, but doesn’t check whether those predicates are 
equalities -  because when that code was written, non-equality correlation 
wasn’t allowed. Therefore, this bug has existed since non-equality correlation 
was added (~2 years ago).

### Why are the changes needed?
Fix invalid queries returning wrong results

### Does this PR introduce _any_ user-facing change?
Yes, block subqueries with invalid group-bys.

### How was this patch tested?
Add tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46839 from jchen5/scalar-subq-gby.

Authored-by: Jack Chen 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  38 +++-
 .../spark/sql/catalyst/expressions/subquery.scala  |  72 ++-
 .../org/apache/spark/sql/internal/SQLConf.scala|   9 +
 .../scalar-subquery-group-by.sql.out   | 206 
 .../scalar-subquery/scalar-subquery-group-by.sql   |  28 +++
 .../scalar-subquery-group-by.sql.out   | 211 +
 6 files changed, 555 insertions(+), 9 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index e18f4d1b36e1..1c2baa78be1b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.catalyst.analysis
 import scala.collection.mutable
 
 import org.apache.spark.SparkException
+import org.apache.spark.internal.Logging
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.ExtendedAnalysisException
 import org.apache.spark.sql.catalyst.expressions._
@@ -41,7 +42,7 @@ import org.apache.spark.util.Utils
 /**
  * Throws user facing errors when passed invalid queries that fail to analyze.
  */
-trait CheckAnalysis extends PredicateHelper with LookupCatalog with 
QueryErrorsBase {
+trait CheckAnalysis extends PredicateHelper with LookupCatalog with 
QueryErrorsBase with Logging {
 
   protected def isView(nameParts: Seq[String]): Boolean
 
@@ -912,13 +913,36 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 
   // SPARK-18504/SPARK-18814: Block cases where GROUP BY columns
   // are not part of the correlated columns.
+
+  // Note: groupByCols does not contain outer refs - grouping by an outer 
ref is always ok
   val groupByC

svn commit: r69509 - in /dev/spark: v4.0.0-preview1-rc1-bin/ v4.0.0-preview1-rc1-docs/ v4.0.0-preview1-rc2-bin/ v4.0.0-preview1-rc2-docs/

2024-06-02 Thread wenchen

Author: wenchen
Date: Mon Jun  3 01:12:24 2024
New Revision: 69509

Log:
Removing RC artifacts.

Removed:
dev/spark/v4.0.0-preview1-rc1-bin/
dev/spark/v4.0.0-preview1-rc1-docs/
dev/spark/v4.0.0-preview1-rc2-bin/
dev/spark/v4.0.0-preview1-rc2-docs/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69508 - /release/spark/KEYS

2024-06-02 Thread wenchen

Author: wenchen
Date: Mon Jun  3 00:59:52 2024
New Revision: 69508

Log:
Update KEYS

Modified:
release/spark/KEYS

Modified: release/spark/KEYS
==
--- release/spark/KEYS (original)
+++ release/spark/KEYS Mon Jun  3 00:59:52 2024
@@ -2079,3 +2079,61 @@ ThVo7dEVoknhannfoULNv5ekjZ/LsFNGHRUZ
 =9cvL
 -END PGP PUBLIC KEY BLOCK-
 
+pub   rsa4096 2024-05-07 [SC]
+  4DC9676CEF9A83E98FCA02784D6620843CD87F5A
+uid  Wenchen Fan (CODE SIGNING KEY) 
+sub   rsa4096 2024-05-07 [E]
+
+-BEGIN PGP PUBLIC KEY BLOCK-
+
+mQINBGY6XpcBEADBeNz3IBYriwrPzMYJJO5u1DaWAJ4Sryx6PUZgvssrcqojYVTh
+MjtlBkWRcNquAyDrVlU1vtq1yMq5KopQoAEi/l3xaEDZZ0IFAob6+GlGXEon2Jvf
+0FXQsx+Df4nMVl7KPqh68T++Z4GkvK5wyyN9uaUTWL2deGeinVxTh6qWQT8YiCd5
+wof+Dk5IIzKQ5VIBhU/U9S0jo/pqhH4okcZGTyT2Q7sfg4eXl5+Y2OR334RkvTcX
+uJjcnJ8BUbBSm1UhNg4OGBEJgi+lE1GEgw4juOfTAPh9fx8SCLhuX0m6Qc/y9bAK
+Q4zejbF5F2Um9dqrZqg6Egp+nlzydn59hq9owSnQ6JdoA/PLcgoign0sghu9xGCR
+GpgI2kS7Q8bu6dy7T0BfUerLZ1FHu7nCT2ZNSIh/Y2eOhuBhUr3llg8xa3PZZob/
+2sZE2dJ3g/qp2Nbo+s5Q5kELtuo6cZD0EISQwt68hGWIgxs0vtci2c2kQYFS0oqw
+fGynEeDFZRHV3ET5rioYaoPi70Cnibght5ocL0t6sl0RQQVp6k2i1aofJbZA480N
+ivuJ5agGaSRxmIDk6JlDsHJGxO9oC066ZLJiR6i0JUinGP7sw/nNmgup/AB+y4hW
+9WdeAFyYmuYysDRRyE6z1MPDp1R00MyGxHNFDF64/JPY/nKKFdXp+aCazwARAQAB
+tDNXZW5jaGVuIEZhbiAoQ09ERSBTSUdOSU5HIEtFWSkgPHdlbmNoZW5AYXBhY2hl
+Lm9yZz6JAlEEEwEIADsWIQRNyWds75qD6Y/KAnhNZiCEPNh/WgUCZjpelwIbAwUL
+CQgHAgIiAgYVCgkICwIEFgIDAQIeBwIXgAAKCRBNZiCEPNh/WkofD/9sI7J3i9Ck
+NOlHpVnjAaHjyGX5cVA2dZGniJdLf5yOKOI6pu7dMW+NThsXO1Iv+BRYo7una6/Q
+vUquKKxCXIN3vNmKIB1e9lj4MaIhCRmXUSQxjkVa9JW3P/F520Ct3VjiCZ5IjPv+
+g1hF/wrkuuoAFlcC/bfGWafkaZgszavSpCdp27mUXUNbvLW0dPJ3+ay4cDPuT1DI
+6DhB8qpqN7gInDFACW2qtQ2KZh1JFGy5ZccQ9dB3t/B4BYiUie6a3eQWgKqLF1hw
+8yHY3DkCVGfnXJk4+LMWqgazQxoB6oZjBvoQYtGOPXr1ZbmtiRHCDM5KmZ+QmIXB
+ZGBXkLaqt2QGxlwUGlvn+nKuTsp8VL1APIlKdMpvMW59uz1ycZHMeTJGAMtZw8Qm
+kxG62kqnDYeZ6oWwinY3wYP4UmqFSWIfcHMfBwED4uOC//r9H1bO+JRFMwOxqSN7
+kGfFJoV5eOvMOwRnXPJiPpnQEHPEkp/TAl2ANHWzdXy9TifiHOvTln3NXQVpznnW
+H6f9+W36J1IE9EWktciptKUtvwY1np+G71Swa0Q4mNgb8OGf6UNJGv4vPbSlhzlr
+1a5oYP59eHO3XqANcuKyTFxfja+rgrMldufZFCk1hSnBdAic/jaHrhIQSLcTGFiJ
+QVyiC2VlO2eZCkCTfoSlolwgzzoY4wNumLkCDQRmOl6XARAAt+N+djFZOuJdLcSz
+pz6nG88gxLmPwf+Xlhv2+xDS3wyM1OWmDAkeMDNq8OuZMes6ZXwRxDvSj7w7dlE6
+dQ1BlDz4RP4GoYG++dnPlHp/NWQ8I/eW8XC5uxkvl56YG/0DudoTLb5nxHtv+kpm
+p+eVCqWRYI5RQPdcxEZzXEije+aEj2aMRQ8cO7RAgTamRWXr+XsRkSypZ8ttTISr
+u+UuQPKT6XRMtkB2i8ekwO+jIK/mMrAteIF/cK0jv2JTlYmWrBtmGgYjHZHlzZak
+/MzWN4tU5VbJMMXa9wHicZS0/cPV9Fz3dnR0sBVgaIDsK+/vRGxHd/LGFtXH+Wrp
+pPMaR4FHCx3r44aL17B5lJocwf7Xma2gavOl80NR+a8iOW6biKdlALRZKX4G4cJj
+1vnWHDJceZOuFWMVIs7zfJymvQpROCRED3q1el+zCICnLtBue6ikqv7nfyBNCaR2
+qZhw4TPMzzGTRIdKIalcSTi+bGfSYTsU2kVDBbH+0nD5I7Tx62H4shsJtgmwyP4R
+q2dxJPpC4i+L09crjyl7rYvwHu4QU8vxcQXN4cH4O5pKOr2GoGnV8Y7kpZaRUo6w
+/Q/Rx3I3UKAyYJv0R1mK4AifM0JzMkqxAUvUdUbs2obRT04sxtr1bA+9dLEv4b8c
+YGKmRgt96GCNx1XZ8Q+FPdmsaO0AEQEAAYkCNgQYAQgAIBYhBE3JZ2zvmoPpj8oC
+eE1mIIQ82H9aBQJmOl6XAhsMAAoJEE1mIIQ82H9aBfAQAKf6xHNuKibXcRMwqmcx
+rx18d0dbeMEjrPqSe5vGOylLQZRpwZmKwflU9kZgOU2WRuqZsaPE0w5wxhsNDe8s
+UqxW08xB6v8BVj6BT9umJQNyQF5CrsjkZe2EtmYlbdNmt4t8DMNEmhhasEglWUui
+0se3I0wIwDaYAW+KppwzweO8SrUZVaB6QhOckRFhz/1wCNyc2Yp90OjWjuATffOE
+ZWSeGPn9GCbtJ+SPtLtMUlxy/BoRA6OWv6H5VAt6pJVw3XPP/o450i7lYxbmbv8W
+qm5/8nWx1XBvTvOxGoT9h+45bWjLTXtJJ2RhEftGHZ9439VSgssXBl+S/yjpnHOa
+14tRCVABP8bgAQ7HEKZ9YyII6MOAEzNa2gNVKr7+gwB1ddrGdzx6TrIUwRlgilDJ
+XORdEON4Ssx31Y1+Dt+d4lkkGu5Ymkj8iFIeH6FNOnFWM/stTmL0fE4IGpWbUHc+
+nqz7zEgili8TanLQRUmz9ClVJTG4G9t31FYF8nNzDPxug9oSMJXBfVlzhRMRZH3z
+t/XdxNFHyu7rzXidiXTJSmujeqS++mKcXxx02m+V2qfwkAwnt6OS9NDLPVrzuuMN
+NDfY3Gr4dTCbd+JQxtC0w4GuUV1V3lcOwyEjPKJVYuZwUl0UspRbNmtsaybRbzVs
++q68az33WU5++zSuqrU3fIRp
+=1zLb
+-END PGP PUBLIC KEY BLOCK-
+



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69507 - /dev/spark/v4.0.0-preview1-rc3-bin/ /release/spark/spark-4.0.0-preview1/

2024-06-02 Thread wenchen

Author: wenchen
Date: Mon Jun  3 00:59:50 2024
New Revision: 69507

Log:
Apache Spark 4.0.0-preview1

Added:
release/spark/spark-4.0.0-preview1/
  - copied from r69506, dev/spark/v4.0.0-preview1-rc3-bin/
Removed:
dev/spark/v4.0.0-preview1-rc3-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69506 - /dev/spark/v4.0.0-preview1-rc3-docs/

2024-06-02 Thread wenchen

Author: wenchen
Date: Mon Jun  3 00:59:48 2024
New Revision: 69506

Log:
Remove RC artifacts

Removed:
dev/spark/v4.0.0-preview1-rc3-docs/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) tag v4.0.0-preview1 created (now 7a7a8bc4bab5)

2024-06-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to tag v4.0.0-preview1
in repository https://gitbox.apache.org/repos/asf/spark.git


  at 7a7a8bc4bab5 (commit)
No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (96365c86962b -> 3cd35f8cb646)

2024-05-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 96365c86962b [SPARK-48465][SQL] Avoid no-op empty relation propagation
 add 3cd35f8cb646 [SPARK-48391][CORE] Using addAll instead of add function 
in fromAccumulatorInfos method of TaskMetrics Class

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48391][CORE] Using addAll instead of add function in fromAccumulatorInfos method of TaskMetrics Class

2024-05-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 744b070fa964 [SPARK-48391][CORE] Using addAll instead of add function 
in fromAccumulatorInfos method of TaskMetrics Class
744b070fa964 is described below

commit 744b070fa964dee9e5460a24f88f22c3af8170dc
Author: Dereck Li 
AuthorDate: Fri May 31 15:56:05 2024 -0700

[SPARK-48391][CORE] Using addAll instead of add function in 
fromAccumulatorInfos method of TaskMetrics Class

### What changes were proposed in this pull request?

Using addAll instead of add function in fromAccumulators method of 
TaskMetrics.

### Why are the changes needed?

To Improve performance. In the fromAccumulators method of TaskMetrics，we 
should use `
tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as 
_externalAccums is a instance of CopyOnWriteArrayList

### Does this PR introduce _any_ user-facing change?

yes.

### How was this patch tested?

No Tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46705 from monkeyboy123/fromAccumulators-accelerate.

Authored-by: Dereck Li 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 3cd35f8cb6462051c621cf49de54b9c5692aae1d)
Signed-off-by: Wenchen Fan 
---
 core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
b/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala
index 78b39b0cbda6..d446104cb642 100644
--- a/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala
+++ b/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala
@@ -328,16 +328,19 @@ private[spark] object TaskMetrics extends Logging {
*/
   def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = {
 val tm = new TaskMetrics
+val externalAccums = new java.util.ArrayList[AccumulatorV2[Any, Any]]()
 for (acc <- accums) {
   val name = acc.name
+  val tmpAcc = acc.asInstanceOf[AccumulatorV2[Any, Any]]
   if (name.isDefined && tm.nameToAccums.contains(name.get)) {
 val tmAcc = tm.nameToAccums(name.get).asInstanceOf[AccumulatorV2[Any, 
Any]]
 tmAcc.metadata = acc.metadata
-tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]])
+tmAcc.merge(tmpAcc)
   } else {
-tm._externalAccums.add(acc)
+externalAccums.add(tmpAcc)
   }
 }
+tm._externalAccums.addAll(externalAccums)
 tm
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (844821c82da5 -> 96365c86962b)

2024-05-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 844821c82da5 [SPARK-47578][R] Migrate RPackageUtils with variables to 
structured logging framework
 add 96365c86962b [SPARK-48465][SQL] Avoid no-op empty relation propagation

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala   | 6 --
 .../spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala   | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (747437c80aa8 -> f083e61925e9)

2024-05-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 747437c80aa8 [SPARK-48476][SQL] fix NPE error message for null 
delmiter csv
 add f083e61925e9 [SPARK-48430][SQL] Fix map value extraction when map 
contains collated strings

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/analysis/CollationTypeCasts.scala | 20 ++---
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |  2 +-
 .../spark/sql/catalyst/expressions/misc.scala  |  4 ++--
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 23 
 .../org/apache/spark/sql/CollationSuite.scala  | 25 --
 5 files changed, 56 insertions(+), 18 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48476][SQL] fix NPE error message for null delmiter csv

2024-05-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 747437c80aa8 [SPARK-48476][SQL] fix NPE error message for null 
delmiter csv
747437c80aa8 is described below

commit 747437c80aa875844f41ac61a419443af9f3b4b2
Author: milastdbx 
AuthorDate: Fri May 31 09:10:38 2024 -0700

[SPARK-48476][SQL] fix NPE error message for null delmiter csv

### What changes were proposed in this pull request?

In this pull request i propose we throw proper error code when customer 
specifies null as a delimiter for CSV. Currently we throw NPE.

### Why are the changes needed?

To make spark more user friendly.

### Does this PR introduce _any_ user-facing change?

Yes, customer will now get INVALID_DELIMITER_VALUE.NULL_VALUE error class 
when they specify null for delimiter of csv.

### How was this patch tested?

unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46810 from milastdbx/dev/milast/fixNPEForDelimiterCSV.

Authored-by: milastdbx 
Signed-off-by: Wenchen Fan 
---
 common/utils/src/main/resources/error/error-conditions.json  | 5 +
 .../scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala   | 5 +
 .../org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala| 9 +
 3 files changed, 19 insertions(+)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 3914c0f177dc..3dd7a6d65d7f 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -2021,6 +2021,11 @@
   "Delimiter cannot be empty string."
 ]
   },
+  "NULL_VALUE" : {
+"message" : [
+  "Delimiter cannot be null."
+]
+  },
   "SINGLE_BACKSLASH" : {
 "message" : [
   "Single backslash is prohibited. It has special meaning as beginning 
of an escape sequence. To get the backslash character, pass a string with two 
backslashes as the delimiter."
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
index 62638d70dd90..7b6664a4117a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
@@ -120,6 +120,11 @@ object CSVExprUtils {
* @throws SparkIllegalArgumentException if any of the individual input 
chunks are illegal
*/
   def toDelimiterStr(str: String): String = {
+if (str == null) {
+  throw new SparkIllegalArgumentException(
+errorClass = "INVALID_DELIMITER_VALUE.NULL_VALUE")
+}
+
 var idx = 0
 
 var delimiter = ""
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala
index 2e94c723a6f2..d4b68500e078 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala
@@ -33,6 +33,15 @@ class CSVExprUtilsSuite extends SparkFunSuite {
 assert(CSVExprUtils.toChar("""\\""") === '\\')
   }
 
+  test("Does not accept null delimiter") {
+checkError(
+  exception = intercept[SparkIllegalArgumentException]{
+CSVExprUtils.toDelimiterStr(null)
+  },
+  errorClass = "INVALID_DELIMITER_VALUE.NULL_VALUE",
+  parameters = Map.empty)
+  }
+
   test("Does not accept delimiter larger than one character") {
 checkError(
   exception = intercept[SparkIllegalArgumentException]{


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48419][SQL] Foldable propagation replace foldable column shoul…

2024-05-30 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3e27543128c8 [SPARK-48419][SQL] Foldable propagation replace foldable 
column shoul…
3e27543128c8 is described below

commit 3e27543128c84bb4b6642589bb1c6da21c38b957
Author: KnightChess <981159...@qq.com>
AuthorDate: Thu May 30 17:53:38 2024 -0700

[SPARK-48419][SQL] Foldable propagation replace foldable column shoul…

…d use origin column name

### What changes were proposed in this pull request?
fix optimizer rule `FoldablePropagation` will change column name, use 
origin name.

### Why are the changes needed?
fix bug

### Does this PR introduce _any_ user-facing change?
`before fix`
befor optimizer:
```shell
'Project ['x, 'y, 'z]
+- 'Project ['a AS x, str AS Y, 'b AS z]
   +- LocalRelation  , [a, b]
```

after optimizer:

```shell
Project [x, str AS Y, z]
+- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114]
   +- LocalRelation , [a, b]
```
column name `y` will be replace to 'Y', it change plan schame

`after fix`
the query plan schema is still y
```shell
Project [x, str AS y, z]
+- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114]
   +- LocalRelation , [a, b]
```

### How was this patch tested?
Added UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46742 from KnightChess/fix-foldable-propagation.

Authored-by: KnightChess <981159...@qq.com>
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/optimizer/expressions.scala |  2 +-
 .../sql/catalyst/optimizer/FoldablePropagationSuite.scala | 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
index 32700f176f25..2c55e4c8fd37 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
@@ -1023,7 +1023,7 @@ object FoldablePropagation extends Rule[LogicalPlan] {
   plan
 } else {
   plan transformExpressions {
-case a: AttributeReference if foldableMap.contains(a) => foldableMap(a)
+case a: AttributeReference if foldableMap.contains(a) => 
foldableMap(a).withName(a.name)
   }
 }
   }
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FoldablePropagationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FoldablePropagationSuite.scala
index 767ef38ea7f7..5866f29e4e86 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FoldablePropagationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FoldablePropagationSuite.scala
@@ -214,4 +214,15 @@ class FoldablePropagationSuite extends PlanTest {
 val expected = testRelation.select(foldableAttr, 
$"a").rebalance(foldableAttr, $"a").analyze
 comparePlans(optimized, expected)
   }
+
+  test("SPARK-48419: Foldable propagation replace foldable column should use 
origin column name") {
+val query = testRelation
+  .select($"a".as("x"), "str".as("Y"), $"b".as("z"))
+  .select($"x", $"y", $"z")
+val optimized = Optimize.execute(query.analyze)
+val correctAnswer = testRelation
+  .select($"a".as("x"), "str".as("Y"), $"b".as("z"))
+  .select($"x", "str".as("y"), $"z").analyze
+comparePlans(optimized, correctAnswer)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48468] Add LogicalQueryStage interface in catalyst

2024-05-30 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6b4f97e1411c [SPARK-48468] Add LogicalQueryStage interface in catalyst
6b4f97e1411c is described below

commit 6b4f97e1411c223b77e7bbc4b46a5f399c39823e
Author: Ziqi Liu 
AuthorDate: Thu May 30 14:10:18 2024 -0700

[SPARK-48468] Add LogicalQueryStage interface in catalyst

### What changes were proposed in this pull request?

Adding `LogicalQueryStage` interface in catalyst, and 
`org.apache.spark.sql.execution.adaptive.LogicalQueryStage` inherits from 
`logical.LogicalQueryStage`

### Why are the changes needed?

Make LogicalQueryStage visible in logical rewrites.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46799 from liuzqt/SPARK-48468.

Authored-by: Ziqi Liu 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/plans/logical/LogicalPlan.scala   | 28 ++
 .../sql/execution/adaptive/LogicalQueryStage.scala | 17 ++---
 2 files changed, 42 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index 98e91585c2a0..a2ede8ac735c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -25,6 +25,7 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.{AliasAwareQueryOutputOrdering, 
QueryPlan}
 import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats
 import org.apache.spark.sql.catalyst.trees.{BinaryLike, LeafLike, TreeNodeTag, 
UnaryLike}
+import org.apache.spark.sql.catalyst.trees.TreePattern.{LOGICAL_QUERY_STAGE, 
TreePattern}
 import org.apache.spark.sql.catalyst.types.DataTypeUtils
 import org.apache.spark.sql.catalyst.util.MetadataColumnHelper
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
@@ -214,6 +215,33 @@ trait LeafNode extends LogicalPlan with 
LeafLike[LogicalPlan] {
 throw new SparkUnsupportedOperationException("_LEGACY_ERROR_TEMP_3114")
 }
 
+/**
+ * A abstract class for LogicalQueryStage that is visible in logical rewrites.
+ */
+abstract class LogicalQueryStage extends LeafNode {
+  override protected val nodePatterns: Seq[TreePattern] = 
Seq(LOGICAL_QUERY_STAGE)
+
+  /**
+   * Returns the logical plan that is included in this query stage
+   */
+  def logicalPlan: LogicalPlan
+
+  /**
+   * Returns the physical plan.
+   */
+  def physicalPlan: QueryPlan[_]
+
+  /**
+   * Return true if the physical stage is materialized
+   */
+  def isMaterialized: Boolean
+
+  /**
+   * Return true if the physical plan corresponds directly to a stage
+   */
+  def isDirectStage: Boolean
+}
+
 /**
  * A logical plan node with single child.
  */
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LogicalQueryStage.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LogicalQueryStage.scala
index 8ce2452cc141..506f52fd9072 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LogicalQueryStage.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LogicalQueryStage.scala
@@ -18,7 +18,8 @@
 package org.apache.spark.sql.execution.adaptive
 
 import org.apache.spark.sql.catalyst.expressions.{Attribute, SortOrder}
-import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
RepartitionOperation, Statistics}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, 
RepartitionOperation, Statistics}
+import org.apache.spark.sql.catalyst.plans.logical
 import org.apache.spark.sql.catalyst.trees.TreePattern.{LOGICAL_QUERY_STAGE, 
REPARTITION_OPERATION, TreePattern}
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.execution.aggregate.BaseAggregateExec
@@ -35,8 +36,8 @@ import 
org.apache.spark.sql.execution.aggregate.BaseAggregateExec
 // TODO we can potentially include only [[QueryStageExec]] in this class if we 
make the aggregation
 // planning aware of partitioning.
 case class LogicalQueryStage(
-logicalPlan: LogicalPlan,
-physicalPlan: SparkPlan) extends LeafNode {
+override val logicalPlan: LogicalPlan,
+override val physicalPlan: SparkPlan) extends logical.LogicalQueryStage {
 
   override def output: Seq[Attribute] = logicalPlan.output
   override val isStreaming: Boolean = logicalPlan.isStreamin

(spark) branch master updated: [SPARK-48477][SQL][TESTS] Use withSQLConf in tests: Refactor CollationSuite, CoalesceShufflePartitionsSuite, SQLExecutionSuite

2024-05-30 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e1f5a7c856ab [SPARK-48477][SQL][TESTS] Use withSQLConf in tests: 
Refactor CollationSuite, CoalesceShufflePartitionsSuite, SQLExecutionSuite
e1f5a7c856ab is described below

commit e1f5a7c856ab7ed4bf03e490ee7c1307775a
Author: Rui Wang 
AuthorDate: Thu May 30 14:07:10 2024 -0700

[SPARK-48477][SQL][TESTS] Use withSQLConf in tests: Refactor 
CollationSuite, CoalesceShufflePartitionsSuite, SQLExecutionSuite

### What changes were proposed in this pull request?

Use withSQLConf in tests when it is appropriate.

### Why are the changes needed?

Enforce good practice for setting config in test cases.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Existing UT

### Was this patch authored or co-authored using generative AI tooling?

NO

Closes #46812 from amaliujia/sql_config_4.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/CollationSuite.scala  |  16 +--
 .../execution/CoalesceShufflePartitionsSuite.scala | 128 +++--
 .../spark/sql/execution/SQLExecutionSuite.scala|   9 +-
 3 files changed, 78 insertions(+), 75 deletions(-)

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
index 9b3bfe1c77b3..42da779b84ad 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
@@ -677,14 +677,14 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
   sql(s"INSERT INTO $tableName VALUES ('bbb', 'bbb')")
   sql(s"INSERT INTO $tableName VALUES ('BBB', 'BBB')")
 
-  sql(s"SET spark.sql.legacy.createHiveTableByDefault=false")
-
-  withTable(newTableName) {
-checkError(
-  exception = intercept[AnalysisException] {
-sql(s"CREATE TABLE $newTableName AS SELECT c1 || c2 FROM 
$tableName")
-  },
-  errorClass = "COLLATION_MISMATCH.IMPLICIT")
+  withSQLConf("spark.sql.legacy.createHiveTableByDefault" -> "false") {
+withTable(newTableName) {
+  checkError(
+exception = intercept[AnalysisException] {
+  sql(s"CREATE TABLE $newTableName AS SELECT c1 || c2 FROM 
$tableName")
+},
+errorClass = "COLLATION_MISMATCH.IMPLICIT")
+}
   }
 }
   }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/CoalesceShufflePartitionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/CoalesceShufflePartitionsSuite.scala
index e87b90dfdd84..dc72b4a092ae 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/CoalesceShufflePartitionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/CoalesceShufflePartitionsSuite.scala
@@ -21,6 +21,7 @@ import org.apache.spark.{SparkConf, SparkFunSuite}
 import org.apache.spark.internal.config.IO_ENCRYPTION_ENABLED
 import org.apache.spark.internal.config.UI.UI_ENABLED
 import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.SQLConfHelper
 import org.apache.spark.sql.execution.adaptive._
 import org.apache.spark.sql.execution.adaptive.AQEShuffleReadExec
 import org.apache.spark.sql.execution.exchange.ReusedExchangeExec
@@ -28,7 +29,7 @@ import org.apache.spark.sql.functions._
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.util.ArrayImplicits._
 
-class CoalesceShufflePartitionsSuite extends SparkFunSuite {
+class CoalesceShufflePartitionsSuite extends SparkFunSuite with SQLConfHelper {
 
   private var originalActiveSparkSession: Option[SparkSession] = _
   private var originalInstantiatedSparkSession: Option[SparkSession] = _
@@ -374,72 +375,73 @@ class CoalesceShufflePartitionsSuite extends 
SparkFunSuite {
 
   test("SPARK-24705 adaptive query execution works correctly when exchange 
reuse enabled") {
 val test: SparkSession => Unit = { spark: SparkSession =>
-  spark.sql("SET spark.sql.exchange.reuse=true")
-  val df = spark.range(0, 6, 1).selectExpr("id AS key", "id AS value")
-
-  // test case 1: a query stage has 3 child stages but they are the same 
stage.
-  // Final Stage 1
-  //   ShuffleQueryStage 0
-  //   ReusedQueryStage 0
-  //   ReusedQueryStage 0
-  val resultDf = df.join(df, "key").join(df, "key")
-  QueryTest.checkAnswer(resultDf, (0 to 5).map(i => Row(i, i, i, i)))
-

(spark) branch master updated (69afd4be9c93 -> f68d761c9b21)

2024-05-30 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 69afd4be9c93 [SPARK-47361][SQL] Derby: Calculate suitable precision 
and scale for DECIMAL type
 add f68d761c9b21 [SPARK-48292][CORE] Revert [SPARK-39195][SQL] Spark 
OutputCommitCoordinator should abort stage when committed file not consistent 
with task status

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/SparkContext.scala |  7 +-
 .../src/main/scala/org/apache/spark/SparkEnv.scala | 12 +--
 .../spark/scheduler/OutputCommitCoordinator.scala  | 12 +--
 .../OutputCommitCoordinatorIntegrationSuite.scala  | 11 ++-
 .../scheduler/OutputCommitCoordinatorSuite.scala   | 19 +++--
 .../datasources/parquet/ParquetIOSuite.scala   | 85 ++
 6 files changed, 58 insertions(+), 88 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-41049][SQL][FOLLOW-UP][3.5] stateful expressions test uses different pretty name

2024-05-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new c87b6483a3e0 [SPARK-41049][SQL][FOLLOW-UP][3.5] stateful expressions 
test uses different pretty name
c87b6483a3e0 is described below

commit c87b6483a3e0690be2b267e6dcf93a3edd63b030
Author: Rui Wang 
AuthorDate: Wed May 29 17:15:17 2024 -0700

[SPARK-41049][SQL][FOLLOW-UP][3.5] stateful expressions test uses different 
pretty name

### What changes were proposed in this pull request?

We need use a different pretty string for the stateful expression test case 
in branch-3.5.

### Why are the changes needed?

Fix the failing test case.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Existing UT

### Was this patch authored or co-authored using generative AI tooling?

NO

Closes #46795 from amaliujia/branch-3.5.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index 260ecaa5ece1..7ee18df37561 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -3641,8 +3641,9 @@ class DataFrameSuite extends QueryTest
 val v4 = to_csv(struct(v3.as("a"))) // to_csv is CodegenFallback
 df.select(v3, v3, v4, v4).collect().foreach { row =>
   assert(row.getMap(0).toString() == row.getMap(1).toString())
-  assert(row.getString(2) == s"{key -> ${row.getMap(0).get("key").get}}")
-  assert(row.getString(3) == s"{key -> ${row.getMap(0).get("key").get}}")
+  val expectedString = s"keys: [key], values: 
[${row.getMap(0).get("key").get}]"
+  assert(row.getString(2) == s"""\"$expectedString\"""")
+  assert(row.getString(3) == s"""\"$expectedString\"""")
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48431][SQL] Do not forward predicates on collated columns to file readers

2024-05-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a3b8420e5eec [SPARK-48431][SQL] Do not forward predicates on collated 
columns to file readers
a3b8420e5eec is described below

commit a3b8420e5eecc3ce33528bc7c73967a64b1f670e
Author: Ole Sasse 
AuthorDate: Wed May 29 13:52:33 2024 -0700

[SPARK-48431][SQL] Do not forward predicates on collated columns to file 
readers

### What changes were proposed in this pull request?

[SPARK-47657](https://issues.apache.org/jira/browse/SPARK-47657) allows to 
push filters on collated columns to file sources that support it. If such 
filters are pushed to file sources, those file sources must not push those 
filters to the actual file readers (i.e. parquet or csv readers), because there 
is no guarantee that those support collations.

In this PR we are widening filters on collations to be AlwaysTrue when we 
translate filters for file sources.

### Why are the changes needed?

Without this, no file source can implement filter pushdown

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit tests. No component tests are possible because there is no file 
source with filter pushdown yet.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46760 from olaky/filter-translation-for-collations.

Authored-by: Ole Sasse 
Signed-off-by: Wenchen Fan 
---
 .../execution/datasources/DataSourceStrategy.scala | 31 +---
 .../datasources/DataSourceStrategySuite.scala  | 55 +-
 2 files changed, 78 insertions(+), 8 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index 22b60caf2669..7cda347ce581 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -54,7 +54,7 @@ import 
org.apache.spark.sql.execution.streaming.StreamingRelation
 import org.apache.spark.sql.sources._
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.util.{PartitioningUtils => 
CatalystPartitioningUtils}
-import org.apache.spark.sql.util.CaseInsensitiveStringMap
+import org.apache.spark.sql.util.{CaseInsensitiveStringMap, SchemaUtils}
 import org.apache.spark.unsafe.types.UTF8String
 
 /**
@@ -595,6 +595,16 @@ object DataSourceStrategy
   translatedFilterToExpr: Option[mutable.HashMap[sources.Filter, 
Expression]],
   nestedPredicatePushdownEnabled: Boolean)
 : Option[Filter] = {
+
+def translateAndRecordLeafNodeFilter(filter: Expression): Option[Filter] = 
{
+  val translatedFilter =
+translateLeafNodeFilter(filter, 
PushableColumn(nestedPredicatePushdownEnabled))
+  if (translatedFilter.isDefined && translatedFilterToExpr.isDefined) {
+translatedFilterToExpr.get(translatedFilter.get) = predicate
+  }
+  translatedFilter
+}
+
 predicate match {
   case expressions.And(left, right) =>
 // See SPARK-12218 for detailed discussion
@@ -621,16 +631,25 @@ object DataSourceStrategy
 right, translatedFilterToExpr, nestedPredicatePushdownEnabled)
 } yield sources.Or(leftFilter, rightFilter)
 
+  case notNull @ expressions.IsNotNull(_: AttributeReference) =>
+// Not null filters on attribute references can always be pushed, also 
for collated columns.
+translateAndRecordLeafNodeFilter(notNull)
+
+  case isNull @ expressions.IsNull(_: AttributeReference) =>
+// Is null filters on attribute references can always be pushed, also 
for collated columns.
+translateAndRecordLeafNodeFilter(isNull)
+
+  case p if p.references.exists(ref => 
SchemaUtils.hasNonUTF8BinaryCollation(ref.dataType)) =>
+// The filter cannot be pushed and we widen it to be AlwaysTrue(). 
This is only valid if
+// the result of the filter is not negated by a Not expression it is 
wrapped in.
+translateAndRecordLeafNodeFilter(Literal.TrueLiteral)
+
   case expressions.Not(child) =>
 translateFilterWithMapping(child, translatedFilterToExpr, 
nestedPredicatePushdownEnabled)
   .map(sources.Not)
 
   case other =>
-val filter = translateLeafNodeFilter(other, 
PushableColumn(nestedPredicatePushdownEnabled))
-if (filter.isDefined && translatedFilterToExpr.isDefined) {
-  translatedFilterToExpr.get(filter.get) = predicate
-}
-filter
+translateAndRecordLeafNo

(spark) branch branch-3.5 updated: [SPARK-48273][SQL][FOLLOWUP] Explicitly create non-Hive table in identifier-clause.sql

2024-05-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 043944e1b549 [SPARK-48273][SQL][FOLLOWUP] Explicitly create non-Hive 
table in identifier-clause.sql
043944e1b549 is described below

commit 043944e1b54902f6d8204a5610e8eb780f1fe753
Author: Wenchen Fan 
AuthorDate: Wed May 29 13:35:01 2024 -0700

[SPARK-48273][SQL][FOLLOWUP] Explicitly create non-Hive table in 
identifier-clause.sql

### What changes were proposed in this pull request?

A followup of https://github.com/apache/spark/pull/46580 . It's better to 
create non-Hive tables in the tests, so that it's backport safe, as old 
branches creates hive table by default.

### Why are the changes needed?

fix branch-3.5 CI

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46794 from cloud-fan/test.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit cf47293b5fc7c80d19e50fda44a01f91d5e34530)
Signed-off-by: Wenchen Fan 
---
 .../sql-tests/analyzer-results/identifier-clause.sql.out  | 8 
 .../src/test/resources/sql-tests/inputs/identifier-clause.sql | 6 +++---
 .../test/resources/sql-tests/results/identifier-clause.sql.out| 6 +++---
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/identifier-clause.sql.out
 
b/sql/core/src/test/resources/sql-tests/analyzer-results/identifier-clause.sql.out
index 823ce43247a7..9b56a172e59d 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/identifier-clause.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/identifier-clause.sql.out
@@ -687,7 +687,7 @@ org.apache.spark.sql.AnalysisException
 
 
 -- !query
-CREATE TABLE IDENTIFIER(1)(c1 INT)
+CREATE TABLE IDENTIFIER(1)(c1 INT) USING csv
 -- !query analysis
 org.apache.spark.sql.AnalysisException
 {
@@ -709,7 +709,7 @@ org.apache.spark.sql.AnalysisException
 
 
 -- !query
-CREATE TABLE IDENTIFIER('a.b.c')(c1 INT)
+CREATE TABLE IDENTIFIER('a.b.c')(c1 INT) USING csv
 -- !query analysis
 org.apache.spark.sql.AnalysisException
 {
@@ -902,7 +902,7 @@ CacheTableAsSelect t1, (select my_col from (values (1), 
(2), (1) as (my_col)) gr
 
 
 -- !query
-create table identifier('t2') as (select my_col from (values (1), (2), (1) as 
(my_col)) group by 1)
+create table identifier('t2') using csv as (select my_col from (values (1), 
(2), (1) as (my_col)) group by 1)
 -- !query analysis
 CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`t2`, 
ErrorIfExists, [my_col]
+- Aggregate [my_col#x], [my_col#x]
@@ -914,7 +914,7 @@ CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`t2`, ErrorIfExis
 -- !query
 insert into identifier('t2') select my_col from (values (3) as (my_col)) group 
by 1
 -- !query analysis
-InsertIntoHadoopFsRelationCommand file:[not included in 
comparison]/{warehouse_dir}/t2, false, Parquet, [path=file:[not included in 
comparison]/{warehouse_dir}/t2], Append, `spark_catalog`.`default`.`t2`, 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(file:[not included 
in comparison]/{warehouse_dir}/t2), [my_col]
+InsertIntoHadoopFsRelationCommand file:[not included in 
comparison]/{warehouse_dir}/t2, false, CSV, [path=file:[not included in 
comparison]/{warehouse_dir}/t2], Append, `spark_catalog`.`default`.`t2`, 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(file:[not included 
in comparison]/{warehouse_dir}/t2), [my_col]
 +- Aggregate [my_col#x], [my_col#x]
+- SubqueryAlias __auto_generated_subquery_name
   +- SubqueryAlias as
diff --git a/sql/core/src/test/resources/sql-tests/inputs/identifier-clause.sql 
b/sql/core/src/test/resources/sql-tests/inputs/identifier-clause.sql
index 9e6314202b5f..e85fdf7b5da3 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/identifier-clause.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/identifier-clause.sql
@@ -109,8 +109,8 @@ VALUES(IDENTIFIER(1));
 VALUES(IDENTIFIER(SUBSTR('HELLO', 1, RAND() + 1)));
 SELECT `IDENTIFIER`('abs')(c1) FROM VALUES(-1) AS T(c1);
 
-CREATE TABLE IDENTIFIER(1)(c1 INT);
-CREATE TABLE IDENTIFIER('a.b.c')(c1 INT);
+CREATE TABLE IDENTIFIER(1)(c1 INT) USING csv;
+CREATE TABLE IDENTIFIER('a.b.c')(c1 INT) USING csv;
 CREATE VIEW IDENTIFIER('a.b.c')(c1) AS VALUES(1);
 DROP TABLE IDENTIFIER('a.b.c');
 DROP VIEW IDENTIFIER('a.b.c');
@@ -125,7 +125,7 @@ CREATE TEMPORARY VIEW IDENTIFIER('default.v')(c1) AS 
VALUES(1);
 -- SPARK-48273: Aggregation operation in statements using identifier clause 
for table name
 create temporary view identifier('v1

(spark) branch master updated: [SPARK-48273][SQL][FOLLOWUP] Explicitly create non-Hive table in identifier-clause.sql

2024-05-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cf47293b5fc7 [SPARK-48273][SQL][FOLLOWUP] Explicitly create non-Hive 
table in identifier-clause.sql
cf47293b5fc7 is described below

commit cf47293b5fc7c80d19e50fda44a01f91d5e34530
Author: Wenchen Fan 
AuthorDate: Wed May 29 13:35:01 2024 -0700

[SPARK-48273][SQL][FOLLOWUP] Explicitly create non-Hive table in 
identifier-clause.sql

### What changes were proposed in this pull request?

A followup of https://github.com/apache/spark/pull/46580 . It's better to 
create non-Hive tables in the tests, so that it's backport safe, as old 
branches creates hive table by default.

### Why are the changes needed?

fix branch-3.5 CI

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46794 from cloud-fan/test.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../sql-tests/analyzer-results/identifier-clause.sql.out  | 8 
 .../src/test/resources/sql-tests/inputs/identifier-clause.sql | 6 +++---
 .../test/resources/sql-tests/results/identifier-clause.sql.out| 6 +++---
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/identifier-clause.sql.out
 
b/sql/core/src/test/resources/sql-tests/analyzer-results/identifier-clause.sql.out
index f799c19a3bb8..b3e2cd5ada95 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/identifier-clause.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/identifier-clause.sql.out
@@ -732,7 +732,7 @@ org.apache.spark.sql.AnalysisException
 
 
 -- !query
-CREATE TABLE IDENTIFIER(1)(c1 INT)
+CREATE TABLE IDENTIFIER(1)(c1 INT) USING csv
 -- !query analysis
 org.apache.spark.sql.AnalysisException
 {
@@ -754,7 +754,7 @@ org.apache.spark.sql.AnalysisException
 
 
 -- !query
-CREATE TABLE IDENTIFIER('a.b.c')(c1 INT)
+CREATE TABLE IDENTIFIER('a.b.c')(c1 INT) USING csv
 -- !query analysis
 org.apache.spark.sql.AnalysisException
 {
@@ -947,7 +947,7 @@ CacheTableAsSelect t1, (select my_col from (values (1), 
(2), (1) as (my_col)) gr
 
 
 -- !query
-create table identifier('t2') as (select my_col from (values (1), (2), (1) as 
(my_col)) group by 1)
+create table identifier('t2') using csv as (select my_col from (values (1), 
(2), (1) as (my_col)) group by 1)
 -- !query analysis
 CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`t2`, 
ErrorIfExists, [my_col]
+- Aggregate [my_col#x], [my_col#x]
@@ -959,7 +959,7 @@ CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`t2`, ErrorIfExis
 -- !query
 insert into identifier('t2') select my_col from (values (3) as (my_col)) group 
by 1
 -- !query analysis
-InsertIntoHadoopFsRelationCommand file:[not included in 
comparison]/{warehouse_dir}/t2, false, Parquet, [path=file:[not included in 
comparison]/{warehouse_dir}/t2], Append, `spark_catalog`.`default`.`t2`, 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(file:[not included 
in comparison]/{warehouse_dir}/t2), [my_col]
+InsertIntoHadoopFsRelationCommand file:[not included in 
comparison]/{warehouse_dir}/t2, false, CSV, [path=file:[not included in 
comparison]/{warehouse_dir}/t2], Append, `spark_catalog`.`default`.`t2`, 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(file:[not included 
in comparison]/{warehouse_dir}/t2), [my_col]
 +- Aggregate [my_col#x], [my_col#x]
+- SubqueryAlias __auto_generated_subquery_name
   +- SubqueryAlias as
diff --git a/sql/core/src/test/resources/sql-tests/inputs/identifier-clause.sql 
b/sql/core/src/test/resources/sql-tests/inputs/identifier-clause.sql
index 978b82c331fe..46461dcd048e 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/identifier-clause.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/identifier-clause.sql
@@ -119,8 +119,8 @@ VALUES(IDENTIFIER(1));
 VALUES(IDENTIFIER(SUBSTR('HELLO', 1, RAND() + 1)));
 SELECT `IDENTIFIER`('abs')(c1) FROM VALUES(-1) AS T(c1);
 
-CREATE TABLE IDENTIFIER(1)(c1 INT);
-CREATE TABLE IDENTIFIER('a.b.c')(c1 INT);
+CREATE TABLE IDENTIFIER(1)(c1 INT) USING csv;
+CREATE TABLE IDENTIFIER('a.b.c')(c1 INT) USING csv;
 CREATE VIEW IDENTIFIER('a.b.c')(c1) AS VALUES(1);
 DROP TABLE IDENTIFIER('a.b.c');
 DROP VIEW IDENTIFIER('a.b.c');
@@ -135,7 +135,7 @@ CREATE TEMPORARY VIEW IDENTIFIER('default.v')(c1) AS 
VALUES(1);
 -- SPARK-48273: Aggregation operation in statements using identifier clause 
for table name
 create temporary view identifier('v1') as (select my_col from (values (1), 
(2), (1) as (my_col)) group by 1);
 cache table identifier('t1') as (select

(spark) branch master updated (0461745f1616 -> dc6b493dd1f4)

2024-05-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0461745f1616 [SPARK-48281][SQL] Alter string search logic for 
UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex)
 add dc6b493dd1f4 [SPARK-48462][SQL][TESTS] Use withSQLConf in tests: 
Refactor HiveQuerySuite and HiveTableScanSuite

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/hive/execution/HiveQuerySuite.scala  | 111 +++--
 .../sql/hive/execution/HiveTableScanSuite.scala|  18 ++--
 2 files changed, 67 insertions(+), 62 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex)

2024-05-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0461745f1616 [SPARK-48281][SQL] Alter string search logic for 
UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex)
0461745f1616 is described below

commit 0461745f161692c7ad2bc0e418c4e5fb75f71ef5
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed May 29 11:16:37 2024 -0700

[SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE 
collation (StringInStr, SubstringIndex)

### What changes were proposed in this pull request?
String searching in UTF8_BINARY_LCASE now works on character-level, rather 
than on byte-level. For example: `instr("İ", "i")`; now returns 0, because 
there exists no `start, len` such that `lowercase(substring("İ", start, len)) 
== "i"`.

### Why are the changes needed?
Fix functions that give unusable results due to one-to-many case mapping 
when performing string search under UTF8_BINARY_LCASE (see example above).

### Does this PR introduce _any_ user-facing change?
Yes, behaviour of `instr` and `substring_index` expressions is changed for 
edge cases with one-to-many case mapping.

### How was this patch tested?
New unit tests in `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46589 from uros-db/alter-lcase-vol2.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 48 ++---
 .../spark/sql/catalyst/util/CollationSupport.java  |  2 +-
 .../org/apache/spark/unsafe/types/UTF8String.java  | 13 +-
 .../spark/unsafe/types/CollationSupportSuite.java  | 50 +-
 4 files changed, 75 insertions(+), 38 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index 0d0094d8d0a0..a6e96003ec34 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -345,14 +345,14 @@ public class CollationAwareUTF8String {
*/
   public static int lowercaseIndexOf(final UTF8String target, final UTF8String 
pattern,
   final int start) {
-if (pattern.numChars() == 0) return 0;
+if (pattern.numChars() == 0) return target.indexOfEmpty(start);
 return lowercaseFind(target, pattern.toLowerCase(), start);
   }
 
   public static int indexOf(final UTF8String target, final UTF8String pattern,
   final int start, final int collationId) {
 if (pattern.numBytes() == 0) {
-  return 0;
+  return target.indexOfEmpty(start);
 }
 
 StringSearch stringSearch = CollationFactory.getStringSearch(target, 
pattern, collationId);
@@ -444,47 +444,27 @@ public class CollationAwareUTF8String {
   return UTF8String.EMPTY_UTF8;
 }
 
-UTF8String lowercaseString = string.toLowerCase();
 UTF8String lowercaseDelimiter = delimiter.toLowerCase();
 
 if (count > 0) {
-  int idx = -1;
+  // Search left to right (note: the start code point is inclusive).
+  int matchLength = -1;
   while (count > 0) {
-idx = lowercaseString.find(lowercaseDelimiter, idx + 1);
-if (idx >= 0) {
-  count--;
-} else {
-  // can not find enough delim
-  return string;
-}
-  }
-  if (idx == 0) {
-return UTF8String.EMPTY_UTF8;
+matchLength = lowercaseFind(string, lowercaseDelimiter, matchLength + 
1);
+if (matchLength > MATCH_NOT_FOUND) --count; // Found a delimiter.
+else return string; // Cannot find enough delimiters in the string.
   }
-  byte[] bytes = new byte[idx];
-  copyMemory(string.getBaseObject(), string.getBaseOffset(), bytes, 
BYTE_ARRAY_OFFSET, idx);
-  return UTF8String.fromBytes(bytes);
-
+  return string.substring(0, matchLength);
 } else {
-  int idx = string.numBytes() - delimiter.numBytes() + 1;
+  // Search right to left (note: the end code point is exclusive).
+  int matchLength = string.numChars() + 1;
   count = -count;
   while (count > 0) {
-idx = lowercaseString.rfind(lowercaseDelimiter, idx - 1);
-if (idx >= 0) {
-  count--;
-} else {
-  // can not find enough delim
-  return string;
-}
+matchLength = lowercaseRFind(string, lowercaseDelimiter, matchLength - 
1);
+if (matchLength > MATCH_NOT_FOUND) -

(spark) branch master updated: [SPARK-48444][SQL][TESTS] Use withSQLConf in tests: Refactor SQLQuerySuite

2024-05-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 49204b10af58 [SPARK-48444][SQL][TESTS] Use withSQLConf in tests: 
Refactor SQLQuerySuite
49204b10af58 is described below

commit 49204b10af58230af2e6d9104ad61fb81f6a0bc3
Author: Rui Wang 
AuthorDate: Wed May 29 10:38:33 2024 -0700

[SPARK-48444][SQL][TESTS] Use withSQLConf in tests: Refactor SQLQuerySuite

### What changes were proposed in this pull request?

Use withSQLConf in tests when it is appropriate.

### Why are the changes needed?

Enforce good practice for setting config in test cases.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Existing UT

### Was this patch authored or co-authored using generative AI tooling?

NO

Closes #46778 from amaliujia/test_case_with_sql_config.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/hive/execution/SQLQuerySuite.scala   | 113 ++---
 1 file changed, 55 insertions(+), 58 deletions(-)

diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
index 0bcac639443c..05b73e31d115 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
@@ -178,24 +178,24 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 |PARTITIONED BY (state STRING, month INT)
 |STORED AS PARQUET
   """.stripMargin)
+withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") {
+  sql("INSERT INTO TABLE orders PARTITION(state, month) SELECT * FROM 
orders1")
+  sql("INSERT INTO TABLE orderupdates PARTITION(state, month) SELECT * 
FROM orderupdates1")
 
-sql("set hive.exec.dynamic.partition.mode=nonstrict")
-sql("INSERT INTO TABLE orders PARTITION(state, month) SELECT * FROM 
orders1")
-sql("INSERT INTO TABLE orderupdates PARTITION(state, month) SELECT * 
FROM orderupdates1")
-
-checkAnswer(
-  sql(
-"""
-  |select orders.state, orders.month
-  |from orders
-  |join (
-  |  select distinct orders.state,orders.month
-  |  from orders
-  |  join orderupdates
-  |on orderupdates.id = orders.id) ao
-  |  on ao.state = orders.state and ao.month = orders.month
+  checkAnswer(
+sql(
+  """
+|select orders.state, orders.month
+|from orders
+|join (
+|  select distinct orders.state,orders.month
+|  from orders
+|  join orderupdates
+|on orderupdates.id = orders.id) ao
+|  on ao.state = orders.state and ao.month = orders.month
 """.stripMargin),
-  (1 to 6).map(_ => Row("CA", 20151)))
+(1 to 6).map(_ => Row("CA", 20151)))
+}
   }
 }
   }
@@ -715,21 +715,23 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
   }
 
   test("command substitution") {
-sql("set tbl=src")
-checkAnswer(
-  sql("SELECT key FROM ${hiveconf:tbl} ORDER BY key, value limit 1"),
-  sql("SELECT key FROM src ORDER BY key, value limit 1").collect().toSeq)
+withSQLConf("tbl" -> "src") {
+  checkAnswer(
+sql("SELECT key FROM ${hiveconf:tbl} ORDER BY key, value limit 1"),
+sql("SELECT key FROM src ORDER BY key, value limit 1").collect().toSeq)
+}
 
-sql("set spark.sql.variable.substitute=false") // disable the substitution
-sql("set tbl2=src")
-intercept[Exception] {
-  sql("SELECT key FROM ${hiveconf:tbl2} ORDER BY key, value limit 
1").collect()
+withSQLConf("tbl2" -> "src", "spark.sql.variable.substitute" -> "false") {
+  intercept[Exception] {
+sql("SELECT key FROM ${hiveconf:tbl2} ORDER BY key, value limit 
1").collect()
+  }
 }
 
-sql("set spark.sql.variable.substitute=true") // enable the substitution
-checkAnswer(
-  sql("SELECT key FROM ${hiveconf:tbl2} ORDER BY key, value limit 1"),
-  sql("SELECT key FROM src ORDER BY key, value li

(spark) branch master updated (a86bca131028 -> e6236af3d08c)

2024-05-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a86bca131028 [SPARK-48445][SQL] Don't inline UDFs with expensive 
children
 add e6236af3d08c [SPARK-48000][SQL] Enable hash join support for all 
collations (StringType)

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/util/CollationFactory.java  |  11 ++
 .../catalyst/analysis/RewriteCollationJoin.scala   |  45 ++
 .../sql/catalyst/expressions/CollationKey.scala|  47 ++
 .../expressions/CollationExpressionSuite.scala |  26 
 .../spark/sql/execution/SparkOptimizer.scala   |   4 +-
 .../org/apache/spark/sql/CollationSuite.scala  | 166 -
 6 files changed, 264 insertions(+), 35 deletions(-)
 create mode 100644 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala
 create mode 100644 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationKey.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69427 - in /dev/spark/v4.0.0-preview1-rc3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/articles/sparkr-vignettes_files/ _site/api/R/articles/sparkr-vignettes_

2024-05-28 Thread wenchen

Author: wenchen
Date: Tue May 28 17:45:42 2024
New Revision: 69427

Log:
Apache Spark v4.0.0-preview1-rc3 docs


[This commit notification would consist of 4816 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate)

2024-05-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 249390017ef4 [SPARK-48221][SQL] Alter string search logic for 
UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate)
249390017ef4 is described below

commit 249390017ef4a045037213dec386e16cca125080
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue May 28 10:05:12 2024 -0700

[SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE 
collation (Contains, StartsWith, EndsWith, StringLocate)

### What changes were proposed in this pull request?
String searching in UTF8_BINARY_LCASE now works on character-level, rather 
than on byte-level. For example: `contains("İ", "i");` now returns **false**, 
because there exists no `start, len` such that `lowercase(substring("İ", start, 
len)) == "i"`.

### Why are the changes needed?
Fix functions that give unusable results due to one-to-many case mapping 
when performing string search under UTF8_BINARY_LCASE (see example above).

### Does this PR introduce _any_ user-facing change?
Yes, behaviour of `contains`, `startswith`, `endswith`, and 
`locate`/`position` expressions is changed for edge cases with one-to-many case 
mapping.

### How was this patch tested?
New unit tests in `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46511 from uros-db/alter-lcase-impl.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 169 +
 .../spark/sql/catalyst/util/CollationSupport.java  |   8 +-
 .../org/apache/spark/unsafe/types/UTF8String.java  | 118 --
 .../spark/unsafe/types/CollationSupportSuite.java  | 129 +---
 .../apache/spark/unsafe/types/UTF8StringSuite.java | 105 -
 5 files changed, 278 insertions(+), 251 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index ee0d611d7e65..0d0094d8d0a0 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -34,6 +34,155 @@ import java.util.Map;
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that starts with the specified 
prefix in lowercase
+   */
+  private static int lowercaseMatchLengthFrom(
+  final UTF8Str

(spark) branch master updated (731a2cfcffae -> e9a3ed857954)

2024-05-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 731a2cfcffae [SPARK-48273][SQL] Fix late rewrite of 
PlanWithUnresolvedIdentifier
 add e9a3ed857954 [SPARK-48159][SQL] Extending support for collated strings 
on datetime expressions

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/datetimeExpressions.scala |  38 ++--
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 234 +
 2 files changed, 254 insertions(+), 18 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier

2024-05-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 7313d71438e4 [SPARK-48273][SQL] Fix late rewrite of 
PlanWithUnresolvedIdentifier
7313d71438e4 is described below

commit 7313d71438e4691f7c086e90ded4a6f644cdcdc5
Author: Nikola Mandic 
AuthorDate: Tue May 28 09:59:53 2024 -0700

[SPARK-48273][SQL] Fix late rewrite of PlanWithUnresolvedIdentifier

### What changes were proposed in this pull request?

`PlanWithUnresolvedIdentifier` is rewritten later in analysis which causes 
rules like
`SubstituteUnresolvedOrdinals` to miss the new plan. This causes following 
queries to fail:
```
create temporary view identifier('v1') as (select my_col from (values (1), 
(2), (1) as (my_col)) group by 1);
--
cache table identifier('t1') as (select my_col from (values (1), (2), (1) 
as (my_col)) group by 1);
--
create table identifier('t2') as (select my_col from (values (1), (2), (1)
as (my_col)) group by 1);
insert into identifier('t2') select my_col from (values (3) as (my_col)) 
group by 1;
```
Fix this by explicitly applying rules after plan rewrite.

### Why are the changes needed?

To fix the described bug.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the mentioned problematic queries.

### How was this patch tested?

Updated existing `identifier-clause.sql` golden file.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46580 from nikolamand-db/SPARK-48273.

Authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 731a2cfcffaeeeb1f1c107080ca77000330d79b5)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala |  9 ++--
 .../analysis/ResolveIdentifierClause.scala | 11 ++--
 .../spark/sql/catalyst/rules/RuleExecutor.scala|  2 +-
 .../analyzer-results/identifier-clause.sql.out | 59 ++
 .../sql-tests/inputs/identifier-clause.sql |  9 
 .../sql-tests/results/identifier-clause.sql.out| 56 
 6 files changed, 139 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index ed7b978045c7..5890a9692e20 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -255,7 +255,7 @@ class Analyzer(override val catalogManager: CatalogManager) 
extends RuleExecutor
 TypeCoercion.typeCoercionRules
   }
 
-  override def batches: Seq[Batch] = Seq(
+  private def earlyBatches: Seq[Batch] = Seq(
 Batch("Substitution", fixedPoint,
   // This rule optimizes `UpdateFields` expression chains so looks more 
like optimization rule.
   // However, when manipulating deeply nested schema, `UpdateFields` 
expression tree could be
@@ -275,7 +275,10 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 Batch("Simple Sanity Check", Once,
   LookupFunctions),
 Batch("Keep Legacy Outputs", Once,
-  KeepLegacyOutputs),
+  KeepLegacyOutputs)
+  )
+
+  override def batches: Seq[Batch] = earlyBatches ++ Seq(
 Batch("Resolution", fixedPoint,
   new ResolveCatalogs(catalogManager) ::
   ResolveInsertInto ::
@@ -319,7 +322,7 @@ class Analyzer(override val catalogManager: CatalogManager) 
extends RuleExecutor
   ResolveTimeZone ::
   ResolveRandomSeed ::
   ResolveBinaryArithmetic ::
-  ResolveIdentifierClause ::
+  new ResolveIdentifierClause(earlyBatches) ::
   ResolveUnion ::
   ResolveRowLevelCommandAssignments ::
   RewriteDeleteFromTable ::
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala
index e0d3e5629ef8..422bad3d89e2 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveIdentifierClause.scala
@@ -20,19 +20,24 @@ package org.apache.spark.sql.catalyst.analysis
 import org.apache.spark.sql.catalyst.expressions.{AliasHelper, EvalHelper, 
Expression}
 import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
-import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.cat

(spark) branch master updated (7fe1b93884aa -> 731a2cfcffae)

2024-05-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 7fe1b93884aa [SPARK-46841][SQL] Add collation support for ICU locales 
and collation specifiers
 add 731a2cfcffae [SPARK-48273][SQL] Fix late rewrite of 
PlanWithUnresolvedIdentifier

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala |  9 ++--
 .../analysis/ResolveIdentifierClause.scala | 11 ++--
 .../spark/sql/catalyst/rules/RuleExecutor.scala|  2 +-
 .../analyzer-results/identifier-clause.sql.out | 59 ++
 .../sql-tests/inputs/identifier-clause.sql |  9 
 .../sql-tests/results/identifier-clause.sql.out| 56 
 6 files changed, 139 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers

2024-05-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7fe1b93884aa [SPARK-46841][SQL] Add collation support for ICU locales 
and collation specifiers
7fe1b93884aa is described below

commit 7fe1b93884aa8e9ba20f19351b8537c687b8f59c
Author: Nikola Mandic 
AuthorDate: Tue May 28 09:56:16 2024 -0700

[SPARK-46841][SQL] Add collation support for ICU locales and collation 
specifiers

### What changes were proposed in this pull request?

Languages and localization for collations are supported by ICU library. 
Collation naming format is as follows:
```
<2-letter language code>[_<4-letter script>][_<3-letter country 
code>][_specifier_specifier...]
```
Locale specifier consists of the first part of collation name (language + 
script + country). Locale specifiers need to be stable across ICU versions; to 
keep existing ids and names invariant we introduce golden file will locale 
table which should case CI failure on any silent changes.

Currently supported optional specifiers:

- `CS`/`CI` - case sensitivity, default is case-sensitive; supported by 
configuring ICU collation levels
- `AS`/`AI` - accent sensitivity, default is accent-sensitive; supported by 
configuring ICU collation levels

User can use collation specifiers in any order except of locale which is 
mandatory and must go first. There is a one-to-one mapping between collation 
ids and collation names defined in `CollationFactory`.

### Why are the changes needed?

To add languages and localization support for collations.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new predefined collations.

### How was this patch tested?

Added checks to `CollationFactorySuite` and ICU locale map golden file.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46180 from nikolamand-db/SPARK-46841.

Authored-by: Nikola Mandic 
    Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/CollationFactory.java  | 678 +
 .../spark/unsafe/types/CollationFactorySuite.scala | 323 +-
 .../src/main/resources/error/error-conditions.json |   4 +-
 .../apache/spark/sql/PlanGenerationTestSuite.scala |   4 +-
 .../src/main/protobuf/spark/connect/types.proto|   2 +-
 .../connect/common/DataTypeProtoConverter.scala|   9 +-
 .../query-tests/queries/csv_from_dataset.json  |   2 +-
 .../query-tests/queries/csv_from_dataset.proto.bin | Bin 158 -> 169 bytes
 .../query-tests/queries/function_lit_array.json|   4 +-
 .../queries/function_lit_array.proto.bin   | Bin 889 -> 911 bytes
 .../query-tests/queries/function_typedLit.json |  32 +-
 .../queries/function_typedLit.proto.bin| Bin 1199 -> 1381 bytes
 .../query-tests/queries/json_from_dataset.json |   2 +-
 .../queries/json_from_dataset.proto.bin| Bin 169 -> 180 bytes
 python/pyspark/sql/connect/proto/types_pb2.py  |  78 +--
 python/pyspark/sql/connect/proto/types_pb2.pyi |  11 +-
 python/pyspark/sql/connect/types.py|   5 +-
 python/pyspark/sql/types.py|  27 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|  15 +-
 .../expressions/CollationExpressionSuite.scala |  33 +-
 .../resources/collations/ICU-collations-map.md | 143 +
 .../sql-tests/analyzer-results/collations.sql.out  |  77 +++
 .../test/resources/sql-tests/inputs/collations.sql |  13 +
 .../resources/sql-tests/results/collations.sql.out |  88 +++
 .../org/apache/spark/sql/CollationSuite.scala  |   2 +-
 .../apache/spark/sql/ICUCollationsMapSuite.scala   |  69 +++
 .../apache/spark/sql/internal/SQLConfSuite.scala   |   3 +-
 27 files changed, 1388 insertions(+), 236 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
index 0133c3feb611..fce12510afaf 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
@@ -19,6 +19,7 @@ package org.apache.spark.sql.catalyst.util;
 import java.text.CharacterIterator;
 import java.text.StringCharacterIterator;
 import java.util.*;
+import java.util.concurrent.ConcurrentHashMap;
 import java.util.function.BiFunction;
 import java.util.function.ToLongFunction;
 
@@ -173,26 +174,546 @@ public final class CollationFactory {
 }
 
 /**
- * Constructor with comparators that are inherited from the given collator.
+ * Collation ID is defined as 32-bit integer. We specify binary

svn commit: r69426 - /dev/spark/v4.0.0-preview1-rc3-bin/

2024-05-28 Thread wenchen

Author: wenchen
Date: Tue May 28 16:50:57 2024
New Revision: 69426

Log:
Apache Spark v4.0.0-preview1-rc3

Added:
dev/spark/v4.0.0-preview1-rc3-bin/
dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc3-bin/pyspark-4.0.0.dev1.tar.gz   (with props)
dev/spark/v4.0.0-preview1-rc3-bin/pyspark-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc3-bin/pyspark-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc3-bin/pyspark_connect-4.0.0.dev1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc3-bin/pyspark_connect-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc3-bin/pyspark_connect-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc3-bin/spark-4.0.0-preview1-bin-hadoop3.tgz   
(with props)
dev/spark/v4.0.0-preview1-rc3-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc

dev/spark/v4.0.0-preview1-rc3-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512

dev/spark/v4.0.0-preview1-rc3-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz   
(with props)

dev/spark/v4.0.0-preview1-rc3-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc

dev/spark/v4.0.0-preview1-rc3-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512
dev/spark/v4.0.0-preview1-rc3-bin/spark-4.0.0-preview1.tgz   (with props)
dev/spark/v4.0.0-preview1-rc3-bin/spark-4.0.0-preview1.tgz.asc
dev/spark/v4.0.0-preview1-rc3-bin/spark-4.0.0-preview1.tgz.sha512

Added: dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz.asc Tue May 
28 16:50:57 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmZWCNcTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WqwvD/4/Lap7M5blsZvzAmevVyZ58wIESquP
+8BNt/2SVuj1gQJizxsqyTk+knyFSQ/NPPlasHc6G/yd5aUAAaggVO1S4QCtKvtQ9
+E9EQQ8BFLjC11Srg/93dDdMuLXsB+SNiYoT9yILtCK9Hs2M84i3aXlla3jPYs1qZ
+/E/a5JmqMhxaBeNk2L4uo6KqevanH5d2Xi9Xe8ulln2xJqJARSVVSOr3qO0BdZjb
+wv7xyDo7wRW96dQywx5gHPuZIL6Qu0bYqRRQAaQZvwmeJnxah9jLZZKWp6E1eLCq
+jD11l+FMauIzyO1B3BK9opsBze8G0mVTuUPFYww5C8DxfxwSDBzUZaGHlp1xmxiv
+lF35PmB/FpRk9ddpzNucJnWddjS582wj+rxi3KnlFIusbTtDFpRFa+5sTa0GG2LO
+wG5vBD2QHSWHQ3NnvGiffp6OIPOmw009+QNi7/JYfVrpsNHRqW5bBew3QeR756Jy
+tFvOCN37wLzLwfEOGDou3lNyYFBlsFk37HqlnQpkmvokPzBJ2giWmwVnIc7iYub5
+DHtB86r/Vmqb1mkqsG9PbsBIzbRX6e1rTAQtQQbBYenaA63rAVwrLFt65Y2rTIt8
+D8ewS9cLhEJaf6ajndb5AlQRxX/hth5xmuSMEXib0V5V/BGgNtw9kQ6GouPmf16J
+AOs0h20YWzkFmw==
+=aEv0
+-END PGP SIGNATURE-

Added: dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz.sha512
==
--- dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz.sha512 
(added)
+++ dev/spark/v4.0.0-preview1-rc3-bin/SparkR_4.0.0-preview1.tar.gz.sha512 Tue 
May 28 16:50:57 2024
@@ -0,0 +1 @@
+b2a81b7239d39b2af3a81a82fa8541db8551a7503a602766e37bdaf70495123e2d3fa68cd4b684af2df2386f0212167a291cbc260d54ac985fd968dc09b3a0d2
  SparkR_4.0.0-preview1.tar.gz

Added: dev/spark/v4.0.0-preview1-rc3-bin/pyspark-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc3-bin/pyspark-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc3-bin/pyspark-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc3-bin/pyspark-4.0.0.dev1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc3-bin/pyspark-4.0.0.dev1.tar.gz.asc Tue May 28 
16:50:57 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmZWCNkTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/Wox1EAC7Eh7Yxy6rmYBTdG5AptbPLt7RyZei
+gjl/8ROpgQ1p7ehghXuLtEkcuy/GSSU02BLsrbkM/QtDwElRBkEDVABcQyS7Jgia
+2WFuK8E1BPPWlIl07KhqEmWwXSSzLRuVAQhMPFjT5g7Op/viqOCXbXSEGoHe8+8Z
+4OJ9zr8qpeMM9ZLivQppq5PAodcKohR7n5BBHFjShNhU3XJ3Cl3pFMxg9weCCuGD
+2SQgPIveai7P9Lhe5Cl5eXiSOCEG+r4QJjk9d5FjAH+VK0qcH0guW41eeHiv3k1y
+DFeh3PJlvUx1TP8/E7hiMUVA5H5HorHkzOraQrFaC+D+tqMWAQFSXThrJmYSRaEU
+h2SFOdQ8Bk4AsAzikzyALULT+gDKhGhtFWLpz5eyt2tWOKL8sCpcF0AnrrssusJp
+5p+9xhBvs9L

(spark) 01/01: Preparing Spark release v4.0.0-preview1-rc3

2024-05-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to tag v4.0.0-preview1-rc3
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 7a7a8bc4bab591ac8b98b2630b38c57adf619b82
Author: Wenchen Fan 
AuthorDate: Tue May 28 16:23:00 2024 +

Preparing Spark release v4.0.0-preview1-rc3
---
 R/pkg/R/sparkR.R   | 4 ++--
 assembly/pom.xml   | 2 +-
 common/kvstore/pom.xml | 2 +-
 common/network-common/pom.xml  | 2 +-
 common/network-shuffle/pom.xml | 2 +-
 common/network-yarn/pom.xml| 2 +-
 common/sketch/pom.xml  | 2 +-
 common/tags/pom.xml| 2 +-
 common/unsafe/pom.xml  | 2 +-
 common/utils/pom.xml   | 2 +-
 common/variant/pom.xml | 2 +-
 connector/avro/pom.xml | 2 +-
 connector/connect/client/jvm/pom.xml   | 2 +-
 connector/connect/common/pom.xml   | 2 +-
 connector/connect/server/pom.xml   | 2 +-
 connector/docker-integration-tests/pom.xml | 2 +-
 connector/kafka-0-10-assembly/pom.xml  | 2 +-
 connector/kafka-0-10-sql/pom.xml   | 2 +-
 connector/kafka-0-10-token-provider/pom.xml| 2 +-
 connector/kafka-0-10/pom.xml   | 2 +-
 connector/kinesis-asl-assembly/pom.xml | 2 +-
 connector/kinesis-asl/pom.xml  | 2 +-
 connector/profiler/pom.xml | 2 +-
 connector/protobuf/pom.xml | 2 +-
 connector/spark-ganglia-lgpl/pom.xml   | 2 +-
 core/pom.xml   | 2 +-
 docs/_config.yml   | 6 +++---
 examples/pom.xml   | 2 +-
 graphx/pom.xml | 2 +-
 hadoop-cloud/pom.xml   | 2 +-
 launcher/pom.xml   | 2 +-
 mllib-local/pom.xml| 2 +-
 mllib/pom.xml  | 2 +-
 pom.xml| 2 +-
 python/pyspark/version.py  | 2 +-
 repl/pom.xml   | 2 +-
 resource-managers/kubernetes/core/pom.xml  | 2 +-
 resource-managers/kubernetes/integration-tests/pom.xml | 2 +-
 resource-managers/yarn/pom.xml | 2 +-
 sql/api/pom.xml| 2 +-
 sql/catalyst/pom.xml   | 2 +-
 sql/core/pom.xml   | 2 +-
 sql/hive-thriftserver/pom.xml  | 2 +-
 sql/hive/pom.xml   | 2 +-
 streaming/pom.xml  | 2 +-
 tools/pom.xml  | 2 +-
 46 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index 0be7e5da24d2..478acf514ef3 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -456,8 +456,8 @@ sparkR.session <- function(
 
   # Check if version number of SparkSession matches version number of SparkR 
package
   jvmVersion <- callJMethod(sparkSession, "version")
-  # Remove -SNAPSHOT from jvm versions
-  jvmVersionStrip <- gsub("-SNAPSHOT", "", jvmVersion, fixed = TRUE)
+  # Remove -preview1 from jvm versions
+  jvmVersionStrip <- gsub("-preview1", "", jvmVersion, fixed = TRUE)
   rPackageVersion <- paste0(packageVersion("SparkR"))
 
   if (jvmVersionStrip != rPackageVersion) {
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 58e7ae5bb0c7..417e7c23ca9f 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.13
-4.0.0-SNAPSHOT
+4.0.0-preview1
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 046648e9c2ae..e1a4497387a2 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.13
-4.0.0-SNAPSHOT
+4.0.0-preview1
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index cdb5bd72158a..d8dff6996cec 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.13
-4.0.0-SNAPSHOT
+4.0.0-preview1
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 0f7036ef

(spark) tag v4.0.0-preview1-rc3 created (now 7a7a8bc4bab5)

2024-05-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to tag v4.0.0-preview1-rc3
in repository https://gitbox.apache.org/repos/asf/spark.git


  at 7a7a8bc4bab5 (commit)
This tag includes the following new commits:

 new 7a7a8bc4bab5 Preparing Spark release v4.0.0-preview1-rc3

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69425 - /dev/spark/KEYS

2024-05-28 Thread wenchen

Author: wenchen
Date: Tue May 28 16:09:45 2024
New Revision: 69425

Log:
Update KEYS

Modified:
dev/spark/KEYS

Modified: dev/spark/KEYS
==
--- dev/spark/KEYS (original)
+++ dev/spark/KEYS Tue May 28 16:09:45 2024
@@ -704,62 +704,61 @@ kyHyHY5kPG9HfDOSahPz
 =SDAz
 -END PGP PUBLIC KEY BLOCK-
 
-pub   rsa4096 2024-05-07 [SC]
-  4DC9676CEF9A83E98FCA02784D6620843CD87F5A
-uid  Wenchen Fan (CODE SIGNING KEY) 
-sub   rsa4096 2024-05-07 [E]
+pub   4096R/4F4FDC8A 2018-09-18
+uid  Wenchen Fan (CODE SIGNING KEY) 
+sub   4096R/6F3F5B0E 2018-09-18
 
 -BEGIN PGP PUBLIC KEY BLOCK-
+Version: GnuPG v1
 
-mQINBGY6XpcBEADBeNz3IBYriwrPzMYJJO5u1DaWAJ4Sryx6PUZgvssrcqojYVTh
-MjtlBkWRcNquAyDrVlU1vtq1yMq5KopQoAEi/l3xaEDZZ0IFAob6+GlGXEon2Jvf
-0FXQsx+Df4nMVl7KPqh68T++Z4GkvK5wyyN9uaUTWL2deGeinVxTh6qWQT8YiCd5
-wof+Dk5IIzKQ5VIBhU/U9S0jo/pqhH4okcZGTyT2Q7sfg4eXl5+Y2OR334RkvTcX
-uJjcnJ8BUbBSm1UhNg4OGBEJgi+lE1GEgw4juOfTAPh9fx8SCLhuX0m6Qc/y9bAK
-Q4zejbF5F2Um9dqrZqg6Egp+nlzydn59hq9owSnQ6JdoA/PLcgoign0sghu9xGCR
-GpgI2kS7Q8bu6dy7T0BfUerLZ1FHu7nCT2ZNSIh/Y2eOhuBhUr3llg8xa3PZZob/
-2sZE2dJ3g/qp2Nbo+s5Q5kELtuo6cZD0EISQwt68hGWIgxs0vtci2c2kQYFS0oqw
-fGynEeDFZRHV3ET5rioYaoPi70Cnibght5ocL0t6sl0RQQVp6k2i1aofJbZA480N
-ivuJ5agGaSRxmIDk6JlDsHJGxO9oC066ZLJiR6i0JUinGP7sw/nNmgup/AB+y4hW
-9WdeAFyYmuYysDRRyE6z1MPDp1R00MyGxHNFDF64/JPY/nKKFdXp+aCazwARAQAB
+mQINBFugiYgBEAC4DsJBWF3VjWiKEiD8XNPRTg3Bnw52fe4bTB9Jvh/q0VStJjO7
+CSHZ1/P5h60zbS5UWLP2mt+c0FaW6wv7PxafCnd1MPENGBkttZbC4UjWDSbPp0vx
+fkUfrAqflWvO1AaCveg2MlyQdLZ1HwVz+PDLWqE+Ev2p3Si4Jfx5P2O9FmWt8a/b
+Wea/4gfy/5zFWRberQjt4CkSBuNU+cOo19/n32JJJYbRqrzFAGs/DJUIxNXC1qef
+c2iB3dyff1mkLb9Vzd1RfhZaSNUElo67o4Vi6SswgvHxoE03wIcoJvBTafqLxy6p
+mt5SAzOyvvmOVcLNqP9i5+c4sBrxvQ2ZEZrZt7dKfhbh4W8ged/TNWMoNOCX2usD
+Fj17KrFAEaeqtEwRdwZMxGqKI/NxANkdPSxS4T/JQoi+N6LBJ88yzmeCquA8MT0b
+/H4ziyjgrSRugCE6jcsbuObQsDxiqPSSXeWSjPoYq876JcqAgZzSYYdlGVw2J9Vb
+46hhEqhGk+91vK6CtyuhKv5KXk1B3Rhhc5znKWcahD3cpISxwTSzN9OwQHEd8Ovv
+x0WAhY3WOexrBekH7Sy00gjaHSAHFj3ReITfffWkv6t4TGLyohEOfgdxFvq03Fhd
+p7bWDmux47jP6AUUjP0VXRsG9ev3ch+bbcbRlo15HPBtyehoPn4BellFAQARAQAB
 tDNXZW5jaGVuIEZhbiAoQ09ERSBTSUdOSU5HIEtFWSkgPHdlbmNoZW5AYXBhY2hl
-Lm9yZz6JAlEEEwEIADsWIQRNyWds75qD6Y/KAnhNZiCEPNh/WgUCZjpelwIbAwUL
-CQgHAgIiAgYVCgkICwIEFgIDAQIeBwIXgAAKCRBNZiCEPNh/WkofD/9sI7J3i9Ck
-NOlHpVnjAaHjyGX5cVA2dZGniJdLf5yOKOI6pu7dMW+NThsXO1Iv+BRYo7una6/Q
-vUquKKxCXIN3vNmKIB1e9lj4MaIhCRmXUSQxjkVa9JW3P/F520Ct3VjiCZ5IjPv+
-g1hF/wrkuuoAFlcC/bfGWafkaZgszavSpCdp27mUXUNbvLW0dPJ3+ay4cDPuT1DI
-6DhB8qpqN7gInDFACW2qtQ2KZh1JFGy5ZccQ9dB3t/B4BYiUie6a3eQWgKqLF1hw
-8yHY3DkCVGfnXJk4+LMWqgazQxoB6oZjBvoQYtGOPXr1ZbmtiRHCDM5KmZ+QmIXB
-ZGBXkLaqt2QGxlwUGlvn+nKuTsp8VL1APIlKdMpvMW59uz1ycZHMeTJGAMtZw8Qm
-kxG62kqnDYeZ6oWwinY3wYP4UmqFSWIfcHMfBwED4uOC//r9H1bO+JRFMwOxqSN7
-kGfFJoV5eOvMOwRnXPJiPpnQEHPEkp/TAl2ANHWzdXy9TifiHOvTln3NXQVpznnW
-H6f9+W36J1IE9EWktciptKUtvwY1np+G71Swa0Q4mNgb8OGf6UNJGv4vPbSlhzlr
-1a5oYP59eHO3XqANcuKyTFxfja+rgrMldufZFCk1hSnBdAic/jaHrhIQSLcTGFiJ
-QVyiC2VlO2eZCkCTfoSlolwgzzoY4wNumLkCDQRmOl6XARAAt+N+djFZOuJdLcSz
-pz6nG88gxLmPwf+Xlhv2+xDS3wyM1OWmDAkeMDNq8OuZMes6ZXwRxDvSj7w7dlE6
-dQ1BlDz4RP4GoYG++dnPlHp/NWQ8I/eW8XC5uxkvl56YG/0DudoTLb5nxHtv+kpm
-p+eVCqWRYI5RQPdcxEZzXEije+aEj2aMRQ8cO7RAgTamRWXr+XsRkSypZ8ttTISr
-u+UuQPKT6XRMtkB2i8ekwO+jIK/mMrAteIF/cK0jv2JTlYmWrBtmGgYjHZHlzZak
-/MzWN4tU5VbJMMXa9wHicZS0/cPV9Fz3dnR0sBVgaIDsK+/vRGxHd/LGFtXH+Wrp
-pPMaR4FHCx3r44aL17B5lJocwf7Xma2gavOl80NR+a8iOW6biKdlALRZKX4G4cJj
-1vnWHDJceZOuFWMVIs7zfJymvQpROCRED3q1el+zCICnLtBue6ikqv7nfyBNCaR2
-qZhw4TPMzzGTRIdKIalcSTi+bGfSYTsU2kVDBbH+0nD5I7Tx62H4shsJtgmwyP4R
-q2dxJPpC4i+L09crjyl7rYvwHu4QU8vxcQXN4cH4O5pKOr2GoGnV8Y7kpZaRUo6w
-/Q/Rx3I3UKAyYJv0R1mK4AifM0JzMkqxAUvUdUbs2obRT04sxtr1bA+9dLEv4b8c
-YGKmRgt96GCNx1XZ8Q+FPdmsaO0AEQEAAYkCNgQYAQgAIBYhBE3JZ2zvmoPpj8oC
-eE1mIIQ82H9aBQJmOl6XAhsMAAoJEE1mIIQ82H9aBfAQAKf6xHNuKibXcRMwqmcx
-rx18d0dbeMEjrPqSe5vGOylLQZRpwZmKwflU9kZgOU2WRuqZsaPE0w5wxhsNDe8s
-UqxW08xB6v8BVj6BT9umJQNyQF5CrsjkZe2EtmYlbdNmt4t8DMNEmhhasEglWUui
-0se3I0wIwDaYAW+KppwzweO8SrUZVaB6QhOckRFhz/1wCNyc2Yp90OjWjuATffOE
-ZWSeGPn9GCbtJ+SPtLtMUlxy/BoRA6OWv6H5VAt6pJVw3XPP/o450i7lYxbmbv8W
-qm5/8nWx1XBvTvOxGoT9h+45bWjLTXtJJ2RhEftGHZ9439VSgssXBl+S/yjpnHOa
-14tRCVABP8bgAQ7HEKZ9YyII6MOAEzNa2gNVKr7+gwB1ddrGdzx6TrIUwRlgilDJ
-XORdEON4Ssx31Y1+Dt+d4lkkGu5Ymkj8iFIeH6FNOnFWM/stTmL0fE4IGpWbUHc+
-nqz7zEgili8TanLQRUmz9ClVJTG4G9t31FYF8nNzDPxug9oSMJXBfVlzhRMRZH3z
-t/XdxNFHyu7rzXidiXTJSmujeqS++mKcXxx02m+V2qfwkAwnt6OS9NDLPVrzuuMN
-NDfY3Gr4dTCbd+JQxtC0w4GuUV1V3lcOwyEjPKJVYuZwUl0UspRbNmtsaybRbzVs
-+q68az33WU5++zSuqrU3fIRp
-=1zLb
+Lm9yZz6JAjgEEwECACIFAlugiYgCGwMGCwkIBwMCBhUIAgkKCwQWAgMBAh4BAheA
+AAoJEGuscolPT9yKvqUP/i34exSQNs9NcDvjOQhrvpjYCarL4mdQZOjIn9JWxeWr
+3nkzC9ozEIrb1zt8pqhiYr6qJhmx2EJgIwZTZZ9O0qHFMmYhYn/9/KKidE0XN6t3
+dFcbtRB1PGlc9b34PZNfdhD8PWA/UB1QC0TdTRNKhrIGGIZocrkaBral6uMJZAyV
+kbb+s21cRupPLM2wmU1k3U4WxnaIq2foErhaPC9+OEDAcLH/OxwiekJTCsvZypzE
+1laxo21rX1kgYzeAuqP4BfX5ARyrfM3O31Gh8asrx1bXD4z7dHqJxdJjh7ycdJdT

svn commit: r69418 - /dev/spark/v4.0.0-preview1-rc2-bin/

2024-05-28 Thread wenchen

Author: wenchen
Date: Tue May 28 07:41:08 2024
New Revision: 69418

Log:
Apache Spark v4.0.0-preview1-rc2

Added:
dev/spark/v4.0.0-preview1-rc2-bin/
dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz   (with props)
dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc2-bin/pyspark_connect-4.0.0.dev1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc2-bin/pyspark_connect-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc2-bin/pyspark_connect-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-hadoop3.tgz   
(with props)
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc

dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512

dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz   
(with props)

dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc

dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1.tgz   (with props)
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1.tgz.asc
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1.tgz.sha512

Added: dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.asc Tue May 
28 07:41:08 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmZVh/cTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WtsEEACuWS2ZcvRneZIO8kTM0aH87vywbfEi
+bZYIr47b1LhcTgKRXsJ9qjfESgy7dCw78ykdSTOIw7kSec60kXrlqsAgNEKtxDpg
+B6yUD4+JSve9d9jQOCrwnYbUex+TzkIvWieDaU8DuuaAKf8gq1NMku2w26WBC11N
+lNtEBy93rKuT51si5L5RK7Of58J9s5z8T0b1t/zXO9M+N7C+eDJly6EQ4+6STuYN
+2q8+dne9l/tlthgQ30+YdOprU6ZRIwGukXRn830ZOOtfifF+ud7DVmk59dqmPzyX
++JiZuuVC56M19kpXt4hyg6cmOdG5wYoMZYApPueZCNUX+D4LC4pXkrI+4d90UnzL
+jlQDD92ChhrWFCUSCg1ysjFH20QXgfiqoLMHBJJ3jWZGJfAhvxBOW7Y9wLND68HI
+rFTxld/RkHFouwssasgxTL00mlWRZOXWdm/iByZS3J2U/bQgk4TbEqyHlCvKPUuK
+0UaHNVpO+jUjJ8uTCKnk9JgZTKTPGNx3nFtNdE/vckIKOuZkhYVq9jvIUBoRsoCb
+Rh/X5+aHUHZJT7faNOBVeNLgAGugIf8t/K3GysJSXxnXEBjDX94b5ruy8Mp5Odja
+OAAr4U/RpQQvoGLnoc0ZAYok8V5RQW7Vy7Q8Tf+0RnPis4VIWB0XeA4Ts5QLfyd9
+nb/DxsuDsKofxw==
+=t0/V
+-END PGP SIGNATURE-

Added: dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.sha512
==
--- dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.sha512 
(added)
+++ dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.sha512 Tue 
May 28 07:41:08 2024
@@ -0,0 +1 @@
+2df825b17df1103bc368a8c382e1a8accfb82163b58adfd56026b528d35af21c93342b243658d4ecee50300380dea2755ab3f7eb5a4296d84089f392c62a8440
  SparkR_4.0.0-preview1.tar.gz

Added: dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.asc Tue May 28 
07:41:08 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmZVh/kTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WiYdD/9hTv6p373FBLLbZ2EMXaKvxNUE+CxW
+pM/25bKRUVhb8bn/jE6rK6PUEGMk16p2rULbS5Ml2/KFCv31U3MYMyxXrKh3xMXt
+Z+DDKcKv8tBjDaGxrGHBY9ob3ODU3Vng24HGtXKlAkesAbcbQfYsaVwI8Djl0tHT
+bcJ48rXV+aoQUUpRq5TrPoKN9BOv5GL+GVPjxFXysejsnwmz2vusNYDBV2hScrAA
+H2kwshbhX95zxxDQfP2jzZcEM/gFBHGYL9vbfS5yRpjjARP5LAJRFZRU9KL3evTa
+g17B09/m5ED2OJdDgDrx+caZqIau8RnQYB1l723iO+BM7zJkW5qHHRsoMaf10Vvi
+rDQrtIRE/YSEVmtWJYIuwLY2beloLFdUm1/4GwCMqkV+YpNEsBKqGsm31aqeP28Y
+1w6sPQZXbo9

svn commit: r69417 - /dev/spark/v4.0.0-preview1-rc2-bin/

2024-05-28 Thread wenchen

Author: wenchen
Date: Tue May 28 06:35:58 2024
New Revision: 69417

Log:
Deleting

Removed:
dev/spark/v4.0.0-preview1-rc2-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-41049][SQL][FOLLOW-UP] Mark map related expressions as stateful expressions

2024-05-27 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new f42c029fac5c [SPARK-41049][SQL][FOLLOW-UP] Mark map related 
expressions as stateful expressions
f42c029fac5c is described below

commit f42c029fac5c8015d80ad957fae325243a2ed30d
Author: Rui Wang 
AuthorDate: Mon May 27 22:40:13 2024 -0700

[SPARK-41049][SQL][FOLLOW-UP] Mark map related expressions as stateful 
expressions

MapConcat contains a state so it is stateful:
```
private lazy val mapBuilder = new ArrayBasedMapBuilder(dataType.keyType, 
dataType.valueType)
```

Similarly `MapFromEntries, CreateMap, MapFromArrays, StringToMap, and 
TransformKeys` need the same change.

Stateful expression should be marked as stateful.

No

N/A

No

Closes #46721 from amaliujia/statefulexpr.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
(cherry picked from commit af1ac1edc2a96c9aba949e3100ddae37b6f0e5b2)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/collectionOperations.scala  |  3 +++
 .../spark/sql/catalyst/expressions/complexTypeCreator.scala|  6 ++
 .../spark/sql/catalyst/expressions/higherOrderFunctions.scala  |  2 ++
 .../src/test/scala/org/apache/spark/sql/DataFrameSuite.scala   | 10 +-
 4 files changed, 20 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 3ddbe38fdedf..45896382af67 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -712,6 +712,7 @@ case class MapConcat(children: Seq[Expression])
 }
   }
 
+  override def stateful: Boolean = true
   override def nullable: Boolean = children.exists(_.nullable)
 
   private lazy val mapBuilder = new ArrayBasedMapBuilder(dataType.keyType, 
dataType.valueType)
@@ -827,6 +828,8 @@ case class MapFromEntries(child: Expression)
 
   override def nullable: Boolean = child.nullable || nullEntries
 
+  override def stateful: Boolean = true
+
   @transient override lazy val dataType: MapType = dataTypeDetails.get._1
 
   override def checkInputDataTypes(): TypeCheckResult = dataTypeDetails match {
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
index c95a0987330d..1b6f86984be7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
@@ -242,6 +242,8 @@ case class CreateMap(children: Seq[Expression], 
useStringTypeWhenEmpty: Boolean)
 
   private lazy val mapBuilder = new ArrayBasedMapBuilder(dataType.keyType, 
dataType.valueType)
 
+  override def stateful: Boolean = true
+
   override def eval(input: InternalRow): Any = {
 var i = 0
 while (i < keys.length) {
@@ -317,6 +319,8 @@ case class MapFromArrays(left: Expression, right: 
Expression)
   valueContainsNull = right.dataType.asInstanceOf[ArrayType].containsNull)
   }
 
+  override def stateful: Boolean = true
+
   private lazy val mapBuilder = new ArrayBasedMapBuilder(dataType.keyType, 
dataType.valueType)
 
   override def nullSafeEval(keyArray: Any, valueArray: Any): Any = {
@@ -563,6 +567,8 @@ case class StringToMap(text: Expression, pairDelim: 
Expression, keyValueDelim: E
 this(child, Literal(","), Literal(":"))
   }
 
+  override def stateful: Boolean = true
+
   override def first: Expression = text
   override def second: Expression = pairDelim
   override def third: Expression = keyValueDelim
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
index fec1df108bcc..5b10b401af98 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
@@ -918,6 +918,8 @@ case class TransformKeys(
 
   override def dataType: MapType = MapType(function.dataType, valueType, 
valueContainsNull)
 
+  override def stateful: Boolean = true
+
   override def checkInputDataTypes(): TypeCheckResult = {
 TypeUtils.checkForMapKeyType(function.dataType)
   }
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuit

(spark) branch master updated: [SPARK-41049][SQL][FOLLOW-UP] Mark map related expressions as stateful expressions

2024-05-27 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new af1ac1edc2a9 [SPARK-41049][SQL][FOLLOW-UP] Mark map related 
expressions as stateful expressions
af1ac1edc2a9 is described below

commit af1ac1edc2a96c9aba949e3100ddae37b6f0e5b2
Author: Rui Wang 
AuthorDate: Mon May 27 22:40:13 2024 -0700

[SPARK-41049][SQL][FOLLOW-UP] Mark map related expressions as stateful 
expressions

### What changes were proposed in this pull request?

MapConcat contains a state so it is stateful:
```
private lazy val mapBuilder = new ArrayBasedMapBuilder(dataType.keyType, 
dataType.valueType)
```

Similarly `MapFromEntries, CreateMap, MapFromArrays, StringToMap, and 
TransformKeys` need the same change.

### Why are the changes needed?

Stateful expression should be marked as stateful.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46721 from amaliujia/statefulexpr.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/collectionOperations.scala  |  3 +++
 .../spark/sql/catalyst/expressions/complexTypeCreator.scala|  6 ++
 .../spark/sql/catalyst/expressions/higherOrderFunctions.scala  |  2 ++
 .../src/test/scala/org/apache/spark/sql/DataFrameSuite.scala   | 10 +-
 4 files changed, 20 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 632e2f3d3e97..ea117f876550 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -713,6 +713,7 @@ case class MapConcat(children: Seq[Expression])
 }
   }
 
+  override def stateful: Boolean = true
   override def nullable: Boolean = children.exists(_.nullable)
 
   private lazy val mapBuilder = new ArrayBasedMapBuilder(dataType.keyType, 
dataType.valueType)
@@ -828,6 +829,8 @@ case class MapFromEntries(child: Expression)
 
   override def nullable: Boolean = child.nullable || nullEntries
 
+  override def stateful: Boolean = true
+
   @transient override lazy val dataType: MapType = dataTypeDetails.get._1
 
   override def checkInputDataTypes(): TypeCheckResult = dataTypeDetails match {
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
index 4c0d00534060..167c02c0bafc 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
@@ -245,6 +245,8 @@ case class CreateMap(children: Seq[Expression], 
useStringTypeWhenEmpty: Boolean)
 
   private lazy val mapBuilder = new ArrayBasedMapBuilder(dataType.keyType, 
dataType.valueType)
 
+  override def stateful: Boolean = true
+
   override def eval(input: InternalRow): Any = {
 var i = 0
 while (i < keys.length) {
@@ -320,6 +322,8 @@ case class MapFromArrays(left: Expression, right: 
Expression)
   valueContainsNull = right.dataType.asInstanceOf[ArrayType].containsNull)
   }
 
+  override def stateful: Boolean = true
+
   private lazy val mapBuilder = new ArrayBasedMapBuilder(dataType.keyType, 
dataType.valueType)
 
   override def nullSafeEval(keyArray: Any, valueArray: Any): Any = {
@@ -568,6 +572,8 @@ case class StringToMap(text: Expression, pairDelim: 
Expression, keyValueDelim: E
 this(child, Literal(","), Literal(":"))
   }
 
+  override def stateful: Boolean = true
+
   override def first: Expression = text
   override def second: Expression = pairDelim
   override def third: Expression = keyValueDelim
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
index 896f3e9774f3..80bcf156133e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
@@ -920,6 +920,8 @@ case class TransformKeys(
 
   override def dataType: MapType = MapType(function.dataType, valueType, 
valueContainsNull)
 
+  override def stateful: Boolean = true
+
   override

svn commit: r69416 - in /dev/spark/v4.0.0-preview1-rc2-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/articles/sparkr-vignettes_files/ _site/api/R/articles/sparkr-vignettes_

2024-05-27 Thread wenchen

Author: wenchen
Date: Tue May 28 05:29:59 2024
New Revision: 69416

Log:
Apache Spark v4.0.0-preview1-rc2 docs


[This commit notification would consist of 4816 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69415 - /dev/spark/v4.0.0-preview1-rc2-bin/

2024-05-27 Thread wenchen

Author: wenchen
Date: Tue May 28 04:31:46 2024
New Revision: 69415

Log:
Apache Spark v4.0.0-preview1-rc2

Added:
dev/spark/v4.0.0-preview1-rc2-bin/
dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz   (with props)
dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-hadoop3.tgz   
(with props)
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc

dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512

dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz   
(with props)

dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc

dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1.tgz   (with props)
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1.tgz.asc
dev/spark/v4.0.0-preview1-rc2-bin/spark-4.0.0-preview1.tgz.sha512

Added: dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.asc Tue May 
28 04:31:46 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmZVW6kTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WvFCEACK8aTOoX+9lkY3dGTcjk9YY2RmlHzg
+k/gew4ZlH9liGSpXHcSPJ6iG/Su63rshfveHmM2ycKOkxLbzXbeRMBHtmQGI8sQf
+q+usChfuexo0GH4kBsMU0xtJkUe3SCwRdIr6aDq5eH6yxVEPNrqyzbCekwmqgE7y
+KV37qb7EQyq3sSZH0HFrAgEhgMMvQRRp/SD+WnHVoY4/dEtksZ4ip0TjXImKWIZG
+HowM6Xks7M/qXsnk2kXzbrSY/lpWbGcBVcTr3Hh+z0iYMS05ohXk8JRx7hMmhUGc
+sBcAYwupNzyai/lFWpToe17E6QI1mSIiG2CqOgtuYXs0za05673mZ6GcVwiLTrNz
+tGH0CBY2G+9iEjHYR51bJTlIs6J9KvHz/CJmO2OUk9s14LHGLpQ3DPbEiHQ/r9Ic
+Jb+WhDe/7Ajq8Ohq3bXm2fJIs7vDDC9bATFixaAY5o/jj6Q7hWeokZN7tyjFKigf
+yoTCtkXPa+WHz2JueiY21EXBu8pD/S0GKsy1wctyT4WiykBBB5M1Ue43D+UerKgK
+i/4UqZnEVTAVVMX1YkGgz5RxC6D/UdzmknNMbric7CFF/Imst0VYR7OBJtLRRl8X
+8REh4Am6wFfjgEhM0zZOCFZha1Dd9isHykzG2sRtf0BNXsGkRplPjMGmqGul71Vd
+4BHp/DiV4go/3g==
+=SsBi
+-END PGP SIGNATURE-

Added: dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.sha512
==
--- dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.sha512 
(added)
+++ dev/spark/v4.0.0-preview1-rc2-bin/SparkR_4.0.0-preview1.tar.gz.sha512 Tue 
May 28 04:31:46 2024
@@ -0,0 +1 @@
+dec2bf5ec07c86af950dcbe518be1fd5155d55c7a4c9b8e83c69e11dc2395806a18e526a3d2096c2d770569f0c2032d6fa96c7ade2ce83ded98ed6b5e26a
  SparkR_4.0.0-preview1.tar.gz

Added: dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc2-bin/pyspark-4.0.0.dev1.tar.gz.asc Tue May 28 
04:31:46 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmZVW6sTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WgTOD/0UDo7uwuyjgxgtWNGD9uLxLs/JbhPp
+iNzKX4h3wWD0yyvfYGIt4HGbBRNTiJ071PgLhQe3HC+e5Di2bvjW4RAa6OzhssFF
+R6UVojuQK0AKR2+BidQNR/ODfx5wRyU5uPx8qJu7egjHepR+q/1NR+A/dyQpi93w
+5cpIqqWNC3pd/JSQ1nIBM0jxWJuGtmm0IvMvhwyRuUdpZzo2ONpEjJnlNn9tCR1Z
+xyRJnXnj/Zqd468E5Wn59iZJtwK7rSe1hNNYivLInEc+paDRZtKNz+xl/LWnXgzq
+R4eIiRAiOjZnQtfuZceXb3rftFbzcxkzD1hvb1MxQO+Vf/tAcste1G3d+RJdEhdg
+fPsOATbFe2K7+DHwwU1QnN2Pse/exuXCCa9KmJJXcGo8hnLEb2naDt3GuaweDb97
+CuwAqLcbwAJvng8G9RsZ8q+uKx06linFScOzgIw9Y8YzbubH4jy8PlgnZ+OYTM4p
+PYfj81c91/ZTv0KgPCkpPTpYkjZfQkrTzHF8rAodJT1EheyGfWvEotbmgwUqH8Gm
+nuNfkSmKBrzPpExUFvJiIlEapzg7C4u/mMO8WEOuLYKtwtOR9wwiZdPL0tTp16Ve
+luxFjEHKkzQzB/TyA6QsK1FO92PlCyXAXz7jHsccU7Fip

(spark) branch master updated: [SPARK-48239][INFRA][FOLLOWUP] install the missing `jq` library

2024-05-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 416d7f24fc35 [SPARK-48239][INFRA][FOLLOWUP] install the missing `jq` 
library
416d7f24fc35 is described below

commit 416d7f24fc354e912773ceb160210ad6a0c5fe99
Author: Wenchen Fan 
AuthorDate: Fri May 24 20:53:00 2024 -0700

[SPARK-48239][INFRA][FOLLOWUP] install the missing `jq` library

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/46534 . We 
missed the `jq` library which is needed to create git tags.

### Why are the changes needed?

fix bug

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manual

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46743 from cloud-fan/script.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 dev/create-release/release-util.sh | 3 +++
 dev/create-release/spark-rm/Dockerfile | 1 +
 2 files changed, 4 insertions(+)

diff --git a/dev/create-release/release-util.sh 
b/dev/create-release/release-util.sh
index 0394fb49c2fa..b5edbf40d487 100755
--- a/dev/create-release/release-util.sh
+++ b/dev/create-release/release-util.sh
@@ -128,6 +128,9 @@ function get_release_info {
 RC_COUNT=1
   fi
 
+  if [ "$GIT_BRANCH" = "master" ]; then
+RELEASE_VERSION="$RELEASE_VERSION-preview1"
+  fi
   export NEXT_VERSION
   export RELEASE_VERSION=$(read_config "Release" "$RELEASE_VERSION")
 
diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index adaa4df3f579..5fdaf58feee2 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -58,6 +58,7 @@ RUN apt-get update && apt-get install -y \
 texinfo \
 texlive-latex-extra \
 qpdf \
+jq \
 r-base \
 ruby \
 ruby-dev \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) tag v4.0.0-preview1-rc2 created (now 7cfe5a6e44e8)

2024-05-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to tag v4.0.0-preview1-rc2
in repository https://gitbox.apache.org/repos/asf/spark.git


  at 7cfe5a6e44e8 (commit)
This tag includes the following new commits:

 new 7cfe5a6e44e8 Preparing Spark release v4.0.0-preview1-rc2

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) 01/01: Preparing Spark release v4.0.0-preview1-rc2

2024-05-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to tag v4.0.0-preview1-rc2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66
Author: Wenchen Fan 
AuthorDate: Fri May 24 18:53:15 2024 +

Preparing Spark release v4.0.0-preview1-rc2
---
 R/pkg/R/sparkR.R   | 4 ++--
 assembly/pom.xml   | 2 +-
 common/kvstore/pom.xml | 2 +-
 common/network-common/pom.xml  | 2 +-
 common/network-shuffle/pom.xml | 2 +-
 common/network-yarn/pom.xml| 2 +-
 common/sketch/pom.xml  | 2 +-
 common/tags/pom.xml| 2 +-
 common/unsafe/pom.xml  | 2 +-
 common/utils/pom.xml   | 2 +-
 common/variant/pom.xml | 2 +-
 connector/avro/pom.xml | 2 +-
 connector/connect/client/jvm/pom.xml   | 2 +-
 connector/connect/common/pom.xml   | 2 +-
 connector/connect/server/pom.xml   | 2 +-
 connector/docker-integration-tests/pom.xml | 2 +-
 connector/kafka-0-10-assembly/pom.xml  | 2 +-
 connector/kafka-0-10-sql/pom.xml   | 2 +-
 connector/kafka-0-10-token-provider/pom.xml| 2 +-
 connector/kafka-0-10/pom.xml   | 2 +-
 connector/kinesis-asl-assembly/pom.xml | 2 +-
 connector/kinesis-asl/pom.xml  | 2 +-
 connector/profiler/pom.xml | 2 +-
 connector/protobuf/pom.xml | 2 +-
 connector/spark-ganglia-lgpl/pom.xml   | 2 +-
 core/pom.xml   | 2 +-
 docs/_config.yml   | 6 +++---
 examples/pom.xml   | 2 +-
 graphx/pom.xml | 2 +-
 hadoop-cloud/pom.xml   | 2 +-
 launcher/pom.xml   | 2 +-
 mllib-local/pom.xml| 2 +-
 mllib/pom.xml  | 2 +-
 pom.xml| 2 +-
 python/pyspark/version.py  | 2 +-
 repl/pom.xml   | 2 +-
 resource-managers/kubernetes/core/pom.xml  | 2 +-
 resource-managers/kubernetes/integration-tests/pom.xml | 2 +-
 resource-managers/yarn/pom.xml | 2 +-
 sql/api/pom.xml| 2 +-
 sql/catalyst/pom.xml   | 2 +-
 sql/core/pom.xml   | 2 +-
 sql/hive-thriftserver/pom.xml  | 2 +-
 sql/hive/pom.xml   | 2 +-
 streaming/pom.xml  | 2 +-
 tools/pom.xml  | 2 +-
 46 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index 0be7e5da24d2..478acf514ef3 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -456,8 +456,8 @@ sparkR.session <- function(
 
   # Check if version number of SparkSession matches version number of SparkR 
package
   jvmVersion <- callJMethod(sparkSession, "version")
-  # Remove -SNAPSHOT from jvm versions
-  jvmVersionStrip <- gsub("-SNAPSHOT", "", jvmVersion, fixed = TRUE)
+  # Remove -preview1 from jvm versions
+  jvmVersionStrip <- gsub("-preview1", "", jvmVersion, fixed = TRUE)
   rPackageVersion <- paste0(packageVersion("SparkR"))
 
   if (jvmVersionStrip != rPackageVersion) {
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 58e7ae5bb0c7..417e7c23ca9f 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.13
-4.0.0-SNAPSHOT
+4.0.0-preview1
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 046648e9c2ae..e1a4497387a2 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.13
-4.0.0-SNAPSHOT
+4.0.0-preview1
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index cdb5bd72158a..d8dff6996cec 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.13
-4.0.0-SNAPSHOT
+4.0.0-preview1
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 0f7036ef

(spark) tag v4.0.0-preview-rc1 deleted (was 9fec87d16a04)

2024-05-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to tag v4.0.0-preview-rc1
in repository https://gitbox.apache.org/repos/asf/spark.git


*** WARNING: tag v4.0.0-preview-rc1 was deleted! ***

 was 9fec87d16a04 Preparing Spark release v4.0.0-preview-rc1

This change permanently discards the following revisions:

 discard 9fec87d16a04 Preparing Spark release v4.0.0-preview-rc1


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48364][SQL] Add AbstractMapType type casting and fix RaiseError parameter map to work with collated strings

2024-05-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6be3560f3c89 [SPARK-48364][SQL] Add AbstractMapType type casting and 
fix RaiseError parameter map to work with collated strings
6be3560f3c89 is described below

commit 6be3560f3c89e212e850a0788d24a7c0755ea35b
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed May 22 05:21:23 2024 -0700

[SPARK-48364][SQL] Add AbstractMapType type casting and fix RaiseError 
parameter map to work with collated strings

### What changes were proposed in this pull request?
Following up on the introduction of AbstractMapType 
(https://github.com/apache/spark/pull/46458) and changes that introduce 
collation awareness for RaiseError expression 
(https://github.com/apache/spark/pull/46461), this PR should add the 
appropriate type casting rules for AbstractMapType.

### Why are the changes needed?
Fix the CI failure for the `Support RaiseError misc expression with 
collation` test when ANSI is off.

### Does this PR introduce _any_ user-facing change?
Yes, type casting is now allowed for map types with collated strings.

### How was this patch tested?
Extended suite `CollationSQLExpressionsANSIOffSuite` with ANSI disabled.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46661 from uros-db/fix-abstract-map.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CollationTypeCasts.scala | 15 -
 .../spark/sql/catalyst/analysis/TypeCoercion.scala | 13 +--
 .../spark/sql/catalyst/expressions/misc.scala  |  4 ++--
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 10 +++--
 .../org/apache/spark/sql/CollationSuite.scala  | 25 ++
 5 files changed, 37 insertions(+), 30 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
index a50dad7c8cdb..00abdf4ee19d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
@@ -25,7 +25,7 @@ import 
org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType, haveS
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.types.{ArrayType, DataType, StringType}
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StringType}
 
 object CollationTypeCasts extends TypeCoercionRule {
   override val transform: PartialFunction[Expression, Expression] = {
@@ -85,6 +85,11 @@ object CollationTypeCasts extends TypeCoercionRule {
   private def extractStringType(dt: DataType): StringType = dt match {
 case st: StringType => st
 case ArrayType(et, _) => extractStringType(et)
+case MapType(kt, vt, _) => if (hasStringType(kt)) {
+extractStringType(kt)
+  } else {
+extractStringType(vt)
+  }
   }
 
   /**
@@ -102,6 +107,14 @@ object CollationTypeCasts extends TypeCoercionRule {
   case st: StringType if st.collationId != castType.collationId => castType
   case ArrayType(arrType, nullable) =>
 castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case MapType(keyType, valueType, nullable) =>
+val newKeyType = castStringType(keyType, castType).getOrElse(keyType)
+val newValueType = castStringType(valueType, 
castType).getOrElse(valueType)
+if (newKeyType != keyType || newValueType != valueType) {
+  MapType(newKeyType, newValueType, nullable)
+} else {
+  null
+}
   case _ => null
 }
 Option(ret)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
index 936bb22baa46..7866f47c28b1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
@@ -31,7 +31,7 @@ import org.apache.spark.sql.catalyst.trees.AlwaysProcess
 import org.apache.spark.sql.catalyst.types.DataTypeUtils
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.internal.types.{AbstractArrayType, 
AbstractStringType, StringTypeAnyCollation}
+import or

(spark) branch master updated: [SPARK-48215][SQL] Extending support for collated strings on date_format expression

2024-05-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e04d3d7c430a [SPARK-48215][SQL] Extending support for collated strings 
on date_format expression
e04d3d7c430a is described below

commit e04d3d7c430a1fa446f0379680f619b8b14b5eb5
Author: Nebojsa Savic 
AuthorDate: Wed May 22 04:28:06 2024 -0700

[SPARK-48215][SQL] Extending support for collated strings on date_format 
expression

### What changes were proposed in this pull request?
We are extending support for collated strings on date_format function, 
since currently it throws DATATYPE_MISSMATCH exception when collated strings 
are passed as "format" parameter. 
https://docs.databricks.com/en/sql/language-manual/functions/date_format.html

### Why are the changes needed?
Exception is thrown on invocation when collated strings are passed as 
arguments to date_format.

### Does this PR introduce _any_ user-facing change?
No user facing changes, extending support.

### How was this patch tested?
Tests are added with this PR.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46561 from nebojsa-db/SPARK-48215.

Authored-by: Nebojsa Savic 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/expressions/datetimeExpressions.scala |  5 ++--
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 32 ++
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
index 081a42f5608e..8caf8c5d48c2 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
@@ -36,6 +36,7 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils._
 import org.apache.spark.sql.catalyst.util.LegacyDateFormats.SIMPLE_DATE_FORMAT
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.types.StringTypeAnyCollation
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.types.DayTimeIntervalType.DAY
 import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}
@@ -951,9 +952,9 @@ case class DateFormatClass(left: Expression, right: 
Expression, timeZoneId: Opti
 
   def this(left: Expression, right: Expression) = this(left, right, None)
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType, 
StringType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType, 
StringTypeAnyCollation)
 
   override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression =
 copy(timeZoneId = Option(timeZoneId))
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index 0d48f9f0a88d..828245bb3fdd 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -1600,6 +1600,38 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("DateFormat expression with collation") {
+case class DateFormatTestCase[R](date: String, format: String, collation: 
String, result: R)
+val testCases = Seq(
+  DateFormatTestCase("2021-01-01", "-MM-dd", "UTF8_BINARY", 
"2021-01-01"),
+  DateFormatTestCase("2021-01-01", "-dd", "UTF8_BINARY_LCASE", 
"2021-01"),
+  DateFormatTestCase("2021-01-01", "-MM-dd", "UNICODE", "2021-01-01"),
+  DateFormatTestCase("2021-01-01", "", "UNICODE_CI", "2021")
+)
+
+for {
+  collateDate <- Seq(true, false)
+  collateFormat <- Seq(true, false)
+} {
+  testCases.foreach(t => {
+val dateArg = if (collateDate) s"collate('${t.date}', 
'${t.collation}')" else s"'${t.date}'"
+val formatArg =
+  if (collateFormat) {
+s"collate('${t.format}', '${t.collation}')"
+  } else {
+s"'${t.format}'"
+  }
+
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> t.collation) {
+  val query = s"SELECT date_format(${dateArg}, ${formatArg})"
+

(spark) branch master updated: [SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support

2024-05-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 617ac1aec748 [SPARK-48031] Decompose viewSchemaMode config, add SHOW 
CREATE TABLE support
617ac1aec748 is described below

commit 617ac1aec7481d6063af539b02980692e98beb70
Author: Serge Rielau 
AuthorDate: Mon May 20 16:01:24 2024 +0800

[SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support

### What changes were proposed in this pull request?

We separate enablement of WITH SCHEMA ... clause from the change in default 
from SCHEMA BINDING to SCHEMA COMPENSATION.
This allows user to upgrade in two steps:
1. Enable the feature, and deal with DESCRIBE EXTENDED.
2. Get their affairs in order by ALTER VIEW to SCHEMA BINDING for those 
views they aim to keep in that mode
3. Switch the default.

### Why are the changes needed?

It allows customers to upgrade more safely.

### Does this PR introduce _any_ user-facing change?

Yes

### How was this patch tested?

Added more tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46652 from srielau/SPARK-48031-view-evolutiion-part2.

Lead-authored-by: Serge Rielau 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 docs/sql-migration-guide.md|   3 +-
 .../sql/catalyst/catalog/SessionCatalog.scala  |   6 +-
 .../spark/sql/catalyst/catalog/interface.scala |   6 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala |  14 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|  26 ++-
 .../spark/sql/execution/command/tables.scala   |   7 +
 .../view-schema-binding-config.sql.out | 166 +--
 .../analyzer-results/view-schema-binding.sql.out   |  24 +--
 .../inputs/view-schema-binding-config.sql  |  52 +++--
 .../sql-tests/inputs/view-schema-binding.sql   |   2 +-
 .../sql-tests/results/charvarchar.sql.out  |   1 +
 .../sql-tests/results/show-create-table.sql.out|   6 +
 .../results/view-schema-binding-config.sql.out | 231 ++---
 .../sql-tests/results/view-schema-binding.sql.out  |  25 +--
 .../apache/spark/sql/execution/SQLViewSuite.scala  |   2 +-
 .../spark/sql/execution/SQLViewTestSuite.scala |   7 +-
 16 files changed, 453 insertions(+), 125 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 15205e9284cd..02a4fae5d262 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -54,7 +54,8 @@ license: |
 - Since Spark 4.0, The default value for 
`spark.sql.legacy.ctePrecedencePolicy` has been changed from `EXCEPTION` to 
`CORRECTED`. Instead of raising an error, inner CTE definitions take precedence 
over outer definitions.
 - Since Spark 4.0, The default value for `spark.sql.legacy.timeParserPolicy` 
has been changed from `EXCEPTION` to `CORRECTED`. Instead of raising an 
`INCONSISTENT_BEHAVIOR_CROSS_VERSION` error, `CANNOT_PARSE_TIMESTAMP` will be 
raised if ANSI mode is enable. `NULL` will be returned if ANSI mode is 
disabled. See [Datetime Patterns for Formatting and 
Parsing](sql-ref-datetime-pattern.html).
 - Since Spark 4.0, A bug falsely allowing `!` instead of `NOT` when `!` is not 
a prefix operator has been fixed. Clauses such as `expr ! IN (...)`, `expr ! 
BETWEEN ...`, or `col ! NULL` now raise syntax errors. To restore the previous 
behavior, set `spark.sql.legacy.bangEqualsNot` to `true`. 
-- Since Spark 4.0, Views allow control over how they react to underlying query 
changes. By default views tolerate column type changes in the query and 
compensate with casts. To restore the previous behavior, allowing up-casts 
only, set `spark.sql.viewSchemaBindingMode` to `DISABLED`. This disables the 
feature and also disallows the `WITH SCHEMA` clause.
+- Since Spark 4.0, By default views tolerate column type changes in the query 
and compensate with casts. To restore the previous behavior, allowing up-casts 
only, set `spark.sql.legacy.viewSchemaCompensation` to `false`.
+- Since Spark 4.0, Views allow control over how they react to underlying query 
changes. By default views tolerate column type changes in the query and 
compensate with casts. To disable thsi feature set 
`spark.sql.legacy.viewSchemaBindingMode` to `false`. This also removes the 
clause from `DESCRIBE EXTENDED` and `SHOW CREATE TABLE`.
 
 ## Upgrading from Spark SQL 3.5.1 to 3.5.2
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
index 96883afcfc5c..dbf2102a183a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog

(spark) branch master updated: [SPARK-48305][SQL] Add collation support for CurrentLike expressions

2024-05-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6a17b794338b [SPARK-48305][SQL] Add collation support for CurrentLike 
expressions
6a17b794338b is described below

commit 6a17b794338b0473c11ae17e5c8f1450c0b3f358
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Mon May 20 15:51:23 2024 +0800

[SPARK-48305][SQL] Add collation support for CurrentLike expressions

### What changes were proposed in this pull request?
Introduce collation awareness for CurrentLike expressions: 
current_database/current_schema, current_catalog, 
user/current_user/session_user.

### Why are the changes needed?
Add collation support for CurrentLike expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
CurrentLike functions: current_database/current_schema, current_catalog, 
user/current_user/session_user.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46613 from uros-db/current-like-expressions.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/expressions/misc.scala |  6 +++---
 .../spark/sql/catalyst/optimizer/finishAnalysis.scala|  7 ---
 .../apache/spark/sql/CollationSQLExpressionsSuite.scala  | 16 
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
index eda65ae48f00..e9fa362de14c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
@@ -200,7 +200,7 @@ object AssertTrue {
   since = "1.6.0",
   group = "misc_funcs")
 case class CurrentDatabase() extends LeafExpression with Unevaluable {
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
   override def nullable: Boolean = false
   override def prettyName: String = "current_schema"
   final override val nodePatterns: Seq[TreePattern] = Seq(CURRENT_LIKE)
@@ -219,7 +219,7 @@ case class CurrentDatabase() extends LeafExpression with 
Unevaluable {
   since = "3.1.0",
   group = "misc_funcs")
 case class CurrentCatalog() extends LeafExpression with Unevaluable {
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
   override def nullable: Boolean = false
   override def prettyName: String = "current_catalog"
   final override val nodePatterns: Seq[TreePattern] = Seq(CURRENT_LIKE)
@@ -335,7 +335,7 @@ case class TypeOf(child: Expression) extends 
UnaryExpression {
 // scalastyle:on line.size.limit
 case class CurrentUser() extends LeafExpression with Unevaluable {
   override def nullable: Boolean = false
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
   override def prettyName: String =
 getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse("current_user")
   final override val nodePatterns: Seq[TreePattern] = Seq(CURRENT_LIKE)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
index 92ac7599a8ff..48753fbfe326 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
@@ -33,6 +33,7 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.{convertSpecialDate, 
convertSpecialTimestamp, convertSpecialTimestampNTZ, instantToMicros, 
localDateTimeToMicros}
 import org.apache.spark.sql.catalyst.util.TypeUtils.toSQLExpr
 import org.apache.spark.sql.connector.catalog.CatalogManager
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 
 
@@ -151,11 +152,11 @@ case class ReplaceCurrentLike(catalogManager: 
CatalogManager) extends Rule[Logic
 
 plan.transformAllExpressionsWithPruning(_.containsPattern(CURRENT_LIKE)) {
   case CurrentDatabase() =>
-Literal.create(currentNamespace, StringType)
+Literal.create(currentNamespace, SQLConf.get.defaultStringType)
   case CurrentCatalog() =>
-

(spark) branch master updated: [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE

2024-05-18 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6f6b4860268d [SPARK-48175][SQL][PYTHON] Store collation information in 
metadata and not in type for SER/DE
6f6b4860268d is described below

commit 6f6b4860268dc250d8e31a251d740733798aa512
Author: Stefan Kandic 
AuthorDate: Sat May 18 15:17:56 2024 +0800

[SPARK-48175][SQL][PYTHON] Store collation information in metadata and not 
in type for SER/DE

### What changes were proposed in this pull request?
Changing serialization and deserialization of collated strings so that the 
collation information is put in the metadata of the enclosing struct field - 
and then read back from there during parsing.

Format of serialization will look something like this:
```json
{
  "type": "struct",
  "fields": [
"name": "colName",
"type": "string",
"nullable": true,
"metadata": {
  "__COLLATIONS": {
"colName": "UNICODE"
  }
}
  ]
}
```

If we have a map we will add suffixes `.key` and `.value` in the metadata:
```json
{
  "type": "struct",
  "fields": [
{
  "name": "mapField",
  "type": {
"type": "map",
"keyType": "string",
"valueType": "string",
"valueContainsNull": true
  },
  "nullable": true,
  "metadata": {
"__COLLATIONS": {
  "mapField.key": "UNICODE",
  "mapField.value": "UNICODE"
}
  }
}
  ]
}
```
It will be a similar story for arrays (we will add `.element` suffix). We 
could have multiple suffixes when working with deeply nested data types 
(Map[String, Array[Array[String]]] - see tests for this example)

### Why are the changes needed?
Putting collation info in field metadata is the only way to not break old 
clients reading new tables with collations. `CharVarcharUtils` does a similar 
thing but this is much less hacky, and more friendly for all 3p clients - which 
is especially important since delta also uses spark for schema ser/de.

It will also remove the need for additional logic introduced in #46083 to 
remove collations before writing to HMS as this way the tables will be fully 
HMS compatible.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
With unit tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46280 from stefankandic/newDeltaSchema.

Lead-authored-by: Stefan Kandic 
Co-authored-by: Stefan Kandic 
<154237371+stefankan...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/CollationFactory.java  |  99 +++-
 .../src/main/resources/error/error-conditions.json |  12 +
 python/pyspark/errors/error-conditions.json|  10 +
 .../pyspark/sql/tests/connect/test_parity_types.py |   4 +
 python/pyspark/sql/tests/test_types.py | 249 +++--
 python/pyspark/sql/types.py| 178 +--
 .../org/apache/spark/sql/types/DataType.scala  |  74 +-
 .../org/apache/spark/sql/types/StringType.scala|   7 +
 .../org/apache/spark/sql/types/StructField.scala   |  62 -
 .../org/apache/spark/sql/types/DataTypeSuite.scala | 181 ++-
 .../apache/spark/sql/types/StructTypeSuite.scala   | 183 +++
 .../streaming/StreamingDeduplicationSuite.scala|   2 +-
 .../spark/sql/streaming/StreamingQuerySuite.scala  |   2 +-
 13 files changed, 1004 insertions(+), 59 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
index 863445b6..0133c3feb611 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
@@ -36,11 +36,62 @@ import org.apache.spark.unsafe.types.UTF8String;
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+private final String provider;
+private final String name

(spark) branch master updated (15fb4787354a -> 3edd6c7e1d50)

2024-05-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 15fb4787354a [SPARK-48321][CONNECT][TESTS] Avoid using deprecated 
methods in dsl
 add 3edd6c7e1d50 [SPARK-48312][SQL] Improve 
Alias.removeNonInheritableMetadata performance

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/sql/types/Metadata.scala   |  7 +++
 .../spark/sql/catalyst/expressions/namedExpressions.scala  | 14 +++---
 2 files changed, 18 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48308][CORE] Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 57948c865e06 [SPARK-48308][CORE] Unify getting data schema without 
partition columns in FileSourceStrategy
57948c865e06 is described below

commit 57948c865e064469a75c92f8b58c632b9b40fdd3
Author: Johan Lasperas 
AuthorDate: Thu May 16 22:38:02 2024 +0800

[SPARK-48308][CORE] Unify getting data schema without partition columns in 
FileSourceStrategy

### What changes were proposed in this pull request?
Compute the schema of the data without partition columns only once in 
FileSourceStrategy.

### Why are the changes needed?
In FileSourceStrategy, the schema of the data excluding partition columns 
is computed 2 times in a slightly different way, using an AttributeSet 
(`partitionSet`) and using the attributes directly (`partitionColumns`)
These don't have the exact same semantics, AttributeSet will only use 
expression ids for comparison while comparing with the actual attributes will 
use the name, type, nullability and metadata. We want to use the former here.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46619 from johanl-db/reuse-schema-without-partition-columns.

Authored-by: Johan Lasperas 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/execution/datasources/FileSourceStrategy.scala| 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
index 8333c276cdd8..d31cb111924b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
@@ -216,9 +216,8 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
   val requiredExpressions: Seq[NamedExpression] = filterAttributes.toSeq 
++ projects
   val requiredAttributes = AttributeSet(requiredExpressions)
 
-  val readDataColumns = dataColumns
+  val readDataColumns = dataColumnsWithoutPartitionCols
 .filter(requiredAttributes.contains)
-.filterNot(partitionColumns.contains)
 
   // Metadata attributes are part of a column of type struct up to this 
point. Here we extract
   // this column from the schema and specify a matcher for that.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (fa83d0f8fce7 -> 4be0828e6e6a)

2024-05-16 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from fa83d0f8fce7 [SPARK-48296][SQL] Codegen Support for `to_xml`
 add 4be0828e6e6a [SPARK-48288] Add source data type for connector cast 
expression

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/connector/expressions/Cast.java   | 18 +-
 .../sql/connector/util/V2ExpressionSQLBuilder.java |  6 +++---
 .../spark/sql/catalyst/util/V2ExpressionBuilder.scala  |  2 +-
 .../scala/org/apache/spark/sql/jdbc/JdbcDialects.scala |  4 ++--
 4 files changed, 23 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48252][SQL] Update CommonExpressionRef when necessary

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ca3593288d57 [SPARK-48252][SQL] Update CommonExpressionRef when 
necessary
ca3593288d57 is described below

commit ca3593288d577435a193f356b5214cf6f4bd534a
Author: Wenchen Fan 
AuthorDate: Thu May 16 09:42:36 2024 +0800

[SPARK-48252][SQL] Update CommonExpressionRef when necessary

### What changes were proposed in this pull request?

The `With` expression assumes that it should be created after all input 
expressions are fully resolved. This is mostly true (function lookup happens 
after function input expressions are resolved), but there is a special case of 
column resolution in HAVING: we use `TempResolvedColumn` to try one column 
resolution option. If it doesn't work, re-resolve the column, which may be a 
different data type. `With` expression should update the refs when this happens.

### Why are the changes needed?

bug fix, otherwise the query will fail

### Does this PR introduce _any_ user-facing change?

This feature is not released yet.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46552 from cloud-fan/with.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/catalyst/expressions/With.scala   | 18 +-
 .../optimizer/RewriteWithExpressionSuite.scala | 14 ++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
index 29794b33641c..5f6f9afa5797 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
@@ -40,7 +40,23 @@ case class With(child: Expression, defs: 
Seq[CommonExpressionDef])
   override def children: Seq[Expression] = child +: defs
   override protected def withNewChildrenInternal(
   newChildren: IndexedSeq[Expression]): Expression = {
-copy(child = newChildren.head, defs = 
newChildren.tail.map(_.asInstanceOf[CommonExpressionDef]))
+val newDefs = newChildren.tail.map(_.asInstanceOf[CommonExpressionDef])
+// If any `CommonExpressionDef` has been updated (data type or 
nullability), also update its
+// `CommonExpressionRef` in the `child`.
+val newChild = newDefs.filter(_.resolved).foldLeft(newChildren.head) { 
(result, newDef) =>
+  defs.find(_.id == newDef.id).map { oldDef =>
+if (newDef.dataType != oldDef.dataType || newDef.nullable != 
oldDef.nullable) {
+  val newRef = new CommonExpressionRef(newDef)
+  result.transform {
+case oldRef: CommonExpressionRef if oldRef.id == newRef.id =>
+  newRef
+  }
+} else {
+  result
+}
+  }.getOrElse(result)
+}
+copy(child = newChild, defs = newDefs)
   }
 
   /**
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
index aa8ffb2b0454..0aeca961aa51 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
@@ -18,6 +18,7 @@
 package org.apache.spark.sql.catalyst.optimizer
 
 import org.apache.spark.SparkException
+import org.apache.spark.sql.catalyst.analysis.TempResolvedColumn
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._
 import org.apache.spark.sql.catalyst.expressions._
@@ -438,4 +439,17 @@ class RewriteWithExpressionSuite extends PlanTest {
   Optimizer.execute(plan)
 }
   }
+
+  test("SPARK-48252: TempResolvedColumn in common expression") {
+val a = testRelation.output.head
+val tempResolved = TempResolvedColumn(a, Seq("a"))
+val expr = With(tempResolved) { case Seq(ref) =>
+  ref === 1
+}
+val plan = testRelation.having($"b")(avg("a").as("a"))(expr).analyze
+comparePlans(
+  Optimizer.execute(plan),
+  testRelation.groupBy($"b")(avg("a").as("a")).where($"a" === 1).analyze
+)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48172][SQL] Fix escaping issues in JDBCDialects

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 0e7156d2d801 [SPARK-48172][SQL] Fix escaping issues in JDBCDialects
0e7156d2d801 is described below

commit 0e7156d2d80171876c7a5e674349c53ee013be38
Author: Mihailo Milosevic 
AuthorDate: Wed May 15 22:15:52 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBCDialects

This PR is a fix of https://github.com/apache/spark/pull/46437. The 
previous PR was reverted as `LONGTEXT` is not supported by all dialects.

Special case escaping for MySQL and fix issues with redundant escaping for 
' character.
New changes introduced in the fix include change `LONGTEXT` -> 
`VARCHAR(50)`, as well as fix for table naming in the tests.

When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\' syntax instead of ESCAPE '' which would 
cause errors when trying to push down.

Yes

Tests for each existing dialect.

No.

Closes #46588 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 9e386b472981979e368a5921c58da5bfefe3acfe)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   3 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 14 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 1a25cd2802dd..fd99bb2a3bc5 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -67,6 +67,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col VARCHAR(50)
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..5f4f0b7a3afb 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote''_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index a527c6f8cb5b..51f31220d9a5 100644
--

(spark) branch branch-3.5 updated: [SPARK-48172][SQL] Fix escaping issues in JDBCDialects

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 210ed2521d3d [SPARK-48172][SQL] Fix escaping issues in JDBCDialects
210ed2521d3d is described below

commit 210ed2521d3dc1202cd1ba855ed5e729a5d940d0
Author: Mihailo Milosevic 
AuthorDate: Wed May 15 22:15:52 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBCDialects

This PR is a fix of https://github.com/apache/spark/pull/46437. The 
previous PR was reverted as `LONGTEXT` is not supported by all dialects.

Special case escaping for MySQL and fix issues with redundant escaping for 
' character.
New changes introduced in the fix include change `LONGTEXT` -> 
`VARCHAR(50)`, as well as fix for table naming in the tests.

When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\' syntax instead of ESCAPE '' which would 
cause errors when trying to push down.

Yes

Tests for each existing dialect.

No.

Closes #46588 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 9e386b472981979e368a5921c58da5bfefe3acfe)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   3 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 14 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 9a78244f5326..5bcc8afefb1d 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -80,6 +80,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col VARCHAR(50)
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..5f4f0b7a3afb 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote''_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index 0dc3a39f4db5..0bb2ea8249b3 100644
--

(spark) branch master updated: [SPARK-48172][SQL] Fix escaping issues in JDBCDialects

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9e386b472981 [SPARK-48172][SQL] Fix escaping issues in JDBCDialects
9e386b472981 is described below

commit 9e386b472981979e368a5921c58da5bfefe3acfe
Author: Mihailo Milosevic 
AuthorDate: Wed May 15 22:15:52 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBCDialects

This PR is a fix of https://github.com/apache/spark/pull/46437. The 
previous PR was reverted as `LONGTEXT` is not supported by all dialects.

### What changes were proposed in this pull request?
Special case escaping for MySQL and fix issues with redundant escaping for 
' character.
New changes introduced in the fix include change `LONGTEXT` -> 
`VARCHAR(50)`, as well as fix for table naming in the tests.

### Why are the changes needed?
When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\' syntax instead of ESCAPE '' which would 
cause errors when trying to push down.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Tests for each existing dialect.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46588 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   1 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 12 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 3642094d11b2..57129e9d846f 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -62,6 +62,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col VARCHAR(50)
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..5f4f0b7a3afb 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote''_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/

(spark) branch master updated (8c0a7ba82c98 -> 5e87e9fbd6e6)

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8c0a7ba82c98 [SPARK-48160][SQL] Add collation support for XPATH 
expressions
 add 5e87e9fbd6e6 [SPARK-48277] Improve error message for 
ErrorClassesJsonReader.getErrorMessage

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala   | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48160][SQL] Add collation support for XPATH expressions

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8c0a7ba82c98 [SPARK-48160][SQL] Add collation support for XPATH 
expressions
8c0a7ba82c98 is described below

commit 8c0a7ba82c98c7f7e686c4ee81d2aad49cc7a6e0
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed May 15 14:24:46 2024 +0800

[SPARK-48160][SQL] Add collation support for XPATH expressions

### What changes were proposed in this pull request?
Introduce collation awareness for XPath expressions: xpath_boolean, 
xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_string, 
xpath.

### Why are the changes needed?
Add collation support for Xpath expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
XPath functions: xpath_boolean, xpath_short, xpath_int, xpath_long, 
xpath_float, xpath_double, xpath_string, xpath.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46508 from uros-db/xpath-expressions.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/xml/xpath.scala | 11 --
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 44 ++
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala
index c3a285178c11..f65061e8d0ea 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala
@@ -23,6 +23,8 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.Cast._
 import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
 import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.types.StringTypeAnyCollation
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
@@ -39,7 +41,8 @@ abstract class XPathExtract
   /** XPath expressions are always nullable, e.g. if the xml string is empty. 
*/
   override def nullable: Boolean = true
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation)
 
   override def checkInputDataTypes(): TypeCheckResult = {
 if (!path.foldable) {
@@ -47,7 +50,7 @@ abstract class XPathExtract
 errorSubClass = "NON_FOLDABLE_INPUT",
 messageParameters = Map(
   "inputName" -> toSQLId("path"),
-  "inputType" -> toSQLType(StringType),
+  "inputType" -> toSQLType(StringTypeAnyCollation),
   "inputExpr" -> toSQLExpr(path)
 )
   )
@@ -221,7 +224,7 @@ case class XPathDouble(xml: Expression, path: Expression) 
extends XPathExtract {
 // scalastyle:on line.size.limit
 case class XPathString(xml: Expression, path: Expression) extends XPathExtract 
{
   override def prettyName: String = "xpath_string"
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def nullSafeEval(xml: Any, path: Any): Any = {
 val ret = xpathUtil.evalString(xml.asInstanceOf[UTF8String].toString, 
pathString)
@@ -245,7 +248,7 @@ case class XPathString(xml: Expression, path: Expression) 
extends XPathExtract {
 // scalastyle:on line.size.limit
 case class XPathList(xml: Expression, path: Expression) extends XPathExtract {
   override def prettyName: String = "xpath"
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
+  override def dataType: DataType = ArrayType(SQLConf.get.defaultStringType, 
containsNull = false)
 
   override def nullSafeEval(xml: Any, path: Any): Any = {
 val nodeList = 
xpathUtil.evalNodeList(xml.asInstanceOf[UTF8String].toString, pathString)
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index 48c3853bb5cf..37dcdf9bd721 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -548,6 +548,5

(spark) branch master updated: [SPARK-48162][SQL] Add collation support for MISC expressions

2024-05-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 723354039f1d [SPARK-48162][SQL] Add collation support for MISC 
expressions
723354039f1d is described below

commit 723354039f1de587cacdf4ba48c076a896fdffd1
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed May 15 14:23:31 2024 +0800

[SPARK-48162][SQL] Add collation support for MISC expressions

### What changes were proposed in this pull request?
Introduce collation awareness for misc expressions: raise_error, uuid, 
version, typeof, aes_encrypt, aes_decrypt.

### Why are the changes needed?
Add collation support for misc expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
misc functions: raise_error, uuid, version, typeof, aes_encrypt, aes_decrypt.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46461 from uros-db/misc-expressions.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../explain-results/function_aes_decrypt.explain   |   2 +-
 .../function_aes_decrypt_with_mode.explain |   2 +-
 .../function_aes_decrypt_with_mode_padding.explain |   2 +-
 ...ction_aes_decrypt_with_mode_padding_aad.explain |   2 +-
 .../explain-results/function_aes_encrypt.explain   |   2 +-
 .../function_aes_encrypt_with_mode.explain |   2 +-
 .../function_aes_encrypt_with_mode_padding.explain |   2 +-
 ...nction_aes_encrypt_with_mode_padding_iv.explain |   2 +-
 ...on_aes_encrypt_with_mode_padding_iv_aad.explain |   2 +-
 .../function_try_aes_decrypt.explain   |   2 +-
 .../function_try_aes_decrypt_with_mode.explain |   2 +-
 ...ction_try_aes_decrypt_with_mode_padding.explain |   2 +-
 ...n_try_aes_decrypt_with_mode_padding_aad.explain |   2 +-
 .../spark/sql/catalyst/expressions/misc.scala  |  14 ++-
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 136 +
 15 files changed, 157 insertions(+), 19 deletions(-)

diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt.explain
 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt.explain
index 31e03b79eb98..55f1c314671a 100644
--- 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt.explain
+++ 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt.explain
@@ -1,2 +1,2 @@
-Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils, BinaryType, 
aesDecrypt, cast(g#0 as binary), cast(g#0 as binary), GCM, DEFAULT, cast( as 
binary), BinaryType, BinaryType, StringType, StringType, BinaryType, true, 
true, true) AS aes_decrypt(g, g, GCM, DEFAULT, )#0]
+Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils, BinaryType, 
aesDecrypt, cast(g#0 as binary), cast(g#0 as binary), GCM, DEFAULT, cast( as 
binary), BinaryType, BinaryType, StringTypeAnyCollation, 
StringTypeAnyCollation, BinaryType, true, true, true) AS aes_decrypt(g, g, GCM, 
DEFAULT, )#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode.explain
 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode.explain
index fc572e8fe7c6..762a4f47a058 100644
--- 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode.explain
+++ 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode.explain
@@ -1,2 +1,2 @@
-Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils, BinaryType, 
aesDecrypt, cast(g#0 as binary), cast(g#0 as binary), g#0, DEFAULT, cast( as 
binary), BinaryType, BinaryType, StringType, StringType, BinaryType, true, 
true, true) AS aes_decrypt(g, g, g, DEFAULT, )#0]
+Project [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils, BinaryType, 
aesDecrypt, cast(g#0 as binary), cast(g#0 as binary), g#0, DEFAULT, cast( as 
binary), BinaryType, BinaryType, StringTypeAnyCollation, 
StringTypeAnyCollation, BinaryType, true, true, true) AS aes_decrypt(g, g, g, 
DEFAULT, )#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_aes_decrypt_with_mode_padding.explain
 
b/connector/connect/com

(spark) branch master updated: [SPARK-48263] Collate function support for non UTF8_BINARY strings

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 91da2caa409c [SPARK-48263] Collate function support for non 
UTF8_BINARY strings
91da2caa409c is described below

commit 91da2caa409cb156a970fea0fc8355fcd8c6a2e6
Author: Nebojsa Savic 
AuthorDate: Tue May 14 23:39:26 2024 +0800

[SPARK-48263] Collate function support for non UTF8_BINARY strings

### What changes were proposed in this pull request?
collate("xx", "") does not work when there is a config for 
default collation set which configures non UTF8_BINARY collation as default.

### Why are the changes needed?
Fixing the compatibility issue with default collation config and collate 
function.

### Does this PR introduce _any_ user-facing change?
Customers will be able to execute collation(, ) function 
even when default collation config is configured to some other collation than 
UTF8_BINARY. We are expanding the surface area for cx.

### How was this patch tested?
Added tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46574 from nebojsa-db/SPARK-48263.

Authored-by: Nebojsa Savic 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/collationExpressions.scala|  4 ++--
 .../test/scala/org/apache/spark/sql/CollationSuite.scala   | 14 --
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
index 6af00e193d94..7c02475a60ad 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
@@ -57,14 +57,14 @@ object CollateExpressionBuilder extends ExpressionBuilder {
 expressions match {
   case Seq(e: Expression, collationExpr: Expression) =>
 (collationExpr.dataType, collationExpr.foldable) match {
-  case (StringType, true) =>
+  case (_: StringType, true) =>
 val evalCollation = collationExpr.eval()
 if (evalCollation == null) {
   throw QueryCompilationErrors.unexpectedNullError("collation", 
collationExpr)
 } else {
   Collate(e, evalCollation.toString)
 }
-  case (StringType, false) => throw 
QueryCompilationErrors.nonFoldableArgumentError(
+  case (_: StringType, false) => throw 
QueryCompilationErrors.nonFoldableArgumentError(
 funcName, "collationName", StringType)
   case (_, _) => throw 
QueryCompilationErrors.unexpectedInputDataTypeError(
 funcName, 1, StringType, collationExpr)
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
index fce9ad3cc184..b22a762a2954 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
@@ -67,8 +67,18 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
   }
 
   test("collate function syntax") {
-assert(sql(s"select collate('aaa', 'utf8_binary')").schema(0).dataType == 
StringType(0))
-assert(sql(s"select collate('aaa', 
'utf8_binary_lcase')").schema(0).dataType == StringType(1))
+assert(sql(s"select collate('aaa', 'utf8_binary')").schema(0).dataType ==
+  StringType("UTF8_BINARY"))
+assert(sql(s"select collate('aaa', 
'utf8_binary_lcase')").schema(0).dataType ==
+  StringType("UTF8_BINARY_LCASE"))
+  }
+
+  test("collate function syntax with default collation set") {
+withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UTF8_BINARY_LCASE") {
+  assert(sql(s"select collate('aaa', 
'utf8_binary_lcase')").schema(0).dataType ==
+StringType("UTF8_BINARY_LCASE"))
+  assert(sql(s"select collate('aaa', 'UNICODE')").schema(0).dataType == 
StringType("UNICODE"))
+}
   }
 
   test("collate function syntax invalid arg count") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47301][SQL][TESTS][FOLLOWUP] Remove workaround for ParquetIOSuite

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 97bf1ee9f6f7 [SPARK-47301][SQL][TESTS][FOLLOWUP] Remove workaround for 
ParquetIOSuite
97bf1ee9f6f7 is described below

commit 97bf1ee9f6f76d49df50560bf792135308f289a9
Author: panbingkun 
AuthorDate: Tue May 14 23:37:47 2024 +0800

[SPARK-47301][SQL][TESTS][FOLLOWUP] Remove workaround for ParquetIOSuite

### What changes were proposed in this pull request?
The pr aims to remove workaround for ParquetIOSuite.

### Why are the changes needed?
After https://github.com/apache/spark/pull/46562 is completed, the reason 
why the ut `SPARK-7837 Do not close output writer twice when commitTask() 
fails` failed due to different event processing time sequence no longer exists, 
so we remove the previous workaround here.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Manually test.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46577 from panbingkun/SPARK-47301_FOLLOWUP.

Authored-by: panbingkun 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/datasources/parquet/ParquetIOSuite.scala  | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
index ba8fef0b3a8d..4fb8faa43a39 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
@@ -1589,12 +1589,8 @@ class ParquetIOWithoutOutputCommitCoordinationSuite
 .coalesce(1)
   
df.write.partitionBy("a").options(extraOptions).parquet(dir.getCanonicalPath)
 }
-if (m2.getErrorClass != null) {
-  assert(m2.getErrorClass == "TASK_WRITE_FAILED")
-  assert(m2.getCause.getMessage.contains("Intentional exception for 
testing purposes"))
-} else {
-  assert(m2.getMessage.contains("TASK_WRITE_FAILED"))
-}
+assert(m2.getErrorClass == "TASK_WRITE_FAILED")
+assert(m2.getCause.getMessage.contains("Intentional exception for 
testing purposes"))
   }
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new a848e2790cba [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects
a848e2790cba is described below

commit a848e2790cba0b7ee77d391dc534146bd35ee50a
Author: Mihailo Milosevic 
AuthorDate: Tue May 14 23:31:46 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

Special case escaping for MySQL and fix issues with redundant escaping for 
' character.

When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\\' syntax instead of ESCAPE '\' which would 
cause errors when trying to push down.

Yes

Tests for each existing dialect.

No.

Closes #46437 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 47006a493f98ca85196194d16d58b5847177b1a3)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   3 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 14 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 1a25cd2802dd..11ddce68aecd 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -67,6 +67,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col LONGTEXT
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..a42caeafe6fe 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote\\'_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index a527c6f8cb5b..6658b5ed6c77 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
@@ -66,6 +66,12 @@ class MsSqlServerIntegrationSuite extends 
DockerJ

(spark) branch branch-3.5 updated: [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new f37fa436cd4e [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects
f37fa436cd4e is described below

commit f37fa436cd4e0ef9f486a60f9af91a3ce0195df9
Author: Mihailo Milosevic 
AuthorDate: Tue May 14 23:31:46 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

Special case escaping for MySQL and fix issues with redundant escaping for 
' character.

When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\\' syntax instead of ESCAPE '\' which would 
cause errors when trying to push down.

Yes

Tests for each existing dialect.

No.

Closes #46437 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 47006a493f98ca85196194d16d58b5847177b1a3)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   3 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 14 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 9a78244f5326..9b4916ddd36b 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -80,6 +80,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col LONGTEXT
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..a42caeafe6fe 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote\\'_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index 0dc3a39f4db5..57a2667557fa 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
@@ -86,6 +86,12 @@ class MsSqlServerIntegrationSuite extends 
DockerJ

(spark) branch master updated: [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 47006a493f98 [SPARK-48172][SQL] Fix escaping issues in JDBC Dialects
47006a493f98 is described below

commit 47006a493f98ca85196194d16d58b5847177b1a3
Author: Mihailo Milosevic 
AuthorDate: Tue May 14 23:31:46 2024 +0800

[SPARK-48172][SQL] Fix escaping issues in JDBC Dialects

### What changes were proposed in this pull request?
Special case escaping for MySQL and fix issues with redundant escaping for 
' character.

### Why are the changes needed?
When pushing down startsWith, endsWith and contains they are converted to 
LIKE. This requires addition of escape characters for these expressions. 
Unfortunately, MySQL uses ESCAPE '\\' syntax instead of ESCAPE '\' which would 
cause errors when trying to push down.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Tests for each existing dialect.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46437 from mihailom-db/SPARK-48172.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/jdbc/v2/DB2IntegrationSuite.scala|   6 +
 .../sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala |  11 +
 .../sql/jdbc/v2/MsSqlServerIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |   6 +
 .../spark/sql/jdbc/v2/OracleIntegrationSuite.scala |   6 +
 .../sql/jdbc/v2/PostgresIntegrationSuite.scala |   6 +
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 229 +
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   1 -
 .../sql/connector/expressions/expressions.scala|   4 +-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |   7 -
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  15 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala|   6 +-
 12 files changed, 291 insertions(+), 12 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
index 3642094d11b2..36795747319d 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
@@ -62,6 +62,12 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
 connection.prepareStatement(
   "CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary 
DECIMAL(20, 2), bonus DOUBLE)")
   .executeUpdate()
+connection.prepareStatement(
+  s"""CREATE TABLE pattern_testing_table (
+ |pattern_testing_col LONGTEXT
+ |)
+   """.stripMargin
+).executeUpdate()
   }
 
   override def testUpdateColumnType(tbl: String): Unit = {
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
index 72edfc9f1bf1..a42caeafe6fe 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DockerJDBCIntegrationV2Suite.scala
@@ -38,6 +38,17 @@ abstract class DockerJDBCIntegrationV2Suite extends 
DockerJDBCIntegrationSuite {
   .executeUpdate()
 connection.prepareStatement("INSERT INTO employee VALUES (6, 'jen', 12000, 
1200)")
   .executeUpdate()
+
+connection.prepareStatement(
+  s"""
+ |INSERT INTO pattern_testing_table VALUES
+ |('special_character_quote\\'_present'),
+ |('special_character_quote_not_present'),
+ |('special_character_percent%_present'),
+ |('special_character_percent_not_present'),
+ |('special_character_underscore_present'),
+ |('special_character_underscorenot_present')
+ """.stripMargin).executeUpdate()
   }
 
   def tablePreparation(connection: Connection): Unit
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
index b1b8aec5ad33..46530fe5419a 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
+++ 
b/connector/docker-integration-tes

(spark) branch master updated: [SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if remain child is just BroadcastQueryStageExec

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e5ad5e94a8c8 [SPARK-48155][SQL] AQEPropagateEmptyRelation for join 
should check if remain child is just BroadcastQueryStageExec
e5ad5e94a8c8 is described below

commit e5ad5e94a8c891210637084a69359c1364201653
Author: Angerszh 
AuthorDate: Tue May 14 17:32:56 2024 +0800

[SPARK-48155][SQL] AQEPropagateEmptyRelation for join should check if 
remain child is just BroadcastQueryStageExec

### What changes were proposed in this pull request?
It's a new approach to fix 
[SPARK-39551](https://issues.apache.org/jira/browse/SPARK-39551)
This situation happened for AQEPropagateEmptyRelation when one side is 
empty and one side is BroadcastQueryStateExec
This pr avoid do propagate, not to revert all queryStagePreparationRules's 
result.

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manuel tested `SPARK-39551: Invalid plan check - invalid broadcast query 
stage`, it can work well without origin fix and current pr

For added UT,
```
  test("SPARK-48155: AQEPropagateEmptyRelation check remained child for 
join") {
withSQLConf(
  SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
  val (_, adaptivePlan) = runAdaptiveAndVerifyResult(
"""
  |SELECT /*+ BROADCAST(t3) */ t3.b, count(t3.a) FROM testData2 t1
  |INNER JOIN (
  |  SELECT * FROM testData2
  |  WHERE b = 0
  |  UNION ALL
  |  SELECT * FROM testData2
  |  WHErE b != 0
  |) t2
  |ON t1.b = t2.b AND t1.a = 0
  |RIGHT OUTER JOIN testData2 t3
  |ON t1.a > t3.a
  |GROUP BY t3.b
""".stripMargin
  )
  assert(findTopLevelBroadcastNestedLoopJoin(adaptivePlan).size == 1)
  assert(findTopLevelUnion(adaptivePlan).size == 0)
}
  }
```

before this pr the adaptive plan is
```
*(9) HashAggregate(keys=[b#226], functions=[count(1)], output=[b#226, 
count(a)#228L])
+- AQEShuffleRead coalesced
   +- ShuffleQueryStage 3
  +- Exchange hashpartitioning(b#226, 5), ENSURE_REQUIREMENTS, 
[plan_id=356]
 +- *(8) HashAggregate(keys=[b#226], functions=[partial_count(1)], 
output=[b#226, count#232L])
+- *(8) Project [b#226]
   +- BroadcastNestedLoopJoin BuildRight, RightOuter, (a#23 > 
a#225)
  :- *(7) Project [a#23]
  :  +- *(7) SortMergeJoin [b#24], [b#220], Inner
  : :- *(5) Sort [b#24 ASC NULLS FIRST], false, 0
  : :  +- AQEShuffleRead coalesced
  : : +- ShuffleQueryStage 0
  : :+- Exchange hashpartitioning(b#24, 5), 
ENSURE_REQUIREMENTS, [plan_id=211]
  : :   +- *(1) Filter (a#23 = 0)
  : :  +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
  : : +- Scan[obj#22]
  : +- *(6) Sort [b#220 ASC NULLS FIRST], false, 0
  :+- AQEShuffleRead coalesced
  :   +- ShuffleQueryStage 1
  :  +- Exchange hashpartitioning(b#220, 5), 
ENSURE_REQUIREMENTS, [plan_id=233]
  : +- Union
  ::- *(2) Project [b#220]
  ::  +- *(2) Filter (b#220 = 0)
  :: +- *(2) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#219, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#220]
  ::+- Scan[obj#218]
  :+- *(3) Project [b#223]
  :   +- *(3) Filter NOT (b#223 = 0)
  :  +- *(3) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#222, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#223]
  : +-

(spark) branch master updated: [SPARK-46707][SQL][FOLLOWUP] Push down throwable predicate through aggregates

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6766c39b458a [SPARK-46707][SQL][FOLLOWUP] Push down throwable 
predicate through aggregates
6766c39b458a is described below

commit 6766c39b458ad7abacd1a5b11c896efabf36f95c
Author: zml1206 
AuthorDate: Tue May 14 15:53:43 2024 +0800

[SPARK-46707][SQL][FOLLOWUP] Push down throwable predicate through 
aggregates

### What changes were proposed in this pull request?
Push down throwable predicate through aggregates and add ut for "can't push 
down nondeterministic filter through aggregate".

### Why are the changes needed?
If we can push down a filter through Aggregate, it means the filter only 
references the grouping keys. The Aggregate operator can't reduce grouping keys 
so the filter won't see any new data after pushing down. So push down throwable 
filter through aggregate does not affect exception thrown.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44975 from zml1206/SPARK-46707-FOLLOWUP.

Authored-by: zml1206 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/optimizer/Optimizer.scala  |  8 ++--
 .../sql/catalyst/optimizer/FilterPushdownSuite.scala  | 19 ---
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index dfc1e17c2a29..4ee6d9027a9c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -1768,6 +1768,10 @@ object PushPredicateThroughNonJoin extends 
Rule[LogicalPlan] with PredicateHelpe
   val aliasMap = getAliasMap(project)
   project.copy(child = Filter(replaceAlias(condition, aliasMap), 
grandChild))
 
+// We can push down deterministic predicate through Aggregate, including 
throwable predicate.
+// If we can push down a filter through Aggregate, it means the filter 
only references the
+// grouping keys or constants. The Aggregate operator can't reduce 
distinct values of grouping
+// keys so the filter won't see any new data after push down.
 case filter @ Filter(condition, aggregate: Aggregate)
   if aggregate.aggregateExpressions.forall(_.deterministic)
 && aggregate.groupingExpressions.nonEmpty =>
@@ -1777,8 +1781,8 @@ object PushPredicateThroughNonJoin extends 
Rule[LogicalPlan] with PredicateHelpe
   // attributes produced by the aggregate operator's child operator.
   val (pushDown, stayUp) = splitConjunctivePredicates(condition).partition 
{ cond =>
 val replaced = replaceAlias(cond, aliasMap)
-cond.deterministic && !cond.throwable &&
-  cond.references.nonEmpty && 
replaced.references.subsetOf(aggregate.child.outputSet)
+cond.deterministic && cond.references.nonEmpty &&
+  replaced.references.subsetOf(aggregate.child.outputSet)
   }
 
   if (pushDown.nonEmpty) {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
index 03e65412d166..5027222be6b8 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
@@ -219,6 +219,17 @@ class FilterPushdownSuite extends PlanTest {
 comparePlans(optimized, correctAnswer)
   }
 
+  test("Can't push down nondeterministic filter through aggregate") {
+val originalQuery = testRelation
+  .groupBy($"a")($"a", count($"b") as "c")
+  .where(Rand(10) > $"a")
+  .analyze
+
+val optimized = Optimize.execute(originalQuery)
+
+comparePlans(optimized, originalQuery)
+  }
+
   test("filters: combines filters") {
 val originalQuery = testRelation
   .select($"a")
@@ -1483,14 +1494,16 @@ class FilterPushdownSuite extends PlanTest {
   test("SPARK-46707: push down predicate with sequence (without step) through 
aggregates") {
 val x = testRelation.subquery("x")
 
-// do not push down when sequence has step param
+// Always push down sequence as it's deterministic
 val queryWithStep = x.groupBy($"x.a", $"x.b"

(spark) branch master updated: [SPARK-48157][SQL] Add collation support for CSV expressions

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e6c914f63079 [SPARK-48157][SQL] Add collation support for CSV 
expressions
e6c914f63079 is described below

commit e6c914f630793992eba7a409ec2cd061f385ce02
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue May 14 14:17:45 2024 +0800

[SPARK-48157][SQL] Add collation support for CSV expressions

### What changes were proposed in this pull request?
Introduce collation awareness for CSV expressions: from_csv, schema_of_csv, 
to_csv.

### Why are the changes needed?
Add collation support for CSV expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
CSV functions: from_csv, schema_of_csv, to_csv.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46504 from uros-db/csv-expressions.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/csvExpressions.scala  |   7 +-
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 112 +
 2 files changed, 116 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
index 4714fc1ded9c..cb10440c4832 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
@@ -31,6 +31,7 @@ import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.catalyst.util.TypeUtils._
 import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryErrorsBase}
 import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.types.StringTypeAnyCollation
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
@@ -146,7 +147,7 @@ case class CsvToStructs(
 converter(parser.parse(csv))
   }
 
-  override def inputTypes: Seq[AbstractDataType] = StringType :: Nil
+  override def inputTypes: Seq[AbstractDataType] = StringTypeAnyCollation :: 
Nil
 
   override def prettyName: String = "from_csv"
 
@@ -177,7 +178,7 @@ case class SchemaOfCsv(
 child = child,
 options = ExprUtils.convertToMapData(options))
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def nullable: Boolean = false
 
@@ -300,7 +301,7 @@ case class StructsToCsv(
 (row: Any) => 
UTF8String.fromString(gen.writeToString(row.asInstanceOf[InternalRow]))
   }
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression =
 copy(timeZoneId = Option(timeZoneId))
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index 22b29154cd78..f8b3548b956c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -313,6 +313,118 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("Support CsvToStructs csv expression with collation") {
+case class CsvToStructsTestCase(
+ input: String,
+ collationName: String,
+ schema: String,
+ options: String,
+ result: Row,
+ structFields: Seq[StructField]
+)
+
+val testCases = Seq(
+  CsvToStructsTestCase("1", "UTF8_BINARY", "'a INT'", "",
+Row(1), Seq(
+  StructField("a", IntegerType, nullable = true)
+)),
+  CsvToStructsTestCase("true, 0.8", "UTF8_BINARY_LCASE", "'A BOOLEAN, B 
DOUBLE'", "",
+Row(true, 0.8), Seq(
+  StructField("A", BooleanType, nullable = true),
+  StructField("B", DoubleType, nullable = true)
+)),
+  CsvToStructsTestCase("\"Spark\"", "UNICODE", "'a STRING'", "",
+Row("Spark"), Seq(
+  StructField("a", StringType("UNICODE"), nullable = true)
+)),
+  CsvToStructsTestCase("26/08/2015", "UTF8_BINARY", "'time Timestamp'",
+

(spark) branch master updated: [SPARK-48229][SQL] Add collation support for inputFile expressions

2024-05-14 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9241b8e8c0df [SPARK-48229][SQL] Add collation support for inputFile 
expressions
9241b8e8c0df is described below

commit 9241b8e8c0dfe35fbe1631fd440527eb72d88de8
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Tue May 14 14:08:30 2024 +0800

[SPARK-48229][SQL] Add collation support for inputFile expressions

### What changes were proposed in this pull request?
Introduce collation awareness for inputFile expressions: input_file_name.

### Why are the changes needed?
Add collation support for inputFile expressions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for 
inputFile functions: input_file_name.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46503 from uros-db/input-file-block.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/inputFileBlock.scala |  5 +++--
 .../apache/spark/sql/CollationSQLExpressionsSuite.scala | 17 +
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala
index 6cd88367aa9a..65eb995ff32f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala
@@ -21,7 +21,8 @@ import org.apache.spark.rdd.InputFileBlockHolder
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
CodeGenerator, ExprCode, FalseLiteral}
 import org.apache.spark.sql.catalyst.expressions.codegen.Block._
-import org.apache.spark.sql.types.{DataType, LongType, StringType}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{DataType, LongType}
 import org.apache.spark.unsafe.types.UTF8String
 
 // scalastyle:off whitespace.end.of.line
@@ -39,7 +40,7 @@ case class InputFileName() extends LeafExpression with 
Nondeterministic {
 
   override def nullable: Boolean = false
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def prettyName: String = "input_file_name"
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index dd5703d1284a..22b29154cd78 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -1275,6 +1275,23 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("Support InputFileName expression with collation") {
+// Supported collations
+Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", 
"UNICODE_CI").foreach(collationName => {
+  val query =
+s"""
+   |select input_file_name()
+   |""".stripMargin
+  // Result
+  withSQLConf(SqlApiConf.DEFAULT_COLLATION -> collationName) {
+val testQuery = sql(query)
+checkAnswer(testQuery, Row(""))
+val dataType = StringType(collationName)
+assert(testQuery.schema.fields.head.dataType.sameType(dataType))
+  }
+})
+  }
+
   // TODO: Add more tests for other SQL expressions
 
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48265][SQL] Infer window group limit batch should do constant folding

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 34588a82239a [SPARK-48265][SQL] Infer window group limit batch should 
do constant folding
34588a82239a is described below

commit 34588a82239a5c12fefed13e271edd963b821b1c
Author: Angerszh 
AuthorDate: Tue May 14 13:44:47 2024 +0800

[SPARK-48265][SQL] Infer window group limit batch should do constant folding

### What changes were proposed in this pull request?
Plan after PropagateEmptyRelation may generate double local limit
```
 GlobalLimit 21
 +- LocalLimit 21
!   +- Union false, false
!  :- LocalLimit 21
!  :  +- Project [item_id#647L]
!  : +- Filter ()
!  :+- Relation db.table[,... 91 more fields] parquet
!  +- LocalLimit 21
! +- Project [item_id#738L]
!+- LocalRelation , [, ... 91 more fields]
```
to
```
 GlobalLimit 21
+- LocalLimit 21
   - LocalLimit 21
  +- Project [item_id#647L]
+- Filter ()
   +- Relation db.table[,... 91 more fields] parquet
```
after `Infer window group limit batch` batch's `EliminateLimits`
will be
```
 GlobalLimit 21
+- LocalLimit least(21, 21)
  +- Project [item_id#647L]
+- Filter ()
   +- Relation db.table[,... 91 more fields] parquet
```
It can't work, here miss a `ConstantFolding`

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46568 from AngersZh/SPARK-48265.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 7974811218c9fb52ac9d07f8983475a885ada81b)
Signed-off-by: Wenchen Fan 
---
 .../src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
index 70a35ea91153..6173703ef3cd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
@@ -89,7 +89,8 @@ class SparkOptimizer(
   InferWindowGroupLimit,
   LimitPushDown,
   LimitPushDownThroughWindow,
-  EliminateLimits) :+
+  EliminateLimits,
+  ConstantFolding) :+
 Batch("User Provided Optimizers", fixedPoint, 
experimentalMethods.extraOptimizations: _*) :+
 Batch("Replace CTE with Repartition", Once, ReplaceCTERefWithRepartition)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48265][SQL] Infer window group limit batch should do constant folding

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7974811218c9 [SPARK-48265][SQL] Infer window group limit batch should 
do constant folding
7974811218c9 is described below

commit 7974811218c9fb52ac9d07f8983475a885ada81b
Author: Angerszh 
AuthorDate: Tue May 14 13:44:47 2024 +0800

[SPARK-48265][SQL] Infer window group limit batch should do constant folding

### What changes were proposed in this pull request?
Plan after PropagateEmptyRelation may generate double local limit
```
 GlobalLimit 21
 +- LocalLimit 21
!   +- Union false, false
!  :- LocalLimit 21
!  :  +- Project [item_id#647L]
!  : +- Filter ()
!  :+- Relation db.table[,... 91 more fields] parquet
!  +- LocalLimit 21
! +- Project [item_id#738L]
!+- LocalRelation , [, ... 91 more fields]
```
to
```
 GlobalLimit 21
+- LocalLimit 21
   - LocalLimit 21
  +- Project [item_id#647L]
+- Filter ()
   +- Relation db.table[,... 91 more fields] parquet
```
after `Infer window group limit batch` batch's `EliminateLimits`
will be
```
 GlobalLimit 21
+- LocalLimit least(21, 21)
  +- Project [item_id#647L]
+- Filter ()
   +- Relation db.table[,... 91 more fields] parquet
```
It can't work, here miss a `ConstantFolding`

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46568 from AngersZh/SPARK-48265.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
---
 .../src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
index 70a35ea91153..6173703ef3cd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala
@@ -89,7 +89,8 @@ class SparkOptimizer(
   InferWindowGroupLimit,
   LimitPushDown,
   LimitPushDownThroughWindow,
-  EliminateLimits) :+
+  EliminateLimits,
+  ConstantFolding) :+
 Batch("User Provided Optimizers", fixedPoint, 
experimentalMethods.extraOptimizations: _*) :+
 Batch("Replace CTE with Repartition", Once, ReplaceCTERefWithRepartition)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0ea808880e22 [SPARK-48027][SQL][FOLLOWUP] Add comments for the other 
code branch
0ea808880e22 is described below

commit 0ea808880e22e2b6cc97a3e946123bec035ade93
Author: beliefer 
AuthorDate: Tue May 14 13:26:17 2024 +0800

[SPARK-48027][SQL][FOLLOWUP] Add comments for the other code branch

### What changes were proposed in this pull request?
This PR propose to add comments for the other code branch.

### Why are the changes needed?
https://github.com/apache/spark/pull/46263 missing the comments for the 
other code branch.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
N/A

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #46536 from beliefer/SPARK-48027_followup.

Authored-by: beliefer 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/optimizer/InjectRuntimeFilter.scala| 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala
index 3bb7c4d1ceca..176e927b2d21 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala
@@ -123,21 +123,20 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
   case ExtractEquiJoinKeys(joinType, lkeys, rkeys, _, _, left, right, _) =>
 // Runtime filters use one side of the [[Join]] to build a set of join 
key values and prune
 // the other side of the [[Join]]. It's also OK to use a superset of 
the join key values
-// (ignore null values) to do the pruning.
+// (ignore null values) to do the pruning. We can also extract from 
the other side if the
+// join keys are transitive, and the other side always produces a 
superset output of join
+// key values. Any join side always produce a superset output of its 
corresponding
+// join keys, but for transitive join keys we need to check the join 
type.
 // We assume other rules have already pushed predicates through join 
if possible.
 // So the predicate references won't pass on anymore.
 if (left.output.exists(_.semanticEquals(targetKey))) {
   extract(left, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
 currentPlan = left, targetKey = targetKey).orElse {
-// We can also extract from the right side if the join keys are 
transitive, and
-// the right side always produces a superset output of join left 
keys.
-// Let's look at an example
+// An example that extract from the right side if the join keys 
are transitive.
 // left table: 1, 2, 3
 // right table, 3, 4
-// left outer join output: (1, null), (2, null), (3, 3)
-// left key output: 1, 2, 3
-// Any join side always produce a superset output of its 
corresponding
-// join keys, but for transitive join keys we need to check the 
join type.
+// right outer join output: (3, 3), (null, 4)
+// right key output: 3, 4
 if (canPruneLeft(joinType)) {
   lkeys.zip(rkeys).find(_._1.semanticEquals(targetKey)).map(_._2)
 .flatMap { newTargetKey =>
@@ -152,7 +151,11 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
 } else if (right.output.exists(_.semanticEquals(targetKey))) {
   extract(right, AttributeSet.empty, hasHitFilter = false, 
hasHitSelectiveFilter = false,
 currentPlan = right, targetKey = targetKey).orElse {
-// We can also extract from the left side if the join keys are 
transitive.
+// An example that extract from the left side if the join keys are 
transitive.
+// left table: 1, 2, 3
+// right table, 3, 4
+// left outer join output: (1, null), (2, null), (3, 3)
+// left key output: 1, 2, 3
 if (canPruneRight(joinType)) {
   rkeys.zip(lkeys).find(_._1.semanticEquals(targetKey)).map(_._2)
 .flatMap { newTargetKey =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 19d12b249f0f [SPARK-48241][SQL][3.5] CSV parsing failure with 
char/varchar type columns
19d12b249f0f is described below

commit 19d12b249f0fe4cb5b20b9722188c5a850147cec
Author: joey.ljy 
AuthorDate: Tue May 14 13:06:57 2024 +0800

[SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns

### What changes were proposed in this pull request?
CSV table containing char and varchar columns will result in the following 
error when selecting from the CSV table:
```
spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
  id INT,
  name CHAR(10))
USING csv
```
```
java.lang.IllegalArgumentException: requirement failed: requiredSchema 
(struct) should be the subset of dataSchema 
(struct).
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.(UnivocityParser.scala:56)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
```

### Why are the changes needed?
For char and varchar types, Spark will convert them to `StringType` in 
`CharVarcharUtils.replaceCharVarcharWithStringInSchema` and record 
`__CHAR_VARCHAR_TYPE_STRING` in the metadata.

The reason for the above error is that the `StringType` columns in the 
`dataSchema` and `requiredSchema` of `UnivocityParser` are not consistent. The 
`StringType` in the `dataSchema` has metadata, while the metadata in the 
`requiredSchema` is empty. We need to retain the metadata when resolving schema.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add a new test case in `CSVSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46565 from liujiayi771/branch-3.5-SPARK-48241.

Authored-by: joey.ljy 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/plans/logical/LogicalPlan.scala   |  4 +++-
 sql/core/src/test/resources/test-data/char.csv |  4 
 .../sql/execution/datasources/csv/CSVSuite.scala   | 24 ++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index 374eb070db1c..7fe8bd356ea9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -116,7 +116,9 @@ abstract class LogicalPlan
   def resolve(schema: StructType, resolver: Resolver): Seq[Attribute] = {
 schema.map { field =>
   resolve(field.name :: Nil, resolver).map {
-case a: AttributeReference => a
+case a: AttributeReference =>
+  // Keep the metadata in given schema.
+  a.withMetadata(field.metadata)
 case _ => throw 
QueryExecutionErrors.resolveCannotHandleNestedSchema(this)
   }.getOrElse {
 throw QueryCompilationErrors.cannotResolveAttributeError(
diff --git a/sql/core/src/test/resources/test-data/char.csv 
b/sql/core/src/test/resources/test-data/char.csv
new file mode 100644
index ..d2be68a15fc1
--- /dev/null
+++ b/sql/core/src/test/resources/test-data/char.csv
@@ -0,0 +1,4 @@
+color,name
+pink,Bob
+blue,Mike
+grey,Tom
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index a91adb787838..3762c00ff1a1 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -80,6 +80,7 @@ abstract class CSVSuite
   private val valueMalformedFile = "test-data/value-malformed.csv"
   private val badAfterGoodFile = "test-data/bad_after_good.csv"
   privat

(spark) branch master updated: [SPARK-48241][SQL] CSV parsing failure with char/varchar type columns

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b14abb3a2ed0 [SPARK-48241][SQL] CSV parsing failure with char/varchar 
type columns
b14abb3a2ed0 is described below

commit b14abb3a2ed086d2ff8f340f60c0dc1e460c7a59
Author: joey.ljy 
AuthorDate: Mon May 13 22:42:31 2024 +0800

[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns

### What changes were proposed in this pull request?
CSV table containing char and varchar columns will result in the following 
error when selecting from the CSV table:
```
spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
  id INT,
  name CHAR(10))
USING csv
```
```
java.lang.IllegalArgumentException: requirement failed: requiredSchema 
(struct) should be the subset of dataSchema 
(struct).
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.(UnivocityParser.scala:56)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
```

### Why are the changes needed?
For char and varchar types, Spark will convert them to `StringType` in 
`CharVarcharUtils.replaceCharVarcharWithStringInSchema` and record 
`__CHAR_VARCHAR_TYPE_STRING` in the metadata.

The reason for the above error is that the `StringType` columns in the 
`dataSchema` and `requiredSchema` of `UnivocityParser` are not consistent. The 
`StringType` in the `dataSchema` has metadata, while the metadata in the 
`requiredSchema` is empty. We need to retain the metadata when resolving schema.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add a new test case in `CSVSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46537 from liujiayi771/csv-char.

Authored-by: joey.ljy 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/plans/logical/LogicalPlan.scala   |  4 +++-
 sql/core/src/test/resources/test-data/char.csv |  4 
 .../sql/execution/datasources/csv/CSVSuite.scala   | 24 ++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index b989233da674..98e91585c2a0 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -118,7 +118,9 @@ abstract class LogicalPlan
   def resolve(schema: StructType, resolver: Resolver): Seq[Attribute] = {
 schema.map { field =>
   resolve(field.name :: Nil, resolver).map {
-case a: AttributeReference => a
+case a: AttributeReference =>
+  // Keep the metadata in given schema.
+  a.withMetadata(field.metadata)
 case _ => throw 
QueryExecutionErrors.resolveCannotHandleNestedSchema(this)
   }.getOrElse {
 throw QueryCompilationErrors.cannotResolveAttributeError(
diff --git a/sql/core/src/test/resources/test-data/char.csv 
b/sql/core/src/test/resources/test-data/char.csv
new file mode 100644
index ..d2be68a15fc1
--- /dev/null
+++ b/sql/core/src/test/resources/test-data/char.csv
@@ -0,0 +1,4 @@
+color,name
+pink,Bob
+blue,Mike
+grey,Tom
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index 22ea133ee19a..0e58b96531da 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -80,6 +80,7 @@ abstract class CSVSuite
   private val valueMalformedFile = "test-data/value-malformed.csv"
   private val badAfterGoodFile = "test-data/bad_after_good.csv"
   private val malformedRowFile = "test-data/m

(spark) branch master updated (42f2132d1fc9 -> 3456d4f69a86)

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 42f2132d1fc9 [SPARK-48206][SQL][TESTS] Add tests for window rewrites 
with RewriteWithExpression
 add 3456d4f69a86 [SPARK-47681][FOLLOWUP] Fix schema_of_variant(decimal)

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/variant/variantExpressions.scala  |  7 +++
 .../test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala | 10 ++
 2 files changed, 13 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48206][SQL][TESTS] Add tests for window rewrites with RewriteWithExpression

2024-05-13 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 42f2132d1fc9 [SPARK-48206][SQL][TESTS] Add tests for window rewrites 
with RewriteWithExpression
42f2132d1fc9 is described below

commit 42f2132d1fc99bf2ec5bd398d21dcbdbd5cbde47
Author: Kelvin Jiang 
AuthorDate: Mon May 13 22:28:27 2024 +0800

[SPARK-48206][SQL][TESTS] Add tests for window rewrites with 
RewriteWithExpression

### What changes were proposed in this pull request?

This PR adds more testing for `RewriteWithExpression` around `Window` 
operators.

### Why are the changes needed?

Adds more testing for `RewriteWithExpression`, which can be fragile around 
`WindowExpressions`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46492 from kelvinjian-db/SPARK-48206-window.

Authored-by: Kelvin Jiang 
Signed-off-by: Wenchen Fan 
---
 .../optimizer/RewriteWithExpressionSuite.scala | 223 +
 1 file changed, 135 insertions(+), 88 deletions(-)

diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
index 8f023fa4156b..aa8ffb2b0454 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
@@ -24,7 +24,6 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.PlanTest
 import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
 import org.apache.spark.sql.catalyst.rules.RuleExecutor
-import org.apache.spark.sql.types.IntegerType
 
 class RewriteWithExpressionSuite extends PlanTest {
 
@@ -37,6 +36,20 @@ class RewriteWithExpressionSuite extends PlanTest {
   private val testRelation = LocalRelation($"a".int, $"b".int)
   private val testRelation2 = LocalRelation($"x".int, $"y".int)
 
+  private def normalizeCommonExpressionIds(plan: LogicalPlan): LogicalPlan = {
+plan.transformAllExpressions {
+  case a: Alias if a.name.startsWith("_common_expr") =>
+a.withName("_common_expr_0")
+  case a: AttributeReference if a.name.startsWith("_common_expr") =>
+a.withName("_common_expr_0")
+}
+  }
+
+  override def comparePlans(
+plan1: LogicalPlan, plan2: LogicalPlan, checkAnalysis: Boolean = true): 
Unit = {
+super.comparePlans(normalizeCommonExpressionIds(plan1), 
normalizeCommonExpressionIds(plan2))
+  }
+
   test("simple common expression") {
 val a = testRelation.output.head
 val expr = With(a) { case Seq(ref) =>
@@ -52,65 +65,48 @@ class RewriteWithExpressionSuite extends PlanTest {
   ref * ref
 }
 val plan = testRelation.select(expr.as("col"))
-val commonExprId = expr.defs.head.id.id
-val commonExprName = s"_common_expr_$commonExprId"
 comparePlans(
   Optimizer.execute(plan),
   testRelation
-.select((testRelation.output :+ (a + a).as(commonExprName)): _*)
-.select(($"$commonExprName" * $"$commonExprName").as("col"))
+.select((testRelation.output :+ (a + a).as("_common_expr_0")): _*)
+.select(($"_common_expr_0" * $"_common_expr_0").as("col"))
 .analyze
 )
   }
 
   test("nested WITH expression in the definition expression") {
-val a = testRelation.output.head
+val Seq(a, b) = testRelation.output
 val innerExpr = With(a + a) { case Seq(ref) =>
   ref + ref
 }
-val innerCommonExprId = innerExpr.defs.head.id.id
-val innerCommonExprName = s"_common_expr_$innerCommonExprId"
-
-val b = testRelation.output.last
 val outerExpr = With(innerExpr + b) { case Seq(ref) =>
   ref * ref
 }
-val outerCommonExprId = outerExpr.defs.head.id.id
-val outerCommonExprName = s"_common_expr_$outerCommonExprId"
 
 val plan = testRelation.select(outerExpr.as("col"))
-val rewrittenOuterExpr = ($"$innerCommonExprName" + 
$"$innerCommonExprName" + b)
-  .as(outerCommonExprName)
-val outerExprAttr = AttributeReference(outerCommonExprName, IntegerType)(
-  exprId = rewrittenOuterExpr.exprId)
 comparePlans(
   Optimizer.execute(plan),
   testRelation
-.selec

svn commit: r69098 - /dev/spark/v4.0.0-preview1-rc1-bin/

2024-05-10 Thread wenchen

Author: wenchen
Date: Sat May 11 04:28:26 2024
New Revision: 69098

Log:
Apache Spark v4.0.0-preview1-rc1

Added:
dev/spark/v4.0.0-preview1-rc1-bin/
dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz   
(with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz   
(with props)

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.asc
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.sha512

Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc Sat May 
11 04:28:26 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY+8UQTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WkV1D/44BoMRwBQPQybc9ldlemMhKNQ/1OLB
+mUwhLpeUryOpUjO8AXa60YBajHqg9hivRxAUiuoaBSn7HjWY+3+nwkbcA7ZyMaV2
+Hgvfu4orB2kYXx4JgiE+dd2Zbuq+HFTv32dDUe+FyiHvhFw/bL0TIYUNJfKNcBtq
+KZDl9K5wemNjmpUSQAfEh3/vkikv5xOGxV+yEohgpB3t5Wg3hTETISXLfx/mHDu5
+GPjdCZ1omcqxZsV16CFZHV/uzK5aEDXfPdo2OO5V94xyQL0EQaMnzzMUdHkxPJ3p
+747tTf/q5rXHOb7S67MtNoBZ8myR23mQGJTwlV6E8CJWcbH7R6SEHekG9kIPGd3i
+UHoBAmroi+KfAdRej2Nqvz7SfeDeAmFw2kBRIm42FYWIqalAqbKU9LlXSpjyvYkO
+82df+5mwOzJf5VSU9D3krmjqWMFdjlLbDI1O1hLMNHyZkCYzPf+pmFhABsfGMXZH
+D8vURqF5aL9BmEuwi1SF0zSa9bI0otQj0DBvCbZnUeULSHB+P/eFqHoXjtNX2ArB
+43zmyaDywfqPXoMItvb+sGGUvatbLTCjjl6yfwgZEKOHs5noCygmL1WoLVQV+UYe
+UXb/hOJrP4FdUARpnMmz6R0NYSgQ7RZ7lOjQqs3VB7W1ashh0EWDD1hbeqMpvdx/
++fBbOLMrdzxifw==
+=2il7
+-END PGP SIGNATURE-

Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512
==
--- dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 
(added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 Sat 
May 11 04:28:26 2024
@@ -0,0 +1 @@
+60c0f5348da36d3399b596648e104202b2e9925a5b52694bf83cec9be1b4e78db6b4aa7f2f9257bca74dd514dca176ab6b51ab4c0abad2b31fb3fc5b5c14
  SparkR_4.0.0-preview1.tar.gz

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc Sat May 11 
04:28:26 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY+8UYTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WsnjD/4m0Dyb8ZcxS/JScvFxl3eg7KRWi8d8
+bGHs/pHZxdwS/HUkBRtv0w6HXJV6ZtQW1CPtbZ0VKOqElUfGPS/VaxE91I7c2Vmb
++/P2/buVX6fBlF+vIUPECyVgblnhBeZKbBb5Wcz3xpL1Jfj/6qi3o9uLnFFfy55S
+N6FWIJ5xrjl9mlo6+s4qqL/06u982NaEyUsu51eNgapTQcNUAjFKme13WC3W7n0S
+i6ixtW1oXmfY74CzSfn6KNC+5QvxKwJznS7ZxrG3g/chcaR8rApUZ526v4XL7LP0
+BDNeqCI+blAjVYFUzBIkvZp8SR/BbJv2HSySq5hbf0S6l0O+iuj8tZ/oa8Z0hCNf
+lXUw2ORG7RJKUZePdC+F+vYrmISyDRiWb4ddSUAjkzXy8KEWw6y55VULCq4vHbDc
+1Zwmf2izaujavcSJMjBnMhoZZ1PBlxgVQwHYu0Pi3qLCxyIn4oTd1wW7h6u5IGMr
++1LjMaGCrKbWSafp+cXGtzfJGjzPjCdIN2HqX6l53Vli4jn8I8yGJZs7hp+SZ281
+QBmzgiDLWUdQf+72bGNNlvy1FliPg0k7

svn commit: r69097 - /dev/spark/v4.0.0-preview1-rc1-bin/

2024-05-10 Thread wenchen

Author: wenchen
Date: Sat May 11 03:59:33 2024
New Revision: 69097

Log:
prepare for re-uploading

Removed:
dev/spark/v4.0.0-preview1-rc1-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69092 - in /dev/spark/v4.0.0-preview1-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/articles/sparkr-vignettes_files/ _site/api/R/articles/sparkr-vignettes_

2024-05-10 Thread wenchen

Author: wenchen
Date: Fri May 10 16:44:08 2024
New Revision: 69092

Log:
Apache Spark v4.0.0-preview1-rc1 docs


[This commit notification would consist of 4810 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser

2024-05-10 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a6632ffa16f6 [SPARK-48143][SQL] Use lightweight exceptions for 
control-flow between UnivocityParser and FailureSafeParser
a6632ffa16f6 is described below

commit a6632ffa16f6907eba96e745920d571924bf4b63
Author: Vladimir Golubev 
AuthorDate: Sat May 11 00:37:54 2024 +0800

[SPARK-48143][SQL] Use lightweight exceptions for control-flow between 
UnivocityParser and FailureSafeParser

# What changes were proposed in this pull request?
New lightweight exception for control-flow between UnivocityParser and 
FalureSafeParser to speed-up malformed CSV parsing.

This is a different way to implement these reverted changes: 
https://github.com/apache/spark/pull/46478

The previous implementation was more invasive - removing `cause` from 
`BadRecordException` could break upper code, which unwraps errors and checks 
the types of the causes. This implementation only touches `FailureSafeParser` 
and `UnivocityParser` since in the codebase they are always used together, 
unlike `JacksonParser` and `StaxXmlParser`. Removing stacktrace from 
`BadRecordException` is safe, since the cause itself has an adequate stacktrace 
(except pure control-flow cases).

### Why are the changes needed?
Parsing in `PermissiveMode` is slow due to heavy exception construction 
(stacktrace filling + string template substitution in `SparkRuntimeException`)

### Does this PR introduce _any_ user-facing change?
No, since `FailureSafeParser` unwraps `BadRecordException` and correctly 
rethrows user-facing exceptions in `FailFastMode`

### How was this patch tested?
- `testOnly org.apache.spark.sql.catalyst.csv.UnivocityParserSuite`
- Manually run csv benchmark
- Manually checked correct and malformed csv in sherk-shell 
(org.apache.spark.SparkException is thrown with the stacktrace)

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46500 from 
vladimirg-db/vladimirg-db/use-special-lighweight-exception-for-control-flow-between-univocity-parser-and-failure-safe-parser.

Authored-by: Vladimir Golubev 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/csv/UnivocityParser.scala   |  5 +++--
 .../sql/catalyst/util/BadRecordException.scala | 22 +++---
 .../sql/catalyst/util/FailureSafeParser.scala  | 11 +--
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
index a5158d8a22c6..4d95097e1681 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
@@ -316,7 +316,7 @@ class UnivocityParser(
   throw BadRecordException(
 () => getCurrentInput,
 () => Array.empty,
-QueryExecutionErrors.malformedCSVRecordError(""))
+LazyBadRecordCauseWrapper(() => 
QueryExecutionErrors.malformedCSVRecordError("")))
 }
 
 val currentInput = getCurrentInput
@@ -326,7 +326,8 @@ class UnivocityParser(
   // However, we still have chance to parse some of the tokens. It 
continues to parses the
   // tokens normally and sets null when `ArrayIndexOutOfBoundsException` 
occurs for missing
   // tokens.
-  Some(QueryExecutionErrors.malformedCSVRecordError(currentInput.toString))
+  Some(LazyBadRecordCauseWrapper(
+() => 
QueryExecutionErrors.malformedCSVRecordError(currentInput.toString)))
 } else None
 // When the length of the returned tokens is identical to the length of 
the parsed schema,
 // we just need to:
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
index 65a56c1064e4..654b0b8c73e5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
@@ -67,16 +67,32 @@ case class PartialResultArrayException(
   extends Exception(cause)
 
 /**
- * Exception thrown when the underlying parser meet a bad record and can't 
parse it.
+ * Exception thrown when the underlying parser met a bad record and can't 
parse it.
+ * The stacktrace is not collected for better preformance, and thus, this 
exception should
+ * not be used in a user-facing context.
  * @param record a function to return the record that cause the parser to fail
  * @param partialResults a fu

(spark) branch master updated: [SPARK-48146][SQL] Fix aggregate function in With expression child assertion

2024-05-10 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7ef0440ef221 [SPARK-48146][SQL] Fix aggregate function in With 
expression child assertion
7ef0440ef221 is described below

commit 7ef0440ef22161a6160f7b9000c70b26c84eecf7
Author: Kelvin Jiang 
AuthorDate: Fri May 10 22:39:15 2024 +0800

[SPARK-48146][SQL] Fix aggregate function in With expression child assertion

### What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/46034, there was a complicated edge 
case where common expression references in aggregate functions in the child of 
a `With` expression could become dangling. An assertion was added to avoid that 
case from happening, but the assertion wasn't fully accurate as a query like:
```
select
  id between max(if(id between 1 and 2, 2, 1)) over () and id
from range(10)
```
would fail the assertion.

This PR fixes the assertion to be more accurate.

### Why are the changes needed?

This addresses a regression in https://github.com/apache/spark/pull/46034.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46443 from kelvinjian-db/SPARK-48146-agg.

Authored-by: Kelvin Jiang 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/With.scala  | 26 +
 .../optimizer/RewriteWithExpressionSuite.scala | 27 +-
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
index 14deedd9c70f..29794b33641c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
@@ -17,7 +17,8 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
-import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE_EXPRESSION, 
COMMON_EXPR_REF, TreePattern, WITH_EXPRESSION}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, 
TreePattern, WITH_EXPRESSION}
 import org.apache.spark.sql.types.DataType
 
 /**
@@ -27,9 +28,11 @@ import org.apache.spark.sql.types.DataType
  */
 case class With(child: Expression, defs: Seq[CommonExpressionDef])
   extends Expression with Unevaluable {
-  // We do not allow With to be created with an AggregateExpression in the 
child, as this would
-  // create a dangling CommonExpressionRef after rewriting it in 
RewriteWithExpression.
-  assert(!child.containsPattern(AGGREGATE_EXPRESSION))
+  // We do not allow creating a With expression with an AggregateExpression 
that contains a
+  // reference to a common expression defined in that scope (note that it can 
contain another With
+  // expression with a common expression ref of the inner With). This is to 
prevent the creation of
+  // a dangling CommonExpressionRef after rewriting it in 
RewriteWithExpression.
+  assert(!With.childContainsUnsupportedAggExpr(this))
 
   override val nodePatterns: Seq[TreePattern] = Seq(WITH_EXPRESSION)
   override def dataType: DataType = child.dataType
@@ -92,6 +95,21 @@ object With {
 val commonExprRefs = commonExprDefs.map(new CommonExpressionRef(_))
 With(replaced(commonExprRefs), commonExprDefs)
   }
+
+  private[sql] def childContainsUnsupportedAggExpr(withExpr: With): Boolean = {
+lazy val commonExprIds = withExpr.defs.map(_.id).toSet
+withExpr.child.exists {
+  case agg: AggregateExpression =>
+// Check that the aggregate expression does not contain a reference to 
a common expression
+// in the outer With expression (it is ok if it contains a reference 
to a common expression
+// for a nested With expression).
+agg.exists {
+  case r: CommonExpressionRef => commonExprIds.contains(r.id)
+  case _ => false
+}
+  case _ => false
+}
+  }
 }
 
 case class CommonExpressionId(id: Long = CommonExpressionId.newId, 
canonicalized: Boolean = false) {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
index d482b18d9331..8f023fa4156b 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/

(spark) branch master updated (33cac4436e59 -> 2df494fd4e4e)

2024-05-10 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 33cac4436e59 [SPARK-47847][CORE] Deprecate 
`spark.network.remoteReadNioBufferConversion`
 add 2df494fd4e4e [SPARK-48158][SQL] Add collation support for XML 
expressions

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/xmlExpressions.scala  |   9 +-
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 124 +
 2 files changed, 129 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9a2818820f11 [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 
and refresh Gem lock file
9a2818820f11 is described below

commit 9a2818820f11f9bdcc042f4ab80850918911c68c
Author: Nicholas Chammas 
AuthorDate: Fri May 10 09:58:16 2024 +0800

[SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock 
file

### What changes were proposed in this pull request?

Sync the version of Bundler that we are using across various scripts and 
documentation. Also refresh the Gem lock file.

### Why are the changes needed?

We are seeing inconsistent build behavior, likely due to the inconsistent 
Bundler versions.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI + the preview release process.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46512 from nchammas/bundler-sync.

Authored-by: Nicholas Chammas 
Signed-off-by: Wenchen Fan 
---
 .github/workflows/build_and_test.yml   |  3 +++
 dev/create-release/spark-rm/Dockerfile |  2 +-
 docs/Gemfile.lock  | 16 
 docs/README.md |  2 +-
 4 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 4a11823aee60..881fb8cb0674 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -872,6 +872,9 @@ jobs:
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
 - name: Install dependencies for documentation generation
   run: |
+# Keep the version of Bundler here in sync with the following 
locations:
+#   - dev/create-release/spark-rm/Dockerfile
+#   - docs/README.md
 gem install bundler -v 2.4.22
 cd docs
 bundle install
diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index 8d5ca38ba88e..13f4112ca03d 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -38,7 +38,7 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true
 ARG APT_INSTALL="apt-get install --no-install-recommends -y"
 
 ARG PIP_PKGS="sphinx==4.5.0 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.13.3 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==3.1.2 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==2.0.3 pyarrow==10.0.1 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17 grpcio==1.62.0 protobuf==4.21.6 
grpcio-status==1.62.0 googleapis-common-protos==1.56.4"
-ARG GEM_PKGS="bundler:2.3.8"
+ARG GEM_PKGS="bundler:2.4.22"
 
 # Install extra needed repos and refresh.
 # - CRAN repo
diff --git a/docs/Gemfile.lock b/docs/Gemfile.lock
index 4e38f18703f3..e137f0f039b9 100644
--- a/docs/Gemfile.lock
+++ b/docs/Gemfile.lock
@@ -4,16 +4,16 @@ GEM
 addressable (2.8.6)
   public_suffix (>= 2.0.2, < 6.0)
 colorator (1.1.0)
-concurrent-ruby (1.2.2)
+concurrent-ruby (1.2.3)
 em-websocket (0.5.3)
   eventmachine (>= 0.12.9)
   http_parser.rb (~> 0)
 eventmachine (1.2.7)
 ffi (1.16.3)
 forwardable-extended (2.6.0)
-google-protobuf (3.25.2)
+google-protobuf (3.25.3)
 http_parser.rb (0.8.0)
-i18n (1.14.1)
+i18n (1.14.5)
   concurrent-ruby (~> 1.0)
 jekyll (4.3.3)
   addressable (~> 2.4)
@@ -42,22 +42,22 @@ GEM
 kramdown-parser-gfm (1.1.0)
   kramdown (~> 2.0)
 liquid (4.0.4)
-listen (3.8.0)
+listen (3.9.0)
   rb-fsevent (~> 0.10, >= 0.10.3)
   rb-inotify (~> 0.9, >= 0.9.10)
 mercenary (0.4.0)
 pathutil (0.16.2)
   forwardable-extended (~> 2.6)
-public_suffix (5.0.4)
-rake (13.1.0)
+public_suffix (5.0.5)
+rake (13.2.1)
 rb-fsevent (0.11.2)
 rb-inotify (0.10.1)
   ffi (~> 1.0)
 rexml (3.2.6)
 rouge (3.30.0)
 safe_yaml (1.0.5)
-sass-embedded (1.69.7)
-  google-protobuf (~> 3.25)
+sass-embedded (1.63.6)
+  google-protobuf (~> 3.23)
   rake (>= 13.0.0)
 terminal-table (3.0.2)
   unicode-display_width (>= 1.1.1, < 3)
diff --git a/docs/README.md b/docs/README.md
index 414c8dbd8303..363f1c207636 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -36,7 +36,7 @@ You need to have [Ruby 3][ruby] and [Python 3][python] 
installed. Make sure the
 [python]: https://www.python.org/downloads/
 
 ```sh
-$ gem install bundler
+$ gem install bundler -v 2.4.22
 ```
 
 After this all the required Ruby dependencies can be installed from the 
`docs/` directory

svn commit: r69065 - /dev/spark/v4.0.0-preview1-rc1-bin/

2024-05-09 Thread wenchen

Author: wenchen
Date: Thu May  9 16:31:11 2024
New Revision: 69065

Log:
Apache Spark v4.0.0-preview1-rc1

Added:
dev/spark/v4.0.0-preview1-rc1-bin/
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz   
(with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz   
(with props)

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.asc
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.sha512

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc Thu May  9 
16:31:11 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY8+e4THHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/Wv78D/9aNsBANuVpIjYr+XkWYaimRLJ5IT0Z
+qKehjJBuMBDaBMMN3iWconDHBiASQT0FTYGDBeYI72fLFSMKBna5+Lu22+KD/K6h
+V8SZxPSQsAHQABYq9ha++XXyo1Vo+msPQ0pQAblmTrSpsvSWZmC8spzb5GbKYvK5
+kxr4Qt1XnHeGNJNToqGlbl/Hc2Etg5PkPBxMPBWMh7kLknMEscMNUf87JqCIa8LG
+hMid/0lrrevEm8gkuu0ol9Vgz4P+dreKE9eCfmWOXCod04y8tJnVPs83wUOZfmKV
+dHkELaMVwz3fa40QP77gK38K5i22aUgYk6dvhB+OgtatZ5tk0Dxp3AI2OObngEUm
+4cGmQLwcses53vApwkExq427gS8td4sTE2G1D4+hSdEcm8Fj69w4Ado/DlIAHZob
+KLV15qtNOyaIapT4GxBqoeqsw7tnRmxiP8K8UxFcPV/vZC1yQKIIULigPjttZKoW
++REE2N7ZyPvbvgItwjAL8hpCeYEkd7RDa7ofHAv6icC1qSsJZ9gxFM4rJvriI4g2
+tnYEvZduGpBunhlwVb0R3kAF5XoLIZQ5qm6kyWAzioc0gxzYVc3Rd+bXjm+vmopt
+bXHOM6N2lLQwqnWlHsyjGVFugrkkRXZbQbIV6FynXpKaz5YtkUhUMkofz7mOYhBi
++1Z8nZ04B6YLbw==
+=85FX
+-END PGP SIGNATURE-

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 Thu May  
9 16:31:11 2024
@@ -0,0 +1 @@
+2509cf6473495b0cd5c132d87f5e1c33593fa7375ca01bcab1483093cea92bdb6ad7afc7c72095376b28fc5acdc71bb323935d17513f33ee5276c6991ff668d1
  pyspark-4.0.0.dev1.tar.gz

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc 
(added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc Thu 
May  9 16:31:11 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJGBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY8+fATHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WoCMD/iZjkaGTUqt3jkIjWIUzpQo+kLn8//m
+f+hwUtAguXvbMJXwBOz/Q/f+KvGk0tutsbd6rmBB6cHjH4GoZPp1x6iBitFAO47r
+kHy/0xYkb70SPQCWIGQQpRv3g0uxTmpqL9H4YcIvexkV2wXG5VSwGvbSI4596n7l
+x7M3rRmFzrxhcNIYLQdhNuat0mwuJFWe6R7Zk7UYFFishn9dNt8EOYx8vsGAuMP8
+Uy3+7oZQOAGqdQGSL7Ev4Pqve7MrrPgGXaixGukXibi707NCURnHTDcenPfoEEiQ
+Hj83I3G+JrRhtsue/103a/GnHheUgwE8oEkefnUX7qC5tSn4T8lI2KpDBv9AL1pm
+Bv0eXf5X5xEM4wvO7DCgbeEDPLg72jjt9X8zjAYx05HddvTuPjeKEL+Ga6G0ueTz
+HRXHrgd1EFZ1znPZhWiSTmeqZTXdrb6wKTYt8Y6mk1oEGL3b0qE2LNkSED+4l40u
+41MlV3pmZyjRGYZl29XZKf4isKYyjec7UbJSM5ok4zCRF0p8Gvj0EihGS4X6rYpW
+9XxwjViKMIp7DCEcWjWpO6pJ8Ygb2Snh1UTFFgtzSVAoMqUgHnBHejJ4RA4ncHu6

(spark) branch master updated: [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 21333f8c1fc0 [SPARK-47409][SQL] Add support for collation for 
StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)
21333f8c1fc0 is described below

commit 21333f8c1fc01756e6708ad6ccf21f585fcb881d
Author: David Milicevic 
AuthorDate: Thu May 9 23:05:20 2024 +0800

[SPARK-47409][SQL] Add support for collation for StringTrim type of 
functions/expressions (for UTF8_BINARY & LCASE)

Recreating [original PR](https://github.com/apache/spark/pull/45749) 
because code has been reorganized in [this 
PR](https://github.com/apache/spark/pull/45978).

### What changes were proposed in this pull request?
This PR is created to add support for collations to StringTrim family of 
functions/expressions, specifically:
- `StringTrim`
- `StringTrimBoth`
- `StringTrimLeft`
- `StringTrimRight`

Changes:
- `CollationSupport.java`
  - Add new `StringTrim`, `StringTrimLeft` and `StringTrimRight` classes 
with corresponding logic.
  - `CollationAwareUTF8String` - add new `trim`, `trimLeft` and `trimRight` 
methods that actually implement trim logic.
- `UTF8String.java` - expose some of the methods publicly.
- `stringExpressions.scala`
  - Change input types.
  - Change eval and code gen logic.
- `CollationTypeCasts.scala` - add `StringTrim*` expressions to 
`CollationTypeCasts` rules.

### Why are the changes needed?
We are incrementally adding collation support to a built-in string 
functions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes:
- User should now be able to use non-default collations in string trim 
functions.

### How was this patch tested?
Already existing tests + new unit/e2e tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46206 from davidm-db/string-trim-functions.

Authored-by: David Milicevic 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 470 ++
 .../spark/sql/catalyst/util/CollationSupport.java  | 534 -
 .../org/apache/spark/unsafe/types/UTF8String.java  |   2 +-
 .../spark/unsafe/types/CollationSupportSuite.java  | 193 
 .../sql/catalyst/analysis/CollationTypeCasts.scala |   2 +-
 .../catalyst/expressions/stringExpressions.scala   |  53 +-
 .../sql/CollationStringExpressionsSuite.scala  | 161 ++-
 7 files changed, 1054 insertions(+), 361 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
new file mode 100644
index ..ee0d611d7e65
--- /dev/null
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -0,0 +1,470 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.lang.UCharacter;
+import com.ibm.icu.text.BreakIterator;
+import com.ibm.icu.text.StringSearch;
+import com.ibm.icu.util.ULocale;
+
+import org.apache.spark.unsafe.UTF8StringBuilder;
+import org.apache.spark.unsafe.types.UTF8String;
+
+import static org.apache.spark.unsafe.Platform.BYTE_ARRAY_OFFSET;
+import static org.apache.spark.unsafe.Platform.copyMemory;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Utility class for collation-aware UTF8String operations.
+ */
+public class CollationAwareUTF8String {
+  public static UTF8String replace(final UTF8String src, final UTF8String 
search,
+  final UTF8String replace, final int collationId) {
+// This collation aware implementation is based on existing implementation 
on UTF8String
+if (src.numBytes() == 0 || search.numBytes() == 0) {
+  return src;
+}
+
+StringSearch stringSearch = CollationFactory.getStringSearch(src, search,

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 7515 matches

Mail list logo