from:"\"wenchen\""

(spark) branch branch-3.5 updated: [SPARK-49000][SQL][3.5] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates

2024-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 0008bd1df41a [SPARK-49000][SQL][3.5] Fix "select count(distinct 1) 
from t" where t is empty table by expanding RewriteDistinctAggregates
0008bd1df41a is described below

commit 0008bd1df41aabb155af6f38f4fc491b06d9f314
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Fri Aug 2 22:28:07 2024 +0800

[SPARK-49000][SQL][3.5] Fix "select count(distinct 1) from t" where t is 
empty table by expanding RewriteDistinctAggregates

### What changes were proposed in this pull request?
Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on 
DISTINCT literals. Physical plan for `select count(distinct 1) from t`:

```
-- count(distinct 1)
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct 1)], 
output=[count(DISTINCT 1)#2L])
   +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], 
output=[count#6L])
  +- HashAggregate(keys=[], functions=[], output=[])
 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20]
+- HashAggregate(keys=[], functions=[], output=[])
   +- FileScan parquet spark_catalog.default.t[] Batched: 
false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
```

Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` 
node yields one row to `partial_count` node, which then captures one row. This 
four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`.

To fix the problem, we're adding `Expand` node which will force non-empty 
grouping expressions in `HashAggregateExec` nodes. This will in turn enable 
streaming zero rows to parent `partial_count` node, yielding correct final 
result.

### Why are the changes needed?
Aggregation with DISTINCT literal gives wrong results. For example, when 
running on empty table `t`:
`select count(distinct 1) from t` returns 1, while the correct result 
should be 0.
For reference:
`select count(1) from t` returns 0, which is the correct and expected 
result.

### Does this PR introduce _any_ user-facing change?
Yes, this fixes a critical bug in Spark.

### How was this patch tested?
New e2e SQL tests for aggregates with DISTINCT literals.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47566 from uros-db/SPARK-49000-3.5.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../optimizer/RewriteDistinctAggregates.scala  |  16 ++-
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 111 +
 2 files changed, 124 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
index da3cf782f668..801bd2693af4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
@@ -197,6 +197,17 @@ import org.apache.spark.util.collection.Utils
  * techniques.
  */
 object RewriteDistinctAggregates extends Rule[LogicalPlan] {
+  private def mustRewrite(
+  distinctAggs: Seq[AggregateExpression],
+  groupingExpressions: Seq[Expression]): Boolean = {
+// If there are any distinct AggregateExpressions with filter, we need to 
rewrite the query.
+// Also, if there are no grouping expressions and all distinct aggregate 
expressions are
+// foldable, we need to rewrite the query, e.g. SELECT COUNT(DISTINCT 1). 
Without this case,
+// non-grouping aggregation queries with distinct aggregate expressions 
will be incorrectly
+// handled by the aggregation strategy, causing wrong results when working 
with empty tables.
+distinctAggs.exists(_.filter.isDefined) || (groupingExpressions.isEmpty &&
+  distinctAggs.exists(_.aggregateFunction.children.forall(_.foldable)))
+  }
 
   private def mayNeedtoRewrite(a: Aggregate): Boolean = {
 val aggExpressions = collectAggregateExprs(a)
@@ -204,8 +215,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
 // We need at least two distinct aggregates or the single distinct 
aggregate group exists filter
 // clause for this ru

(spark) branch master updated: [SPARK-49093][SQL] GROUP BY with MapType nested inside complex type

2024-08-01 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new aca0d24c8724 [SPARK-49093][SQL] GROUP BY with MapType nested inside 
complex type
aca0d24c8724 is described below

commit aca0d24c8724f1de4e65ad6c663b8e2255dbccd0
Author: Nebojsa Savic 
AuthorDate: Fri Aug 2 14:28:20 2024 +0800

[SPARK-49093][SQL] GROUP BY with MapType nested inside complex type

### What changes were proposed in this pull request?
Currently we are supporting GROUP BY , where column is of MapType, 
but we don't support scenarios where column type contains MapType nested in 
some complex type (i.e. ARRAY>), this PR addresses this issue.

### Why are the changes needed?
We are extending support for MapType columns in GROUP BY clause.

### Does this PR introduce _any_ user-facing change?
Customer would be able to use MapType nested in complex type as part of 
GROUP BY clause.

### How was this patch tested?
Added tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47331 from nebojsa-db/SC-170296.

Authored-by: Nebojsa Savic 
Signed-off-by: Wenchen Fan 
---
 .../InsertMapSortInGroupingExpressions.scala   |  53 +++-
 .../sql-tests/analyzer-results/group-by.sql.out| 130 +++
 .../test/resources/sql-tests/inputs/group-by.sql   |  64 +
 .../resources/sql-tests/results/group-by.sql.out   | 143 +
 4 files changed, 386 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala
index 3d883fa9d9ae..51f6ca374909 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortInGroupingExpressions.scala
@@ -17,11 +17,12 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.MapSort
+import org.apache.spark.sql.catalyst.expressions.{ArrayTransform, 
CreateNamedStruct, Expression, GetStructField, If, IsNull, LambdaFunction, 
Literal, MapFromArrays, MapKeys, MapSort, MapValues, NamedLambdaVariable}
 import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.catalyst.trees.TreePattern.AGGREGATE
-import org.apache.spark.sql.types.MapType
+import org.apache.spark.sql.types.{ArrayType, MapType, StructType}
+import org.apache.spark.util.ArrayImplicits.SparkArrayOps
 
 /**
  * Adds MapSort to group expressions containing map columns, as the key/value 
paris need to be
@@ -34,12 +35,56 @@ object InsertMapSortInGroupingExpressions extends 
Rule[LogicalPlan] {
 _.containsPattern(AGGREGATE), ruleId) {
 case a @ Aggregate(groupingExpr, _, _) =>
   val newGrouping = groupingExpr.map { expr =>
-if (!expr.isInstanceOf[MapSort] && 
expr.dataType.isInstanceOf[MapType]) {
-  MapSort(expr)
+if (!expr.exists(_.isInstanceOf[MapSort])
+  && expr.dataType.existsRecursively(_.isInstanceOf[MapType])) {
+  insertMapSortRecursively(expr)
 } else {
   expr
 }
   }
   a.copy(groupingExpressions = newGrouping)
   }
+
+  /*
+  Inserts MapSort recursively taking into account when
+  it is nested inside a struct or array.
+   */
+  private def insertMapSortRecursively(e: Expression): Expression = {
+e.dataType match {
+  case m: MapType =>
+// Check if value type of MapType contains MapType (possibly nested)
+// and special handle this case.
+val mapSortExpr = if 
(m.valueType.existsRecursively(_.isInstanceOf[MapType])) {
+  MapFromArrays(MapKeys(e), insertMapSortRecursively(MapValues(e)))
+} else {
+  e
+}
+
+MapSort(mapSortExpr)
+
+  case StructType(fields)
+if 
fields.exists(_.dataType.existsRecursively(_.isInstanceOf[MapType])) =>
+val struct = CreateNamedStruct(fields.zipWithIndex.flatMap { case (f, 
i) =>
+  Seq(Literal(f.name), insertMapSortRecursively(
+GetStructField(e, i, Some(f.name
+}.toImmutableArraySeq)
+if (struct.valExprs.forall(_.isInstanceOf[GetStructField])) {
+  // No field needs MapSort processing, just return the original 
expression.
+  e
+} else if (e.nullable) {
+  If(IsNull(e), Literal(null, struct.dataType), struct)
+} else {
+  struct
+}
+
+  case ArrayType(et, containsNull) if 
et.e

[jira] [Assigned] (SPARK-49065) Rebasing in legacy formatters/parsers must support non JVM default time zones

2024-08-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49065:
---

Assignee: Sumeet Varma

> Rebasing in legacy formatters/parsers must support non JVM default time zones
> -
>
> Key: SPARK-49065
> URL: https://issues.apache.org/jira/browse/SPARK-49065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Sumeet Varma
>Assignee: Sumeet Varma
>Priority: Major
>  Labels: pull-request-available
>
> Currently, rebasing timestamp defaults to JVM timezone and so it produces 
> incorrect results when the explicitly over-riden timezone in the 
> TimestampFormatter library is not the same as JVM timezone.
> To fix it, explicitly pass the overridden timezone parameter to 
> {{rebaseJulianToGregorianMicros}} and {{rebaseGregorianToJulianMicros}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-49065) Rebasing in legacy formatters/parsers must support non JVM default time zones

2024-08-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49065.
-
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 47541
[https://github.com/apache/spark/pull/47541]

> Rebasing in legacy formatters/parsers must support non JVM default time zones
> -
>
> Key: SPARK-49065
> URL: https://issues.apache.org/jira/browse/SPARK-49065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Sumeet Varma
>Assignee: Sumeet Varma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> Currently, rebasing timestamp defaults to JVM timezone and so it produces 
> incorrect results when the explicitly over-riden timezone in the 
> TimestampFormatter library is not the same as JVM timezone.
> To fix it, explicitly pass the overridden timezone parameter to 
> {{rebaseJulianToGregorianMicros}} and {{rebaseGregorianToJulianMicros}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-49065][SQL] Rebasing in legacy formatters/parsers must support non JVM default time zones

2024-08-01 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a1e7fb18d0c4 [SPARK-49065][SQL] Rebasing in legacy formatters/parsers 
must support non JVM default time zones
a1e7fb18d0c4 is described below

commit a1e7fb18d0c4c4d51e2a5f5b050f528a045960d8
Author: Sumeet Varma 
AuthorDate: Thu Aug 1 22:29:08 2024 +0800

[SPARK-49065][SQL] Rebasing in legacy formatters/parsers must support non 
JVM default time zones

### What changes were proposed in this pull request?

Explicitly pass the overridden timezone parameter to 
`rebaseJulianToGregorianMicros` and `rebaseGregorianToJulianMicros`.

### Why are the changes needed?

Currently, rebasing timestamp defaults to JVM timezone and so it produces 
incorrect results when the explicitly over-riden timezone in the 
TimestampFormatter library is not the same as JVM timezone.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UT to capture this scenario

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47541 from sumeet-db/rebase_time_zone.

Authored-by: Sumeet Varma 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 63c08632f2d6e559bc6a8396e646389e5402c757)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/util/SparkDateTimeUtils.scala |  6 
 .../sql/catalyst/util/TimestampFormatter.scala | 12 
 .../catalyst/util/TimestampFormatterSuite.scala| 34 ++
 3 files changed, 46 insertions(+), 6 deletions(-)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
index 698e7b37a9ef..980eee9390d0 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
@@ -237,6 +237,9 @@ trait SparkDateTimeUtils {
   def toJavaTimestamp(micros: Long): Timestamp =
 toJavaTimestampNoRebase(rebaseGregorianToJulianMicros(micros))
 
+  def toJavaTimestamp(timeZoneId: String, micros: Long): Timestamp =
+toJavaTimestampNoRebase(rebaseGregorianToJulianMicros(timeZoneId, micros))
+
   /**
* Converts microseconds since the epoch to an instance of 
`java.sql.Timestamp`.
*
@@ -273,6 +276,9 @@ trait SparkDateTimeUtils {
   def fromJavaTimestamp(t: Timestamp): Long =
 rebaseJulianToGregorianMicros(fromJavaTimestampNoRebase(t))
 
+  def fromJavaTimestamp(timeZoneId: String, t: Timestamp): Long =
+rebaseJulianToGregorianMicros(timeZoneId, fromJavaTimestampNoRebase(t))
+
   /**
* Converts an instance of `java.sql.Timestamp` to the number of 
microseconds since
* 1970-01-01T00:00:00.00Z.
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
index 0866cee9334c..07b32af5c85e 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
@@ -429,11 +429,11 @@ class LegacyFastTimestampFormatter(
 val micros = cal.getMicros()
 cal.set(Calendar.MILLISECOND, 0)
 val julianMicros = Math.addExact(millisToMicros(cal.getTimeInMillis), 
micros)
-rebaseJulianToGregorianMicros(julianMicros)
+rebaseJulianToGregorianMicros(TimeZone.getTimeZone(zoneId), julianMicros)
   }
 
   override def format(timestamp: Long): String = {
-val julianMicros = rebaseGregorianToJulianMicros(timestamp)
+val julianMicros = 
rebaseGregorianToJulianMicros(TimeZone.getTimeZone(zoneId), timestamp)
 cal.setTimeInMillis(Math.floorDiv(julianMicros, MICROS_PER_SECOND) * 
MILLIS_PER_SECOND)
 cal.setMicros(Math.floorMod(julianMicros, MICROS_PER_SECOND))
 fastDateFormat.format(cal)
@@ -443,7 +443,7 @@ class LegacyFastTimestampFormatter(
 if (ts.getNanos == 0) {
   fastDateFormat.format(ts)
 } else {
-  format(fromJavaTimestamp(ts))
+  format(fromJavaTimestamp(zoneId.getId, ts))
 }
   }
 
@@ -467,7 +467,7 @@ class LegacySimpleTimestampFormatter(
   }
 
   override def parse(s: String): Long = {
-fromJavaTimestamp(new Timestamp(sdf.parse(s).getTime))
+fromJavaTimestamp(zoneId.getId, new Timestamp(sdf.parse(s).getTime))
   }
 
   override def parseOptional(s: String): Option[Long] = {
@@ -475,12 +475,12 @@ class LegacySimpleTimestampFormatter(
 if (date == null) {
   None
 } else {
-  Some(fromJavaTimestamp(new Timestamp(date.getTime)))
+  Some(fromJavaTimestamp(zoneId.getId, new Timestamp(date.getTime

(spark) branch master updated: [SPARK-49065][SQL] Rebasing in legacy formatters/parsers must support non JVM default time zones

2024-08-01 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 63c08632f2d6 [SPARK-49065][SQL] Rebasing in legacy formatters/parsers 
must support non JVM default time zones
63c08632f2d6 is described below

commit 63c08632f2d6e559bc6a8396e646389e5402c757
Author: Sumeet Varma 
AuthorDate: Thu Aug 1 22:29:08 2024 +0800

[SPARK-49065][SQL] Rebasing in legacy formatters/parsers must support non 
JVM default time zones

### What changes were proposed in this pull request?

Explicitly pass the overridden timezone parameter to 
`rebaseJulianToGregorianMicros` and `rebaseGregorianToJulianMicros`.

### Why are the changes needed?

Currently, rebasing timestamp defaults to JVM timezone and so it produces 
incorrect results when the explicitly over-riden timezone in the 
TimestampFormatter library is not the same as JVM timezone.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UT to capture this scenario

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47541 from sumeet-db/rebase_time_zone.

Authored-by: Sumeet Varma 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/util/SparkDateTimeUtils.scala |  6 
 .../sql/catalyst/util/TimestampFormatter.scala | 12 
 .../catalyst/util/TimestampFormatterSuite.scala| 34 ++
 3 files changed, 46 insertions(+), 6 deletions(-)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
index 0447d813e26a..a6592ad51c65 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
@@ -251,6 +251,9 @@ trait SparkDateTimeUtils {
   def toJavaTimestamp(micros: Long): Timestamp =
 toJavaTimestampNoRebase(rebaseGregorianToJulianMicros(micros))
 
+  def toJavaTimestamp(timeZoneId: String, micros: Long): Timestamp =
+toJavaTimestampNoRebase(rebaseGregorianToJulianMicros(timeZoneId, micros))
+
   /**
* Converts microseconds since the epoch to an instance of 
`java.sql.Timestamp`.
*
@@ -287,6 +290,9 @@ trait SparkDateTimeUtils {
   def fromJavaTimestamp(t: Timestamp): Long =
 rebaseJulianToGregorianMicros(fromJavaTimestampNoRebase(t))
 
+  def fromJavaTimestamp(timeZoneId: String, t: Timestamp): Long =
+rebaseJulianToGregorianMicros(timeZoneId, fromJavaTimestampNoRebase(t))
+
   /**
* Converts an instance of `java.sql.Timestamp` to the number of 
microseconds since
* 1970-01-01T00:00:00.00Z.
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
index 9f57f8375c54..79d627b493fd 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
@@ -435,11 +435,11 @@ class LegacyFastTimestampFormatter(
 val micros = cal.getMicros()
 cal.set(Calendar.MILLISECOND, 0)
 val julianMicros = Math.addExact(millisToMicros(cal.getTimeInMillis), 
micros)
-rebaseJulianToGregorianMicros(julianMicros)
+rebaseJulianToGregorianMicros(TimeZone.getTimeZone(zoneId), julianMicros)
   }
 
   override def format(timestamp: Long): String = {
-val julianMicros = rebaseGregorianToJulianMicros(timestamp)
+val julianMicros = 
rebaseGregorianToJulianMicros(TimeZone.getTimeZone(zoneId), timestamp)
 cal.setTimeInMillis(Math.floorDiv(julianMicros, MICROS_PER_SECOND) * 
MILLIS_PER_SECOND)
 cal.setMicros(Math.floorMod(julianMicros, MICROS_PER_SECOND))
 fastDateFormat.format(cal)
@@ -449,7 +449,7 @@ class LegacyFastTimestampFormatter(
 if (ts.getNanos == 0) {
   fastDateFormat.format(ts)
 } else {
-  format(fromJavaTimestamp(ts))
+  format(fromJavaTimestamp(zoneId.getId, ts))
 }
   }
 
@@ -473,7 +473,7 @@ class LegacySimpleTimestampFormatter(
   }
 
   override def parse(s: String): Long = {
-fromJavaTimestamp(new Timestamp(sdf.parse(s).getTime))
+fromJavaTimestamp(zoneId.getId, new Timestamp(sdf.parse(s).getTime))
   }
 
   override def parseOptional(s: String): Option[Long] = {
@@ -481,12 +481,12 @@ class LegacySimpleTimestampFormatter(
 if (date == null) {
   None
 } else {
-  Some(fromJavaTimestamp(new Timestamp(date.getTime)))
+  Some(fromJavaTimestamp(zoneId.getId, new Timestamp(date.getTime)))
 }
   }
 
   override def format(us: Long): String = {
-sdf.format(toJavaTimestamp(us))
+sdf.format

[jira] [Assigned] (SPARK-49074) Fix cached Variant with column size greater than 128KB or individual variant larger than 2kb

2024-07-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49074:
---

Assignee: Richard Chen

> Fix cached Variant with column size greater than 128KB or individual variant 
> larger than 2kb
> 
>
> Key: SPARK-49074
> URL: https://issues.apache.org/jira/browse/SPARK-49074
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the `actualSize` method of the `VARIANT` `columnType` isn't 
> overridden, so we use the default size of 2kb for the `actualSize`. We should 
> define `actualSize` so the cached variant column can correctly be written to 
> the byte buffer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-49074][SQL] Fix variant with `df.cache()`

2024-07-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bf3ad7e94afb [SPARK-49074][SQL] Fix variant with `df.cache()`
bf3ad7e94afb is described below

commit bf3ad7e94afb5416d995dc22344566899ee7c4b0
Author: Richard Chen 
AuthorDate: Thu Aug 1 08:48:19 2024 +0800

[SPARK-49074][SQL] Fix variant with `df.cache()`

### What changes were proposed in this pull request?

Currently, the `actualSize` method of the `VARIANT` `columnType` isn't 
overridden, so we use the default size of 2kb for the `actualSize`. We should 
define `actualSize` so the cached variant column can correctly be written to 
the byte buffer.

Currently, if the avg per-variant size is greater than 2KB and the total 
column size is greater than 128KB (the default initial buffer size), an 
exception will be (incorrectly) thrown.

### Why are the changes needed?

to fix caching larger variants (in df.cache()), such as the ones included 
in the UTs.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

added UT

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47559 from richardc-db/fix_variant_cache.

Authored-by: Richard Chen 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/columnar/ColumnType.scala  |  6 +++
 .../scala/org/apache/spark/sql/VariantSuite.scala  | 53 ++
 2 files changed, 59 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala
index 5cc3a3d83d4c..60695a6c5d49 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala
@@ -829,6 +829,12 @@ private[columnar] object VARIANT
   /** Chosen to match the default size set in `VariantType`. */
   override def defaultSize: Int = 2048
 
+  override def actualSize(row: InternalRow, ordinal: Int): Int = {
+val v = getField(row, ordinal)
+// 4 bytes each for the integers representing the 'value' and 'metadata' 
lengths.
+8 + v.getValue().length + v.getMetadata().length
+  }
+
   override def getField(row: InternalRow, ordinal: Int): VariantVal = 
row.getVariant(ordinal)
 
   override def setField(row: InternalRow, ordinal: Int, value: VariantVal): 
Unit =
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala
index ce2643f9e239..0c8b0b501951 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala
@@ -652,6 +652,21 @@ class VariantSuite extends QueryTest with 
SharedSparkSession with ExpressionEval
 checkAnswer(df, expected.collect())
   }
 
+  test("variant with many keys in a cached row-based df") {
+// The initial size of the buffer backing a cached dataframe column is 
128KB.
+// See `ColumnBuilder`.
+val numKeys = 128 * 1024
+var keyIterator = (0 until numKeys).iterator
+val entries = Array.fill(numKeys)(s"""\"${keyIterator.next()}\": 
\"test\"""")
+val jsonStr = s"{${entries.mkString(", ")}}"
+val query = s"""select parse_json('${jsonStr}') v from range(0, 10)"""
+val df = spark.sql(query)
+df.cache()
+
+val expected = spark.sql(query)
+checkAnswer(df, expected.collect())
+  }
+
   test("struct of variant in a cached row-based df") {
 val query = """select named_struct(
   'v', parse_json(format_string('{\"a\": %s}', id)),
@@ -680,6 +695,21 @@ class VariantSuite extends QueryTest with 
SharedSparkSession with ExpressionEval
 checkAnswer(df, expected.collect())
   }
 
+  test("array variant with many keys in a cached row-based df") {
+// The initial size of the buffer backing a cached dataframe column is 
128KB.
+// See `ColumnBuilder`.
+val numKeys = 128 * 1024
+var keyIterator = (0 until numKeys).iterator
+val entries = Array.fill(numKeys)(s"""\"${keyIterator.next()}\": 
\"test\"""")
+val jsonStr = s"{${entries.mkString(", ")}}"
+val query = s"""select array(parse_json('${jsonStr}')) v from range(0, 
10)"""
+val df = spark.sql(query)
+df.cache()
+
+val expected = spark.sql(query)
+checkAnswer(df, expected.collect())
+  }
+

[jira] [Resolved] (SPARK-49074) Fix cached Variant with column size greater than 128KB or individual variant larger than 2kb

2024-07-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49074.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47559
[https://github.com/apache/spark/pull/47559]

> Fix cached Variant with column size greater than 128KB or individual variant 
> larger than 2kb
> 
>
> Key: SPARK-49074
> URL: https://issues.apache.org/jira/browse/SPARK-49074
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, the `actualSize` method of the `VARIANT` `columnType` isn't 
> overridden, so we use the default size of 2kb for the `actualSize`. We should 
> define `actualSize` so the cached variant column can correctly be written to 
> the byte buffer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals

2024-07-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49000.
-
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 47525
[https://github.com/apache/spark/pull/47525]

> Aggregation with DISTINCT gives wrong results when dealing with literals
> 
>
> Key: SPARK-49000
> URL: https://issues.apache.org/jira/browse/SPARK-49000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> Aggregation with *DISTINCT* gives wrong results when dealing with literals. 
> It appears that this bug affects all (or most) released versions of Spark.
>  
> For example:
> {code:java}
> select count(distinct 1) from t{code}
> returns 1, while the correct result should be 0.
>  
> For reference:
> {code:java}
> select count(1) from t{code}
> returns 0, which is the correct and expected result.
>  
> In these examples, suppose that *t* is a table with any columns).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals

2024-07-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49000:
---

Assignee: Uroš Bojanić

> Aggregation with DISTINCT gives wrong results when dealing with literals
> 
>
> Key: SPARK-49000
> URL: https://issues.apache.org/jira/browse/SPARK-49000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Critical
>  Labels: pull-request-available
>
> Aggregation with *DISTINCT* gives wrong results when dealing with literals. 
> It appears that this bug affects all (or most) released versions of Spark.
>  
> For example:
> {code:java}
> select count(distinct 1) from t{code}
> returns 1, while the correct result should be 0.
>  
> For reference:
> {code:java}
> select count(1) from t{code}
> returns 0, which is the correct and expected result.
>  
> In these examples, suppose that *t* is a table with any columns).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates

2024-07-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new c6df89079886 [SPARK-49000][SQL] Fix "select count(distinct 1) from t" 
where t is empty table by expanding RewriteDistinctAggregates
c6df89079886 is described below

commit c6df890798862d0863afbfff8fca0ee4df70354f
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed Jul 31 22:37:42 2024 +0800

[SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty 
table by expanding RewriteDistinctAggregates

Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on 
DISTINCT literals. Physical plan for `select count(distinct 1) from t`:
```
-- count(distinct 1)
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct 1)], 
output=[count(DISTINCT 1)#2L])
   +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], 
output=[count#6L])
  +- HashAggregate(keys=[], functions=[], output=[])
 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20]
+- HashAggregate(keys=[], functions=[], output=[])
   +- FileScan parquet spark_catalog.default.t[] Batched: 
false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
```
Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` 
node yields one row to `partial_count` node, which then captures one row. This 
four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`.

To fix the problem, we're adding `Expand` node which will force non-empty 
grouping expressions in `HashAggregateExec` nodes. This will in turn enable 
streaming zero rows to parent `partial_count` node, yielding correct final 
result.

Aggregation with DISTINCT literal gives wrong results. For example, when 
running on empty table `t`:
`select count(distinct 1) from t` returns 1, while the correct result 
should be 0.
For reference:
`select count(1) from t` returns 0, which is the correct and expected 
result.

Yes, this fixes a critical bug in Spark.

New e2e SQL tests for aggregates with DISTINCT literals.

No.

Closes #47525 from nikolamand-db/SPARK-49000-spark-expand-approach.

Lead-authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Co-authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
(cherry picked from commit dfa21332f20fff4aa6052ffa556d206497c066cf)
Signed-off-by: Wenchen Fan 
---
 .../optimizer/RewriteDistinctAggregates.scala  |  13 ++-
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 114 +
 2 files changed, 125 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
index da3cf782f668..e91493188873 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
@@ -197,6 +197,15 @@ import org.apache.spark.util.collection.Utils
  * techniques.
  */
 object RewriteDistinctAggregates extends Rule[LogicalPlan] {
+  private def mustRewrite(
+  aggregateExpressions: Seq[AggregateExpression],
+  groupingExpressions: Seq[Expression]): Boolean = {
+// If there are any AggregateExpressions with filter, we need to rewrite 
the query.
+// Also, if there are no grouping expressions and all aggregate 
expressions are foldable,
+// we need to rewrite the query, e.g. SELECT COUNT(DISTINCT 1).
+aggregateExpressions.exists(_.filter.isDefined) || 
(groupingExpressions.isEmpty &&
+  
aggregateExpressions.exists(_.aggregateFunction.children.forall(_.foldable)))
+  }
 
   private def mayNeedtoRewrite(a: Aggregate): Boolean = {
 val aggExpressions = collectAggregateExprs(a)
@@ -205,7 +214,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
 // clause for this rule because aggregation strategy can handle a single 
distinct aggregate
 // group without filter clause.
 // This check can produce false-positives, e.g., SUM(DISTINCT a) & 
COUNT(DISTINCT a).
-distinctAggs.size > 1 || distinctAggs.exists(_.filter.isDefined)
+distinctAggs.size > 1 || mustRewrite(distinctAggs, a.groupingExpressions)
   }
 
   def apply(plan: LogicalPlan): LogicalP

(spark) branch master updated: [SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates

2024-07-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dfa21332f20f [SPARK-49000][SQL] Fix "select count(distinct 1) from t" 
where t is empty table by expanding RewriteDistinctAggregates
dfa21332f20f is described below

commit dfa21332f20fff4aa6052ffa556d206497c066cf
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed Jul 31 22:37:42 2024 +0800

[SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty 
table by expanding RewriteDistinctAggregates

### What changes were proposed in this pull request?
Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on 
DISTINCT literals. Physical plan for `select count(distinct 1) from t`:
```
-- count(distinct 1)
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct 1)], 
output=[count(DISTINCT 1)#2L])
   +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], 
output=[count#6L])
  +- HashAggregate(keys=[], functions=[], output=[])
 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20]
+- HashAggregate(keys=[], functions=[], output=[])
   +- FileScan parquet spark_catalog.default.t[] Batched: 
false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
```
Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` 
node yields one row to `partial_count` node, which then captures one row. This 
four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`.

To fix the problem, we're adding `Expand` node which will force non-empty 
grouping expressions in `HashAggregateExec` nodes. This will in turn enable 
streaming zero rows to parent `partial_count` node, yielding correct final 
result.

### Why are the changes needed?
Aggregation with DISTINCT literal gives wrong results. For example, when 
running on empty table `t`:
`select count(distinct 1) from t` returns 1, while the correct result 
should be 0.
For reference:
`select count(1) from t` returns 0, which is the correct and expected 
result.

### Does this PR introduce _any_ user-facing change?
Yes, this fixes a critical bug in Spark.

### How was this patch tested?
New e2e SQL tests for aggregates with DISTINCT literals.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47525 from nikolamand-db/SPARK-49000-spark-expand-approach.

Lead-authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Co-authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
---
 .../optimizer/RewriteDistinctAggregates.scala  |  13 ++-
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 114 +
 2 files changed, 125 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
index da3cf782f668..e91493188873 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
@@ -197,6 +197,15 @@ import org.apache.spark.util.collection.Utils
  * techniques.
  */
 object RewriteDistinctAggregates extends Rule[LogicalPlan] {
+  private def mustRewrite(
+  aggregateExpressions: Seq[AggregateExpression],
+  groupingExpressions: Seq[Expression]): Boolean = {
+// If there are any AggregateExpressions with filter, we need to rewrite 
the query.
+// Also, if there are no grouping expressions and all aggregate 
expressions are foldable,
+// we need to rewrite the query, e.g. SELECT COUNT(DISTINCT 1).
+aggregateExpressions.exists(_.filter.isDefined) || 
(groupingExpressions.isEmpty &&
+  
aggregateExpressions.exists(_.aggregateFunction.children.forall(_.foldable)))
+  }
 
   private def mayNeedtoRewrite(a: Aggregate): Boolean = {
 val aggExpressions = collectAggregateExprs(a)
@@ -205,7 +214,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
 // clause for this rule because aggregation strategy can handle a single 
distinct aggregate
 // group without filter clause.
 // This check can produce false-positives, e.g., SUM(DISTINCT a) & 
COUNT(DISTINCT a).
-distinctAggs.size > 1 || distinctAggs.exists(_.filter.isDefin

[jira] [Assigned] (SPARK-48977) Optimize collation support for string search (UTF8_LCASE collation)

2024-07-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48977:
---

Assignee: Uroš Bojanić

> Optimize collation support for string search (UTF8_LCASE collation)
> ---
>
> Key: SPARK-48977
> URL: https://issues.apache.org/jira/browse/SPARK-48977
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48977) Optimize collation support for string search (UTF8_LCASE collation)

2024-07-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48977.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47444
[https://github.com/apache/spark/pull/47444]

> Optimize collation support for string search (UTF8_LCASE collation)
> ---
>
> Key: SPARK-48977
> URL: https://issues.apache.org/jira/browse/SPARK-48977
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48977][SQL] Optimize string searching under UTF8_LCASE collation

2024-07-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0e07873d368f [SPARK-48977][SQL] Optimize string searching under 
UTF8_LCASE collation
0e07873d368f is described below

commit 0e07873d368fa17dfff11e28fd45531f1f388864
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed Jul 31 17:16:17 2024 +0800

[SPARK-48977][SQL] Optimize string searching under UTF8_LCASE collation

### What changes were proposed in this pull request?
Modify string search under UTF8_LCASE collation by utilizing UTF8String 
character iterator to reduce one order of algorithmic complexity.

### Why are the changes needed?
Optimize implementation for `contains`, `startsWith`, `endsWith`, `locate` 
expressions.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47444 from uros-db/optimize-search.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 83 +++---
 1 file changed, 73 insertions(+), 10 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index 430d1fb89832..5b005f152c51 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -85,13 +85,44 @@ public class CollationAwareUTF8String {
   final UTF8String lowercasePattern,
   int startPos) {
 assert startPos >= 0;
-for (int len = 0; len <= target.numChars() - startPos; ++len) {
-  if (lowerCaseCodePoints(target.substring(startPos, startPos + len))
-  .equals(lowercasePattern)) {
-return len;
+// Use code point iterators for efficient string search.
+Iterator targetIterator = target.codePointIterator();
+Iterator patternIterator = lowercasePattern.codePointIterator();
+// Skip to startPos in the target string.
+for (int i = 0; i < startPos; ++i) {
+  if (targetIterator.hasNext()) {
+targetIterator.next();
+  } else {
+return MATCH_NOT_FOUND;
   }
 }
-return MATCH_NOT_FOUND;
+// Compare the characters in the target and pattern strings.
+int matchLength = 0, codePointBuffer = -1, targetCodePoint, 
patternCodePoint;
+while (targetIterator.hasNext() && patternIterator.hasNext()) {
+  if (codePointBuffer != -1) {
+targetCodePoint = codePointBuffer;
+codePointBuffer = -1;
+  } else {
+// Use buffered lowercase code point iteration to handle one-to-many 
case mappings.
+targetCodePoint = getLowercaseCodePoint(targetIterator.next());
+if (targetCodePoint == CODE_POINT_COMBINED_LOWERCASE_I_DOT) {
+  targetCodePoint = CODE_POINT_LOWERCASE_I;
+  codePointBuffer = CODE_POINT_COMBINING_DOT;
+}
+++matchLength;
+  }
+  patternCodePoint = patternIterator.next();
+  if (targetCodePoint != patternCodePoint) {
+return MATCH_NOT_FOUND;
+  }
+}
+// If the pattern string has more characters, or the match is found at the 
middle of a
+// character that maps to multiple characters in lowercase, then match is 
not found.
+if (patternIterator.hasNext() || codePointBuffer != -1) {
+  return MATCH_NOT_FOUND;
+}
+// If all characters are equal, return the length of the match in the 
target string.
+return matchLength;
   }
 
   /**
@@ -155,13 +186,45 @@ public class CollationAwareUTF8String {
   final UTF8String target,
   final UTF8String lowercasePattern,
   int endPos) {
-assert endPos <= target.numChars();
-for (int len = 0; len <= endPos; ++len) {
-  if (lowerCaseCodePoints(target.substring(endPos - len, 
endPos)).equals(lowercasePattern)) {
-return len;
+assert endPos >= 0;
+// Use code point iterators for efficient string search.
+Iterator targetIterator = target.reverseCodePointIterator();
+Iterator patternIterator = 
lowercasePattern.reverseCodePointIterator();
+// Skip to startPos in the target string.
+for (int i = endPos; i < target.numChars(); ++i) {
+  if (targetIterator.hasNext()) {
+targetIterator.next();
+  } else {
+return MATCH_NOT_FOUND;
   }
 }
-return MATCH_NOT_FOUND;
+// Compare the characters in the target and pattern strings.
+

[jira] [Assigned] (SPARK-48725) Use lowerCaseCodePoints in string functions for UTF8_LCASE

2024-07-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48725:
---

Assignee: Uroš Bojanić

> Use lowerCaseCodePoints in string functions for UTF8_LCASE
> --
>
> Key: SPARK-48725
> URL: https://issues.apache.org/jira/browse/SPARK-48725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48725) Use lowerCaseCodePoints in string functions for UTF8_LCASE

2024-07-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48725.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47132
[https://github.com/apache/spark/pull/47132]

> Use lowerCaseCodePoints in string functions for UTF8_LCASE
> --
>
> Key: SPARK-48725
> URL: https://issues.apache.org/jira/browse/SPARK-48725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48725][SQL] Integrate CollationAwareUTF8String.lowerCaseCodePoints into string expressions

2024-07-30 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 98c365f6e652 [SPARK-48725][SQL] Integrate 
CollationAwareUTF8String.lowerCaseCodePoints into string expressions
98c365f6e652 is described below

commit 98c365f6e652f740e51c3f185ebd252918ae824a
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Wed Jul 31 13:13:49 2024 +0800

[SPARK-48725][SQL] Integrate CollationAwareUTF8String.lowerCaseCodePoints 
into string expressions

### What changes were proposed in this pull request?
Use `CollationAwareUTF8String.lowerCaseCodePoints` logic to properly 
lowercase strings according to UTF8_LCASE collation, instead of relying on 
`UTF8String.toLowerCase()` method calls.

### Why are the changes needed?
Avoid correctness issues with respect to code-point logic in UTF8_LCASE 
(arising when Java to performs string lowercasing) and ensure consistent 
results.

### Does this PR introduce _any_ user-facing change?
Yes, collation aware string function implementations will now rely on 
`CollationAwareUTF8String` string lowercasing for UTF8_LCASE collation, instead 
of `UTF8String` logic (which resorts to Java's implementation).

### How was this patch tested?
Existing tests, with some new cases in `CollationSupportSuite`.
6 expressions are affected by this change: `contains`, `instr`, 
`find_in_set`, `replace`, `locate`, `substr_index`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47132 from uros-db/lcase-cp.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
    Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java|  27 ++-
 .../spark/sql/catalyst/util/CollationFactory.java  |  35 ++--
 .../spark/unsafe/types/CollationSupportSuite.java  | 224 +
 3 files changed, 256 insertions(+), 30 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index b9868ca665a6..430d1fb89832 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -51,9 +51,8 @@ public class CollationAwareUTF8String {
   /**
* Returns whether the target string starts with the specified prefix, 
starting from the
* specified position (0-based index referring to character position in 
UTF8String), with respect
-   * to the UTF8_LCASE collation. The method assumes that the prefix is 
already lowercased
-   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
-   * same prefix string.
+   * to the UTF8_LCASE collation. The method assumes that the prefix is 
already lowercased prior
+   * to method call to avoid the overhead of lowercasing the same prefix 
string multiple times.
*
* @param target the string to be searched in
* @param lowercasePattern the string to be searched for
@@ -87,7 +86,8 @@ public class CollationAwareUTF8String {
   int startPos) {
 assert startPos >= 0;
 for (int len = 0; len <= target.numChars() - startPos; ++len) {
-  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+  if (lowerCaseCodePoints(target.substring(startPos, startPos + len))
+  .equals(lowercasePattern)) {
 return len;
   }
 }
@@ -123,8 +123,7 @@ public class CollationAwareUTF8String {
* Returns whether the target string ends with the specified suffix, ending 
at the specified
* position (0-based index referring to character position in UTF8String), 
with respect to the
* UTF8_LCASE collation. The method assumes that the suffix is already 
lowercased prior
-   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
-   * suffix string.
+   * to method call to avoid the overhead of lowercasing the same suffix 
string multiple times.
*
* @param target the string to be searched in
* @param lowercasePattern the string to be searched for
@@ -158,7 +157,7 @@ public class CollationAwareUTF8String {
   int endPos) {
 assert endPos <= target.numChars();
 for (int len = 0; len <= endPos; ++len) {
-  if (target.substring(endPos - len, 
endPos).toLowerCase().equals(lowercasePattern)) {
+  if (lowerCaseCodePoints(target.substring(endPos - len, 
endPos)).equals(lowercasePattern)) {
 return len;
   }
 }
@@ -191,10 +190,9 @@ public class CollationAwareUTF8String {
   }
 
   /**
-   * Lowercase U

[jira] [Resolved] (SPARK-49003) Fix calculating hash value of collated strings

2024-07-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-49003.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47502
[https://github.com/apache/spark/pull/47502]

> Fix calculating hash value of collated strings
> --
>
> Key: SPARK-49003
> URL: https://issues.apache.org/jira/browse/SPARK-49003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Marko Ilic
>Assignee: Marko Ilic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49003) Fix calculating hash value of collated strings

2024-07-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-49003:
---

Assignee: Marko Ilic

> Fix calculating hash value of collated strings
> --
>
> Key: SPARK-49003
> URL: https://issues.apache.org/jira/browse/SPARK-49003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Marko Ilic
>Assignee: Marko Ilic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-49003][SQL] Fix interpreted code path hashing to be collation aware

2024-07-30 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5929412e65cd [SPARK-49003][SQL] Fix interpreted code path hashing to 
be collation aware
5929412e65cd is described below

commit 5929412e65cd82e72f74b21772d8c7ed04217399
Author: Marko 
AuthorDate: Wed Jul 31 12:26:35 2024 +0800

[SPARK-49003][SQL] Fix interpreted code path hashing to be collation aware

### What changes were proposed in this pull request?
Changed hash function to be collation aware. This change is just for 
interpreted code path. Codegen hashing was already collation aware.

### Why are the changes needed?
We were getting the wrong hash for collated strings because our hash 
function only used the binary representation of the string. It didn't take 
collation into account.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added tests to `HashExpressionsSuite.scala`

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47502 from ilicmarkodb/ilicmarkodb/fix_string_hash.

Lead-authored-by: Marko 
Co-authored-by: Marko Ilic 
Co-authored-by: Marko Ilić 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/hash.scala  | 12 --
 .../expressions/HashExpressionsSuite.scala | 26 +-
 .../spark/sql/CollationExpressionWalkerSuite.scala |  8 ---
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 12 +-
 4 files changed, 46 insertions(+), 12 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
index fa342f641509..3a667f370428 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
@@ -33,7 +33,7 @@ import org.apache.spark.sql.catalyst.expressions.Cast._
 import org.apache.spark.sql.catalyst.expressions.codegen._
 import org.apache.spark.sql.catalyst.expressions.codegen.Block._
 import org.apache.spark.sql.catalyst.types.DataTypeUtils
-import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.catalyst.util.{ArrayData, CollationFactory, 
MapData}
 import org.apache.spark.sql.catalyst.util.DateTimeConstants._
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.internal.SQLConf
@@ -565,7 +565,15 @@ abstract class InterpretedHashFunction {
   case a: Array[Byte] =>
 hashUnsafeBytes(a, Platform.BYTE_ARRAY_OFFSET, a.length, seed)
   case s: UTF8String =>
-hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), seed)
+val st = dataType.asInstanceOf[StringType]
+if (st.supportsBinaryEquality) {
+  hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), seed)
+} else {
+  val stringHash = CollationFactory
+.fetchCollation(st.collationId)
+.hashFunction.applyAsLong(s)
+  hashLong(stringHash, seed)
+}
 
   case array: ArrayData =>
 val elementType = dataType match {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
index 474c27c0de9a..6f3890cafd2a 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
@@ -31,7 +31,7 @@ import org.apache.spark.sql.{RandomDataGenerator, Row}
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.encoders.{ExamplePointUDT, 
ExpressionEncoder}
 import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection
-import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, 
GenericArrayData, IntervalUtils}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, 
CollationFactory, DateTimeUtils, GenericArrayData, IntervalUtils}
 import org.apache.spark.sql.types.{ArrayType, StructType, _}
 import org.apache.spark.unsafe.types.UTF8String
 import org.apache.spark.util.ArrayImplicits._
@@ -620,6 +620,30 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 checkHiveHashForDecimal("123456.123456789012345678901234567890", 38, 31, 
1728235666)
   }
 
+  for (collation <- Seq("UTF8_LCASE", "UNICODE_CI", "UTF8_BINARY")) {
+test(s"hash check for collated $collation string

[jira] [Created] (SPARK-49057) Do not block the AQE loop when submitting query stages

2024-07-30 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-49057:
---

 Summary: Do not block the AQE loop when submitting query stages
 Key: SPARK-49057
 URL: https://issues.apache.org/jira/browse/SPARK-49057
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX

2024-07-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48989.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47481
[https://github.com/apache/spark/pull/47481]

> WholeStageCodeGen error resulting in NumberFormatException when calling 
> SUBSTRING_INDEX
> ---
>
> Key: SPARK-48989
> URL: https://issues.apache.org/jira/browse/SPARK-48989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: This was tested from the {{spark-shell}}, in local mode. 
>  All Spark versions were run with default settings.
> Spark 4.0 SNAPSHOT:  Exception.
> Spark 4.0 Preview:  Exception.
> Spark 3.5.1:  Success.
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> One seems to run into a {{NumberFormatException}}, possibly from an error in 
> WholeStageCodeGen, when I exercise {{SUBSTRING_INDEX}} with a null row, thus:
> {code:scala}
> // Create integer table with one null.
> sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) 
> ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable")
> // Exercise substring-index.
> sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM 
> PARQUET.`/tmp/mytable` ").show()
> {code}
> On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following 
> exception:
> {code}
> java.lang.NumberFormatException: For input string: "columnartorow_value_0"
>   at 
> java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
>   at java.base/java.lang.Integer.parseInt(Integer.java:668)
>   at 
> org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868)
>   at 
> org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
>   at 
> org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162)
>   at 
> org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74)
>   at scala.collection.immutable.List.map(List.scala:247)
>   at scala.collection.immutable.List.map(List.scala:79)
>   at 
> org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68)
>   at 
> org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99)
> {code}
> The same query seems to run alright on Spark 3.5.x:
> {code}
> ++-+
> | num| subs|
> ++-+
> |   1|a|
> |   2|  a_a|
> |   3|a_a_a|
> |NULL| NULL|
> ++-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48989][SQL] Fix error result of throwing an exception of `SUBSTRING_INDEX` built-in function with Codegen

2024-07-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f9d63158bb98 [SPARK-48989][SQL] Fix error result of throwing an 
exception of `SUBSTRING_INDEX` built-in function with Codegen
f9d63158bb98 is described below

commit f9d63158bb98dd35fea7dd3df34baf1624ea4497
Author: Wei Guo 
AuthorDate: Tue Jul 30 12:39:27 2024 +0800

[SPARK-48989][SQL] Fix error result of throwing an exception of 
`SUBSTRING_INDEX` built-in function with Codegen

### What changes were proposed in this pull request?

This PR aims to fix error result of throwing an exception of 
`SUBSTRING_INDEX` built-in function with Codegen after related work about 
supporting it to work with collated strings.

### Why are the changes needed?

Fix a bug.  Currently, this function cannot be used because it throws an 
exception.

![image](https://github.com/user-attachments/assets/32aaddb8-b261-47a2-8eeb-d66ed79a7db2)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA and add related test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47481 from wayneguow/SPARK-48989.

Authored-by: Wei Guo 
Signed-off-by: Wenchen Fan 
---
 .../java/org/apache/spark/sql/catalyst/util/CollationSupport.java | 8 
 .../apache/spark/sql/catalyst/expressions/stringExpressions.scala | 2 +-
 .../spark/sql/catalyst/expressions/StringExpressionsSuite.scala   | 2 ++
 .../test/scala/org/apache/spark/sql/StringFunctionsSuite.scala| 8 
 4 files changed, 15 insertions(+), 5 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
index 453423ddbc33..672708bec82b 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
@@ -469,15 +469,15 @@ public final class CollationSupport {
   }
 }
 public static String genCode(final String string, final String delimiter,
-final int count, final int collationId) {
+final String count, final int collationId) {
   CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
   String expr = "CollationSupport.SubstringIndex.exec";
   if (collation.supportsBinaryEquality) {
-return String.format(expr + "Binary(%s, %s, %d)", string, delimiter, 
count);
+return String.format(expr + "Binary(%s, %s, %s)", string, delimiter, 
count);
   } else if (collation.supportsLowercaseEquality) {
-return String.format(expr + "Lowercase(%s, %s, %d)", string, 
delimiter, count);
+return String.format(expr + "Lowercase(%s, %s, %s)", string, 
delimiter, count);
   } else {
-return String.format(expr + "ICU(%s, %s, %d, %d)", string, delimiter, 
count, collationId);
+return String.format(expr + "ICU(%s, %s, %d, %s)", string, delimiter, 
count, collationId);
   }
 }
 public static UTF8String execBinary(final UTF8String string, final 
UTF8String delimiter,
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
index 08f373a86ae3..95531c29a9af 100755
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
@@ -1657,7 +1657,7 @@ case class SubstringIndex(strExpr: Expression, delimExpr: 
Expression, countExpr:
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 defineCodeGen(ctx, ev, (str, delim, count) =>
-  CollationSupport.SubstringIndex.genCode(str, delim, 
Integer.parseInt(count, 10), collationId))
+  CollationSupport.SubstringIndex.genCode(str, delim, count, collationId))
   }
 
   override protected def withNewChildrenInternal(
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
index 847c783e1936..7210979f0846 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
@@ -356,6 +356,8 @@ class StringExpressionsSuite extends SparkFunSuit

(spark) branch master updated: [SPARK-48865][SQL][FOLLOWUP] Add failOnError argument to replace TryEval in TryUrlDecode

2024-07-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 80c44329fec4 [SPARK-48865][SQL][FOLLOWUP] Add failOnError argument to 
replace TryEval in TryUrlDecode
80c44329fec4 is described below

commit 80c44329fec47f7eaa0f25b3dff96e2ee06c80c9
Author: wforget <643348...@qq.com>
AuthorDate: Tue Jul 30 11:25:39 2024 +0800

[SPARK-48865][SQL][FOLLOWUP] Add failOnError argument to replace TryEval in 
TryUrlDecode

### What changes were proposed in this pull request?

Add `failOnError` argument for `UrlDecode` to replace `TryEval` in 
`TryUrlDecode`.

### Why are the changes needed?

Address https://github.com/apache/spark/pull/47294#discussion_r1681150787

> I'm not a big fan of TryEval as it catches all the exceptions, including 
the ones not from UrlDecode, but from its input expressions.
>
> We should add a boolean flag in UrlDecode to control the null-on-error 
behavior.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47514 from wForget/try_url_decode.

Authored-by: wforget <643348...@qq.com>
Signed-off-by: Wenchen Fan 
---
 .../explain-results/function_try_url_decode.explain |  2 +-
 .../explain-results/function_url_decode.explain |  2 +-
 .../spark/sql/catalyst/expressions/urlExpressions.scala | 17 ++---
 .../sql-tests/analyzer-results/url-functions.sql.out|  8 
 4 files changed, 16 insertions(+), 13 deletions(-)

diff --git 
a/connect/common/src/test/resources/query-tests/explain-results/function_try_url_decode.explain
 
b/connect/common/src/test/resources/query-tests/explain-results/function_try_url_decode.explain
index 74b360a6b5f3..d1d6d3b1476b 100644
--- 
a/connect/common/src/test/resources/query-tests/explain-results/function_try_url_decode.explain
+++ 
b/connect/common/src/test/resources/query-tests/explain-results/function_try_url_decode.explain
@@ -1,2 +1,2 @@
-Project [tryeval(static_invoke(UrlCodec.decode(g#0, UTF-8))) AS 
try_url_decode(g)#0]
+Project [static_invoke(UrlCodec.decode(g#0, UTF-8, false)) AS 
try_url_decode(g)#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/connect/common/src/test/resources/query-tests/explain-results/function_url_decode.explain
 
b/connect/common/src/test/resources/query-tests/explain-results/function_url_decode.explain
index 6111cc1374fb..ef203ac49a32 100644
--- 
a/connect/common/src/test/resources/query-tests/explain-results/function_url_decode.explain
+++ 
b/connect/common/src/test/resources/query-tests/explain-results/function_url_decode.explain
@@ -1,2 +1,2 @@
-Project [static_invoke(UrlCodec.decode(g#0, UTF-8)) AS url_decode(g)#0]
+Project [static_invoke(UrlCodec.decode(g#0, UTF-8, true)) AS url_decode(g)#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
index c2b999f30161..84531d6b1ad6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
@@ -29,7 +29,7 @@ import org.apache.spark.sql.catalyst.trees.UnaryLike
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.internal.types.StringTypeAnyCollation
-import org.apache.spark.sql.types.{AbstractDataType, DataType}
+import org.apache.spark.sql.types.{AbstractDataType, BooleanType, DataType}
 import org.apache.spark.unsafe.types.UTF8String
 
 // scalastyle:off line.size.limit
@@ -86,16 +86,18 @@ case class UrlEncode(child: Expression)
   since = "3.4.0",
   group = "url_funcs")
 // scalastyle:on line.size.limit
-case class UrlDecode(child: Expression)
+case class UrlDecode(child: Expression, failOnError: Boolean = true)
   extends RuntimeReplaceable with UnaryLike[Expression] with 
ImplicitCastInputTypes {
 
+  def this(child: Expression) = this(child, true)
+
   override lazy val replacement: Expression =
 StaticInvoke(
   UrlCodec.getClass,
   SQLConf.get.defaultStringType,
   "decode",
-  Seq(child, Literal("UTF-8")),
-  Seq(StringTypeAnyCollation, StringTypeAnyCollation))
+  Seq(child, Literal("UTF-8"), Literal(failOnError)),
+  Seq(StringTypeAnyCollation, StringTypeAnyCollation, BooleanType))
 
   override protected def

(spark) branch master updated: [SPARK-48900] Add `reason` field for all internal calls for job/stage cancellation

2024-07-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 308669fc3019 [SPARK-48900] Add `reason` field for all internal calls 
for job/stage cancellation
308669fc3019 is described below

commit 308669fc301916837bacb7c3ec1ecef93190c094
Author: Mingkang Li 
AuthorDate: Tue Jul 30 09:02:27 2024 +0800

[SPARK-48900] Add `reason` field for all internal calls for job/stage 
cancellation

### What changes were proposed in this pull request?

The changes can be grouped into two categories:

- **Add `reason` field for all internal calls for job/stage cancellation**
  - Cancel because of exceptions:
- org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
- org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala
- 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
   - Cancel by user (Web UI):
  - org/apache/spark/ui/jobs/JobsTab.scala
   - Cancel when streaming terminates/query ends:
 - org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 - 
org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

_(Developers familiar with these components: would appreciate it if you 
could provide suggestions for more helpful cancellation reason strings! :) )_

- **API Change for `JobWaiter`**
Currently, the `.cancel()` function in `JobWaiter` does not allow 
specifying a reason for cancellation. This limitation prevents us from 
reporting cancellation reasons in AQE, such as cleanup or canceling unnecessary 
jobs after query replanning. To address this, we should add an `reason: String` 
parameter to all relevant APIs along the chain.

https://github.com/user-attachments/assets/239cbcd6-8d78-446a-98d0-456e5e837494";>

### Why are the changes needed?

Today it is difficult to determine why a job, stage, or job group was 
canceled. We should leverage existing Spark functionality to provide a reason 
string explaining the cancellation cause, and should add new APIs to let us 
provide this reason when canceling job groups. For more context, please read 
[this JIRA ticket](https://issues.apache.org/jira/browse/SPARK-48900).

This feature can be implemented in two PRs:
1. [Modify the current SparkContext and its downstream APIs to add the 
reason string, such as cancelJobGroup and 
cancelJobsWithTag](https://github.com/apache/spark/pull/47361)
2. Add reasons for all internal calls to these methods.

**Note: This is the second of the two PRs to implement this new feature**

### Does this PR introduce _any_ user-facing change?

It adds reasons for jobs and stages cancelled internally, providing users 
with clearer explanations for cancellations by Spark, such as exceptions, end 
of streaming, AQE query replanning, etc.

### How was this patch tested?

Tests for the API changes on the `JobWaiter` chain are added in 
`JobCancellationSuite.scala`

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47374 from mingkangli-db/reason_internal_calls.

Authored-by: Mingkang Li 
Signed-off-by: Wenchen Fan 
---
 .../connect/execution/ExecuteThreadRunner.scala|  4 ++-
 .../main/scala/org/apache/spark/FutureAction.scala | 15 ++---
 .../org/apache/spark/scheduler/JobWaiter.scala | 17 +++---
 .../scala/org/apache/spark/ui/jobs/JobsTab.scala   |  2 +-
 .../org/apache/spark/JobCancellationSuite.scala| 37 +-
 project/MimaExcludes.scala |  3 ++
 .../execution/adaptive/AdaptiveSparkPlanExec.scala |  2 +-
 .../sql/execution/adaptive/QueryStageExec.scala| 13 
 .../execution/exchange/BroadcastExchangeExec.scala | 11 ---
 .../execution/exchange/ShuffleExchangeExec.scala   |  6 ++--
 .../execution/streaming/MicroBatchExecution.scala  |  6 ++--
 .../streaming/continuous/ContinuousExecution.scala |  3 +-
 .../sql/execution/BroadcastExchangeSuite.scala |  3 +-
 .../SparkExecuteStatementOperation.scala   |  5 +--
 14 files changed, 94 insertions(+), 33 deletions(-)

diff --git 
a/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
 
b/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
index 4ef4f632204b..1a38f237ee09 100644
--- 
a/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
+++ 
b/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
@@ -115,7 +115,9 @@ private[connect] class ExecuteThreadRunner(executeHolder: 
ExecuteHolder)

[jira] [Resolved] (SPARK-48910) Slow linear searches in PreprocessTableCreation

2024-07-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48910.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47484
[https://github.com/apache/spark/pull/47484]

> Slow linear searches in PreprocessTableCreation
> ---
>
> Key: SPARK-48910
> URL: https://issues.apache.org/jira/browse/SPARK-48910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> PreprocessTableCreation does Seq.contains over partition columns, which 
> becomes very slow in case of 1000s of partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48910) Slow linear searches in PreprocessTableCreation

2024-07-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48910:
---

Assignee: Vladimir Golubev

> Slow linear searches in PreprocessTableCreation
> ---
>
> Key: SPARK-48910
> URL: https://issues.apache.org/jira/browse/SPARK-48910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
>
> PreprocessTableCreation does Seq.contains over partition columns, which 
> becomes very slow in case of 1000s of partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation

2024-07-29 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new db036537ab55 [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear 
searches in PreprocessTableCreation
db036537ab55 is described below

commit db036537ab5593b2520742b3b1a1028bb0fcc7fa
Author: Vladimir Golubev 
AuthorDate: Mon Jul 29 22:07:02 2024 +0800

[SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in 
PreprocessTableCreation

### What changes were proposed in this pull request?

Use `HashSet`/`HashMap` instead of doing linear searches over the `Seq`. In 
case of 1000s of partitions this significantly improves the performance.

### Why are the changes needed?

To avoid the O(n*m) passes in the `PreprocessTableCreation`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UTs

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47484 from 
vladimirg-db/vladimirg-db/get-rid-of-linear-searches-preprocess-table-creation.

Authored-by: Vladimir Golubev 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/datasources/rules.scala| 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
index 9bc0793650ac..9279862c9196 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources
 
 import java.util.Locale
 
+import scala.collection.mutable.{HashMap, HashSet}
 import scala.jdk.CollectionConverters._
 
 import org.apache.spark.sql.{AnalysisException, SaveMode, SparkSession}
@@ -248,10 +249,14 @@ case class PreprocessTableCreation(catalog: 
SessionCatalog) extends Rule[Logical
 DDLUtils.checkTableColumns(tableDesc.copy(schema = 
analyzedQuery.schema))
 
 val output = analyzedQuery.output
+
+val outputByName = HashMap(output.map(o => o.name -> o): _*)
 val partitionAttrs = normalizedTable.partitionColumnNames.map { 
partCol =>
-  output.find(_.name == partCol).get
+  outputByName(partCol)
 }
-val newOutput = output.filterNot(partitionAttrs.contains) ++ 
partitionAttrs
+val partitionAttrsSet = HashSet(partitionAttrs: _*)
+val newOutput = output.filterNot(partitionAttrsSet.contains) ++ 
partitionAttrs
+
 val reorderedQuery = if (newOutput == output) {
   analyzedQuery
 } else {
@@ -263,12 +268,14 @@ case class PreprocessTableCreation(catalog: 
SessionCatalog) extends Rule[Logical
 DDLUtils.checkTableColumns(tableDesc)
 val normalizedTable = normalizeCatalogTable(tableDesc.schema, 
tableDesc)
 
+val normalizedSchemaByName = HashMap(normalizedTable.schema.map(s => 
s.name -> s): _*)
 val partitionSchema = normalizedTable.partitionColumnNames.map { 
partCol =>
-  normalizedTable.schema.find(_.name == partCol).get
+  normalizedSchemaByName(partCol)
 }
-
-val reorderedSchema =
-  
StructType(normalizedTable.schema.filterNot(partitionSchema.contains) ++ 
partitionSchema)
+val partitionSchemaSet = HashSet(partitionSchema: _*)
+val reorderedSchema = StructType(
+  normalizedTable.schema.filterNot(partitionSchemaSet.contains) ++ 
partitionSchema
+)
 
 c.copy(tableDesc = normalizedTable.copy(schema = reorderedSchema))
   }
@@ -360,8 +367,9 @@ case class PreprocessTableCreation(catalog: SessionCatalog) 
extends Rule[Logical
 messageParameters = Map.empty)
 }
 
+val normalizedPartitionColsSet = HashSet(normalizedPartitionCols: _*)
 schema
-  .filter(f => normalizedPartitionCols.contains(f.name))
+  .filter(f => normalizedPartitionColsSet.contains(f.name))
   .foreach { field =>
 if (!PartitioningUtils.canPartitionOn(field.dataType)) {
   throw 
QueryCompilationErrors.invalidPartitionColumnDataTypeError(field)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Re: [VOTE] Release Spark 3.5.2 (RC4)

2024-07-29 Thread Wenchen Fan

+1

On Sat, Jul 27, 2024 at 10:03 AM Dongjoon Hyun 
wrote:

> +1
>
> Thank you, Kent.
>
> Dongjoon.
>
> On Fri, Jul 26, 2024 at 6:37 AM Kent Yao  wrote:
>
>> Hi dev,
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.5.2.
>>
>> The vote is open until Jul 29, 14:00:00 UTC, and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.5.2-rc4 (commit
>> 1edbddfadeb46581134fa477d35399ddc63b7163):
>> https://github.com/apache/spark/tree/v3.5.2-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1460/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc4-docs/
>>
>> The list of bug fixes going into 3.5.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12353980
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc4-bin/pyspark-3.5.2.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.5.2?
>> ===
>>
>> The current list of open tickets targeted at 3.5.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for
>> "Target Version/s" = 3.5.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Thanks,
>> Kent Yao
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

[jira] [Assigned] (SPARK-45787) Support Catalog.listColumns() for clustering columns

2024-07-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45787:
---

Assignee: Jiaheng Tang  (was: Terry Kim)

> Support Catalog.listColumns() for clustering columns
> 
>
> Key: SPARK-45787
> URL: https://issues.apache.org/jira/browse/SPARK-45787
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Jiaheng Tang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support Catalog.listColumns() for clustering columns so that it `
> org.apache.spark.sql.catalog.Column` contains clustering info (e.g., 
> isCluster).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45787) Support Catalog.listColumns() for clustering columns

2024-07-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45787.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47451
[https://github.com/apache/spark/pull/47451]

> Support Catalog.listColumns() for clustering columns
> 
>
> Key: SPARK-45787
> URL: https://issues.apache.org/jira/browse/SPARK-45787
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support Catalog.listColumns() for clustering columns so that it `
> org.apache.spark.sql.catalog.Column` contains clustering info (e.g., 
> isCluster).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45787) Support Catalog.listColumns() for clustering columns

2024-07-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45787:
---

Assignee: Terry Kim

> Support Catalog.listColumns() for clustering columns
> 
>
> Key: SPARK-45787
> URL: https://issues.apache.org/jira/browse/SPARK-45787
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>  Labels: pull-request-available
>
> Support Catalog.listColumns() for clustering columns so that it `
> org.apache.spark.sql.catalog.Column` contains clustering info (e.g., 
> isCluster).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated (f3b819ec642c -> e73ede7939f2)

2024-07-25 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f3b819ec642c [SPARK-48503][SQL] Allow grouping on expressions in 
scalar subqueries, if they are bound to outer rows
 add e73ede7939f2 [SPARK-45787][SQL] Support Catalog.listColumns for 
clustering columns

No new revisions were added by this update.

Summary of changes:
 R/pkg/tests/fulltests/test_sparkSQL.R  |  3 +-
 .../org/apache/spark/sql/catalog/interface.scala   | 18 --
 python/pyspark/sql/catalog.py  |  2 ++
 python/pyspark/sql/connect/catalog.py  |  1 +
 python/pyspark/sql/tests/test_catalog.py   |  4 +++
 .../catalyst/catalog/ExternalCatalogSuite.scala|  9 +++--
 .../sql/catalyst/catalog/SessionCatalogSuite.scala |  5 ++-
 .../org/apache/spark/sql/catalog/interface.scala   | 17 +++--
 .../apache/spark/sql/internal/CatalogImpl.scala| 15 +---
 .../apache/spark/sql/internal/CatalogSuite.scala   | 41 ++
 10 files changed, 87 insertions(+), 28 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows

2024-07-25 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f3b819ec642c [SPARK-48503][SQL] Allow grouping on expressions in 
scalar subqueries, if they are bound to outer rows
f3b819ec642c is described below

commit f3b819ec642c22327e84ff0c08d255a23bb51611
Author: Andrey Gubichev 
AuthorDate: Fri Jul 26 09:37:28 2024 +0800

[SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if 
they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in https://github.com/apache/spark/pull/46839, 
allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = 
T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. 
This guarantees that for every outer row, there is exactly one value of 
cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer 
expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  | 29 ++---
 .../spark/sql/catalyst/expressions/subquery.scala  | 73 --
 .../scalar-subquery-group-by.sql.out   | 53 
 .../scalar-subquery/scalar-subquery-group-by.sql   |  7 +++
 .../scalar-subquery-group-by.sql.out   | 41 
 5 files changed, 175 insertions(+), 28 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 2bc6785aa40c..3a18b0c4d2ea 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -905,26 +905,31 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   // SPARK-18504/SPARK-18814: Block cases where GROUP BY columns
   // are not part of the correlated columns.
 
-  // Note: groupByCols does not contain outer refs - grouping by an outer 
ref is always ok
-  val groupByCols = 
AttributeSet(agg.groupingExpressions.flatMap(_.references))
-  // Collect the inner query attributes that are guaranteed to have a 
single value for each
-  // outer row. See comment on getCorrelatedEquivalentInnerColumns.
-  val correlatedEquivalentCols = getCorrelatedEquivalentInnerColumns(query)
-  val nonEquivalentGroupByCols = groupByCols -- correlatedEquivalentCols
+  // Collect the inner query expressions that are guaranteed to have a 
single value for each
+  // outer row. See comment on getCorrelatedEquivalentInnerExpressions.
+  val correlatedEquivalentExprs = 
getCorrelatedEquivalentInnerExpressions(query)
+  // Grouping expressions, except outer refs and constant expressions - 
grouping by an
+  // outer ref or a constant is always ok
+  val groupByExprs =
+ExpressionSet(agg.groupingExpressions.filter(x => 
!x.isInstanceOf[OuterReference] &&
+  x.references.nonEmpty))
+  val nonEquivalentGroupByExprs = groupByExprs -- correlatedEquivalentExprs
 
   val invalidCols = if (!SQLConf.get.getConf(
 
SQLConf.LEGACY_SCALAR_SUBQUERY_ALLOW_GROUP_BY_NON_EQUALITY_CORRELATED_PREDICATE))
 {
-nonEquivalentGroupByCols
+nonEquivalentGroupByExprs
   } else {
 // Legacy incorrect logic for checking for invalid group-by columns 
(see SPARK-48503).
 // Allows any inner attribute that appears in a correlated predicate, 
even if it is a
 // non-equality predicate or under an operator that can change the 
values of the attribute
 // (see comments on getCorrelatedEquivalentInnerColumns for examples).
+// Note: groupByCols does not contain outer refs - grouping by an 
outer ref is always ok
+val groupByCols = 
AttributeSet(agg.groupingExpressions.flatMap(_.references))
 val subqueryColumns = 
getCorrelatedPredicates(query).flatMap(_.references)
   .filterNot(conditions.flatMap(_.references).contains)
 val correlatedCols = A

Re: [外部邮件] [VOTE] Release Spark 3.5.2 (RC2)

2024-07-25 Thread Wenchen Fan

I'm changing my vote to -1 as we found a regression that breaks Delta
Lake's generated column feature. The fix was merged just now:
https://github.com/apache/spark/pull/47483

Can we cut a new RC?

On Thu, Jul 25, 2024 at 3:13 PM Mridul Muralidharan 
wrote:

>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Phive -Pyarn -Pkubernetes
>
> Regards,
> Mridul
>
>
> On Tue, Jul 23, 2024 at 9:51 PM Kent Yao  wrote:
>
>> +1(non-binding), I have checked:
>>
>> - Download links are OK
>> - Signatures, Checksums, and the KEYS file are OK
>> - LICENSE and NOTICE are present
>> - No unexpected binary files in source releases
>> - Successfully built from source
>>
>> Thanks,
>> Kent Yao
>>
>> On 2024/07/23 06:55:28 yangjie01 wrote:
>> > +1, Thanks Kent Yao ~
>> >
>> > 在 2024/7/22 17:01，“Kent Yao”mailto:y...@apache.org>>
>> 写入:
>> >
>> >
>> > Hi dev,
>> >
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 3.5.2.
>> >
>> >
>> > The vote is open until Jul 25, 09:00:00 AM UTC, and passes if a
>> majority +1
>> > PMC votes are cast, with
>> > a minimum of 3 +1 votes.
>> >
>> >
>> > [ ] +1 Release this package as Apache Spark 3.5.2
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org/
>> 
>> >
>> >
>> > The tag to be voted on is v3.5.2-rc2 (commit
>> > 6d8f511430881fa7a3203405260da174df424103):
>> > https://github.com/apache/spark/tree/v3.5.2-rc2 <
>> https://github.com/apache/spark/tree/v3.5.2-rc2>
>> >
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/ <
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/>
>> >
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS <
>> https://dist.apache.org/repos/dist/dev/spark/KEYS>
>> >
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1458/
>> 
>> >
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-docs/ <
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-docs/>
>> >
>> >
>> > The list of bug fixes going into 3.5.2 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353980 <
>> https://issues.apache.org/jira/projects/SPARK/versions/12353980>
>> >
>> >
>> > FAQ
>> >
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC via "pip install
>> >
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/pyspark-3.5.2.tar.gz";
>> <
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/pyspark-3.5.2.tar.gz"
>> ;>
>> > and see if anything important breaks.
>> > In the Java/Scala, you can add the staging repository to your projects
>> > resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with an out of date RC going forward).
>> >
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.5.2?
>> > ===
>> >
>> >
>> > The current list of open tickets targeted at 3.5.2 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK <
>> https://issues.apache.org/jira/projects/SPARK> and search for
>> > "Target Version/s" = 3.5.2
>> >
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> >
>> > Thanks,
>> > Kent Yao
>> >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > dev-unsubscr...@spark.apache.org>
>> >
>> >
>> >
>> >
>> >
>> >
>> > -
>> > To unsu

[jira] [Resolved] (SPARK-48761) Add clusterBy DataFrameWriter API for Scala

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48761.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47301
[https://github.com/apache/spark/pull/47301]

> Add clusterBy DataFrameWriter API for Scala
> ---
>
> Key: SPARK-48761
> URL: https://issues.apache.org/jira/browse/SPARK-48761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaheng Tang
>Assignee: Jiaheng Tang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add a new `clusterBy` DataFrameWriter API for Scala. This allows users to 
> interact with clustered tables using DataFrameWriter API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48761) Add clusterBy DataFrameWriter API for Scala

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48761:
---

Assignee: Jiaheng Tang

> Add clusterBy DataFrameWriter API for Scala
> ---
>
> Key: SPARK-48761
> URL: https://issues.apache.org/jira/browse/SPARK-48761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaheng Tang
>Assignee: Jiaheng Tang
>Priority: Major
>  Labels: pull-request-available
>
> Add a new `clusterBy` DataFrameWriter API for Scala. This allows users to 
> interact with clustered tables using DataFrameWriter API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala

2024-07-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bafce5de394b [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter 
API for Scala
bafce5de394b is described below

commit bafce5de394bafe63226ab5dae0ce3ac4245a793
Author: Jiaheng Tang 
AuthorDate: Thu Jul 25 12:58:11 2024 +0800

[SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala

### What changes were proposed in this pull request?

Introduce a new `clusterBy` DataFrame API in Scala. This PR adds the API 
for both the DataFrameWriter V1 and V2, as well as Spark Connect.

### Why are the changes needed?

Introduce more ways for users to interact with clustered tables.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new `clusterBy` DataFrame API in Scala to allow specifying 
the clustering columns when writing DataFrames.

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47301 from zedtang/clusterby-scala-api.

Authored-by: Jiaheng Tang 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json | 14 
 .../sql/connect/planner/SparkConnectPlanner.scala  | 10 +++
 .../org/apache/spark/sql/connect/dsl/package.scala |  4 ++
 .../connect/planner/SparkConnectProtoSuite.scala   | 42 
 .../org/apache/spark/sql/DataFrameWriter.scala | 19 ++
 .../org/apache/spark/sql/DataFrameWriterV2.scala   | 23 +++
 .../org/apache/spark/sql/ClientDatasetSuite.scala  |  4 ++
 project/MimaExcludes.scala |  4 +-
 .../spark/sql/catalyst/catalog/interface.scala | 31 +
 .../spark/sql/errors/QueryCompilationErrors.scala  | 30 +
 .../sql/connector/catalog/InMemoryBaseTable.scala  |  4 ++
 .../org/apache/spark/sql/DataFrameWriter.scala | 66 +--
 .../org/apache/spark/sql/DataFrameWriterV2.scala   | 40 ++-
 .../execution/datasources/DataSourceUtils.scala|  5 ++
 .../spark/sql/execution/datasources/rules.scala| 19 +-
 .../apache/spark/sql/DataFrameWriterV2Suite.scala  | 50 +-
 .../execution/command/DescribeTableSuiteBase.scala | 51 ++
 .../sql/test/DataFrameReaderWriterSuite.scala  | 77 +-
 18 files changed, 482 insertions(+), 11 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 44f0a59a4b48..65e063518054 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -471,6 +471,20 @@
 ],
 "sqlState" : "0A000"
   },
+  "CLUSTERING_COLUMNS_MISMATCH" : {
+"message" : [
+  "Specified clustering does not match that of the existing table 
.",
+  "Specified clustering columns: [].",
+  "Existing clustering columns: []."
+],
+"sqlState" : "42P10"
+  },
+  "CLUSTERING_NOT_SUPPORTED" : {
+"message" : [
+  "'' does not support clustering."
+],
+"sqlState" : "42000"
+  },
   "CODEC_NOT_AVAILABLE" : {
 "message" : [
   "The codec  is not available."
diff --git 
a/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index 013f0d83391c..e790a25ec97f 100644
--- 
a/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -3090,6 +3090,11 @@ class SparkConnectPlanner(
   w.partitionBy(names.toSeq: _*)
 }
 
+if (writeOperation.getClusteringColumnsCount > 0) {
+  val names = writeOperation.getClusteringColumnsList.asScala
+  w.clusterBy(names.head, names.tail.toSeq: _*)
+}
+
 if (writeOperation.hasSource) {
   w.format(writeOperation.getSource)
 }
@@ -3153,6 +3158,11 @@ class SparkConnectPlanner(
   w.partitionedBy(names.head, names.tail: _*)
 }
 
+if (writeOperation.getClusteringColumnsCount > 0) {
+  val names = writeOperation.getClusteringColumnsList.asScala
+  w.clusterBy(names.head, names.tail.toSeq: _*)
+}
+
 writeOperation.getMode match {
   case proto.WriteOperationV2.Mode.MODE_CREATE =>
 if (writeOperation.hasProvider) {
diff --git 
a/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala 
b/connect/server/src

[jira] [Assigned] (SPARK-48990) Unified variable related SQL syntax keywords

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48990:
---

Assignee: BingKun Pan

> Unified variable related SQL syntax keywords
> 
>
> Key: SPARK-48990
> URL: https://issues.apache.org/jira/browse/SPARK-48990
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48990) Unified variable related SQL syntax keywords

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48990.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47469
[https://github.com/apache/spark/pull/47469]

> Unified variable related SQL syntax keywords
> 
>
> Key: SPARK-48990
> URL: https://issues.apache.org/jira/browse/SPARK-48990
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48990][SQL] Unified variable related SQL syntax keywords

2024-07-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 34e65a8e7251 [SPARK-48990][SQL] Unified variable related SQL syntax 
keywords
34e65a8e7251 is described below

commit 34e65a8e72513d0445b5ff9b251e3388625ad1ec
Author: panbingkun 
AuthorDate: Wed Jul 24 22:52:04 2024 +0800

[SPARK-48990][SQL] Unified variable related SQL syntax keywords

### What changes were proposed in this pull request?
The pr aims to unified `variable` related `SQL syntax` keywords, enable 
syntax `DECLARE (OR REPLACE)? ...` and `DROP TEMPORARY ...` to support keyword: 
`VAR` (not only `VARIABLE`).

### Why are the changes needed?
When `setting` variables, we support `(VARIABLE | VAR)`, but when 
`declaring` and `dropping` variables, we only support the keyword `VARIABLE` 
(not support the keyword `VAR`)

https://github.com/user-attachments/assets/07084fef-4080-4410-a74c-e6001ae0a8fa";>


https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L68-L72


https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L218-L220

The syntax seems `a bit weird`, `inconsistent experience` in SQL syntax 
related to variable usage by end-users, so I propose to `unify` it.

### Does this PR introduce _any_ user-facing change?
Yes, enable end-users to use `variable related SQL` with `consistent` 
keywords.

### How was this patch tested?
Updated existed UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47469 from panbingkun/SPARK-48990.

Authored-by: panbingkun 
Signed-off-by: Wenchen Fan 
---
 docs/sql-ref-syntax-ddl-declare-variable.md|   2 +-
 docs/sql-ref-syntax-ddl-drop-variable.md   |   2 +-
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |  17 +-
 .../analyzer-results/sql-session-variables.sql.out |   4 +-
 .../sql-tests/inputs/sql-session-variables.sql |   4 +-
 .../results/sql-session-variables.sql.out  |   4 +-
 .../command/DeclareVariableParserSuite.scala   | 188 +
 .../command/DropVariableParserSuite.scala  |  56 ++
 8 files changed, 263 insertions(+), 14 deletions(-)

diff --git a/docs/sql-ref-syntax-ddl-declare-variable.md 
b/docs/sql-ref-syntax-ddl-declare-variable.md
index 518770c4496e..ba9857bf1917 100644
--- a/docs/sql-ref-syntax-ddl-declare-variable.md
+++ b/docs/sql-ref-syntax-ddl-declare-variable.md
@@ -34,7 +34,7 @@ column default expressions, and generated column expressions.
 ### Syntax
 
 ```sql
-DECLARE [ OR REPLACE ] [ VARIABLE ]
+DECLARE [ OR REPLACE ] [ VAR | VARIABLE ]
 variable_name [ data_type ] [ { DEFAULT | = } default_expr ]
 ```
 
diff --git a/docs/sql-ref-syntax-ddl-drop-variable.md 
b/docs/sql-ref-syntax-ddl-drop-variable.md
index f58d944317d5..5b2b0da76945 100644
--- a/docs/sql-ref-syntax-ddl-drop-variable.md
+++ b/docs/sql-ref-syntax-ddl-drop-variable.md
@@ -27,7 +27,7 @@ be thrown if the variable does not exist.
 ### Syntax
 
 ```sql
-DROP TEMPORARY VARIABLE [ IF EXISTS ] variable_name
+DROP TEMPORARY { VAR | VARIABLE } [ IF EXISTS ] variable_name
 ```
 
 ### Parameters
diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
index a50051715e20..c7aa56cf920a 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -66,8 +66,8 @@ compoundStatement
 ;
 
 setStatementWithOptionalVarKeyword
-: SET (VARIABLE | VAR)? assignmentList  
#setVariableWithOptionalKeyword
-| SET (VARIABLE | VAR)? LEFT_PAREN multipartIdentifierList RIGHT_PAREN EQ
+: SET variable? assignmentList  
#setVariableWithOptionalKeyword
+| SET variable? LEFT_PAREN multipartIdentifierList RIGHT_PAREN EQ
 LEFT_PAREN query RIGHT_PAREN
#setVariableWithOptionalKeyword
 ;
 
@@ -215,9 +215,9 @@ statement
 routineCharacteristics
 RETURN (query | expression)
#createUserDefinedFunction
 | DROP TEMPORARY? FUNCTION (IF EXISTS)? identifierReference
#dropFunction
-| DECLARE (OR REPLACE)? VARIABLE?
+| DECLARE (OR REPLACE)? variable?
 identifierReference dataType? variableDefaultExpression?   
#createVariable
-| DROP TEMPORARY VARIABLE (IF EXISTS)? identifierReference 
#dropVariable
+| DROP TEMPORARY vari

[jira] [Resolved] (SPARK-48833) Support variant in `InMemoryTableScan`

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48833.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47252
[https://github.com/apache/spark/pull/47252]

> Support variant in `InMemoryTableScan`
> --
>
> Key: SPARK-48833
> URL: https://issues.apache.org/jira/browse/SPARK-48833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, df.cache() does not support tables with variant types. We should 
> allow for support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48833][SQL][VARIANT] Support variant in `InMemoryTableScan`

2024-07-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0c9b072e1808 [SPARK-48833][SQL][VARIANT] Support variant in 
`InMemoryTableScan`
0c9b072e1808 is described below

commit 0c9b072e1808e180d8670e4100f3344f039cb072
Author: Richard Chen 
AuthorDate: Wed Jul 24 18:02:38 2024 +0800

[SPARK-48833][SQL][VARIANT] Support variant in `InMemoryTableScan`

### What changes were proposed in this pull request?

adds support for variant type in `InMemoryTableScan`, or `df.cache()` by 
supporting writing variant values to an inmemory buffer.

### Why are the changes needed?

prior to this PR, calling `df.cache()` on a df that has a variant would 
fail because `InMemoryTableScan` does not support reading variant types.

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

added UTs
### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47252 from richardc-db/variant_dfcache_support.

Authored-by: Richard Chen 
Signed-off-by: Wenchen Fan 
---
 .../sql/execution/columnar/ColumnAccessor.scala|  6 +-
 .../sql/execution/columnar/ColumnBuilder.scala |  4 ++
 .../spark/sql/execution/columnar/ColumnStats.scala | 15 +
 .../spark/sql/execution/columnar/ColumnType.scala  | 43 -
 .../columnar/GenerateColumnAccessor.scala  |  3 +-
 .../scala/org/apache/spark/sql/VariantSuite.scala  | 72 ++
 6 files changed, 139 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala
index 9652a48e5270..2074649cc986 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala
@@ -28,7 +28,7 @@ import org.apache.spark.sql.errors.QueryExecutionErrors
 import 
org.apache.spark.sql.execution.columnar.compression.CompressibleColumnAccessor
 import org.apache.spark.sql.execution.vectorized.WritableColumnVector
 import org.apache.spark.sql.types._
-import org.apache.spark.unsafe.types.CalendarInterval
+import org.apache.spark.unsafe.types.{CalendarInterval, VariantVal}
 
 /**
  * An `Iterator` like trait used to extract values from columnar byte buffer. 
When a value is
@@ -111,6 +111,10 @@ private[columnar] class IntervalColumnAccessor(buffer: 
ByteBuffer)
   extends BasicColumnAccessor[CalendarInterval](buffer, CALENDAR_INTERVAL)
   with NullableColumnAccessor
 
+private[columnar] class VariantColumnAccessor(buffer: ByteBuffer)
+  extends BasicColumnAccessor[VariantVal](buffer, VARIANT)
+  with NullableColumnAccessor
+
 private[columnar] class CompactDecimalColumnAccessor(buffer: ByteBuffer, 
dataType: DecimalType)
   extends NativeColumnAccessor(buffer, COMPACT_DECIMAL(dataType))
 
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnBuilder.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnBuilder.scala
index 9fafdb794841..b65ef12f12d5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnBuilder.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnBuilder.scala
@@ -131,6 +131,9 @@ class BinaryColumnBuilder extends ComplexColumnBuilder(new 
BinaryColumnStats, BI
 private[columnar]
 class IntervalColumnBuilder extends ComplexColumnBuilder(new 
IntervalColumnStats, CALENDAR_INTERVAL)
 
+private[columnar]
+class VariantColumnBuilder extends ComplexColumnBuilder(new 
VariantColumnStats, VARIANT)
+
 private[columnar] class CompactDecimalColumnBuilder(dataType: DecimalType)
   extends NativeColumnBuilder(new DecimalColumnStats(dataType), 
COMPACT_DECIMAL(dataType))
 
@@ -189,6 +192,7 @@ private[columnar] object ColumnBuilder {
   case s: StringType => new StringColumnBuilder(s)
   case BinaryType => new BinaryColumnBuilder
   case CalendarIntervalType => new IntervalColumnBuilder
+  case VariantType => new VariantColumnBuilder
   case dt: DecimalType if dt.precision <= Decimal.MAX_LONG_DIGITS =>
 new CompactDecimalColumnBuilder(dt)
   case dt: DecimalType => new DecimalColumnBuilder(dt)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
index 45f489cb13c2..4e4b3667fa24 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala
@@ -297,6 +297,2

[jira] [Assigned] (SPARK-48833) Support variant in `InMemoryTableScan`

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48833:
---

Assignee: Richard Chen

> Support variant in `InMemoryTableScan`
> --
>
> Key: SPARK-48833
> URL: https://issues.apache.org/jira/browse/SPARK-48833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
>
> Currently, df.cache() does not support tables with variant types. We should 
> allow for support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48338) Sql Scripting support for Spark SQL

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48338.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47404
[https://github.com/apache/spark/pull/47404]

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - 
> OSS.pdf
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48338) Sql Scripting support for Spark SQL

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48338:
---

Assignee: Aleksandar Tomic

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - 
> OSS.pdf
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48338][SQL] Check variable declarations

2024-07-24 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 239d77b86ca4 [SPARK-48338][SQL] Check variable declarations
239d77b86ca4 is described below

commit 239d77b86ca4ae6ba1ab334db49f4d39622ae2e6
Author: Momcilo Mrkaic 
AuthorDate: Wed Jul 24 15:19:08 2024 +0800

[SPARK-48338][SQL] Check variable declarations

### What changes were proposed in this pull request?

Checking wether variable declaration is only at the beginning of the BEGIN 
END block.

### Why are the changes needed?

SQL standard states that the variables can be declared only immediately 
after BEGIN.

### Does this PR introduce _any_ user-facing change?

Users will get an error if they try to declare variable in the scope that 
is not started with BEGIN and ended with END or if the declarations are not 
immediately after BEGIN.

### How was this patch tested?

Tests are in SqlScriptingParserSuite. There are 2 tests for now, if 
declarations are correctly written and if declarations are not written 
immediately after BEGIN. There is a TODO to write the test if declaration is 
located in the scope that is not BEGIN END.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47404 from momcilomrk-db/check_variable_declarations.

Authored-by: Momcilo Mrkaic 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json | 18 ++
 .../src/main/resources/error/error-states.json |  6 
 .../spark/sql/catalyst/parser/AstBuilder.scala | 42 ++
 .../spark/sql/errors/SqlScriptingErrors.scala  | 14 
 .../catalyst/parser/SqlScriptingParserSuite.scala  | 32 +
 5 files changed, 106 insertions(+), 6 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index d8edc89ba83e..44f0a59a4b48 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -2941,6 +2941,24 @@
 ],
 "sqlState" : "22029"
   },
+  "INVALID_VARIABLE_DECLARATION" : {
+"message" : [
+  "Invalid variable declaration."
+],
+"subClass" : {
+  "NOT_ALLOWED_IN_SCOPE" : {
+"message" : [
+  "Variable  was declared on line , which is not 
allowed in this scope."
+]
+  },
+  "ONLY_AT_BEGINNING" : {
+"message" : [
+  "Variable  can only be declared at the beginning of the 
compound, but it was declared on line ."
+]
+  }
+},
+"sqlState" : "42K0M"
+  },
   "INVALID_VARIABLE_TYPE_FOR_QUERY_EXECUTE_IMMEDIATE" : {
 "message" : [
   "Variable type must be string type but got ."
diff --git a/common/utils/src/main/resources/error/error-states.json 
b/common/utils/src/main/resources/error/error-states.json
index 0cd55bda7ba3..c5c55f11a6aa 100644
--- a/common/utils/src/main/resources/error/error-states.json
+++ b/common/utils/src/main/resources/error/error-states.json
@@ -4619,6 +4619,12 @@
 "standard": "N",
 "usedBy": ["Spark"]
 },
+"42K0M": {
+"description": "Invalid variable declaration.",
+"origin": "Spark,",
+"standard": "N",
+"usedBy": ["Spark"]
+},
 "42KD0": {
 "description": "Ambiguous name reference.",
 "origin": "Databricks",
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index f5cf3e717a3c..a046ededf964 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -48,8 +48,7 @@ import 
org.apache.spark.sql.catalyst.util.DateTimeUtils.{convertSpecialDate, con
 import org.apache.spark.sql.connector.catalog.{CatalogV2Util, 
SupportsNamespaces, TableCatalog}
 import org.apache.spark.sql.connector.catalog.TableChange.ColumnPosition
 import org.apache.spark.sql.connector.expressions.{ApplyTransform, 
BucketTransform, DaysTransform, Expression => V2Expression, FieldReference, 
HoursTransform, IdentityTransform, LiteralValue, MonthsTransform, Transform, 
YearsTransform}
-import org.apache.spark.sql.errors.{QueryCompilationE

[jira] [Resolved] (SPARK-48935) Make `checkEvaluation` directly check the `Collation` expression itself in UT

2024-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48935.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47401
[https://github.com/apache/spark/pull/47401]

> Make `checkEvaluation` directly check the `Collation` expression itself in UT 
> --
>
> Key: SPARK-48935
> URL: https://issues.apache.org/jira/browse/SPARK-48935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48935][SQL][TESTS] Make `checkEvaluation` directly check the `Collation` expression itself in UT

2024-07-23 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4de4ed10ebd5 [SPARK-48935][SQL][TESTS] Make `checkEvaluation` directly 
check the `Collation` expression itself in UT
4de4ed10ebd5 is described below

commit 4de4ed10ebd5ddc949d4d29fe77f1834e1447794
Author: panbingkun 
AuthorDate: Wed Jul 24 14:44:11 2024 +0800

[SPARK-48935][SQL][TESTS] Make `checkEvaluation` directly check the 
`Collation` expression itself in UT

### What changes were proposed in this pull request?
The pr aims to:
- make `checkEvaluation` directly check the `Collation` expression itself 
in UT, rather than `Collation(...).replacement`.
- fix an `miss` check in UT.

### Why are the changes needed?
When checking the `RuntimeReplaceable` expression in UT, there is no need 
to write as `checkEvaluation(Collation(Literal("abc")).replacement, 
"UTF8_BINARY")`, because it has already undergone a similar replacement 
internally.

https://github.com/apache/spark/blob/1a428c1606645057ef94ac8a6cadbb947b9208a6/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L75

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Update existed UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47401 from panbingkun/SPARK-48935.

Authored-by: panbingkun 
Signed-off-by: Wenchen Fan 
---
 .../expressions/CollationExpressionSuite.scala |  12 +-
 .../CollationRegexpExpressionsSuite.scala  | 121 +
 2 files changed, 84 insertions(+), 49 deletions(-)

diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationExpressionSuite.scala
index a4651c6c4c7e..175dd05d5911 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationExpressionSuite.scala
@@ -28,14 +28,14 @@ class CollationExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 assert(collationId == 0)
 val collateExpr = Collate(Literal("abc"), "UTF8_BINARY")
 assert(collateExpr.dataType === StringType(collationId))
-collateExpr.dataType.asInstanceOf[StringType].collationId == 0
+assert(collateExpr.dataType.asInstanceOf[StringType].collationId == 0)
 checkEvaluation(collateExpr, "abc")
   }
 
   test("collate against literal") {
 val collateExpr = Collate(Literal("abc"), "UTF8_LCASE")
 val collationId = CollationFactory.collationNameToId("UTF8_LCASE")
-assert(collateExpr.dataType == StringType(collationId))
+assert(collateExpr.dataType === StringType(collationId))
 checkEvaluation(collateExpr, "abc")
   }
 
@@ -67,16 +67,16 @@ class CollationExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("collation on non-explicit default collation") {
-checkEvaluation(Collation(Literal("abc")).replacement, "UTF8_BINARY")
+checkEvaluation(Collation(Literal("abc")), "UTF8_BINARY")
   }
 
   test("collation on explicitly collated string") {
 checkEvaluation(
   Collation(Literal.create("abc",
-StringType(CollationFactory.UTF8_LCASE_COLLATION_ID))).replacement,
+StringType(CollationFactory.UTF8_LCASE_COLLATION_ID))),
   "UTF8_LCASE")
 checkEvaluation(
-  Collation(Collate(Literal("abc"), "UTF8_LCASE")).replacement,
+  Collation(Collate(Literal("abc"), "UTF8_LCASE")),
   "UTF8_LCASE")
   }
 
@@ -212,7 +212,7 @@ class CollationExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   ("sR_cYRl_sRb", "sr_Cyrl_SRB")
 ).foreach {
   case (collation, normalized) =>
-checkEvaluation(Collation(Literal.create("abc", 
StringType(collation))).replacement,
+checkEvaluation(Collation(Literal.create("abc", 
StringType(collation))),
   normalized)
 }
   }
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala
index 6f0d0c13b32a..2c1244eec365 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala
+++ 
b/sql/ca

Re: [外部邮件] [VOTE] Release Spark 3.5.2 (RC2)

2024-07-23 Thread Wenchen Fan

+1

On Wed, Jul 24, 2024 at 10:51 AM Kent Yao  wrote:

> +1(non-binding), I have checked:
>
> - Download links are OK
> - Signatures, Checksums, and the KEYS file are OK
> - LICENSE and NOTICE are present
> - No unexpected binary files in source releases
> - Successfully built from source
>
> Thanks,
> Kent Yao
>
> On 2024/07/23 06:55:28 yangjie01 wrote:
> > +1, Thanks Kent Yao ~
> >
> > 在 2024/7/22 17:01，“Kent Yao”mailto:y...@apache.org>>
> 写入:
> >
> >
> > Hi dev,
> >
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 3.5.2.
> >
> >
> > The vote is open until Jul 25, 09:00:00 AM UTC, and passes if a majority
> +1
> > PMC votes are cast, with
> > a minimum of 3 +1 votes.
> >
> >
> > [ ] +1 Release this package as Apache Spark 3.5.2
> > [ ] -1 Do not release this package because ...
> >
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/ <
> https://spark.apache.org/>
> >
> >
> > The tag to be voted on is v3.5.2-rc2 (commit
> > 6d8f511430881fa7a3203405260da174df424103):
> > https://github.com/apache/spark/tree/v3.5.2-rc2 <
> https://github.com/apache/spark/tree/v3.5.2-rc2>
> >
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/ <
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/>
> >
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS <
> https://dist.apache.org/repos/dist/dev/spark/KEYS>
> >
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1458/
> 
> >
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-docs/ <
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-docs/>
> >
> >
> > The list of bug fixes going into 3.5.2 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353980 <
> https://issues.apache.org/jira/projects/SPARK/versions/12353980>
> >
> >
> > FAQ
> >
> >
> > =
> > How can I help test this release?
> > =
> >
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC via "pip install
> >
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/pyspark-3.5.2.tar.gz";
> <
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/pyspark-3.5.2.tar.gz"
> ;>
> > and see if anything important breaks.
> > In the Java/Scala, you can add the staging repository to your projects
> > resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
> >
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.5.2?
> > ===
> >
> >
> > The current list of open tickets targeted at 3.5.2 can be found at:
> > https://issues.apache.org/jira/projects/SPARK <
> https://issues.apache.org/jira/projects/SPARK> and search for
> > "Target Version/s" = 3.5.2
> >
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> >
> > Thanks,
> > Kent Yao
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
> >
> >
> >
> >
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[jira] [Resolved] (SPARK-48914) Add OFFSET operator as an option in the subquery generator

2024-07-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48914.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47375
[https://github.com/apache/spark/pull/47375]

> Add OFFSET operator as an option in the subquery generator
> --
>
> Key: SPARK-48914
> URL: https://issues.apache.org/jira/browse/SPARK-48914
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nick Young
>Assignee: Nick Young
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {{GeneratedSubquerySuite}} is a test suite that generates SQL with variations 
> of subqueries. Currently, the operators supported are Joins, Set Operations, 
> Aggregate (with/without group by) and Limit. Implementing OFFSET will 
> increase coverage by 1 additional axis, and should be simple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48914][SQL][TESTS] Add OFFSET operator as an option in the subquery generator

2024-07-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0b6cb3e92a03 [SPARK-48914][SQL][TESTS] Add OFFSET operator as an 
option in the subquery generator
0b6cb3e92a03 is described below

commit 0b6cb3e92a03bc3d472f7bc03a6519c0be4187ae
Author: Avery Qi 
AuthorDate: Tue Jul 23 09:36:30 2024 +0800

[SPARK-48914][SQL][TESTS] Add OFFSET operator as an option in the subquery 
generator

### What changes were proposed in this pull request?
This adds offset operator in subquery generator suite.

### Why are the changes needed?
Complete the subquery generator functionality

### Does this PR introduce _any_ user-facing change?
previously there's no subqueries having offset operator being tested. 
Currently offset operator is added.

### How was this patch tested?
query test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47375 from averyqi-db/offset_operator.

Authored-by: Avery Qi 
Signed-off-by: Wenchen Fan 
---
 .../jdbc/querytest/GeneratedSubquerySuite.scala| 51 +-
 .../apache/spark/sql/QueryGeneratorHelper.scala| 16 +--
 2 files changed, 51 insertions(+), 16 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
index 8cde20529d7a..b526599482da 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
@@ -126,33 +126,49 @@ class GeneratedSubquerySuite extends 
DockerJDBCIntegrationSuite with QueryGenera
   case _ => None
 }
 
-// For the OrderBy, consider whether or not the result of the subquery is 
required to be sorted.
-// This is to maintain test determinism. This is affected by whether the 
subquery has a limit
-// clause.
-val requiresLimitOne = isScalarSubquery && (operatorInSubquery match {
+// For some situation needs exactly one row as output, we force the
+// subquery to have a limit of 1 and no offset value (in case it outputs
+// empty result set).
+val requiresExactlyOneRowOutput = isScalarSubquery && (operatorInSubquery 
match {
   case a: Aggregate => a.groupingExpressions.nonEmpty
-  case l: Limit => l.limitValue > 1
   case _ => true
 })
 
-val orderByClause = if (requiresLimitOne || 
operatorInSubquery.isInstanceOf[Limit]) {
+// For the OrderBy, consider whether or not the result of the subquery is 
required to be sorted.
+// This is to maintain test determinism. This is affected by whether the 
subquery has a limit
+// clause or an offset clause.
+val orderByClause = if (
+  requiresExactlyOneRowOutput || 
operatorInSubquery.isInstanceOf[LimitAndOffset]
+) {
   Some(OrderByClause(projections))
 } else {
   None
 }
 
+// SPARK-46446: offset operator in correlated subquery is not supported
+// as it creates incorrect results for now.
+val requireNoOffsetInCorrelatedSubquery = correlationConditions.nonEmpty
+
 // For the Limit clause, consider whether the subquery needs to return 1 
row, or whether the
 // operator to be included is a Limit.
-val limitClause = if (requiresLimitOne) {
-  Some(Limit(1))
+val limitAndOffsetClause = if (requiresExactlyOneRowOutput) {
+  Some(LimitAndOffset(1, 0))
 } else {
   operatorInSubquery match {
-case limit: Limit => Some(limit)
+case lo: LimitAndOffset =>
+  val offsetValue = if (requireNoOffsetInCorrelatedSubquery) 0 else 
lo.offsetValue
+  if (offsetValue == 0 && lo.limitValue == 0) {
+None
+  } else {
+Some(LimitAndOffset(lo.limitValue, offsetValue))
+  }
 case _ => None
   }
 }
 
-Query(selectClause, fromClause, whereClause, groupByClause, orderByClause, 
limitClause)
+Query(
+  selectClause, fromClause, whereClause, groupByClause, orderByClause, 
limitAndOffsetClause
+)
   }
 
   /**
@@ -236,7 +252,7 @@ class GeneratedSubquerySuite extends 
DockerJDBCIntegrationSuite with QueryGenera
 val orderByClause = Some(OrderByClause(queryProjection))
 
 Query(selectClause, fromClause, whereClause, groupByClause = None,
-  orderByClause, limitClause = None)
+  orderByClause, limitAndOffsetClause = None)
   }
 
   private def getPostgresResult(stmt: Statement, sql: String): Array[Row] = {
@@ -

Re: [VOTE] Differentiate Spark without Spark Connect from Spark Connect

2024-07-22 Thread Wenchen Fan

+1

On Tue, Jul 23, 2024 at 8:40 AM Xinrong Meng  wrote:

> +1
>
> Thank you @Hyukjin Kwon  !
>
> On Mon, Jul 22, 2024 at 5:20 PM Gengliang Wang  wrote:
>
>> +1
>>
>> On Mon, Jul 22, 2024 at 5:19 PM Hyukjin Kwon 
>> wrote:
>>
>>> Starting with my own +1.
>>>
>>> On Tue, 23 Jul 2024 at 09:12, Hyukjin Kwon  wrote:
>>>
 Hi all,

 I’d like to start a vote for differentiating "Spark without Spark
 Connect" as "Spark Classic".

 Please also refer to:

- Discussion thread:
 https://lists.apache.org/thread/ys7zsod8cs9c7qllmf0p0msk6z2mz2ym

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thank you!

>>>

[jira] [Resolved] (SPARK-48959) Make NoSuchNamespaceException extend NoSuchDatabaseException

2024-07-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48959.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47433
[https://github.com/apache/spark/pull/47433]

> Make NoSuchNamespaceException extend NoSuchDatabaseException
> 
>
> Key: SPARK-48959
> URL: https://issues.apache.org/jira/browse/SPARK-48959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48959) Make NoSuchNamespaceException extend NoSuchDatabaseException

2024-07-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48959:
---

Assignee: Ruifeng Zheng

> Make NoSuchNamespaceException extend NoSuchDatabaseException
> 
>
> Key: SPARK-48959
> URL: https://issues.apache.org/jira/browse/SPARK-48959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48959][SQL] Make `NoSuchNamespaceException` extend `NoSuchDatabaseException` to restore the exception handling

2024-07-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3af2ea973c45 [SPARK-48959][SQL] Make `NoSuchNamespaceException` extend 
`NoSuchDatabaseException` to restore the exception handling
3af2ea973c45 is described below

commit 3af2ea973c458b9bd9818d6af733683fb15fbc19
Author: Ruifeng Zheng 
AuthorDate: Mon Jul 22 22:14:30 2024 +0800

[SPARK-48959][SQL] Make `NoSuchNamespaceException` extend 
`NoSuchDatabaseException` to restore the exception handling

### What changes were proposed in this pull request?
Make `NoSuchNamespaceException` extend `NoSuchNamespaceException`

### Why are the changes needed?
1, https://github.com/apache/spark/pull/47276 made many SQL commands throw 
`NoSuchNamespaceException` instead of `NoSuchDatabaseException`, it is more 
then an end-user facing change, it is a breaking change which break the 
exception handling in 3-rd libraries in the eco-system.

2, `NoSuchNamespaceException` and `NoSuchDatabaseException` actually share 
the same error class `SCHEMA_NOT_FOUND`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47433 from zhengruifeng/make_nons_nodb.

Authored-by: Ruifeng Zheng 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/catalyst/analysis/noSuchItemsExceptions.scala| 4 ++--
 .../scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala  | 4 +---
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/analysis/noSuchItemsExceptions.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/analysis/noSuchItemsExceptions.scala
index ac22d26ccfd1..8977d0be24d7 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/analysis/noSuchItemsExceptions.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/analysis/noSuchItemsExceptions.scala
@@ -27,7 +27,7 @@ import org.apache.spark.util.ArrayImplicits._
  * Thrown by a catalog when an item cannot be found. The analyzer will rethrow 
the exception
  * as an [[org.apache.spark.sql.AnalysisException]] with the correct position 
information.
  */
-class NoSuchDatabaseException private(
+class NoSuchDatabaseException private[analysis](
 message: String,
 cause: Option[Throwable],
 errorClass: Option[String],
@@ -60,7 +60,7 @@ class NoSuchNamespaceException private(
 cause: Option[Throwable],
 errorClass: Option[String],
 messageParameters: Map[String, String])
-  extends AnalysisException(
+  extends NoSuchDatabaseException(
 message,
 cause = cause,
 errorClass = errorClass,
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
index 645ed9e6bb0c..283c550c4556 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
@@ -25,7 +25,7 @@ import scala.jdk.CollectionConverters._
 import org.apache.spark.SparkIllegalArgumentException
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.CurrentUserContext
-import org.apache.spark.sql.catalyst.analysis.{AsOfTimestamp, AsOfVersion, 
NamedRelation, NoSuchDatabaseException, NoSuchFunctionException, 
NoSuchNamespaceException, NoSuchTableException, TimeTravelSpec}
+import org.apache.spark.sql.catalyst.analysis.{AsOfTimestamp, AsOfVersion, 
NamedRelation, NoSuchDatabaseException, NoSuchFunctionException, 
NoSuchTableException, TimeTravelSpec}
 import org.apache.spark.sql.catalyst.catalog.ClusterBySpec
 import org.apache.spark.sql.catalyst.expressions.Literal
 import org.apache.spark.sql.catalyst.plans.logical.{SerdeInfo, TableSpec}
@@ -409,7 +409,6 @@ private[sql] object CatalogV2Util {
 } catch {
   case _: NoSuchTableException => None
   case _: NoSuchDatabaseException => None
-  case _: NoSuchNamespaceException => None
 }
 
   def getTable(
@@ -434,7 +433,6 @@ private[sql] object CatalogV2Util {
 } catch {
   case _: NoSuchFunctionException => None
   case _: NoSuchDatabaseException => None
-  case _: NoSuchNamespaceException => None
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Re: [DISCUSS] Differentiate Spark without Spark Connect from Spark Connect

2024-07-21 Thread Wenchen Fan

Classic SGTM.

On Mon, Jul 22, 2024 at 1:12 PM Jungtaek Lim 
wrote:

> I'd propose not to change the name of "Spark Connect" - the name
> represents the characteristic of the mode (separation of layer for client
> and server). Trying to remove the part of "Connect" would just make
> confusion.
>
> +1 for Classic to existing mode, till someone comes up with better
> alternatives.
>
> On Mon, Jul 22, 2024 at 8:50 AM Hyukjin Kwon  wrote:
>
>> I was thinking about a similar option too but I ended up giving this up
>> .. It's quite unlikely at this moment but suppose that we have another
>> Spark Connect-ish component in the far future and it would be challenging
>> to come up with another name ... Another case is that we might have to cope
>> with the cases like Spark Connect, vs Spark (with Spark Connect) and Spark
>> (without Spark Connect) ..
>>
>> On Sun, 21 Jul 2024 at 09:59, Holden Karau 
>> wrote:
>>
>>> I think perhaps Spark Connect could be phrased as “Basic* Spark” &
>>> existing Spark could be “Full Spark” given the API limitations of Spark
>>> connect.
>>>
>>> *I was also thinking Core here but we’ve used core to refer to the RDD
>>> APIs for too long to reuse it here.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Sat, Jul 20, 2024 at 8:02 PM Xiao Li  wrote:
>>>
 Classic is much better than Legacy. : )

 Hyukjin Kwon  于2024年7月18日周四 16:58写道：

> Hi all,
>
> I noticed that we need to standardize our terminology before moving
> forward. For instance, when documenting, 'Spark without Spark Connect' is
> too long and verbose. Additionally, I've observed that we use various 
> names
> for Spark without Spark Connect: Spark Classic, Classic Spark, Legacy
> Spark, etc.
>
> I propose that we consistently refer to it as Spark Classic (vs. Spark
> Connect).
>
> Please share your thoughts on this. Thanks!
>

(spark) branch master updated (1645046fbb83 -> 3f6e2d6ef354)

2024-07-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1645046fbb83 [SPARK-48940][BUILD] Upgrade `Arrow` to 17.0.0
 add 3f6e2d6ef354 [SPARK-48498][SQL][FOLLOWUP] do padding for char-char 
comparison

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/util/CharVarcharUtils.scala | 14 ++-
 .../datasources/ApplyCharTypePadding.scala |  5 ++--
 .../apache/spark/sql/CharVarcharTestSuite.scala| 28 ++
 3 files changed, 35 insertions(+), 12 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48498][SQL][FOLLOWUP] do padding for char-char comparison

2024-07-19 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 28d33e37e2df [SPARK-48498][SQL][FOLLOWUP] do padding for char-char 
comparison
28d33e37e2df is described below

commit 28d33e37e2dfe525e7189b109be92c9eb24903a8
Author: Wenchen Fan 
AuthorDate: Fri Jul 19 15:27:26 2024 +0800

[SPARK-48498][SQL][FOLLOWUP] do padding for char-char comparison

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/46832 to handle 
a missing case: char-char comparison. We should pad both sides if 
`READ_SIDE_CHAR_PADDING` is not enabled.

### Why are the changes needed?

bug fix if people disable read side char padding

### Does this PR introduce _any_ user-facing change?

No because it's a followup and the original PR is not released yet

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47412 from cloud-fan/char.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/util/CharVarcharUtils.scala | 14 ++-
 .../datasources/ApplyCharTypePadding.scala |  5 ++--
 .../apache/spark/sql/CharVarcharTestSuite.scala| 28 ++
 3 files changed, 35 insertions(+), 12 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CharVarcharUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CharVarcharUtils.scala
index f3c272785a7b..87982e7a5f0b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CharVarcharUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CharVarcharUtils.scala
@@ -238,14 +238,14 @@ object CharVarcharUtils extends Logging with 
SparkCharVarcharUtils {
* attributes. When comparing two char type columns/fields, we need to pad 
the shorter one to
* the longer length.
*/
-  def addPaddingInStringComparison(attrs: Seq[Attribute]): Seq[Expression] = {
+  def addPaddingInStringComparison(attrs: Seq[Attribute], alwaysPad: Boolean): 
Seq[Expression] = {
 val rawTypes = attrs.map(attr => getRawType(attr.metadata))
 if (rawTypes.exists(_.isEmpty)) {
   attrs
 } else {
   val typeWithTargetCharLength = 
rawTypes.map(_.get).reduce(typeWithWiderCharLength)
   attrs.zip(rawTypes.map(_.get)).map { case (attr, rawType) =>
-padCharToTargetLength(attr, rawType, 
typeWithTargetCharLength).getOrElse(attr)
+padCharToTargetLength(attr, rawType, typeWithTargetCharLength, 
alwaysPad).getOrElse(attr)
   }
 }
   }
@@ -268,9 +268,10 @@ object CharVarcharUtils extends Logging with 
SparkCharVarcharUtils {
   private def padCharToTargetLength(
   expr: Expression,
   rawType: DataType,
-  typeWithTargetCharLength: DataType): Option[Expression] = {
+  typeWithTargetCharLength: DataType,
+  alwaysPad: Boolean): Option[Expression] = {
 (rawType, typeWithTargetCharLength) match {
-  case (CharType(len), CharType(target)) if target > len =>
+  case (CharType(len), CharType(target)) if alwaysPad || target > len =>
 Some(StringRPad(expr, Literal(target)))
 
   case (StructType(fields), StructType(targets)) =>
@@ -281,7 +282,8 @@ object CharVarcharUtils extends Logging with 
SparkCharVarcharUtils {
 while (i < fields.length) {
   val field = fields(i)
   val fieldExpr = GetStructField(expr, i, Some(field.name))
-  val padded = padCharToTargetLength(fieldExpr, field.dataType, 
targets(i).dataType)
+  val padded = padCharToTargetLength(
+fieldExpr, field.dataType, targets(i).dataType, alwaysPad)
   needPadding = padded.isDefined
   createStructExprs += Literal(field.name)
   createStructExprs += padded.getOrElse(fieldExpr)
@@ -291,7 +293,7 @@ object CharVarcharUtils extends Logging with 
SparkCharVarcharUtils {
 
   case (ArrayType(et, containsNull), ArrayType(target, _)) =>
 val param = NamedLambdaVariable("x", replaceCharVarcharWithString(et), 
containsNull)
-padCharToTargetLength(param, et, target).map { padded =>
+padCharToTargetLength(param, et, target, alwaysPad).map { padded =>
   val func = LambdaFunction(padded, Seq(param))
   ArrayTransform(expr, func)
 }
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ApplyCharTypePadding.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ApplyCharTypePadding.scala
index 1b7b0d702ab9..141767135a50 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/dataso

[jira] [Resolved] (SPARK-48388) [M0] Fix SET behavior for scripts

2024-07-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48388.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47272
[https://github.com/apache/spark/pull/47272]

> [M0] Fix SET behavior for scripts
> -
>
> Key: SPARK-48388
> URL: https://issues.apache.org/jira/browse/SPARK-48388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> By standard, SET is used to set variable value in SQL scripts.
> On our end, SET is configured to work with some Hive configs, so the grammar 
> is a bit messed up and for that reason it was decided to use SET VAR instead 
> of SET to work with SQL variables.
> This is not by standard and we should figure out the way to be able to use 
> SET for SQL variables and forbid setting of Hive configs from SQL scripts.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48388) [M0] Fix SET behavior for scripts

2024-07-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48388:
---

Assignee: David Milicevic

> [M0] Fix SET behavior for scripts
> -
>
> Key: SPARK-48388
> URL: https://issues.apache.org/jira/browse/SPARK-48388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> By standard, SET is used to set variable value in SQL scripts.
> On our end, SET is configured to work with some Hive configs, so the grammar 
> is a bit messed up and for that reason it was decided to use SET VAR instead 
> of SET to work with SQL variables.
> This is not by standard and we should figure out the way to be able to use 
> SET for SQL variables and forbid setting of Hive configs from SQL scripts.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Release Spark 3.5.2 (RC1)

2024-07-18 Thread Wenchen Fan

> The vote is open until Jul 18

Is it a typo? It's July 18 today.

On Thu, Jul 18, 2024 at 6:30 PM Kent Yao  wrote:

> Hi dev,
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.5.2.
>
> The vote is open until Jul 18, 11 AM UTC, and passes if a majority +1
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.5.2-rc1 (commit
> b1510127df952af90b7971684b91f6c0e804de24):
> https://github.com/apache/spark/tree/v3.5.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1457/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc1-docs/
>
> The list of bug fixes going into 3.5.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353980
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc1-bin/pyspark-3.5.2.tar.gz
> "
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.5.2?
> ===
>
> The current list of open tickets targeted at 3.5.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 3.5.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Kent Yao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

(spark) branch master updated: [SPARK-48388][SQL] Fix SET statement behavior for SQL Scripts

2024-07-18 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 83a97917d46d [SPARK-48388][SQL] Fix SET statement behavior for SQL 
Scripts
83a97917d46d is described below

commit 83a97917d46debe9c3ebb24cf2540b07f6a902ec
Author: David Milicevic 
AuthorDate: Thu Jul 18 18:52:16 2024 +0800

[SPARK-48388][SQL] Fix SET statement behavior for SQL Scripts

### What changes were proposed in this pull request?
`SET` statement is used to set config values and it has a poorly designed 
grammar rule `#setConfiguration` that matches everything after `SET` - `SET 
.*?`. This conflicts with the usage of `SET` for setting session variables, and 
we needed to introduce `SET (VAR | VARIABLE)` grammar rule to make distinction 
between setting the config values and session variables - [SET VAR pull 
request](https://github.com/apache/spark/pull/40474).

However, this is not by SQL standard, so for SQL scripting 
([JIRA](https://issues.apache.org/jira/browse/SPARK-48338)) we are opting to 
disable `SET` for configs and use it only for session variables. This enables 
use to use only `SET` for setting values to session variables. Config values 
can still be set from SQL scripts using `EXECUTE IMMEDIATE`.

This change simply reorders grammar rules to achieve above behavior, and 
alters only visitor functions where name of the rule had to be changed or 
completely new rule was added.

### Why are the changes needed?
These changes are supposed to resolve the issues poorly designed `SET` 
statement for the case of SQL scripts.

### Does this PR introduce _any_ user-facing change?
No.
This PR is in a series of PRs that will introduce changes to sql() API to 
add support for SQL scripting, but for now, the API remains unchanged.
In the future, the API will remain the same as well, but it will have new 
possibility to execute SQL scripts.

### How was this patch tested?
Already existing tests should cover the changes.
New tests for SQL scripts were added to:
- `SqlScriptingParserSuite`
- `SqlScriptingInterpreterSuite`

### Was this patch authored or co-authored using generative AI tooling?

Closes #47272 from davidm-db/sql_scripting_set_statement.

Authored-by: David Milicevic 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 | 32 
 .../apache/spark/sql/catalyst/parser/parsers.scala |  2 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala | 58 ++
 .../sql/catalyst/parser/ParserUtilsSuite.scala |  2 +-
 .../catalyst/parser/SqlScriptingParserSuite.scala  | 53 
 .../spark/sql/execution/SparkSqlParser.scala   | 21 
 .../scripting/SqlScriptingInterpreterSuite.scala   | 20 +++-
 7 files changed, 143 insertions(+), 45 deletions(-)

diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
index dc2c2f079461..a50051715e20 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -61,11 +61,18 @@ compoundBody
 
 compoundStatement
 : statement
+| setStatementWithOptionalVarKeyword
 | beginEndCompoundBlock
 ;
 
+setStatementWithOptionalVarKeyword
+: SET (VARIABLE | VAR)? assignmentList  
#setVariableWithOptionalKeyword
+| SET (VARIABLE | VAR)? LEFT_PAREN multipartIdentifierList RIGHT_PAREN EQ
+LEFT_PAREN query RIGHT_PAREN
#setVariableWithOptionalKeyword
+;
+
 singleStatement
-: statement SEMICOLON* EOF
+: (statement|setResetStatement) SEMICOLON* EOF
 ;
 
 beginLabel
@@ -212,7 +219,7 @@ statement
 identifierReference dataType? variableDefaultExpression?   
#createVariable
 | DROP TEMPORARY VARIABLE (IF EXISTS)? identifierReference 
#dropVariable
 | EXPLAIN (LOGICAL | FORMATTED | EXTENDED | CODEGEN | COST)?
-statement  #explain
+(statement|setResetStatement)  #explain
 | SHOW TABLES ((FROM | IN) identifierReference)?
 (LIKE? pattern=stringLit)?
#showTables
 | SHOW TABLE EXTENDED ((FROM | IN) ns=identifierReference)?
@@ -251,26 +258,29 @@ statement
 | (MSCK)? REPAIR TABLE identifierReference
 (option=(ADD|DROP|SYNC) PARTITIONS)?   
#repairTable
 | op=(ADD | LIST) identifier .*?   
#manageResource
-| SET COLLATION collationName=identifier

(spark) branch master updated: [SPARK-36680][SQL][FOLLOWUP] Files with options should be put into resolveDataSource function

2024-07-18 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1a428c160664 [SPARK-36680][SQL][FOLLOWUP] Files with options should be 
put into resolveDataSource function
1a428c160664 is described below

commit 1a428c1606645057ef94ac8a6cadbb947b9208a6
Author: lizongze 
AuthorDate: Thu Jul 18 15:58:38 2024 +0800

[SPARK-36680][SQL][FOLLOWUP] Files with options should be put into 
resolveDataSource function

### What changes were proposed in this pull request?

When reading csv, json and other files, pass the options parameter to the 
rules.resolveDataSource method to make the options parameter effective.

This is a bug fix for [#46707](https://github.com/apache/spark/pull/46707) 
szehon-ho

### Why are the changes needed?

For the following SQL, the options parameter passed in does not take 
effect. This is because the rules.resolveDataSource method does not pass the 
options parameter during the datasource construction process

```
 SELECT * FROM csv.`/test/data.csv` WITH (`header` = true, 'delimiter' = 
'|')
 ```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test in SQLQuerySuite

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47370 from logze/hint-options.

Authored-by: lizongze 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/datasources/rules.scala|  8 +++-
 .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 22 +-
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
index e4c3cd20dedb..5265c24b140d 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
@@ -19,6 +19,8 @@ package org.apache.spark.sql.execution.datasources
 
 import java.util.Locale
 
+import scala.jdk.CollectionConverters._
+
 import org.apache.spark.sql.{AnalysisException, SaveMode, SparkSession}
 import org.apache.spark.sql.catalyst.analysis._
 import org.apache.spark.sql.catalyst.catalog._
@@ -50,7 +52,11 @@ class ResolveSQLOnFile(sparkSession: SparkSession) extends 
Rule[LogicalPlan] {
 
   private def resolveDataSource(unresolved: UnresolvedRelation): DataSource = {
 val ident = unresolved.multipartIdentifier
-val dataSource = DataSource(sparkSession, paths = Seq(ident.last), 
className = ident.head)
+val dataSource = DataSource(
+  sparkSession,
+  paths = Seq(ident.last),
+  className = ident.head,
+  options = unresolved.options.asScala.toMap)
 // `dataSource.providingClass` may throw ClassNotFoundException, the 
caller side will try-catch
 // it and return the original plan, so that the analyzer can report table 
not found later.
 val isFileFormat = 
classOf[FileFormat].isAssignableFrom(dataSource.providingClass)
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index 56c364e20846..302b05e9b5ce 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -43,7 +43,7 @@ import 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper
 import org.apache.spark.sql.execution.aggregate._
 import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec
 import org.apache.spark.sql.execution.command.DataWritingCommandExec
-import 
org.apache.spark.sql.execution.datasources.{InsertIntoHadoopFsRelationCommand, 
LogicalRelation}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
InsertIntoHadoopFsRelationCommand, LogicalRelation}
 import org.apache.spark.sql.execution.datasources.v2.BatchScanExec
 import org.apache.spark.sql.execution.datasources.v2.orc.OrcScan
 import org.apache.spark.sql.execution.datasources.v2.parquet.ParquetScan
@@ -4856,6 +4856,26 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
 |"""
 )
   }
+
+  test("SPARK-36680: Files hint options should be put into resolveDataSource 
function") {
+val df1 = spark.range(100).toDF()
+withTempPath { f =>
+  df1.write.json(f.getCanonicalPath)
+  val df2 = sql(
+s"""
+   |SELECT id
+   |FROM json.`${f.getCanonicalPath}`
+   |WITH (`key1` = 1, `key2` = 2)
+""".stripMargin
+  )
+  checkAnswer(df2, df1)
+  val r

(spark) branch branch-3.5 updated: [SPARK-48791][CORE][3.5] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList

2024-07-18 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new f07a54743c57 [SPARK-48791][CORE][3.5] Fix perf regression caused by 
the accumulators registration overhead using CopyOnWriteArrayList
f07a54743c57 is described below

commit f07a54743c574bf703b406d0deba0b6e21a54273
Author: Yi Wu 
AuthorDate: Thu Jul 18 15:37:20 2024 +0800

[SPARK-48791][CORE][3.5] Fix perf regression caused by the accumulators 
registration overhead using CopyOnWriteArrayList

This PR backports https://github.com/apache/spark/pull/47197 to branch-3.5.

### What changes were proposed in this pull request?

This PR proposes to use the `ArrayBuffer` together with the read/write lock 
rather than `CopyOnWriteArrayList` for `TaskMetrics._externalAccums`.

### Why are the changes needed?

Fix the perf regression that caused by the accumulators registration 
overhead using `CopyOnWriteArrayList`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47297 from Ngone51/SPARK-48791-3.5.

Authored-by: Yi Wu 
Signed-off-by: Wenchen Fan 
---
 core/pom.xml   |  4 ++
 .../org/apache/spark/util/ArrayImplicits.scala | 35 ++
 .../org/apache/spark/util/ArrayImplicits.scala | 35 ++
 .../org/apache/spark/executor/TaskMetrics.scala| 43 +-
 mllib-local/pom.xml| 12 ++
 mllib/pom.xml  | 10 +
 pom.xml|  6 +++
 7 files changed, 135 insertions(+), 10 deletions(-)

diff --git a/core/pom.xml b/core/pom.xml
index a3a02cbd1c56..a326e41b8e23 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -243,6 +243,10 @@
   org.scala-lang.modules
   scala-xml_${scala.binary.version}
 
+
+  org.scala-lang.modules
+  scala-collection-compat_${scala.binary.version}
+
 
   org.scala-lang
   scala-library
diff --git 
a/core/src/main/scala-2.12/org/apache/spark/util/ArrayImplicits.scala 
b/core/src/main/scala-2.12/org/apache/spark/util/ArrayImplicits.scala
new file mode 100644
index ..82c6c75bd51a
--- /dev/null
+++ b/core/src/main/scala-2.12/org/apache/spark/util/ArrayImplicits.scala
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import scala.collection.compat.immutable
+
+/**
+ * Implicit methods related to Scala Array.
+ */
+private[spark] object ArrayImplicits {
+
+  implicit class SparkArrayOps[T](xs: Array[T]) {
+
+/**
+ * Wraps an Array[T] as an immutable.ArraySeq[T] without copying.
+ */
+def toImmutableArraySeq: immutable.ArraySeq[T] =
+  immutable.ArraySeq.unsafeWrapArray(xs)
+  }
+}
diff --git 
a/core/src/main/scala-2.13/org/apache/spark/util/ArrayImplicits.scala 
b/core/src/main/scala-2.13/org/apache/spark/util/ArrayImplicits.scala
new file mode 100644
index ..38c2a415af3d
--- /dev/null
+++ b/core/src/main/scala-2.13/org/apache/spark/util/ArrayImplicits.scala
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the Lic

(spark) branch master updated: [SPARK-48900] Add `reason` field for `cancelJobGroup` and `cancelJobsWithTag`

2024-07-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 41b37ae88416 [SPARK-48900] Add `reason` field for `cancelJobGroup` and 
`cancelJobsWithTag`
41b37ae88416 is described below

commit 41b37ae88416bae1b4c9d6ff72f40d5675913d17
Author: Mingkang Li 
AuthorDate: Thu Jul 18 10:22:52 2024 +0800

[SPARK-48900] Add `reason` field for `cancelJobGroup` and 
`cancelJobsWithTag`

### What changes were proposed in this pull request?

This PR introduces the optional `reason` field for `cancelJobGroup` and 
`cancelJobsWithTag` in `SparkContext.scala`, while keeping the old APIs without 
the `reason`, similar to how `cancelJob` is implemented currently.

### Why are the changes needed?

Today it is difficult to determine why a job, stage, or job group was 
canceled. We should leverage existing Spark functionality to provide a reason 
string explaining the cancellation cause, and should add new APIs to let us 
provide this reason when canceling job groups.

**Details:**

Since [SPARK-19549](https://issues.apache.org/jira/browse/SPARK-19549) 
Allow providing reasons for stage/job cancelling - ASF JIRA (Spark 2.20), 
Spark’s cancelJob and cancelStage methods accept an optional reason: String 
that is added to logging output and user-facing error messages when jobs or 
stages are canceled. In our internal calls to these methods, we should always 
supply a reason. For example, we should set an appropriate reason when the 
“kill” links are clicked in the Spark U [...]
Other APIs currently lack a reason field. For example, cancelJobGroup and 
cancelJobsWithTag don’t provide any way to specify a reason, so we only see 
generic logs like “asked to cancel job group ”. We should add an 
ability to pass in a group cancellation reason and thread that through into the 
scheduler’s logging and job failure reasons.

This feature can be implemented in two PRs:

1. Modify the current SparkContext and its downstream APIs to add the 
reason string, such as cancelJobGroup and cancelJobsWithTag

2. Add reasons for all internal calls to these methods.

**Note: This is the first of the two PRs to implement this new feature**

### Does this PR introduce _any_ user-facing change?

Yes, it modifies the SparkContext API, allowing users to add an optional 
`reason: String` to `cancelJobsWithTags` and `cancelJobGroup`, while the old 
methods without the `reason` are also kept. This creates a more uniform 
interface where the user can supply an optional reason for all job/stage 
cancellation calls.

### How was this patch tested?

New tests are added to `JobCancellationSuite` to test the reason fields for 
these calls.

For the API changes in R and PySpark, tests are added to these files:
- R/pkg/tests/fulltests/test_context.R
- python/pyspark/tests/test_pin_thread.py

### Was this patch authored or co-authored using generative AI tooling?

 No

Closes #47361 from mingkangli-db/reason_job_cancellation.

Authored-by: Mingkang Li 
Signed-off-by: Wenchen Fan 
---
 .../main/scala/org/apache/spark/SparkContext.scala | 56 --
 .../apache/spark/api/java/JavaSparkContext.scala   | 23 
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 29 -
 .../apache/spark/scheduler/DAGSchedulerEvent.scala |  7 ++-
 .../org/apache/spark/JobCancellationSuite.scala| 68 +-
 5 files changed, 162 insertions(+), 21 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 5ac2a090dcbe..485f0abcd25e 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -2633,10 +2633,42 @@ class SparkContext(config: SparkConf) extends Logging {
   /**
* Cancel active jobs for the specified group. See 
`org.apache.spark.SparkContext.setJobGroup`
* for more information.
+   *
+   * @param groupId the group ID to cancel
+   * @param reason reason for cancellation
+   *
+   * @since 4.0.0
+   */
+  def cancelJobGroup(groupId: String, reason: String): Unit = {
+assertNotStopped()
+dagScheduler.cancelJobGroup(groupId, cancelFutureJobs = false, 
Option(reason))
+  }
+
+  /**
+   * Cancel active jobs for the specified group. See 
`org.apache.spark.SparkContext.setJobGroup`
+   * for more information.
+   *
+   * @param groupId the group ID to cancel
*/
   def cancelJobGroup(groupId: String): Unit = {
 assertNotStopped()
-dagScheduler.cancelJobGroup(groupId)
+dagScheduler.cancelJobGroup(groupId, cancelFutureJobs = false, None)
+  }
+
+  /**
+   * Cancel active jobs for the specified group

[jira] [Resolved] (SPARK-48900) Add `reason` string to all job / stage / job group cancellation calls

2024-07-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48900.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47361
[https://github.com/apache/spark/pull/47361]

> Add `reason` string to all job / stage / job group cancellation calls
> -
>
> Key: SPARK-48900
> URL: https://issues.apache.org/jira/browse/SPARK-48900
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Mingkang Li
>Assignee: Mingkang Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Today it is difficult to determine _why_ a job, stage, or job group was 
> canceled. We should leverage existing Spark functionality to provide a 
> {{reason}} string explaining the cancellation cause, and should add new APIs 
> to let us provide this reason when canceling job groups.
> {*}Details{*}:
>  * Since SPARK-19549 Allow providing reasons for stage/job cancelling - ASF 
> JIRA (Spark 2.20), Spark’s {{cancelJob}} and {{cancelStage}} methods accept 
> an optional {{reason: String}} that is added to logging output and 
> user-facing error messages when jobs or stages are canceled. In our internal 
> calls to these methods, we should always supply a reason. For example, we 
> should set an appropriate reason when the “kill” links are clicked in the 
> Spark UI (see 
> [code|https://github.com/apache/spark/blob/b14c1f036f8f394ad1903998128c05d04dd584a9/core/src/main/scala/org/apache/spark/ui/jobs/JobsTab.scala#L54C1-L55]).
>  * Other APIs currently lack a {{reason}} field. For example, 
> {{cancelJobGroup}} and {{cancelJobsWithTag}} don’t provide any way to specify 
> a reason, so we only see generic logs like “asked to cancel job group  name>”. We should add an ability to pass in a group cancellation reason and 
> thread that through into the scheduler’s logging and job failure reasons.
> This feature can be implemented in two PRs:
>  # Modify the current {{SparkContext}} and its downstream APIs to add the 
> {{reason}} string, such as {{cancelJobGroup}} and {{cancelJobsWithTag}}
>       2. Add reasons for all internal calls to these methods



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48900) Add `reason` string to all job / stage / job group cancellation calls

2024-07-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48900:
---

Assignee: Mingkang Li

> Add `reason` string to all job / stage / job group cancellation calls
> -
>
> Key: SPARK-48900
> URL: https://issues.apache.org/jira/browse/SPARK-48900
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Mingkang Li
>Assignee: Mingkang Li
>Priority: Major
>  Labels: pull-request-available
>
> Today it is difficult to determine _why_ a job, stage, or job group was 
> canceled. We should leverage existing Spark functionality to provide a 
> {{reason}} string explaining the cancellation cause, and should add new APIs 
> to let us provide this reason when canceling job groups.
> {*}Details{*}:
>  * Since SPARK-19549 Allow providing reasons for stage/job cancelling - ASF 
> JIRA (Spark 2.20), Spark’s {{cancelJob}} and {{cancelStage}} methods accept 
> an optional {{reason: String}} that is added to logging output and 
> user-facing error messages when jobs or stages are canceled. In our internal 
> calls to these methods, we should always supply a reason. For example, we 
> should set an appropriate reason when the “kill” links are clicked in the 
> Spark UI (see 
> [code|https://github.com/apache/spark/blob/b14c1f036f8f394ad1903998128c05d04dd584a9/core/src/main/scala/org/apache/spark/ui/jobs/JobsTab.scala#L54C1-L55]).
>  * Other APIs currently lack a {{reason}} field. For example, 
> {{cancelJobGroup}} and {{cancelJobsWithTag}} don’t provide any way to specify 
> a reason, so we only see generic logs like “asked to cancel job group  name>”. We should add an ability to pass in a group cancellation reason and 
> thread that through into the scheduler’s logging and job failure reasons.
> This feature can be implemented in two PRs:
>  # Modify the current {{SparkContext}} and its downstream APIs to add the 
> {{reason}} string, such as {{cancelJobGroup}} and {{cancelJobsWithTag}}
>       2. Add reasons for all internal calls to these methods



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48915) Add inequality (!=, <, <=, >, >=) predicates for correlation in GeneratedSubquerySuite

2024-07-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48915:
---

Assignee: Wei Guo

> Add inequality (!=, <, <=, >, >=) predicates for correlation in 
> GeneratedSubquerySuite
> --
>
> Key: SPARK-48915
> URL: https://issues.apache.org/jira/browse/SPARK-48915
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nick Young
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> {{GeneratedSubquerySuite}} is a test suite that generates SQL with variations 
> of subqueries. Currently, the operators supported are Joins, Set Operations, 
> Aggregate (with/without group by) and Limit. Implementing inequality (!=, <, 
> <=, >, >=) predicates will increase coverage by 1 additional axis, and should 
> be simple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48915) Add inequality (!=, <, <=, >, >=) predicates for correlation in GeneratedSubquerySuite

2024-07-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48915.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47386
[https://github.com/apache/spark/pull/47386]

> Add inequality (!=, <, <=, >, >=) predicates for correlation in 
> GeneratedSubquerySuite
> --
>
> Key: SPARK-48915
> URL: https://issues.apache.org/jira/browse/SPARK-48915
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nick Young
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {{GeneratedSubquerySuite}} is a test suite that generates SQL with variations 
> of subqueries. Currently, the operators supported are Joins, Set Operations, 
> Aggregate (with/without group by) and Limit. Implementing inequality (!=, <, 
> <=, >, >=) predicates will increase coverage by 1 additional axis, and should 
> be simple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48915][SQL][TESTS] Add some uncovered predicates(!=, <=, >, >=) in test cases of `GeneratedSubquerySuite`

2024-07-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9d4ebf7ec451 [SPARK-48915][SQL][TESTS] Add some uncovered 
predicates(!=, <=, >, >=) in test cases of `GeneratedSubquerySuite`
9d4ebf7ec451 is described below

commit 9d4ebf7ec451b31b2638027f852d094b695d98e1
Author: Wei Guo 
AuthorDate: Thu Jul 18 10:15:25 2024 +0800

[SPARK-48915][SQL][TESTS] Add some uncovered predicates(!=, <=, >, >=) in 
test cases of `GeneratedSubquerySuite`

### What changes were proposed in this pull request?

This PR aims to add some predicates(!=, <=, >, >=) which are not covered in 
test cases of `GeneratedSubquerySuite`.

### Why are the changes needed?

Better coverage of current subquery tests in `GeneratedSubquerySuite`.
For more information about subqueries in `postgresq`, refer to:

https://www.postgresql.org/docs/current/functions-subquery.html#FUNCTIONS-SUBQUERY

https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA and Manual testing with `GeneratedSubquerySuite`.

![image](https://github.com/user-attachments/assets/4b265def-a7a9-405e-94ce-e9902efb79fa)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47386 from wayneguow/SPARK-48915.

Authored-by: Wei Guo 
Signed-off-by: Wenchen Fan 
---
 .../sql/jdbc/querytest/GeneratedSubquerySuite.scala   | 19 +--
 .../org/apache/spark/sql/QueryGeneratorHelper.scala   | 18 +-
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
index 015898b70b7c..fd3dafdda499 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
@@ -182,8 +182,11 @@ class GeneratedSubquerySuite extends 
DockerJDBCIntegrationSuite with QueryGenera
 } else {
   Seq()
 }
-val isScalarSubquery = Seq(SubqueryType.ATTRIBUTE, 
SubqueryType.SCALAR_PREDICATE_EQUALS,
-  SubqueryType.SCALAR_PREDICATE_LESS_THAN).contains(subqueryType)
+val isScalarSubquery = Seq(SubqueryType.ATTRIBUTE,
+  SubqueryType.SCALAR_PREDICATE_EQUALS, 
SubqueryType.SCALAR_PREDICATE_NOT_EQUALS,
+  SubqueryType.SCALAR_PREDICATE_LESS_THAN, 
SubqueryType.SCALAR_PREDICATE_LESS_THAN_OR_EQUALS,
+  SubqueryType.SCALAR_PREDICATE_GREATER_THAN,
+  
SubqueryType.SCALAR_PREDICATE_GREATER_THAN_OR_EQUALS).contains(subqueryType)
 val subqueryOrganization = generateSubquery(
   innerTable, correlationConditions, isDistinct, operatorInSubquery, 
isScalarSubquery)
 
@@ -218,8 +221,16 @@ class GeneratedSubquerySuite extends 
DockerJDBCIntegrationSuite with QueryGenera
 val whereClausePredicate = subqueryType match {
   case SubqueryType.SCALAR_PREDICATE_EQUALS =>
 Equals(expr, Subquery(subqueryOrganization))
+  case SubqueryType.SCALAR_PREDICATE_NOT_EQUALS =>
+NotEquals(expr, Subquery(subqueryOrganization))
   case SubqueryType.SCALAR_PREDICATE_LESS_THAN =>
 LessThan(expr, Subquery(subqueryOrganization))
+  case SubqueryType.SCALAR_PREDICATE_LESS_THAN_OR_EQUALS =>
+LessThanOrEquals(expr, Subquery(subqueryOrganization))
+  case SubqueryType.SCALAR_PREDICATE_GREATER_THAN =>
+GreaterThan(expr, Subquery(subqueryOrganization))
+  case SubqueryType.SCALAR_PREDICATE_GREATER_THAN_OR_EQUALS =>
+GreaterThanOrEquals(expr, Subquery(subqueryOrganization))
   case SubqueryType.EXISTS => Exists(subqueryOrganization)
   case SubqueryType.NOT_EXISTS => Not(Exists(subqueryOrganization))
   case SubqueryType.IN => In(expr, subqueryOrganization)
@@ -294,7 +305,11 @@ class GeneratedSubquerySuite extends 
DockerJDBCIntegrationSuite with QueryGenera
 case SubqueryLocation.FROM => Seq(SubqueryType.RELATION)
 case SubqueryLocation.WHERE => Seq(
   SubqueryType.SCALAR_PREDICATE_LESS_THAN,
+  SubqueryType.SCALAR_PREDICATE_LESS_THAN_OR_EQUALS,
+  SubqueryType.SCALAR_PREDICATE_GREATER_THAN,
+  SubqueryType.SCALAR_PREDICATE_GREATER_THAN_OR_EQUALS,
   SubqueryType.SCALAR_PREDICATE_EQUALS,
+  SubqueryType.SCALAR_PREDICATE_NOT_EQUALS

(spark) branch master updated: [SPARK-48907][SQL] Fix the value `explicitTypes` in `COLLATION_MISMATCH.EXPLICIT`

2024-07-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5d16c3134c44 [SPARK-48907][SQL] Fix the value `explicitTypes` in 
`COLLATION_MISMATCH.EXPLICIT`
5d16c3134c44 is described below

commit 5d16c3134c442a5546251fd7c42b1da9fdf3969e
Author: panbingkun 
AuthorDate: Wed Jul 17 22:57:41 2024 +0800

[SPARK-48907][SQL] Fix the value `explicitTypes` in 
`COLLATION_MISMATCH.EXPLICIT`

### What changes were proposed in this pull request?
The pr aims to
- fix the value `explicitTypes` in `COLLATION_MISMATCH.EXPLICIT`.
- use `checkError` to check exception in `CollationSQLExpressionsSuite` and 
`CollationStringExpressionsSuite`.

### Why are the changes needed?
Only fix bug, eg:
```
SELECT concat_ws(' ', collate('Spark', 'UTF8_LCASE'), collate('SQL', 
'UNICODE'))
```

- Before:
  ```
  [COLLATION_MISMATCH.EXPLICIT] Could not determine which collation to use 
for string functions and operators. Error occurred due to the mismatch between 
explicit collations: `string collate UTF8_LCASE`.`string collate UNICODE`. 
Decide on a single explicit collation and remove others. SQLSTATE: 42P21
  ```
  https://github.com/user-attachments/assets/4e026cb5-2875-4370-9bb9-878f0b607f41";>

- After:
  ```
  [COLLATION_MISMATCH.EXPLICIT] Could not determine which collation to use 
for string functions and operators. Error occurred due to the mismatch between 
explicit collations: [`string collate UTF8_LCASE`, `string collate UNICODE`]. 
Decide on a single explicit collation and remove others. SQLSTATE: 42P21
  ```
  https://github.com/user-attachments/assets/86f489a2-9f2d-4f59-bdb1-95c051a93ee8";>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Updated existed UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47365 from panbingkun/SPARK-48907.

Authored-by: panbingkun 
Signed-off-by: Wenchen Fan 
---
 .../src/main/resources/error/error-conditions.json |   2 +-
 .../spark/sql/errors/QueryCompilationErrors.scala  |   2 +-
 .../spark/sql/CollationSQLExpressionsSuite.scala   |  80 +---
 .../apache/spark/sql/CollationSQLRegexpSuite.scala |   2 +-
 .../sql/CollationStringExpressionsSuite.scala  | 208 +
 .../org/apache/spark/sql/CollationSuite.scala  |  12 +-
 6 files changed, 194 insertions(+), 112 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index b8ada7c6ed0a..84681ab8c225 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -514,7 +514,7 @@
 "subClass" : {
   "EXPLICIT" : {
 "message" : [
-  "Error occurred due to the mismatch between explicit collations: 
. Decide on a single explicit collation and remove others."
+  "Error occurred due to the mismatch between explicit collations: 
[]. Decide on a single explicit collation and remove others."
 ]
   },
   "IMPLICIT" : {
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
index 73a98f9fe4be..b489387c7868 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
@@ -3675,7 +3675,7 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase with Compilat
 new AnalysisException(
   errorClass = "COLLATION_MISMATCH.EXPLICIT",
   messageParameters = Map(
-"explicitTypes" -> toSQLId(explicitTypes)
+"explicitTypes" -> explicitTypes.map(toSQLId).mkString(", ")
   )
 )
   }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
index 7994c496cb65..300f23c317ec 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
@@ -680,11 +680,14 @@ class CollationSQLExpressionsSuite
 val number = "xx"
 val query = s"SELECT to_number('$number', '999');"
 withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
-  val e = intercep

[jira] [Assigned] (SPARK-48923) Fix the incorrect logic of `CollationFactorySuite`

2024-07-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48923:
---

Assignee: BingKun Pan

> Fix the incorrect logic of `CollationFactorySuite`
> --
>
> Key: SPARK-48923
> URL: https://issues.apache.org/jira/browse/SPARK-48923
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48923) Fix the incorrect logic of `CollationFactorySuite`

2024-07-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48923.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47382
[https://github.com/apache/spark/pull/47382]

> Fix the incorrect logic of `CollationFactorySuite`
> --
>
> Key: SPARK-48923
> URL: https://issues.apache.org/jira/browse/SPARK-48923
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated (41a2bebe725d -> 74ca836400b6)

2024-07-17 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 41a2bebe725d [SPARK-48865][SQL] Add try_url_decode function
 add 74ca836400b6 [SPARK-48923][SQL][TESTS] Fix the incorrect logic of 
`CollationFactorySuite`

No new revisions were added by this update.

Summary of changes:
 .../spark/unsafe/types/CollationFactorySuite.scala | 36 ++
 1 file changed, 16 insertions(+), 20 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Assigned] (SPARK-48350) [M0] Support for exceptions thrown from parser/interpreter

2024-07-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48350:
---

Assignee: Milan Dankovic

> [M0] Support for exceptions thrown from parser/interpreter
> --
>
> Key: SPARK-48350
> URL: https://issues.apache.org/jira/browse/SPARK-48350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: Milan Dankovic
>Priority: Major
>  Labels: pull-request-available
>
> In general, add support for SQL scripting related exceptions.
> By the time someone starts working on this item, some exception support might 
> already exist - check if it needs refactoring.
>  
> Have in mind that for some (all?) exceptions we might need to know which 
> line(s) in the script are responsible for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48350) [M0] Support for exceptions thrown from parser/interpreter

2024-07-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48350.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47147
[https://github.com/apache/spark/pull/47147]

> [M0] Support for exceptions thrown from parser/interpreter
> --
>
> Key: SPARK-48350
> URL: https://issues.apache.org/jira/browse/SPARK-48350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: Milan Dankovic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In general, add support for SQL scripting related exceptions.
> By the time someone starts working on this item, some exception support might 
> already exist - check if it needs refactoring.
>  
> Have in mind that for some (all?) exceptions we might need to know which 
> line(s) in the script are responsible for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated (5ff6a52426f5 -> 5dafa4c421d3)

2024-07-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5ff6a52426f5 [SPARK-48886][SS] Add version info to changelog v2 to 
allow for easier evolution
 add 5dafa4c421d3 [SPARK-48350][SQL] Introduction of Custom Exceptions for 
Sql Scripting

No new revisions were added by this update.

Summary of changes:
 .../src/main/resources/error/error-conditions.json | 12 +++
 .../src/main/resources/error/error-states.json |  6 ++
 .../spark/sql/catalyst/parser/AstBuilder.scala | 11 +-
 .../SqlScriptingErrors.scala}  | 24 ++
 .../catalyst/parser/SqlScriptingParserSuite.scala  | 24 +-
 5 files changed, 54 insertions(+), 23 deletions(-)
 copy 
sql/catalyst/src/main/scala/org/apache/spark/sql/{catalyst/catalog/UserDefinedFunctionErrors.scala
 => errors/SqlScriptingErrors.scala} (55%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Resolved] (SPARK-48441) Alter string search logic for: trim (UTF8_BINARY_LCASE)

2024-07-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48441.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46762
[https://github.com/apache/spark/pull/46762]

> Alter string search logic for: trim (UTF8_BINARY_LCASE)
> ---
>
> Key: SPARK-48441
> URL: https://issues.apache.org/jira/browse/SPARK-48441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48441) Alter string search logic for: trim (UTF8_BINARY_LCASE)

2024-07-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48441:
---

Assignee: Uroš Bojanić

> Alter string search logic for: trim (UTF8_BINARY_LCASE)
> ---
>
> Key: SPARK-48441
> URL: https://issues.apache.org/jira/browse/SPARK-48441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48441][SQL] Fix StringTrim behaviour for non-UTF8_BINARY collations

2024-07-15 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 95de02e80dee [SPARK-48441][SQL] Fix StringTrim behaviour for 
non-UTF8_BINARY collations
95de02e80dee is described below

commit 95de02e80deef19607a55b5913c674cc21521132
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Mon Jul 15 16:33:53 2024 +0800

[SPARK-48441][SQL] Fix StringTrim behaviour for non-UTF8_BINARY collations

### What changes were proposed in this pull request?
String searching in UTF8_LCASE now works on character-level, rather than on 
byte-level. For example: `ltrim("İ", "i")` now returns `"İ"`, because there 
exist **no characters** in `"İ"`, starting from the left, such that lowercased 
version of those characters are equal to `"i"`. Note, however, that there is a 
byte subsequence of `"İ"` such that lowercased version of that UTF-8 byte 
sequence equals to `"i"` (so the new behaviour is different than the old 
behaviour).

Also, translation for ICU collations works by repeatedly trimming the 
longest possible substring that matches a character in the trim string, 
starting from the left side of the input string, until trimming is done.

### Why are the changes needed?
Fix functions that give unusable results due to one-to-many case mapping 
when performing string search under UTF8_LCASE (see example above).

### Does this PR introduce _any_ user-facing change?
Yes, behaviour of `trim*` expressions is changed for collated strings for 
edge cases with one-to-many case mapping.

### How was this patch tested?
New unit tests in `CollationSupportSuite` and new e2e sql tests in 
`CollationStringExpressionsSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46762 from uros-db/alter-trim.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 306 +++---
 .../spark/sql/catalyst/util/CollationSupport.java  | 129 ++--
 .../spark/unsafe/types/CollationSupportSuite.java  | 663 +++--
 .../catalyst/expressions/stringExpressions.scala   |  16 +-
 .../sql/CollationStringExpressionsSuite.scala  |  45 +-
 5 files changed, 922 insertions(+), 237 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index af152c87f88c..b9868ca665a6 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -33,6 +33,7 @@ import static 
org.apache.spark.unsafe.types.UTF8String.CodePointIteratorType;
 import java.text.CharacterIterator;
 import java.text.StringCharacterIterator;
 import java.util.HashMap;
+import java.util.HashSet;
 import java.util.Iterator;
 import java.util.Map;
 
@@ -841,117 +842,268 @@ public class CollationAwareUTF8String {
 return UTF8String.fromString(sb.toString());
   }
 
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to the UTF8_LCASE collation. String trimming is 
performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed from both ends of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for UTF8_LCASE collation)
+   */
   public static UTF8String lowercaseTrim(
   final UTF8String srcString,
   final UTF8String trimString) {
-// Matching UTF8String behavior for null `trimString`.
-if (trimString == null) {
-  return null;
-}
+return lowercaseTrimRight(lowercaseTrimLeft(srcString, trimString), 
trimString);
+  }
 
-UTF8String leftTrimmed = lowercaseTrimLeft(srcString, trimString);
-return lowercaseTrimRight(leftTrimmed, trimString);
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to all ICU collations in Spark. String trimming 
is performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed f

[jira] [Assigned] (SPARK-48440) Alter string search logic for: translate (UTF8_BINARY_LCASE)

2024-07-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48440:
---

Assignee: Uroš Bojanić

> Alter string search logic for: translate (UTF8_BINARY_LCASE)
> 
>
> Key: SPARK-48440
> URL: https://issues.apache.org/jira/browse/SPARK-48440
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48440) Alter string search logic for: translate (UTF8_BINARY_LCASE)

2024-07-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48440.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46761
[https://github.com/apache/spark/pull/46761]

> Alter string search logic for: translate (UTF8_BINARY_LCASE)
> 
>
> Key: SPARK-48440
> URL: https://issues.apache.org/jira/browse/SPARK-48440
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations

2024-07-12 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8e4bbdff80a1 [SPARK-48440][SQL] Fix StringTranslate behaviour for 
non-UTF8_BINARY collations
8e4bbdff80a1 is described below

commit 8e4bbdff80a1c069ccce71060751987e9e6c0b6b
Author: Uros Bojanic <157381213+uros...@users.noreply.github.com>
AuthorDate: Fri Jul 12 22:30:18 2024 +0800

[SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY 
collations

### What changes were proposed in this pull request?
String searching in UTF8_LCASE now works on character-level, rather than on 
byte-level. For example: `translate("İ", "i")` now returns `"İ"`, because there 
exists no **single character** in `"İ"` such that lowercased version of that 
character equals to `"i"`. Note, however, that there _is_ a byte subsequence of 
`"İ"` such that lowercased version of that UTF-8 byte sequence equals to `"i"` 
(so the new behaviour is different than the old behaviour).

Also, translation for ICU collations works by repeatedly translating the 
longest possible substring that matches a key in the dictionary (under the 
specified collation), starting from the left side of the input string, until 
the entire string is translated.

### Why are the changes needed?
Fix functions that give unusable results due to one-to-many case mapping 
when performing string search under UTF8_BINARY_LCASE (see example above).

### Does this PR introduce _any_ user-facing change?
Yes, behaviour of `translate` expression is changed for edge cases with 
one-to-many case mapping.

### How was this patch tested?
New unit tests in `CollationStringExpressionsSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46761 from uros-db/alter-translate.

Authored-by: Uros Bojanic <157381213+uros...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 218 +
 .../spark/sql/catalyst/util/CollationSupport.java  |  25 +--
 .../spark/unsafe/types/CollationSupportSuite.java  | 192 --
 .../catalyst/expressions/stringExpressions.scala   |  30 ++-
 .../sql/CollationStringExpressionsSuite.scala  |  51 +
 5 files changed, 402 insertions(+), 114 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index 23adc772b7f3..af152c87f88c 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -18,6 +18,8 @@ package org.apache.spark.sql.catalyst.util;
 
 import com.ibm.icu.lang.UCharacter;
 import com.ibm.icu.text.BreakIterator;
+import com.ibm.icu.text.Collator;
+import com.ibm.icu.text.RuleBasedCollator;
 import com.ibm.icu.text.StringSearch;
 import com.ibm.icu.util.ULocale;
 
@@ -26,8 +28,12 @@ import org.apache.spark.unsafe.types.UTF8String;
 
 import static org.apache.spark.unsafe.Platform.BYTE_ARRAY_OFFSET;
 import static org.apache.spark.unsafe.Platform.copyMemory;
+import static org.apache.spark.unsafe.types.UTF8String.CodePointIteratorType;
 
+import java.text.CharacterIterator;
+import java.text.StringCharacterIterator;
 import java.util.HashMap;
+import java.util.Iterator;
 import java.util.Map;
 
 /**
@@ -424,19 +430,50 @@ public class CollationAwareUTF8String {
* @param codePoint The code point to convert to lowercase.
* @param sb The StringBuilder to append the lowercase character to.
*/
-  private static void lowercaseCodePoint(final int codePoint, final 
StringBuilder sb) {
-if (codePoint == 0x0130) {
+  private static void appendLowercaseCodePoint(final int codePoint, final 
StringBuilder sb) {
+int lowercaseCodePoint = getLowercaseCodePoint(codePoint);
+if (lowercaseCodePoint == CODE_POINT_COMBINED_LOWERCASE_I_DOT) {
   // Latin capital letter I with dot above is mapped to 2 lowercase 
characters.
   sb.appendCodePoint(0x0069);
   sb.appendCodePoint(0x0307);
+} else {
+  // All other characters should follow context-unaware ICU single-code 
point case mapping.
+  sb.appendCodePoint(lowercaseCodePoint);
+}
+  }
+
+  /**
+   * `CODE_POINT_COMBINED_LOWERCASE_I_DOT` is an internal representation of 
the combined lowercase
+   * code point for ASCII lowercase letter i with an additional combining dot 
character (U+0307).
+   * This integer value is not a valid code point itself, but rather an 
artificial code point
+   * marker use

[jira] [Assigned] (SPARK-48871) Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis

2024-07-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48871:
---

Assignee: Carmen Kwan

> Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis 
> --
>
> Key: SPARK-48871
> URL: https://issues.apache.org/jira/browse/SPARK-48871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.2, 3.4.4
>Reporter: Carmen Kwan
>Assignee: Carmen Kwan
>Priority: Major
>  Labels: pull-request-available
>
> I encountered the following exception when attempting to use a 
> non-deterministic udf in my query.
> {code:java}
> [info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
> [INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
> expression, but the actual expression is "[some expression]".; line 2 pos 1
> [info] [some logical plan]
> [info] at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
> [info] at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
> [info] at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
> [info] at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
> [info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66){code}
> The non-deterministic expression can be safely allowed for my custom 
> LogicalPlan, but it is disabled in the checkAnalysis phase. The CheckAnalysis 
> rule is too strict so that reasonable use cases of non-deterministic 
> expressions are also disabled.
> To fix this, we could add a trait that logical plans can extend to implement 
> a method to decide whether there can be non-deterministic expressions for the 
> operator, and check this function in checkAnalysis. This allows delegation of 
> this validation to frameworks that extend Spark so we can allow list more 
> than just the few explicitly named logical plans (e.g. `Project`, `Filter`). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48871) Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis

2024-07-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48871.
-
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 47304
[https://github.com/apache/spark/pull/47304]

> Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis 
> --
>
> Key: SPARK-48871
> URL: https://issues.apache.org/jira/browse/SPARK-48871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.2, 3.4.4
>Reporter: Carmen Kwan
>Assignee: Carmen Kwan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> I encountered the following exception when attempting to use a 
> non-deterministic udf in my query.
> {code:java}
> [info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
> [INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
> expression, but the actual expression is "[some expression]".; line 2 pos 1
> [info] [some logical plan]
> [info] at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
> [info] at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
> [info] at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
> [info] at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
> [info] at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
> [info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
> [info] at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66){code}
> The non-deterministic expression can be safely allowed for my custom 
> LogicalPlan, but it is disabled in the checkAnalysis phase. The CheckAnalysis 
> rule is too strict so that reasonable use cases of non-deterministic 
> expressions are also disabled.
> To fix this, we could add a trait that logical plans can extend to implement 
> a method to decide whether there can be non-deterministic expressions for the 
> operator, and check this function in checkAnalysis. This allows delegation of 
> this validation to frameworks that extend Spark so we can allow list more 
> than just the few explicitly named logical plans (e.g. `Project`, `Filter`). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48871] Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in…

2024-07-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new b15a8725b25c [SPARK-48871] Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS 
validation in…
b15a8725b25c is described below

commit b15a8725b25c0b3f78efcccfb1f69e8d7fbd9a72
Author: zhipeng.mao 
AuthorDate: Fri Jul 12 14:51:47 2024 +0800

[SPARK-48871] Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in…

… CheckAnalysis

The PR added a trait that logical plans can extend to implement a method to 
decide whether there can be non-deterministic expressions for the operator, and 
check this method in checkAnalysis.

I encountered the `INVALID_NON_DETERMINISTIC_EXPRESSIONS` exception when 
attempting to use a non-deterministic udf in my query. The non-deterministic 
expression can be safely allowed for my custom LogicalPlan, but it is disabled 
in the checkAnalysis phase. The CheckAnalysis rule is too strict so that 
reasonable use cases of non-deterministic expressions are also disabled.

No

The test case `"SPARK-48871: AllowsNonDeterministicExpression allow lists 
non-deterministic expressions"` is added.

No

Closes #47304 from zhipengmao-db/zhipengmao-db/SPARK-48871-check-analysis.

Lead-authored-by: zhipeng.mao 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 9cbd5dd4e7477294f7d4289880c7ea0dd67b38d3)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  | 12 +++
 .../plans/logical/basicLogicalOperators.scala  | 10 ++
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 23 ++
 3 files changed, 45 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 485015f2efab..bb399e41d7d0 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -157,6 +157,17 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 visited(cteId) = true
   }
 
+  /**
+   * Checks whether the operator allows non-deterministic expressions.
+   */
+  private def operatorAllowsNonDeterministicExpressions(plan: LogicalPlan): 
Boolean = {
+plan match {
+  case p: SupportsNonDeterministicExpression =>
+p.allowNonDeterministicExpression
+  case _ => false
+}
+  }
+
   def checkAnalysis(plan: LogicalPlan): Unit = {
 val inlineCTE = InlineCTE(alwaysInline = true)
 val cteMap = mutable.HashMap.empty[Long, (CTERelationDef, Int, 
mutable.Map[Long, Int])]
@@ -771,6 +782,7 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 "dataType" -> toSQLType(mapCol.dataType)))
 
   case o if o.expressions.exists(!_.deterministic) &&
+!operatorAllowsNonDeterministicExpressions(o) &&
 !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
 !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] &&
 !o.isInstanceOf[Expand] &&
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index ca2c6a850561..f76e698a6400 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -1957,6 +1957,16 @@ case class DeduplicateWithinWatermark(keys: 
Seq[Attribute], child: LogicalPlan)
  */
 trait SupportsSubquery extends LogicalPlan
 
+/**
+ * Trait that logical plans can extend to check whether it can allow 
non-deterministic
+ * expressions and pass the CheckAnalysis rule.
+ */
+trait SupportsNonDeterministicExpression extends LogicalPlan {
+
+  /** Returns whether it allows non-deterministic expressions. */
+  def allowNonDeterministicExpression: Boolean
+}
+
 /**
  * Collect arbitrary (named) metrics from a dataset. As soon as the query 
reaches a completion
  * point (batch query completes or streaming query epoch completes) an event 
is emitted on the
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
index a7df53db936f..6b5f0fe3876d 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/Anal

(spark) branch master updated (5bbe9c850aaa -> 9cbd5dd4e747)

2024-07-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5bbe9c850aaa [SPARK-48842][DOCS] Document non-determinism of max_by 
and min_by
 add 9cbd5dd4e747 [SPARK-48871] Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS 
validation in…

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/analysis/CheckAnalysis.scala  | 12 +++
 .../plans/logical/basicLogicalOperators.scala  | 10 ++
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 23 ++
 3 files changed, 45 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48760][SQL] Fix CatalogV2Util.applyClusterByChanges

2024-07-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cbe6846c477b [SPARK-48760][SQL] Fix CatalogV2Util.applyClusterByChanges
cbe6846c477b is described below

commit cbe6846c477bc8b6d94385ddd0097c4e97b05d41
Author: Jiaheng Tang 
AuthorDate: Fri Jul 12 11:10:38 2024 +0800

[SPARK-48760][SQL] Fix CatalogV2Util.applyClusterByChanges

### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/47156/ introduced a bug in 
`CatalogV2Util.applyClusterByChanges` that it will remove the existing 
`ClusterByTransform` first, regardless of whether there is a `ClusterBy` table 
change. This means any table change will remove the clustering columns from the 
table.

This PR fixes the bug by removing the `ClusterByTransform` only when there 
is a `ClusterBy` table change.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Amend existing test to catch this bug.
### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47288 from zedtang/fix-apply-cluster-by-changes.

Authored-by: Jiaheng Tang 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/connector/catalog/CatalogV2Util.scala   | 19 +++
 .../execution/command/DescribeTableSuiteBase.scala|  3 ++-
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
index c5888d72c2b2..645ed9e6bb0c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
@@ -178,16 +178,19 @@ private[sql] object CatalogV2Util {
  schema: StructType,
  changes: Seq[TableChange]): Array[Transform] = {
 
-val newPartitioning = 
partitioning.filterNot(_.isInstanceOf[ClusterByTransform]).toBuffer
-changes.foreach {
-  case clusterBy: ClusterBy =>
-newPartitioning += ClusterBySpec.extractClusterByTransform(
+var newPartitioning = partitioning
+// If there is a clusterBy change (only the first one), we overwrite the 
existing
+// clustering columns.
+val clusterByOpt = changes.collectFirst { case c: ClusterBy => c }
+clusterByOpt.foreach { clusterBy =>
+  newPartitioning = partitioning.map {
+case _: ClusterByTransform => ClusterBySpec.extractClusterByTransform(
   schema, ClusterBySpec(clusterBy.clusteringColumns.toIndexedSeq), 
conf.resolver)
-
-  case _ =>
-  // ignore other changes
+case other => other
+  }
 }
-newPartitioning.toArray
+
+newPartitioning
   }
 
   /**
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DescribeTableSuiteBase.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DescribeTableSuiteBase.scala
index 2588aa4313fa..02e8a5e68999 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DescribeTableSuiteBase.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DescribeTableSuiteBase.scala
@@ -181,8 +181,9 @@ trait DescribeTableSuiteBase extends QueryTest with 
DDLCommandTestUtils {
 
   test("describe a clustered table") {
 withNamespaceAndTable("ns", "tbl") { tbl =>
-  sql(s"CREATE TABLE $tbl (col1 STRING COMMENT 'this is comment', col2 
struct) " +
+  sql(s"CREATE TABLE $tbl (col1 STRING, col2 struct) " +
 s"$defaultUsing CLUSTER BY (col1, col2.x)")
+  sql(s"ALTER TABLE $tbl ALTER COLUMN col1 COMMENT 'this is comment';")
   val descriptionDf = sql(s"DESC $tbl")
   assert(descriptionDf.schema.map(field => (field.name, field.dataType)) 
=== Seq(
 ("col_name", StringType),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Resolved] (SPARK-48841) Include `collationName` to `sql()` of `Collate`

2024-07-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48841.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47265
[https://github.com/apache/spark/pull/47265]

> Include `collationName` to `sql()` of `Collate`
> ---
>
> Key: SPARK-48841
> URL: https://issues.apache.org/jira/browse/SPARK-48841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48841][SQL] Include `collationName` to `sql()` of `Collate`

2024-07-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 73eab92d30bf [SPARK-48841][SQL] Include `collationName` to `sql()` of 
`Collate`
73eab92d30bf is described below

commit 73eab92d30bf78d39c80dd5af03b052fe9fc5211
Author: panbingkun 
AuthorDate: Fri Jul 12 10:16:11 2024 +0800

[SPARK-48841][SQL] Include `collationName` to `sql()` of `Collate`

### What changes were proposed in this pull request?
In the PR, I propose to fix the `sql()` method of the `Collate` expression, 
and append the `collationName` clause.

### Why are the changes needed?
To distinguish column names when the `collationName` argument is used by 
`collate`. Before the changes, columns might conflict like the example below, 
and that could confuse users:
```
sql("CREATE TEMP VIEW tbl as (SELECT collate('A', 'UTF8_BINARY'), 
collate('A', 'UTF8_LCASE'))")
```
- Before:
```
[COLUMN_ALREADY_EXISTS] The column `collate(a)` already exists. Choose 
another name or rename the existing column. SQLSTATE: 42711
org.apache.spark.sql.AnalysisException: [COLUMN_ALREADY_EXISTS] The column 
`collate(a)` already exists. Choose another name or rename the existing column. 
SQLSTATE: 42711
at 
org.apache.spark.sql.errors.QueryCompilationErrors$.columnAlreadyExistsError(QueryCompilationErrors.scala:2595)
at 
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:115)
at 
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:97)
```

- After:
```
describe extended tbl;
+---+-+---+
|col_name   |data_type|comment|
+---+-+---+
|collate(A, UTF8_BINARY)|string   |NULL   |
|collate(A, UTF8_LCASE) |string collate UTF8_LCASE|NULL   |
+---+-+---+

```

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
Update existed UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47265 from panbingkun/SPARK-48841.
    
Authored-by: panbingkun 
Signed-off-by: Wenchen Fan 
---
 .../explain-results/function_collate.explain   |   2 +-
 .../expressions/collationExpressions.scala |   4 +
 .../sql-functions/sql-expression-schema.md |   2 +-
 .../sql-tests/analyzer-results/collations.sql.out  |  60 ++---
 .../resources/sql-tests/results/collations.sql.out |  50 ++--
 .../spark/sql/CollationExpressionWalkerSuite.scala |   4 +-
 .../apache/spark/sql/CollationSQLRegexpSuite.scala | 292 -
 .../org/apache/spark/sql/CollationSuite.scala  |   4 +-
 8 files changed, 296 insertions(+), 122 deletions(-)

diff --git 
a/connect/common/src/test/resources/query-tests/explain-results/function_collate.explain
 
b/connect/common/src/test/resources/query-tests/explain-results/function_collate.explain
index c736abf67b11..e4e6aedc34de 100644
--- 
a/connect/common/src/test/resources/query-tests/explain-results/function_collate.explain
+++ 
b/connect/common/src/test/resources/query-tests/explain-results/function_collate.explain
@@ -1,2 +1,2 @@
-Project [collate(g#0, UNICODE) AS collate(g)#0]
+Project [collate(g#0, UNICODE) AS collate(g, UNICODE)#0]
 +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
index c528b523c5e7..c7fbb39ea285 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
@@ -92,6 +92,10 @@ case class Collate(child: Expression, collationName: String)
 
   override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode =
 defineCodeGen(ctx, ev, (in) => in)
+
+  override def sql: String = s"$prettyName(${child.sql}, $collationName)"
+
+  override def toString: String = s"$prettyName($child, $collationName)"
 }
 
 // scalastyle:off line.contains.tab
diff --git a/sql/core/src/test/resources/sql-functions/sql-expression-schema.md 
b/sql/core/src/test/resources/sql-functions/sql-expression-schema.md
index cf218becdf1d..bf4622accf41 100644
--- a/sql/core/src/test/resources/sql-functions/sql-expression-schema.md
+++ b/sql/core/src/test/resources/sql-functions/sql-expression-schema.md

(spark) branch master updated: [SPARK-46743][SQL][FOLLOW UP] Count bug after ScalarSubqery is folded if it has an empty relation

2024-07-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ef71600e1378 [SPARK-46743][SQL][FOLLOW UP] Count bug after 
ScalarSubqery is folded if it has an empty relation
ef71600e1378 is described below

commit ef71600e137886b9996ddd22b57a02e0c3955a22
Author: Andy Lam 
AuthorDate: Fri Jul 12 10:08:04 2024 +0800

[SPARK-46743][SQL][FOLLOW UP] Count bug after ScalarSubqery is folded if it 
has an empty relation

### What changes were proposed in this pull request?

In this PR https://github.com/apache/spark/pull/45125, we handled the case 
where an Aggregate is folded into a Project, causing a count bug. We missed 
cases where:
1. The entire ScalarSubquery's plan is regarded as empty relation, and is 
folded completely.
2. There are operations above the Aggregate in the subquery (such as filter 
and project).

### Why are the changes needed?

This PR fixes that by adding the case handling in ConstantFolding and 
OptimizeSubqueries.

### Does this PR introduce _any_ user-facing change?

Yes. There was a correctness error which happens when the scalar subquery 
is count-bug-susceptible, and empty, and thus folded by `ConstantFolding`.

### How was this patch tested?

Added SQL query tests in `scalar-subquery-count-bug.sql`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47290 from andylam-db/decorr-bugs.

Authored-by: Andy Lam 
Signed-off-by: Wenchen Fan 
---
 .../jdbc/querytest/GeneratedSubquerySuite.scala|  17 +--
 .../spark/sql/catalyst/optimizer/Optimizer.scala   |  22 ++-
 .../spark/sql/catalyst/optimizer/expressions.scala |   6 +
 .../scalar-subquery-count-bug.sql.out  | 160 +
 .../scalar-subquery/scalar-subquery-count-bug.sql  |  67 +
 .../scalar-subquery-count-bug.sql.out  | 116 +++
 6 files changed, 367 insertions(+), 21 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
index 8b27e9cb0e0a..015898b70b7c 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala
@@ -445,20 +445,5 @@ object GeneratedSubquerySuite {
   // Limit number of generated queries per test so that tests will not take 
too long.
   private val NUM_QUERIES_PER_TEST = 1000
 
-  // scalastyle:off line.size.limit
-  private val KNOWN_QUERIES_WITH_DIFFERENT_RESULTS = Seq(
-// SPARK-46743
-"SELECT outer_table.a, outer_table.b, (SELECT COUNT(null_table.a) AS 
aggFunctionAlias FROM null_table WHERE null_table.a = outer_table.a) AS 
subqueryAlias FROM outer_table ORDER BY a DESC NULLS FIRST, b DESC NULLS FIRST, 
subqueryAlias DESC NULLS FIRST;",
-// SPARK-46743
-"SELECT outer_table.a, outer_table.b, (SELECT COUNT(null_table.a) AS 
aggFunctionAlias FROM null_table INNER JOIN join_table ON null_table.a = 
join_table.a WHERE null_table.a = outer_table.a) AS subqueryAlias FROM 
outer_table ORDER BY a DESC NULLS FIRST, b DESC NULLS FIRST, subqueryAlias DESC 
NULLS FIRST;",
-// SPARK-46743
-"SELECT outer_table.a, outer_table.b, (SELECT COUNT(null_table.a) AS 
aggFunctionAlias FROM null_table LEFT OUTER JOIN join_table ON null_table.a = 
join_table.a WHERE null_table.a = outer_table.a) AS subqueryAlias FROM 
outer_table ORDER BY a DESC NULLS FIRST, b DESC NULLS FIRST, subqueryAlias DESC 
NULLS FIRST;",
-// SPARK-46743
-"SELECT outer_table.a, outer_table.b, (SELECT COUNT(null_table.a) AS 
aggFunctionAlias FROM null_table RIGHT OUTER JOIN join_table ON null_table.a = 
join_table.a WHERE null_table.a = outer_table.a) AS subqueryAlias FROM 
outer_table ORDER BY a DESC NULLS FIRST, b DESC NULLS FIRST, subqueryAlias DESC 
NULLS FIRST;",
-// SPARK-46743
-"SELECT outer_table.a, outer_table.b, (SELECT COUNT(innerSubqueryAlias.a) 
AS aggFunctionAlias FROM (SELECT null_table.a, null_table.b FROM null_table 
INTERSECT SELECT join_table.a, join_table.b FROM join_table) AS 
innerSubqueryAlias WHERE innerSubqueryAlias.a = outer_table.a) AS subqueryAlias 
FROM outer_table ORDER BY a DESC NULLS FIRST, b DESC NULLS FIRST, subqueryAlias 
DESC NULLS FIRST;",
-// SPARK-46743
-"SELECT outer_table.a, outer_table.b, (SELECT COUNT(innerSubqueryAlias.a) 
AS aggFunctionAlias FROM (SELECT null_table.a, null_table.b FROM null_table 
EXCEPT SELECT join_table.a

[jira] [Resolved] (SPARK-48793) Unify v1 and v2 `ALTER TABLE .. `DROP|RENAME` COLUMN[S] ...` tests

2024-07-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48793.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47199
[https://github.com/apache/spark/pull/47199]

> Unify v1 and v2 `ALTER TABLE .. `DROP|RENAME` COLUMN[S] ...` tests
> --
>
> Key: SPARK-48793
> URL: https://issues.apache.org/jira/browse/SPARK-48793
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated: [SPARK-48793][SQL][TESTS][S] Unify v1 and v2 `ALTER TABLE .. `DROP|RENAME` COLUMN ...` tests

2024-07-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dbc420f02e81 [SPARK-48793][SQL][TESTS][S] Unify v1 and v2 `ALTER TABLE 
.. `DROP|RENAME` COLUMN ...` tests
dbc420f02e81 is described below

commit dbc420f02e817d03b0605ccd5a6195d50a5af07e
Author: panbingkun 
AuthorDate: Thu Jul 11 20:52:06 2024 +0800

[SPARK-48793][SQL][TESTS][S] Unify v1 and v2 `ALTER TABLE .. `DROP|RENAME` 
COLUMN ...` tests

### What changes were proposed in this pull request?
The pr aims to:
- Move parser tests from `o.a.s.s.c.p.DDLParserSuite` and 
`o.a.s.s.c.p.ErrorParserSuite` to `AlterTableRenameColumnParserSuite` & 
`AlterTableDropColumnParserSuite`
- Add a test for DSv2 ALTER TABLE .. `DROP|RENAME` to 
`v2.AlterTableDropColumnSuite` & `v2.AlterTableRenameColumnSuite`

(This PR includes the unification of two commands: `DROP COLUMN` & `RENAME 
COLUMN`)

### Why are the changes needed?
- To improve test coverage.
- Align with other similar tests, eg: AlterTableRename*

### Does this PR introduce _any_ user-facing change?
No, only tests.

### How was this patch tested?
- Add new UT
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47199 from panbingkun/alter_table_drop_column.

Authored-by: panbingkun 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/parser/DDLParserSuite.scala |  51 --
 .../sql/catalyst/parser/ErrorParserSuite.scala |  18 --
 .../command/AlterTableDropColumnParserSuite.scala  |  68 
 .../command/AlterTableDropColumnSuiteBase.scala|  38 +
 .../AlterTableRenameColumnParserSuite.scala|  56 ++
 .../command/AlterTableRenameColumnSuiteBase.scala  |  39 +
 .../command/v1/AlterTableDropColumnSuite.scala |  57 +++
 .../command/v1/AlterTableRenameColumnSuite.scala   |  56 ++
 .../command/v2/AlterTableDropColumnSuite.scala | 184 
 .../command/v2/AlterTableRenameColumnSuite.scala   | 190 +
 .../command/AlterTableDropColumnSuite.scala|  26 +++
 .../command/AlterTableRenameColumnSuite.scala  |  26 +++
 12 files changed, 740 insertions(+), 69 deletions(-)

diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
index f88c516f0019..d570716c3767 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
@@ -1204,15 +1204,6 @@ class DDLParserSuite extends AnalysisTest {
   )))
   }
 
-  test("alter table: rename column") {
-comparePlans(
-  parsePlan("ALTER TABLE table_name RENAME COLUMN a.b.c TO d"),
-  RenameColumn(
-UnresolvedTable(Seq("table_name"), "ALTER TABLE ... RENAME COLUMN"),
-UnresolvedFieldName(Seq("a", "b", "c")),
-"d"))
-  }
-
   test("alter table: update column type using ALTER") {
 comparePlans(
   parsePlan("ALTER TABLE table_name ALTER COLUMN a.b.c TYPE bigint"),
@@ -1322,48 +1313,6 @@ class DDLParserSuite extends AnalysisTest {
 None))
   }
 
-  test("alter table: drop column") {
-comparePlans(
-  parsePlan("ALTER TABLE table_name DROP COLUMN a.b.c"),
-  DropColumns(
-UnresolvedTable(Seq("table_name"), "ALTER TABLE ... DROP COLUMNS"),
-Seq(UnresolvedFieldName(Seq("a", "b", "c"))),
-ifExists = false))
-
-comparePlans(
-  parsePlan("ALTER TABLE table_name DROP COLUMN IF EXISTS a.b.c"),
-  DropColumns(
-UnresolvedTable(Seq("table_name"), "ALTER TABLE ... DROP COLUMNS"),
-Seq(UnresolvedFieldName(Seq("a", "b", "c"))),
-ifExists = true))
-  }
-
-  test("alter table: drop multiple columns") {
-val sql = "ALTER TABLE table_name DROP COLUMN x, y, a.b.c"
-Seq(sql, sql.replace("COLUMN", "COLUMNS")).foreach { drop =>
-  comparePlans(
-parsePlan(drop),
-DropColumns(
-  UnresolvedTable(Seq("table_name"), "ALTER TABLE ... DROP COLUMNS"),
-  Seq(UnresolvedFieldName(Seq("x")),
-UnresolvedFieldName(Seq("y")),
-UnresolvedFieldName(Seq("a", "b", "c"))),
-  ifExists = false))
-}
-
-val sqlIfExists = "ALTER TABLE tab

[jira] [Resolved] (SPARK-48773) Document config "spark.default.parallelism"

2024-07-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48773.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47171
[https://github.com/apache/spark/pull/47171]

> Document config "spark.default.parallelism"
> ---
>
> Key: SPARK-48773
> URL: https://issues.apache.org/jira/browse/SPARK-48773
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

(spark) branch master updated (261dbf4a9047 -> 896c15e2a30c)

2024-07-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 261dbf4a9047 [SPARK-48866][SQL] Fix hints of valid charset in the 
error message of INVALID_PARAMETER_VALUE.CHARSET
 add 896c15e2a30c [SPARK-48773] Document config "spark.default.parallelism" 
by config builder framework

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/Partitioner.scala |  3 ++-
 .../org/apache/spark/internal/config/package.scala| 12 
 .../cluster/CoarseGrainedSchedulerBackend.scala   |  4 ++--
 .../spark/scheduler/local/LocalSchedulerBackend.scala |  2 +-
 core/src/test/scala/org/apache/spark/FileSuite.scala  |  2 +-
 .../scala/org/apache/spark/PartitioningSuite.scala|  7 ---
 .../spark/SparkContextSchedulerCreationSuite.scala|  3 ++-
 .../org/apache/spark/rdd/PairRDDFunctionsSuite.scala  | 19 ++-
 .../CoarseGrainedSchedulerBackendSuite.scala  |  2 +-
 .../spark/scheduler/SchedulerIntegrationSuite.scala   |  4 ++--
 .../scala/org/apache/spark/graphx/GraphSuite.scala|  3 ++-
 .../datasources/FileSourceStrategySuite.scala |  3 ++-
 12 files changed, 41 insertions(+), 23 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[jira] [Resolved] (SPARK-48851) Attach full name for `SCHEMA_NOT_FOUND`

2024-07-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48851.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47276
[https://github.com/apache/spark/pull/47276]

> Attach full name for `SCHEMA_NOT_FOUND`
> ---
>
> Key: SPARK-48851
> URL: https://issues.apache.org/jira/browse/SPARK-48851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48851) Attach full name for `SCHEMA_NOT_FOUND`

2024-07-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48851:
---

Assignee: BingKun Pan

> Attach full name for `SCHEMA_NOT_FOUND`
> ---
>
> Key: SPARK-48851
> URL: https://issues.apache.org/jira/browse/SPARK-48851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 5420 matches

Mail list logo