[spark] branch branch-3.3 updated: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

dongjoon Tue, 27 Sep 2022 02:15:06 -0700

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.3 by this push:
     new 0eb87211fc1 [SPARK-40562][SQL] Add 
`spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
0eb87211fc1 is described below

commit 0eb87211fc12bb221156aaf5775dec2b6a56950e
Author: Dongjoon Hyun <dongj...@apache.org>
AuthorDate: Tue Sep 27 02:14:22 2022 -0700

    [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
    
    ### What changes were proposed in this pull request?
    
    This PR aims to add a new legacy configuration to keep `grouping__id` value 
like the released Apache Spark 3.2 and 3.3.
    
    Please note that this syntax is non-SQL standard and even Hive doesn't 
support it.
    
    ```SQL
    hive> SELECT version();
    OK
    3.1.3 r4df4d75bf1e16fe0af75aad0b4179c34c07fc975
    Time taken: 0.111 seconds, Fetched: 1 row(s)
    
    hive> SELECT count(*), grouping__id from t GROUP BY a GROUPING SETS (b);
    FAILED: SemanticException 1:63 [Error 10213]:
    Grouping sets expression is not in GROUP BY key. Error encountered near 
token 'b'
    ```
    
    ### Why are the changes needed?
    
    SPARK-40218 fixed a bug caused by SPARK-34932 (at Apache Spark 3.2.0). As a 
side effect, `grouping__id` values are changed.
    
    - Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.3.0.
    ```scala
    scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS 
t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
    +--------+------------+
    |count(1)|grouping__id|
    +--------+------------+
    |       1|           1|
    |       1|           1|
    +--------+------------+
    ```
    
    - After SPARK-40218, Apache Spark 3.4.0, 3.3.1, 3.2.3
    ```scala
    scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS 
t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
    +--------+------------+
    |count(1)|grouping__id|
    +--------+------------+
    |       1|           2|
    |       1|           2|
    +--------+------------+
    ```
    
    - This PR (Apache Spark 3.4.0, 3.3.1, 3.2.3)
    ```scala
    scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS 
t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
    +--------+------------+
    |count(1)|grouping__id|
    +--------+------------+
    |       1|           2|
    |       1|           2|
    +--------+------------+
    
    scala> sql("set spark.sql.legacy.groupingIdWithAppendedUserGroupBy=true")
    res1: org.apache.spark.sql.DataFrame = [key: string, value: string]scala>
    
    sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS 
t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
    +--------+------------+
    |count(1)|grouping__id|
    +--------+------------+
    |       1|           1|
    |       1|           1|
    +--------+------------+
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, this simply added back the previous behavior by the legacy 
configuration.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    Closes #38001 from dongjoon-hyun/SPARK-40562.
    
    Authored-by: Dongjoon Hyun <dongj...@apache.org>
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>
    (cherry picked from commit 5c0ebf3d97ae49b6e2bd2096c2d590abf4d725bd)
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>
---
 docs/sql-migration-guide.md                            |  2 ++
 .../spark/sql/catalyst/expressions/grouping.scala      | 18 ++++++++++++++----
 .../scala/org/apache/spark/sql/internal/SQLConf.scala  | 12 ++++++++++++
 3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 4214c2b9aee..5c46343d994 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -68,6 +68,8 @@ license: |
   
   - Since Spark 3.3, the precision of the return type of round-like functions 
has been fixed. This may cause Spark throw `AnalysisException` of the 
`CANNOT_UP_CAST_DATATYPE` error class when using views created by prior 
versions. In such cases, you need to recreate the views using ALTER VIEW AS or 
CREATE OR REPLACE VIEW AS with newer Spark versions.
 
+  - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS 
(b)`-style SQL statements, `grouping__id` returns different values from Apache 
Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by 
expressions plus grouping set columns. To restore the behavior before 3.3.1 and 
3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For 
details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) 
and [SPARK-40562](https:/ [...]
+
 ## Upgrading from Spark SQL 3.1 to 3.2
 
   - Since Spark 3.2, ADD FILE/JAR/ARCHIVE commands require each path to be 
enclosed by `"` or `'` if the path contains whitespaces.
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
index 22e25b31f2e..8856a3fe94b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
@@ -154,12 +154,22 @@ case class GroupingSets(
   // Note that, we must put `userGivenGroupByExprs` at the beginning, to 
preserve the order of
   // grouping columns specified by users. For example, GROUP BY (a, b) 
GROUPING SETS (b, a), the
   // final grouping columns should be (a, b).
-  override def children: Seq[Expression] = userGivenGroupByExprs ++ 
flatGroupingSets
+  override def children: Seq[Expression] =
+    if (SQLConf.get.groupingIdWithAppendedUserGroupByEnabled) {
+      flatGroupingSets ++ userGivenGroupByExprs
+    } else {
+      userGivenGroupByExprs ++ flatGroupingSets
+    }
+
   override protected def withNewChildrenInternal(
       newChildren: IndexedSeq[Expression]): GroupingSets =
-    copy(
-      userGivenGroupByExprs = newChildren.take(userGivenGroupByExprs.length),
-      flatGroupingSets = newChildren.drop(userGivenGroupByExprs.length))
+    if (SQLConf.get.groupingIdWithAppendedUserGroupByEnabled) {
+      super.legacyWithNewChildren(newChildren).asInstanceOf[GroupingSets]
+    } else {
+      copy(
+        userGivenGroupByExprs = newChildren.take(userGivenGroupByExprs.length),
+        flatGroupingSets = newChildren.drop(userGivenGroupByExprs.length))
+    }
 }
 
 object GroupingSets {
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 7f41e463d89..d9e38ea9258 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -3450,6 +3450,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+   val LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY =
+    buildConf("spark.sql.legacy.groupingIdWithAppendedUserGroupBy")
+      .internal()
+      .doc("When true, grouping_id() returns values based on grouping set 
columns plus " +
+        "user-given group-by expressions order like Spark 3.2.0, 3.2.1, 3.2.2, 
and 3.3.0.")
+      .version("3.2.3")
+      .booleanConf
+      .createWithDefault(false)
+
   val PARQUET_INT96_REBASE_MODE_IN_WRITE =
     buildConf("spark.sql.parquet.int96RebaseModeInWrite")
       .internal()
@@ -4480,6 +4489,9 @@ class SQLConf extends Serializable with Logging {
 
   def integerGroupingIdEnabled: Boolean = 
getConf(SQLConf.LEGACY_INTEGER_GROUPING_ID)
 
+  def groupingIdWithAppendedUserGroupByEnabled: Boolean =
+    getConf(SQLConf.LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY)
+
   def metadataCacheTTL: Long = 
getConf(StaticSQLConf.METADATA_CACHE_TTL_SECONDS)
 
   def coalesceBucketsInJoinEnabled: Boolean = 
getConf(SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Reply via email to