This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.2 by this push: new dc9041a9df0 [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy` dc9041a9df0 is described below commit dc9041a9df0f2f338de1c06c1485a0ab70ff3e16 Author: Dongjoon Hyun <dongj...@apache.org> AuthorDate: Tue Sep 27 02:14:22 2022 -0700 [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy` This PR aims to add a new legacy configuration to keep `grouping__id` value like the released Apache Spark 3.2 and 3.3. Please note that this syntax is non-SQL standard and even Hive doesn't support it. ```SQL hive> SELECT version(); OK 3.1.3 r4df4d75bf1e16fe0af75aad0b4179c34c07fc975 Time taken: 0.111 seconds, Fetched: 1 row(s) hive> SELECT count(*), grouping__id from t GROUP BY a GROUPING SETS (b); FAILED: SemanticException 1:63 [Error 10213]: Grouping sets expression is not in GROUP BY key. Error encountered near token 'b' ``` SPARK-40218 fixed a bug caused by SPARK-34932 (at Apache Spark 3.2.0). As a side effect, `grouping__id` values are changed. - Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.3.0. ```scala scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 1| | 1| 1| +--------+------------+ ``` - After SPARK-40218, Apache Spark 3.4.0, 3.3.1, 3.2.3 ```scala scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 2| | 1| 2| +--------+------------+ ``` - This PR (Apache Spark 3.4.0, 3.3.1, 3.2.3) ```scala scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 2| | 1| 2| +--------+------------+ scala> sql("set spark.sql.legacy.groupingIdWithAppendedUserGroupBy=true") res1: org.apache.spark.sql.DataFrame = [key: string, value: string]scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 1| | 1| 1| +--------+------------+ ``` No, this simply added back the previous behavior by the legacy configuration. Pass the CIs. Closes #38001 from dongjoon-hyun/SPARK-40562. Authored-by: Dongjoon Hyun <dongj...@apache.org> Signed-off-by: Dongjoon Hyun <dongj...@apache.org> (cherry picked from commit 5c0ebf3d97ae49b6e2bd2096c2d590abf4d725bd) Signed-off-by: Dongjoon Hyun <dongj...@apache.org> --- docs/sql-migration-guide.md | 2 ++ .../spark/sql/catalyst/expressions/grouping.scala | 18 ++++++++++++++---- .../scala/org/apache/spark/sql/internal/SQLConf.scala | 12 ++++++++++++ 3 files changed, 28 insertions(+), 4 deletions(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index da132544780..93d45cdc408 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -105,6 +105,8 @@ license: | - In Spark 3.2, date +/- interval with only day-time fields such as `date '2011-11-11' + interval 12 hours` returns timestamp. In Spark 3.1 and earlier, the same expression returns date. To restore the behavior before Spark 3.2, you can use `cast` to convert timestamp as date. + - Since Spark 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, and 3.2.2. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/bro [...] + ## Upgrading from Spark SQL 3.0 to 3.1 - In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggrega [...] diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala index 22e25b31f2e..8856a3fe94b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala @@ -154,12 +154,22 @@ case class GroupingSets( // Note that, we must put `userGivenGroupByExprs` at the beginning, to preserve the order of // grouping columns specified by users. For example, GROUP BY (a, b) GROUPING SETS (b, a), the // final grouping columns should be (a, b). - override def children: Seq[Expression] = userGivenGroupByExprs ++ flatGroupingSets + override def children: Seq[Expression] = + if (SQLConf.get.groupingIdWithAppendedUserGroupByEnabled) { + flatGroupingSets ++ userGivenGroupByExprs + } else { + userGivenGroupByExprs ++ flatGroupingSets + } + override protected def withNewChildrenInternal( newChildren: IndexedSeq[Expression]): GroupingSets = - copy( - userGivenGroupByExprs = newChildren.take(userGivenGroupByExprs.length), - flatGroupingSets = newChildren.drop(userGivenGroupByExprs.length)) + if (SQLConf.get.groupingIdWithAppendedUserGroupByEnabled) { + super.legacyWithNewChildren(newChildren).asInstanceOf[GroupingSets] + } else { + copy( + userGivenGroupByExprs = newChildren.take(userGivenGroupByExprs.length), + flatGroupingSets = newChildren.drop(userGivenGroupByExprs.length)) + } } object GroupingSets { diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 33765619823..b28bfeee245 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -3154,6 +3154,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY = + buildConf("spark.sql.legacy.groupingIdWithAppendedUserGroupBy") + .internal() + .doc("When true, grouping_id() returns values based on grouping set columns plus " + + "user-given group-by expressions order like Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0.") + .version("3.2.3") + .booleanConf + .createWithDefault(false) + val PARQUET_INT96_REBASE_MODE_IN_WRITE = buildConf("spark.sql.parquet.int96RebaseModeInWrite") .internal() @@ -4109,6 +4118,9 @@ class SQLConf extends Serializable with Logging { def integerGroupingIdEnabled: Boolean = getConf(SQLConf.LEGACY_INTEGER_GROUPING_ID) + def groupingIdWithAppendedUserGroupByEnabled: Boolean = + getConf(SQLConf.LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY) + def metadataCacheTTL: Long = getConf(StaticSQLConf.METADATA_CACHE_TTL_SECONDS) def coalesceBucketsInJoinEnabled: Boolean = getConf(SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org