[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30144: [SPARK-33229][SQL] Support GROUP BY use Separate columns and CUBE/ROLLUP

GitBox Tue, 03 Nov 2020 06:05:39 -0800


AngersZhuuuu commented on a change in pull request #30144:
URL: https://github.com/apache/spark/pull/30144#discussion_r515926187




##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##########
@@ -3691,6 +3691,32 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
       checkAnswer(sql("SELECT id FROM t WHERE (SELECT true)"), Row(0L))
     }
   }
+
+  test("SPARK-33229: Support GROUP BY use Separate columns and CUBE/ROLLUP") {
+    withTable("t") {
+      sql("CREATE TABLE t USING PARQUET AS SELECT id AS a, id AS b, id AS c 
FROM range(1)")
+      checkAnswer(sql("SELECT a, b, c, count(*) FROM t GROUP BY CUBE(a, b, 
c)"),
+        Row(0, 0, 0, 1) :: Row(0, 0, null, 1) ::
+          Row(0, null, 0, 1) :: Row(0, null, null, 1) ::
+          Row(null, 0, 0, 1) :: Row(null, 0, null, 1) ::
+          Row(null, null, 0, 1) :: Row(null, null, null, 1) :: Nil)
+      checkAnswer(sql("SELECT a, b, c, count(*) FROM t GROUP BY a, CUBE(b, 
c)"),

Review comment:
       > what's the semantic of it?
   
   If we want some dimensional analysis group by `a` and  different dimensional 
about combine `b` & `c`, in current we need to write 
   `group by cube(a, b, c)` and `where a !=NULL` to remove interfering data， 
with this patch we can just write 
   
   ```
   group by a, cube(b, c)
   ```
   And this set of PR can make  Grouping Analytics more flexible as Postgres 
SQL.  And we do have this need for analysis。
   
   

##########
File path: sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql
##########
@@ -59,4 +59,12 @@ SELECT course, year FROM courseSales GROUP BY CUBE(course, 
year) ORDER BY groupi
 -- Aliases in SELECT could be used in ROLLUP/CUBE/GROUPING SETS
 SELECT a + b AS k1, b AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2);
 SELECT a + b AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b);
-SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING 
SETS(k)
+SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING 
SETS(k);
+
+-- GROUP BY use mixed Separate columns and CUBE/ROLLUP
+SELECT a, b, count(1) FROM testData GROUP BY a, b, CUBE(a, b);
+SELECT a, b, count(1) FROM testData GROUP BY a, b, ROLLUP(a, b);
+SELECT a, b, count(1) FROM testData GROUP BY CUBE(a, b), ROLLUP(a, b);
+SELECT a, b, count(1) FROM testData GROUP BY a, CUBE(a, b), ROLLUP(b);
+SELECT a, b, count(1) FROM testData GROUP BY CUBE(a, b), ROLLUP(a, b) GROUPING 
SETS(a, b);

Review comment:
       This is not unsupported, but it can be fixed easy after refactor 
GROUPING ANALYTICS




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30144: [SPARK-33229][SQL] Support GROUP BY use Separate columns and CUBE/ROLLUP

Reply via email to