[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102118#comment-17102118 ]
Takeshi Yamamuro edited comment on SPARK-31583 at 5/8/20, 12:09 AM: -------------------------------------------------------------------- > the order they were first seen in the specified grouping sets. Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555] I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g., {code:java} select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by a, b, c, d -- selected in a preferable order GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g., {code:java} select d, a, -- partially selected count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} I personally think this makes the resolution logic complicated and more unpredictable. Btw, any other DBMS-like systems following the suggested one? If we change the behaviour, we'd better follow them. was (Author: maropu): > the order they were first seen in the specified grouping sets. Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555] I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g., {code:java} select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by a, b, c, d -- selected in a preferable order GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g., {code:java} select d, a, -- partially selected count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} I personally think this makes the resolution logic complicated and more unpredictable. Btw, any other DBMS-like systems following your suggestion? > grouping_id calculation should be improved > ------------------------------------------ > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Reporter: Costas Piliotis > Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > +----+----+----+----+--------+---+-------+ > |a |b |c |d |count(1)|gid|gid_bin| > +----+----+----+----+--------+---+-------+ > |a |null|c |null|4 |6 |110 | > |null|null|null|null|4 |15 |1111 | > |a |null|null|d |4 |5 |101 | > |a |b |null|d |4 |1 |1 | > +----+----+----+----+--------+---+-------+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org