[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved

Costas Piliotis (Jira) Tue, 28 Apr 2020 09:17:24 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094651#comment-17094651
 ]


Costas Piliotis edited comment on SPARK-31583 at 4/28/20, 4:16 PM:
-------------------------------------------------------------------

[~maropu] I'm trying to avoid referencing the SPARK-21858 which already 
addresses the flipped bits.  Specifically this is about how spark decides where 
to allocate the grouping_id based on the ordinal position in the grouping sets 
rather than the ordinal position in the select clause.   Does that make sense?

So if I have  SELECT a,b,c,d FROM... GROUPING SETS (  (a,b,d), (a,b,c) ) the 
grouping_id bits would be determined as cdba the or instead of dcba.    I 
believe if we look at most RDBMS that has grouping sets identified, my only 
suggestion is that it would be more predictable if the bit order in the 
grouping_id were determined by the ordinal position in the select.   

The flipped bits, is a separate ticket and I do believe the implementation 
should be predictably the same as other implementation in established RDBMS SQL 
implementations where 1=included, 0=excluded, but that matter is closed to 
discussion.   


was (Author: cpiliotis):
[~maropu] I'm trying to avoid referencing the SPARK-21858 which already 
addresses the flipped bits.  Specifically this is about how spark decides where 
to allocate the grouping_id based on the ordinal position in the grouping sets 
rather than the ordinal position in the select clause.   Does that make sense?

So if I have  SELECT a,b,c,d FROM... GROUPING SETS (  (a,b,d), (a,b,c) ) the 
grouping_id would be abdc instead of abcd.    I believe if we look at most 
RDBMS that has grouping sets identified, my only suggestion is that it would be 
more predictable if the bit order in the grouping_id were determined by the 
ordinal position in the select.   

> grouping_id calculation should be improved
> ------------------------------------------
>
>                 Key: SPARK-31583
>                 URL: https://issues.apache.org/jira/browse/SPARK-31583
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Costas Piliotis
>            Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> +----+----+----+----+--------+---+-------+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> +----+----+----+----+--------+---+-------+
> |a   |null|c   |null|4       |6  |110    |
> |null|null|null|null|4       |15 |1111   |
> |a   |null|null|d   |4       |5  |101    |
> |a   |b   |null|d   |4       |1  |1      |
> +----+----+----+----+--------+---+-------+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved

Reply via email to