[ https://issues.apache.org/jira/browse/SPARK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166467#comment-16166467 ]
Dongjoon Hyun commented on SPARK-21858: --------------------------------------- Thank you for conclusion, [~cloud_fan]! > Make Spark grouping_id() compatible with Hive grouping__id > ---------------------------------------------------------- > > Key: SPARK-21858 > URL: https://issues.apache.org/jira/browse/SPARK-21858 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Reporter: Yann Byron > > If you want to migrate some ETLs using `grouping__id` in Hive to Spark and > use Spark `grouping_id()` instead of Hive `grouping__id`, you will find > difference between their evaluations. > Here is an example. > {code:java} > select A, B, grouping__id/grouping_id() from t group by A, B grouping > sets((), (A), (B), (A,B)) > {code} > Running it on Hive and Spark separately, you'll find this: (the selected > attribute in selected grouping set is represented by (/) and otherwise by > (x)) > ||A B||Binary Expression in Spark||Spark||Hive||Binary Expression in Hive||B > A|| > |(x) (x)|11|3|0|00|(x) (x)| > |(x) (/)|10|2|2|10|(/) (x)| > |(/) (x)|01|1|1|01|(x) (/)| > |(/) (/)|00|0|3|11|(/) (/)| > As shown above,In Hive, (/) set to 0, (x) set to 1, and in Spark it's > opposite. > Moreover, attributes in `group by` will reverse firstly in Hive. In Spark > it'll be evaluated directly. > In my opinion, I suggest that modifying the behavior of `grouping_id()` make > it compatible with Hive `grouping__id`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org