[
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986764#comment-14986764
]
Apache Spark commented on SPARK-11275:
--------------------------------------
User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/9429
> [SQL] Incorrect results when using rollup/cube
> -----------------------------------------------
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.0, 1.4.0, 1.5.1
> Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query
> using rollup.
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a,
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like
> == Parsed Logical Plan ==
> 'Rollup ['a,'b],
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS
> sumAB#20),unresolvedalias('GROUPING__ID)]
> 'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as
> bigint)) AS sumAB#20L,GROUPING__ID#23]
> Expand [0,1,3], [a#2,b#3], grouping__id#23
> Subquery mytable
> Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as
> bigint)) AS sumAB#20L,GROUPING__ID#23]
> Expand [0,1,3], [a#2,b#3], grouping__id#23
> LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS
> sumAB#20L,grouping__id#23]
> Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
> Aggregate true, [a#2,b#3,grouping__id#23],
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
> Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)],
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK.
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading.
> Unfortunately, they hide the bugs. Although the buckets passed, they just
> compare the results of SQL and Dataframe. This way is unable to capture the
> regression when both return the same wrong results.
> 5. The same problem also exists in cube. I have not started the investigation
> in cube, but I believe the root causes should be the same.
> 6. It looks like all the logical plans are correct.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]