Yin Huai created SPARK-10169:
--------------------------------

             Summary: Evaluating AggregateFunction1 (old code path) may return 
wrong answers when grouping expressions are used as arguments of aggregate 
functions
                 Key: SPARK-10169
                 URL: https://issues.apache.org/jira/browse/SPARK-10169
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.1, 1.3.1, 1.2.2, 1.1.1
            Reporter: Yin Huai
            Assignee: Yin Huai
            Priority: Critical


Before Spark 1.5, if an aggregate function use an grouping expression as input 
argument, the result of the query can be wrong. The reason is we are using 
transformUp when we do aggregate results rewriting (see 
https://github.com/apache/spark/blob/branch-1.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L154).
 

To reproduce the problem, you can use
{code}
import org.apache.spark.sql.functions._
sc.parallelize((1 to 1000), 50).map(i => 
Tuple1(i)).toDF("i").registerTempTable("t")
sqlContext.sql(""" 
select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
from t
where i % 10 = 5
group by i % 10""").explain()

== Physical Plan ==
Aggregate false, [PartialGroup#234], [PartialGroup#234 AS 
_c0#225,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((PartialGroup#234
 = 5),1,0), LongType)) AS _c1#226L,Coalesce(SUM(PartialCount#233L),0) AS 
_c2#227L]
 Exchange (HashPartitioning [PartialGroup#234], 200)
  Aggregate true, [(i#191 % 10)], [(i#191 % 10) AS 
PartialGroup#234,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf(((i#191
 % 10) = 5),1,0), LongType)) AS PartialSum#232L,COUNT(1) AS PartialCount#233L]
   Project [_1#190 AS i#191]
    Filter ((_1#190 % 10) = 5)
     PhysicalRDD [_1#190], MapPartitionsRDD[93] at mapPartitions at 
ExistingRDD.scala:37

sqlContext.sql(""" 
select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
from t
where i % 10 = 5
group by i % 10""").show

_c0 _c1 _c2
5   50  100
{code}

In Spark 1.5, new aggregation code path does not have the problem. The old code 
path is fixed by 
https://github.com/apache/spark/commit/dd9ae7945ab65d353ed2b113e0c1a00a0533ffd6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to