[ 
https://issues.apache.org/jira/browse/SPARK-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10169:
-------------------------------
    Labels: correctness  (was: )

> Evaluating AggregateFunction1 (old code path) may return wrong answers when 
> grouping expressions are used as arguments of aggregate functions
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10169
>                 URL: https://issues.apache.org/jira/browse/SPARK-10169
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.1.1, 1.2.2, 1.3.1, 1.4.1
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>            Priority: Critical
>              Labels: correctness
>             Fix For: 1.3.2, 1.4.2
>
>
> Before Spark 1.5, if an aggregate function use an grouping expression as 
> input argument, the result of the query can be wrong. The reason is we are 
> using transformUp when we do aggregate results rewriting (see 
> https://github.com/apache/spark/blob/branch-1.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L154).
>  
> To reproduce the problem, you can use
> {code}
> import org.apache.spark.sql.functions._
> sc.parallelize((1 to 1000), 50).map(i => 
> Tuple1(i)).toDF("i").registerTempTable("t")
> sqlContext.sql(""" 
> select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
> from t
> where i % 10 = 5
> group by i % 10""").explain()
> == Physical Plan ==
> Aggregate false, [PartialGroup#234], [PartialGroup#234 AS 
> _c0#225,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((PartialGroup#234
>  = 5),1,0), LongType)) AS _c1#226L,Coalesce(SUM(PartialCount#233L),0) AS 
> _c2#227L]
>  Exchange (HashPartitioning [PartialGroup#234], 200)
>   Aggregate true, [(i#191 % 10)], [(i#191 % 10) AS 
> PartialGroup#234,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf(((i#191
>  % 10) = 5),1,0), LongType)) AS PartialSum#232L,COUNT(1) AS PartialCount#233L]
>    Project [_1#190 AS i#191]
>     Filter ((_1#190 % 10) = 5)
>      PhysicalRDD [_1#190], MapPartitionsRDD[93] at mapPartitions at 
> ExistingRDD.scala:37
> sqlContext.sql(""" 
> select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
> from t
> where i % 10 = 5
> group by i % 10""").show
> _c0 _c1 _c2
> 5   50  100
> {code}
> In Spark 1.5, new aggregation code path does not have the problem. The old 
> code path is fixed by 
> https://github.com/apache/spark/commit/dd9ae7945ab65d353ed2b113e0c1a00a0533ffd6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to