[ https://issues.apache.org/jira/browse/SPARK-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-10169: ------------------------------- Labels: correctness (was: ) > Evaluating AggregateFunction1 (old code path) may return wrong answers when > grouping expressions are used as arguments of aggregate functions > --------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-10169 > URL: https://issues.apache.org/jira/browse/SPARK-10169 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.1.1, 1.2.2, 1.3.1, 1.4.1 > Reporter: Yin Huai > Assignee: Yin Huai > Priority: Critical > Labels: correctness > Fix For: 1.3.2, 1.4.2 > > > Before Spark 1.5, if an aggregate function use an grouping expression as > input argument, the result of the query can be wrong. The reason is we are > using transformUp when we do aggregate results rewriting (see > https://github.com/apache/spark/blob/branch-1.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L154). > > To reproduce the problem, you can use > {code} > import org.apache.spark.sql.functions._ > sc.parallelize((1 to 1000), 50).map(i => > Tuple1(i)).toDF("i").registerTempTable("t") > sqlContext.sql(""" > select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i) > from t > where i % 10 = 5 > group by i % 10""").explain() > == Physical Plan == > Aggregate false, [PartialGroup#234], [PartialGroup#234 AS > _c0#225,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((PartialGroup#234 > = 5),1,0), LongType)) AS _c1#226L,Coalesce(SUM(PartialCount#233L),0) AS > _c2#227L] > Exchange (HashPartitioning [PartialGroup#234], 200) > Aggregate true, [(i#191 % 10)], [(i#191 % 10) AS > PartialGroup#234,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf(((i#191 > % 10) = 5),1,0), LongType)) AS PartialSum#232L,COUNT(1) AS PartialCount#233L] > Project [_1#190 AS i#191] > Filter ((_1#190 % 10) = 5) > PhysicalRDD [_1#190], MapPartitionsRDD[93] at mapPartitions at > ExistingRDD.scala:37 > sqlContext.sql(""" > select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i) > from t > where i % 10 = 5 > group by i % 10""").show > _c0 _c1 _c2 > 5 50 100 > {code} > In Spark 1.5, new aggregation code path does not have the problem. The old > code path is fixed by > https://github.com/apache/spark/commit/dd9ae7945ab65d353ed2b113e0c1a00a0533ffd6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org