Github user maryannxue commented on a diff in the pull request: https://github.com/apache/spark/pull/19488#discussion_r144656360 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -205,14 +205,17 @@ object PhysicalAggregation { case logical.Aggregate(groupingExpressions, resultExpressions, child) => // A single aggregate expression might appear multiple times in resultExpressions. // In order to avoid evaluating an individual aggregate function multiple times, we'll - // build a set of the distinct aggregate expressions and build a function which can + // build a map of the distinct aggregate expressions and build a function which can // be used to re-write expressions so that they reference the single copy of the - // aggregate function which actually gets computed. - val aggregateExpressions = resultExpressions.flatMap { expr => + // aggregate function which actually gets computed. Note that aggregate expressions + // should always be deterministic, so we can use its canonicalized expression as its --- End diff -- So we are talking about two types of "non-deterministic" here: 1. Across-query non-deterministic but in-query deterministic, which means the same expression can produce different results over the same input between different runs, but should always give the same result within the same run. sum/avg on floating point numbers could be an example. Shall we make sure that "select sum(f) - sum(f) from t" always return 0? and similarly for "first()" maybe, should "select first_value(c) = first_value(c) over ..." always return true? It is important to define the behavior first, which will lead to opposite approaches on how to handle the "deterministic" field here. 2. Across-query and in-query non-deterministic, which I don't think is allowed anyway.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org