[ https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li reassigned SPARK-25914: ------------------------------- Assignee: (was: Dilip Biswal) > Separate projection from grouping and aggregate in logical Aggregate > -------------------------------------------------------------------- > > Key: SPARK-25914 > URL: https://issues.apache.org/jira/browse/SPARK-25914 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0 > Reporter: Maryann Xue > Priority: Major > > Currently the Spark SQL logical Aggregate has two expression fields: > {{groupingExpressions}} and {{aggregateExpressions}}, in which > {{aggregateExpressions}} is actually the result expressions, or in other > words, the project list in the SELECT clause. > > This would cause an exception while processing the following query: > {code:java} > SELECT concat('x', concat(a, 's')) > FROM testData2 > GROUP BY concat(a, 's'){code} > After optimization, the query becomes: > {code:java} > SELECT concat('x', a, 's') > FROM testData2 > GROUP BY concat(a, 's'){code} > The optimization rule {{CombineConcats}} optimizes the expressions by > flattening "concat" and causes the query to fail since the expression > {{concat('x', a, 's')}} in the SELECT clause is neither referencing a > grouping expression nor a aggregate expression. > > The problem is that we try to mix two operations in one operator, and worse, > in one field: the group-and-aggregate operation and the project operation. > There are two ways to solve this problem: > 1. Break the two operations into two logical operators, which means a > group-by query can usually be mapped into a Project-over-Aggregate pattern. > 2. Break the two operations into multiple fields in the Aggregate operator, > the same way we do for physical aggregate classes (e.g., > {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, > {{groupingExpressions}} would still be the expressions from the GROUP BY > clause (as before), but {{aggregateExpressions}} would contain aggregate > functions only, and {{resultExpressions}} would be the project list in the > SELECT clause holding references to either {{groupingExpressions}} or > {{aggregateExpressions}}. > > I would say option 1 is even clearer, but it would be more likely to break > the pattern matching in existing optimization rules and thus require more > changes in the compiler. So we'd probably wanna go with option 2. That said, > I suggest we achieve this goal through two iterative steps: > > Phase 1: Keep the current fields of logical Aggregate as > {{groupingExpressions}} and {{aggregateExpressions}}, but change the > semantics of {{aggregateExpressions}} by replacing the grouping expressions > with corresponding references to expressions in {{groupingExpressions}}. The > aggregate expressions in {{aggregateExpressions}} will remain the same. > > Phase 2: Add {{resultExpressions}} for the project list, and keep only > aggregate expressions in {{aggregateExpressions}}. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org