[ 
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25914:
-------------------------------

    Assignee:     (was: Dilip Biswal)

> Separate projection from grouping and aggregate in logical Aggregate
> --------------------------------------------------------------------
>
>                 Key: SPARK-25914
>                 URL: https://issues.apache.org/jira/browse/SPARK-25914
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Maryann Xue
>            Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields: 
> {{groupingExpressions}} and {{aggregateExpressions}}, in which 
> {{aggregateExpressions}} is actually the result expressions, or in other 
> words, the project list in the SELECT clause.
>   
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by 
> flattening "concat" and causes the query to fail since the expression 
> {{concat('x', a, 's')}} in the SELECT clause is neither referencing a 
> grouping expression nor a aggregate expression.
>   
>  The problem is that we try to mix two operations in one operator, and worse, 
> in one field: the group-and-aggregate operation and the project operation. 
> There are two ways to solve this problem:
>  1. Break the two operations into two logical operators, which means a 
> group-by query can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, 
> the same way we do for physical aggregate classes (e.g., 
> {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, 
> {{groupingExpressions}} would still be the expressions from the GROUP BY 
> clause (as before), but {{aggregateExpressions}} would contain aggregate 
> functions only, and {{resultExpressions}} would be the project list in the 
> SELECT clause holding references to either {{groupingExpressions}} or 
> {{aggregateExpressions}}.
>   
>  I would say option 1 is even clearer, but it would be more likely to break 
> the pattern matching in existing optimization rules and thus require more 
> changes in the compiler. So we'd probably wanna go with option 2. That said, 
> I suggest we achieve this goal through two iterative steps:
>   
>  Phase 1: Keep the current fields of logical Aggregate as 
> {{groupingExpressions}} and {{aggregateExpressions}}, but change the 
> semantics of {{aggregateExpressions}} by replacing the grouping expressions 
> with corresponding references to expressions in {{groupingExpressions}}. The 
> aggregate expressions in  {{aggregateExpressions}} will remain the same.
>   
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only 
> aggregate expressions in {{aggregateExpressions}}.
>   



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to