bersprockets opened a new pull request, #38565:
URL: https://github.com/apache/spark/pull/38565

   ### What changes were proposed in this pull request?
   
   `RewriteDistinctAggregates` doesn't typically patch foldable children of the 
distinct aggregation functions except in one odd case (and seemingly by 
accident). This PR extends the policy of not patching foldables to that odd 
case.
   
   
   ### Why are the changes needed?
   
   This query produces incorrect results:
   ```
   select a, count(distinct 100) as cnt1, count(distinct b, 100) as cnt2
   from values (1, 2), (4, 5) as data(a, b)
   group by a;
   
   +---+----+----+
   |a  |cnt1|cnt2|
   +---+----+----+
   |1  |1   |0   |
   |4  |1   |0   |
   +---+----+----+
   ```
   The values for `cnt2` should be 1 and 1 (not 0 and 0).
   
   If you change the literal used in the first aggregate function, the second 
aggregate function now works correctly:
   ```
   select a, count(distinct 101) as cnt1, count(distinct b, 100) as cnt2
   from values (1, 2), (4, 5) as data(a, b)
   group by a;
   
   +---+----+----+
   |a  |cnt1|cnt2|
   +---+----+----+
   |1  |1   |1   |
   |4  |1   |1   |
   +---+----+----+
   ```
   The bug is in the rule `RewriteDistinctAggregates`. When a distinct 
aggregation has only foldable children, `RewriteDistinctAggregates` uses the 
first child as the grouping key (_grouping key_ in this context means the 
function children of distinct aggregate functions: `RewriteDistinctAggregates` 
groups distinct aggregations by function children to determine the `Expand` 
projections  it needs to create). Therefore, the first foldable child gets 
included in the `Expand` projection associated with the aggregation, with a 
corresponding output attribute that is also included in the map for patching 
aggregate functions in the final aggregation.
   
   The `Expand` projections for all other distinct aggregate groups will have 
`null` in the slot associated with that output attribute. If the same foldable 
expression is used in a distinct aggregation associated with a different group, 
`RewriteDistinctAggregates` will improperly patch the associated aggregate 
function to use the previous aggregation's output attribute. Since the output 
attribute is associated with a different group, the value of that slot in the 
`Expand` projection will always be `null`.
   
   In the example above, `count(distinct 101) as cnt1` is the aggregation with 
only foldable children, and `count(distinct b, 100) as cnt2` is the aggregation 
that gets inappropriately patched with the wrong group's output attribute. As a 
result `count(distinct b, 100) as cnt2` (from the first example above) 
essentially becomes `count(distinct b, null) as cnt2`, which is always zero.
   
   `RewriteDistinctAggregates` doesn't typically patch foldable children of the 
distinct aggregation functions in the final aggregation. It potentially patches 
foldable expressions only when there is a distinct aggregation with only 
foldable children, and even then it doesn't patch the aggregation that has only 
foldable children, but instead some other unlucky aggregate function that 
happened to use the same foldable expression.
   
   This PR skips patching any foldable expressions in the aggregate functions 
to avoid patching an aggregation with a different group's output attribute.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to