james727 opened a new pull request #1534:
URL: https://github.com/apache/arrow-datafusion/pull/1534


   # Which issue does this PR close?
   This partially addresses 
https://github.com/apache/arrow-datafusion/issues/1512
   
    # Rationale for this change
   Right now `array_agg(distinct ...)` doesn't work. The physical plan 
construction logic uses the non-distinct `array_agg` whether or not distinct 
was specified. Interestingly enough it still works correctly under certain 
conditions, due to the `SingleDistinctToGroupBy` optimizer rule. 
   
   As an example, consider the following queries:
   ```sql
   --- Works, since logical plan is rewritten with a subquery and non-distinct 
agg. Will return:
   --- +------------------------------------------+
   --- | ARRAYAGG(DISTINCT aggregate_test_100.c2) |
   --- +------------------------------------------+
   --- | [2, 3, 5, 1, 4]                          |
   --- +------------------------------------------+
   SELECT array_agg(DISTINCT c2) FROM aggregate_test_100;
   
   -- Returns incorrect results, since SingleDistinctToGroupBy optimizer rule 
does not apply:
   --- 
+--------------------------------------------------------------------------+
   --- | ARRAYAGG(DISTINCT aggregate_test_100.c2)      | COUNT(DISTINCT 
UInt8(1)) |
   --- 
+--------------------------------------------------------------------------+
   --- | [2, 5, 1, 1, 5, 4, 3, 3, 1, 4, 1, 4, 3, ...]  | 1                      
  |
   --- 
+--------------------------------------------------------------------------+
   SELECT array_agg(DISTINCT c2), count(distinct 1) FROM aggregate_test_100;
   ```
   
   After this change distinct array agg will throw an error when 
`SingleDistinctToGroupBy` does not apply.
   
   I'm planning on working on actually implementing distinct array_agg after 
this, but figured this was worth fixing for now.
   # What changes are included in this PR?
   This marks the aggregation not implemented, and adds a block for testing 
count/approx distinct/array agg with `distinct = true`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to