james727 opened a new pull request #1534: URL: https://github.com/apache/arrow-datafusion/pull/1534
# Which issue does this PR close? This partially addresses https://github.com/apache/arrow-datafusion/issues/1512 # Rationale for this change Right now `array_agg(distinct ...)` doesn't work. The physical plan construction logic uses the non-distinct `array_agg` whether or not distinct was specified. Interestingly enough it still works correctly under certain conditions, due to the `SingleDistinctToGroupBy` optimizer rule. As an example, consider the following queries: ```sql --- Works, since logical plan is rewritten with a subquery and non-distinct agg. Will return: --- +------------------------------------------+ --- | ARRAYAGG(DISTINCT aggregate_test_100.c2) | --- +------------------------------------------+ --- | [2, 3, 5, 1, 4] | --- +------------------------------------------+ SELECT array_agg(DISTINCT c2) FROM aggregate_test_100; -- Returns incorrect results, since SingleDistinctToGroupBy optimizer rule does not apply: --- +--------------------------------------------------------------------------+ --- | ARRAYAGG(DISTINCT aggregate_test_100.c2) | COUNT(DISTINCT UInt8(1)) | --- +--------------------------------------------------------------------------+ --- | [2, 5, 1, 1, 5, 4, 3, 3, 1, 4, 1, 4, 3, ...] | 1 | --- +--------------------------------------------------------------------------+ SELECT array_agg(DISTINCT c2), count(distinct 1) FROM aggregate_test_100; ``` After this change distinct array agg will throw an error when `SingleDistinctToGroupBy` does not apply. I'm planning on working on actually implementing distinct array_agg after this, but figured this was worth fixing for now. # What changes are included in this PR? This marks the aggregation not implemented, and adds a block for testing count/approx distinct/array agg with `distinct = true`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org