ic4y opened a new issue #1282: URL: https://github.com/apache/arrow-datafusion/issues/1282
In most query engines, the execution cost of the distinct aggregation function is huge, but it can be optimized by groupBy. I want to bring this optimization to datafushion Currently for a single distinct aggregation scenario as follows ``` - Aggregation GROUP BY (k) F1(DISTINCT s0, s1, ...), F2(DISTINCT s0, s1, ...), - X into - Aggregation GROUP BY (k) F1(x) F2(x) - Aggregation GROUP BY (k, s0, s1, ...) - X ``` I used a test data set of 60 million to test datafunshion before and after using the optimizer.After optimization,the performance has doubled and the execution time has been reduced from 12 seconds to 6 seconds The test results and the logical plan before and after optimization are as follows ``` sql : select count(distinct LO_EXTENDEDPRICE) from lineorder_flat; ------------------original--------------------- Display: Projection: #COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N] Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT #lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N] TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64] +-------------------------------------------------+ | COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) | +-------------------------------------------------+ | 1040570 | +-------------------------------------------------+ usage millis: 12033 ----------------after optimization------------- Display: Projection: #COUNT(lineorder_flat.LO_EXTENDEDPRICE) [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N] Aggregate: groupBy=[[]], aggr=[[COUNT(#lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N] Aggregate: groupBy=[[#lineorder_flat.LO_EXTENDEDPRICE]], aggr=[[]] [LO_EXTENDEDPRICE:Int64] TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64] +----------------------------------------+ | COUNT(lineorder_flat.LO_EXTENDEDPRICE) | +----------------------------------------+ | 1040570 | +----------------------------------------+ usage millis: 5817 ``` In the case of common aggregation functions and distinct aggregation functions used together, optimization can also be done in a way similar to GROUPING SET. Although it has not been tested on datafushion, I did the above optimization in trino. In our production environment Has a very good performance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org