Rachelint commented on issue #20773: URL: https://github.com/apache/datafusion/issues/20773#issuecomment-4017011592
As I see, these two articles actually talk about how `duckdb` improve performance in high cardinality groups aggregation. But `datafusion` and `duckdb` are too different in aggregation today, and methods in `duckdb` seems to help few about continuing to improve `high cardinality groups aggregation` of datafusion? At first let's see the difference in `high cardinality groups aggregation`. # Duckdb 0.7 https://duckdb.org/2022/03/07/aggregate-hashtable#parallel-aggregation This article describes the total method to improve performance of high cardinality groups aggregation in duckdb 0.7. And I see, there are some mistakes in it, it use a very different way with what mentioned in [Leis et al](https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf) - In low cardinality case (the hashtable len in partial aggr <= 10K) - use single hashtable in `partial aggr` - use single thread to perform `final aggr` (merge hashtables from `partial aggr`s) - In high cardinality case (the hashtable len in partial aggr > 10K) - use partitioned hashtable in `partial aggr` - use multiple threads to perform parallel `final aggr`(like `final aggr` in datafusion) Actually, it give up cache-efficient of hashtable in `partial aggr`, and use the parallel `final aggr` to improve performance. # Duckdb now https://db.in.tum.de/~leis/papers/morsels.pdf Duckdb use the similar method in the paper currently. - In `partial aggr`, it always keep a very small and fixed hashtable - and when hashtable in `partial aggr` become too big, it convert the data to partitions, and reset the hashtable. - In `final aggr`, like datafusion, always partitioning and merge in parallel We have tried the similar approach in datafusion before, see https://github.com/apache/datafusion/issues/6937#issuecomment-1681310199 , but found no obvious improvement. I guess, it is due to the execution model of datafusion (pull-based and async, hard to be cache-efficient). # Datafusion datafusion use `skip partial aggregations` to improve `high cardinality groups aggregation` like `doris`. - In `partial aggr`, we will perform the groups aggregation logic when not exceed the threshold at first - Then we will totally skip the `partial aggr` if excceed the threshold. - So, I guess few obvious improvement can got about optimizing `partial aggr` in `datafusion`, because we will totally skip it in `high cardinality groups aggregation`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
