Dandandan opened a new issue, #20773: URL: https://github.com/apache/datafusion/issues/20773
### Is your feature request related to a problem or challenge? The paper "Morsel-Driven Parallelism"[1] by Leis et al introduces a cache efficient aggregation algorithm. It is also implemented by DuckDB. We can implement it and see if it improves performance. [1] https://db.in.tum.de/~leis/papers/morsels.pdf [2] https://duckdb.org/2022/03/07/aggregate-hashtable#parallel-aggregation ### Describe the solution you'd like The steps are as follows 1. Start with a single hashtable (current approach) without pre-repartitioning 2. Once the table exceeds a threshold (i.e. roughly CPU cache size), (radix) repartition the maps into a number of hashmaps (e.g. we can start of using the number of `target_partitions`) 3. Once they are repartitioned the output batches can be sent directly to the target partitions (no Repartition needed!) <img width="351" height="274" alt="Image" src="https://github.com/user-attachments/assets/d41845bb-1c14-4715-9f44-dcbb6b516014" /> The benefit of it I think is mostly that by local grouping, we process the rows (smaller) hashmaps partition-by-partition, so while doing it they more likely fit in CPU cache. Care must be taken to make the code efficient, i.e. avoid materializing the batches upfront / accumulate based on indices. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
