[I] Implement cache-efficient partial aggregation by Leis et al [datafusion]

via GitHub Sat, 07 Mar 2026 00:09:14 -0800


Dandandan opened a new issue, #20773:
URL: https://github.com/apache/datafusion/issues/20773


   ### Is your feature request related to a problem or challenge?
   
   The paper "Morsel-Driven Parallelism"[1] by Leis et al introduces a cache 
efficient aggregation algorithm.
   It is also implemented by DuckDB.
   
   We can implement it and see if it improves performance.
   
   
   [1] https://db.in.tum.de/~leis/papers/morsels.pdf
   [2] https://duckdb.org/2022/03/07/aggregate-hashtable#parallel-aggregation
   
   ### Describe the solution you'd like
   
   The steps are as follows
   
   1. Start with a single hashtable (current approach) without 
pre-repartitioning
   2. Once the table exceeds a threshold (i.e. roughly CPU cache size), (radix) 
repartition the maps into a number of hashmaps (e.g. we can start of using the 
number of `target_partitions`)
   3. Once they are repartitioned the output batches can be sent directly to 
the target partitions (no Repartition needed!)
   
   <img width="351" height="274" alt="Image" 
src="https://github.com/user-attachments/assets/d41845bb-1c14-4715-9f44-dcbb6b516014";
 />
   
   The benefit of it I think is mostly that by local grouping, we process the 
rows (smaller) hashmaps partition-by-partition, so while doing it they more 
likely fit in CPU cache.
   
   Care must be taken to make the code efficient, i.e. avoid materializing the 
batches upfront / accumulate based on indices.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Implement cache-efficient partial aggregation by Leis et al [datafusion]

Reply via email to