Dandandan commented on issue #20773:
URL: https://github.com/apache/datafusion/issues/20773#issuecomment-4017058606

   > We have tried the similar approach in datafusion before, see 
https://github.com/apache/datafusion/issues/6937#issuecomment-1681310199 , but 
found no obvious improvement.
   
   Note that this is significantly different:
   
   The algorithm used by DuckDB/in the paper:
   * inserts into the hashmap *partition by partition* / *hashmap by hashmap* 
(I think this is the most important point) - this will make the partial 
aggregation much more cache efficient, even for aggregations that are not 
*that* big as the hashmap that is "in progress" will more likely fit in cache / 
lookups are likely to be in cache.
   * Avoids the extra partitioning step (both the double hashing as the copying)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to