Re: [I] Implement cache-efficient partial aggregation by Leis et al [datafusion]

via GitHub Mon, 09 Mar 2026 01:13:33 -0700


Rachelint commented on issue #20773:
URL: https://github.com/apache/datafusion/issues/20773#issuecomment-4021933734


   > I wonder if we can make things also more cache aware with some sorting / 
partitioning based on hash as well with more minimal changes with higher 
cardinality, perhaps getting a similar win in cache efficiency for high 
cardinality (without the large changes):
   > 
   > * sort/partition hashes  so we make the access to the table more linear 
(and also detect/skip duplicate keys avoiding probes
   > * sort/partition group indices to make accumulate state access more linear 
(rather than scattered over entire group state)
   
   Yes, after reading duckdb codes, my current thoughts:
   - For `partial aggr`, skipping is alread a good method to handle high 
cardinality groups in `partial aggr`, what we should contintue to do for 
`partial aggr` I think is improving performance of `RepartExec`
   - For `final aggr`, we should use the partition-wise method to keep the 
hashmap always small like what is done in `duckdb`. And luckily, I found the 
code changes to reach it is not actually large, I am doing experiment about 
this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Implement cache-efficient partial aggregation by Leis et al [datafusion]

Reply via email to