Rachelint commented on issue #20773: URL: https://github.com/apache/datafusion/issues/20773#issuecomment-4021933734
> I wonder if we can make things also more cache aware with some sorting / partitioning based on hash as well with more minimal changes with higher cardinality, perhaps getting a similar win in cache efficiency for high cardinality (without the large changes): > > * sort/partition hashes so we make the access to the table more linear (and also detect/skip duplicate keys avoiding probes > * sort/partition group indices to make accumulate state access more linear (rather than scattered over entire group state) Yes, after reading duckdb codes, my current thoughts: - For `partial aggr`, skipping is alread a good method to handle high cardinality groups in `partial aggr`, what we should contintue to do for `partial aggr` I think is improving performance of `RepartExec` - For `final aggr`, we should use the partition-wise method to keep the hashmap always small like what is done in `duckdb`. And luckily, I found the code changes to reach it is not actually large, I am doing experiment about this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
