adriangb commented on PR #17529: URL: https://github.com/apache/datafusion/pull/17529#issuecomment-3379650299
@rkrishn7 sorry if I haven't looped back here. I went on a bit of a tangent exploring https://github.com/apache/datafusion/pull/17632 and then had some vacation and a team offsite. This is overall very exciting work that I think will help a lot of people. My main concern with this change is the overhead and making a CPU / memory tradeoff decision for users. I think we might be able to ship it as an experimental thing with the feature flag defaulting to false as you've done in this PR but long term I worry that an [extra 8GB or RAM consumed](https://github.com/apache/datafusion/pull/17529#discussion_r2411406442) might be too much. Do you have any numbers on how fast and how much RAM these 3 different scenarios use for some queries? I don't mean to ask you to run them all but I do remember you mentioning you have already. 1. No hashes pushdown. 2. Build a single hash table (this PR). 3. Push down the existing hash tables w/ a CASE statement for which hash table to check based on the same partitioning as RepartitionExec. My hypothesis is that the table will look something like this: | | (1) | (2) | (3) | |----------|------|------|-----| | Runtime | 300s | 25s | 30s | | Peak mem | 8GB | 16GB | 8GB | *If* that were the case, which is just a guess at this point, making a query 10x faster with no extra memory use is easy to justify, everyone wants that! Choosing to make some queries 17% faster for 100% more memory use is harder to justify. If the performance difference is larger and we think it is justified in some cases maybe we can at least try to reserve the extra memory and fall back to re-using the existing hash tables? I also think it's worth thinking about integrating your suggestion from our conversation to use an `IN LIST` expression because it's even integrated into bloom filter pruning and predicate pruning. I see basically no reason to not do that for small build side result sets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
