nuno-faria opened a new issue, #22098: URL: https://github.com/apache/datafusion/issues/22098
Currently, the order of a hash join (i.e., which relation is hashed and which is probed) is dictated by the size of the relations in bytes. This is controlled by the `should_swap_join_order` function: https://github.com/apache/datafusion/blob/e62f06c914cfdfb388abf0b49a45afe015d76fcb/datafusion/physical-optimizer/src/join_selection.rs#L79-L87 This is a safe heuristic, since we want to increase the probability of the hash table fitting in memory. However, if one of the tables is considerably wider than the other, we can end up choosing a narrower table but with many more rows as the build side, which can be slower. I did some tests which joined a narrow table (fixed to 100M rows) with a wide one (from 10M to 100M rows), disabling the automatic join order with `datafusion.optimizer.join_reordering` to test both narrow+wide and wide+narrow: ```sql -- narrow (fixed to 100M rows) create table t1 (k int, v int) insert into t1 select i as k, i as v from generate_series(1, 100000000) t(i) -- wide (variable num of rows) create table t2 (k int, v varchar) insert into t2 select i as k, i || repeat('0', 54) as v from generate_series(1, ...) t(i) ``` Here are the results (8 vCPUs, 32GB Mem, avg of 10 runs): <img width="400" alt="Image" src="https://github.com/user-attachments/assets/9d57880f-7c99-477d-8ff8-1aeb2acf9597" /> The wide table with 10M rows is slightly larger in bytes than the narrow one with 100M, so the current DataFusion implementation always uses the `narrow_join_wide` version. We see that this version is only faster when the wide table has 90M rows, i.e. only 10% fewer rows that the narrow one. So what I'd like to discuss is if the hash join order should be decided based on a more complex heuristic. For example, "if the difference in size between the tables is less than X, go by row count, otherwise go by byte size". It appears that Postgres also does something like this: ```sql -- t1 (narrow): 10M rows -- t2 (wide): 8M rows <-- hashed Hash Cond: (t1.k = t2.k) -> Seq Scan on t1 (... width=8) (actual time=0.016..452.699 rows=10000000.00 loops=1) -> Hash (...) (actual time=3315.965..3315.966 rows=8000000.00 loops=1) Buckets: 8388608 Batches: 1 Memory Usage: 830076kB -> Seq Scan on t2 (... width=65) (actual time=0.027..462.818 rows=8000000.00 loops=1) -- t1 (narrow): 10M rows <-- hashed -- t2 (wide): 9.5M rows Hash Cond: (t2.k = t1.k) -> Seq Scan on t2 (... width=65) (actual time=0.010..437.648 rows=9500000.00 loops=1) -> Hash (...) (actual time=3013.599..3013.600 rows=10000000.00 loops=1) Buckets: 16777216 Batches: 1 Memory Usage: 521697kB -> Seq Scan on t1 (... width=8) (actual time=0.028..472.798 rows=10000000.00 loops=1) ``` For reference, here are the sizes of the tables. It was only faster to use the narrow table once the wider one was 9x larger. <img width="400" alt="Image" src="https://github.com/user-attachments/assets/bd136af9-acec-44be-aecd-3dfb6b753e66" /> cc: @alamb @Dandandan @adriangb since I think you are also interested in hash join performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
