Re: [I] Dynamic hash join order switching [datafusion]

via GitHub Thu, 20 Nov 2025 02:42:12 -0800


Dandandan commented on issue #18840:
URL: https://github.com/apache/datafusion/issues/18840#issuecomment-3557207693


   In the proposal as written, we don't even necessarily need to load only a 
fraction of the memory of the left side (partition) into memory, but can wait 
on all the build side data to be loaded.
   
   BigQuery does this (as does Spark), but as it is a distributed query engine 
it will materialize intermediate stages/shuffles, so it will always have the 
complete statistics available when executing a join.
   
   Of course, the proposal to check after loading the memory only works if both 
sides of the join can be loaded into memory at the same time which might not be 
feasible - it could consider spilling the left side though as well which might 
be an interesting option as it might also reduce the peak memory consumption.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Dynamic hash join order switching [datafusion]

Reply via email to