ctsk commented on issue #17259:
URL: https://github.com/apache/datafusion/issues/17259#issuecomment-3269376835

   Predicate filters are likely not a large help on tpch, since that data is 
notoriously uniform - thus little skipping will occur.
   
   Ultimately, the issue is that the machine is running out of ram -  minor 
optimizations of the hash joins wont really help it: Instead we must do better 
choices on what we pick as build and probe side.
   
   I've had a look at polars join impl and it seems that they are doing 
something smart: They pick build and probe side during execution. A rough 
description would be:
   
   - First buffer a constant amount of rows of both inputs to the join. 
   - If one side is exhausted before the threshold is reached, then that side 
becomes the build side.
         - if the exhausted side is very small, pick a nested loop join
   - ...else the side with the smaller estimated cardinality (HLL sketched) 
becomes the build side.
   
   Perhaps this can side-step some of the bad join decisions / large build 
sides that datafusion makes.
   
   Thoughts on incorporating such an approach in datafusion? @alamb 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to