Re: [I] Improve performance of TPCH q4, q7, q9 at TPCH SF 100 [datafusion]

via GitHub Wed, 17 Sep 2025 12:47:14 -0700


ctsk commented on issue #17259:
URL: https://github.com/apache/datafusion/issues/17259#issuecomment-3269376835


   Predicate filters are likely not a large help on tpch, since that data is 
notoriously uniform - thus little skipping will occur.
   
   Ultimately, the issue is that the machine is running out of ram -  minor 
optimizations of the hash joins wont really help it: Instead we must do better 
choices on what we pick as build and probe side.
   
   I've had a look at polars join impl and it seems that they are doing 
something smart: They pick build and probe side during execution. A rough 
description would be:
   
   - First buffer a constant amount of rows of both inputs to the join. 
   - If one side is exhausted before the threshold is reached, then that side 
becomes the build side.
         - if the exhausted side is very small, pick a nested loop join
   - ...else the side with the smaller estimated cardinality (HLL sketched) 
becomes the build side.
   
   Perhaps this can side-step some of the bad join decisions / large build 
sides that datafusion makes.
   
   Thoughts on incorporating such an approach in datafusion? @alamb 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance of TPCH q4, q7, q9 at TPCH SF 100 [datafusion]

Reply via email to