ctsk commented on issue #17259: URL: https://github.com/apache/datafusion/issues/17259#issuecomment-3269376835
Predicate filters are likely not a large help on tpch, since that data is notoriously uniform - thus little skipping will occur. Ultimately, the issue is that the machine is running out of ram - minor optimizations of the hash joins wont really help it: Instead we must do better choices on what we pick as build and probe side. I've had a look at polars join impl and it seems that they are doing something smart: They pick build and probe side during execution. A rough description would be: - First buffer a constant amount of rows of both inputs to the join. - If one side is exhausted before the threshold is reached, then that side becomes the build side. - if the exhausted side is very small, pick a nested loop join - ...else the side with the smaller estimated cardinality (HLL sketched) becomes the build side. Perhaps this can side-step some of the bad join decisions / large build sides that datafusion makes. Thoughts on incorporating such an approach in datafusion? @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org