nuno-faria commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3073363029
I have also been looking at join performance and I think the main limitation is the order, followed by the lack of join parameterization. In TPC-H, 6 queries use a bad join order, although we can manually rewrite the SQL to force the correct one. Also, 3 TPC-H queries could benefit from join parameterization. As an extreme (and nonsensical) example, this query will take too long to complete with Datafusion due to the large joins, while only taking 200ms with DuckDB. ```sql SELECT * FROM lineitem, orders WHERE l_orderkey = o_orderkey AND o_orderkey = 1 AND l_quantity < ( SELECT avg(l_quantity) FROM lineitem WHERE l_orderkey = o_orderkey ); ``` If we emulate the parameterization here, Datafusion ends up more than 2x faster than DuckDB (altough manually parameterization is not feasible in most queries). Here is the breakdown of the queries in TPC-H: <img width="500" height="500" alt="Image" src="https://github.com/user-attachments/assets/3056e14c-ec91-4a3c-a395-0c44810bbdf7" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org