nuno-faria commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3073363029

   I have also been looking at join performance and I think the main limitation 
is the order, followed by the lack of join parameterization.
   In TPC-H, 6 queries use a bad join order, although we can manually rewrite 
the SQL to force the correct one. Also, 3 TPC-H queries could benefit from join 
parameterization. As an extreme (and nonsensical) example, this query will take 
too long to complete with Datafusion due to the large joins, while only taking 
200ms with DuckDB.
   
   ```sql
   SELECT *
   FROM lineitem, orders
   WHERE l_orderkey = o_orderkey
       AND o_orderkey = 1
       AND l_quantity < (
           SELECT avg(l_quantity)
           FROM lineitem
           WHERE l_orderkey = o_orderkey
       );
   ```
   
   If we emulate the parameterization here, Datafusion ends up more than 2x 
faster than DuckDB (altough manually parameterization is not feasible in most 
queries).
   
   Here is the breakdown of the queries in TPC-H:
   <img width="500" height="500" alt="Image" 
src="https://github.com/user-attachments/assets/3056e14c-ec91-4a3c-a395-0c44810bbdf7";
 />
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to