duongcongtoai opened a new issue, #14554: URL: https://github.com/apache/datafusion/issues/14554
### Is your feature request related to a problem or challenge? Datafusion already support decorrelating simple scalar subqueries in this PR: https://github.com/apache/datafusion/pull/6457 This follow the first approach in [TUM paper](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf) (simple unnesting), and allow decorrelating [this simple query](https://github.com/apache/datafusion/blob/873d4178ab3c98ca6f78ef553c8e33e12dd342fb/datafusion/core/tests/sqllogictests/test_files/subquery.slt#L797) ``` explain select t1.t1_int from t1 where (select count(*) from t2 where t1.t1_id = t2.t2_id) < t1.t1_int ``` However, if we add an `or` condition this subquery ``` explain select t1.t1_int from t1 where (select count(*) from t2 where t1.t1_id = t2.t2_id or t1.t1_name=t2.t2_name) < t1.t1_int ``` Datafusion cannot decorrelate it ``` +--------------+----------------------------------------------------------------------------------------+ | plan_type | plan | +--------------+----------------------------------------------------------------------------------------+ | logical_plan | Projection: t1.t1_int | | | Filter: (<subquery>) < CAST(t1.t1_int AS Int64) | | | Subquery: | | | Projection: count(*) | | | Aggregate: groupBy=[[]], aggr=[[count(Int64(1)) AS count(*)]] | | | Filter: outer_ref(t1.t1_id) = t2.t2_id OR outer_ref(t1.t1_name) = t2.t2_name | | | TableScan: t2 | | | TableScan: t1 projection=[t1_id, t1_name, t1_int] | +--------------+----------------------------------------------------------------------------------------+ ``` ### Describe the solution you'd like Support decorrelating this query following the second method mentioned in the paper ### Describe alternatives you've considered _No response_ ### Additional context General framework for decorrelation maybe discussed here https://github.com/apache/datafusion/issues/5492 But the steps needed to make this work is followed Allow decorrelation for this type of filter exprs in this code: https://github.com/apache/datafusion/blob/813220d54f08c5203ad79bfb066ca638abe208ed/datafusion/optimizer/src/decorrelate.rs#L162 Add more logic to handle complex query decorrelation: - Build domain/magic relation - Rewrite the subquery to join inner table (table of the subquery) with domain/magic relation using its complex filter expression (i.e `t2.t2_id = domain.t1_id OR t2.t2_name = domain.t1_name`) - Rewrite aggregation to group by the additional columns mentioned in the domain/magic relation - Join the outer relation with the newly built aggregation For example the above mentioned query may be rewritten like ``` explain select t1.t1_int from t1, ( select count(*) as count_all, domain.t1_id as t1_id, domain.t1_name as t1_name from ( select distinct t1_id, t1_name from t1 ) as domain join t2 where t2.t2_id = domain.t1_id or t2.t2_name=domain.t1_name group by domain.t1_id, domain.t1_name ) as pulled_up where t1.t1_id=pulled_up.t1_id and pulled_up.count_all < t1.t1_int ``` Logical plan may look like ``` | logical_plan | Projection: t1.t1_int | | | Inner Join: t1.t1_id = pulled_up.t1_id Filter: pulled_up.count_all < CAST(t1.t1_int AS Int64) | | | TableScan: t1 projection=[t1_id, t1_int] | | | SubqueryAlias: pulled_up | | | Projection: count(*) AS count_all, domain.t1_id | | | Aggregate: groupBy=[[domain.t1_id, domain.t1_name]], aggr=[[count(Int64(1)) AS count(*)]] | | | Projection: domain.t1_id, domain.t1_name | | | Inner Join: Filter: t2.t2_id = domain.t1_id OR t2.t2_name = domain.t1_name | | | SubqueryAlias: domain | | | Aggregate: groupBy=[[t1.t1_id, t1.t1_name]], aggr=[[]] | | | TableScan: t1 projection=[t1_id, t1_name] | | | TableScan: t2 projection=[t2_id, t2_name] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org