Thank you, Stamatis. This is helpful. I have linked to this email thread in CALCITE-5631.
> On Apr 16, 2023, at 2:44 AM, Stamatis Zampetakis <zabe...@gmail.com> wrote: > > Few quick thoughts about this. > > For to the problem of query minimization/redundant joins the simpler > scenarios that I can think of are the following: > > # Scenario A > select e1.id from emp e1 inner join emp e2 on e1.name = e2.name; > > If you know the name column is UNIQUE then you can drop the join on e2. > > # Scenario B > select e.name from emp e inner join dept d on e.dept = d.id; > > If you know that e.dept and d.id is a foreign key relationship then > you can drop the join on dept. > > There are probably other cases to define and handle but we should move > incrementally. > > As Julian pointed out, the issue logged in CALITE-5631 could also be > addressed by employing common table expression related optimizations. > CTE optimizations and query minimization are both interesting and > powerful techniques to reduce the cost of the query (whatever that is > speed, money, resources, etc). > > I would suggest focusing on query minimization first since it is > pretty well defined and we could come up with solutions much faster > than CTEs. CTEs usually come up with decisions about materializing or > not the common expressions which are closer to lower level ("physical > plan") optimizations. > > Most minimization techniques focus on select project join (SPJ) > queries so I guess we would have to do some preprocessing to bring the > plan in this format (only Scan, Project, Filter, and Join operators) > before applying the rule. It would be a separate planning phase > combining a bunch of existing rules followed by some new which is > inline with what Julian was saying about bottom-up unification. > > The new rule could be something similar to LoptOptimizeJoinRule that > operates on a MultiJoin. I haven't checked if the MultiJoin operator > is sufficient to express an SPJ query but I think the general idea of > grouping joins together seems to be a promising direction for writing > new rules. > > Best, > Stamatis > > On Sun, Apr 16, 2023 at 2:27 AM Julian Hyde <jh...@apache.org> wrote: >> >> Ian Bertolacci recently logged >> https://issues.apache.org/jira/browse/CALCITE-5631, to convert >> >> select >> (select numarrayagg(C5633_203) from T893 where C5633_586 = T895.id), >> (select numarrayagg(C5633_170) from T893 where C5633_586 = T895.id) >> from T895 >> >> into >> >> select agg.agg1, >> agg.agg2 >> from T895 >> left join ( >> select C5633_586, >> numarrayagg(C5633_203) as agg1, >> numarrayagg(C5633_170) as agg2 >> from T893 >> where C5633_586 is not null >> group by C5633_586) as agg >> on agg.C5633_586 = T895.id >> >> This seems to me an interesting and important problem. But it's also a >> hard problem, and it's not clear to me which approach is the best. >> Does anyone have any ideas for how to approach it? >> >> Also, we could use more example queries that illustrate the general >> pattern. (Preferably in terms of simple databases such as EMP and >> DEPT.) >> >> In Calcite rewrite rules (RelRule) are usually the preferred approach. >> Because the common relational expressions scans can be an arbitrary >> distance apart in the RelNode tree, RelRule doesn't seem suitable. >> >> There seem to be some similarities to algorithms to use materialized >> views, which use bottom-up unification. >> >> Ian's original query actually has correlated scalar sub-queries rather >> than explicit joins. Would it be better to target common sub-queries >> rather than joins? >> >> Lastly, there are similarities with the WinMagic algorithm, which >> converts correlated sub-queries into window aggregates. Is that a >> useful direction? (My implementation of measures in CALCITE-4496 >> naturally creates correlated scalar sub-queries that can be inlined in >> the enclosing query if simple, or converted to window aggregates if >> more complex.) >> >> Julian