Dear Calcite Devs, I would like to start prototyping the approach for the optimization mentioned in (CALCITE-5631 <https://issues.apache.org/jira/browse/CALCITE-5631>). So I just want to follow up on the approach mentioned in the above document. Please let me know if this approach seems reasonable to all of you.
@Stamatis and @Julian, Please let me know if you got a chance to take a look at the document and are in-line with the approach. Any suggestions/clarifications on this approach would be very helpful. Thanks, On Mon, Apr 24, 2023 at 7:25 PM Hanumath Maduri <hanumathmad...@gmail.com> wrote: > > Thanks Julian and Stamatis for sharing your thoughts on this optimization. > I have jotted down a general approach which could optimize the common > subexpression with join elimination use cases in the following document. > Please go through this > <https://docs.google.com/document/d/1rF6sZOipfXe2UdBViSbIF33FMXyz5Ev_s-q4IMe8gEI/edit?usp=sharing> > document and let me know what you think of this approach. > > In gist the approach mentioned in this document is > 1. Finding the common subexpressions > 2. Using the subexpression information to eliminate the joins when > possible. > > I did think about implementing this optimization in the decorrelator as > well, but this approach was not explored further because it could miss > finding non correlated common sub expressions. > > Let me know if you have any questions or suggestions on this approach. > > Thanks, > Hanu > > > > > On Wed, Apr 19, 2023 at 11:15 AM Julian Hyde <jhyde.apa...@gmail.com> > wrote: > >> Thank you, Stamatis. This is helpful. I have linked to this email thread >> in CALCITE-5631. >> >> > On Apr 16, 2023, at 2:44 AM, Stamatis Zampetakis <zabe...@gmail.com> >> wrote: >> > >> > Few quick thoughts about this. >> > >> > For to the problem of query minimization/redundant joins the simpler >> > scenarios that I can think of are the following: >> > >> > # Scenario A >> > select e1.id from emp e1 inner join emp e2 on e1.name = e2.name; >> > >> > If you know the name column is UNIQUE then you can drop the join on e2. >> > >> > # Scenario B >> > select e.name from emp e inner join dept d on e.dept = d.id; >> > >> > If you know that e.dept and d.id is a foreign key relationship then >> > you can drop the join on dept. >> > >> > There are probably other cases to define and handle but we should move >> > incrementally. >> > >> > As Julian pointed out, the issue logged in CALITE-5631 could also be >> > addressed by employing common table expression related optimizations. >> > CTE optimizations and query minimization are both interesting and >> > powerful techniques to reduce the cost of the query (whatever that is >> > speed, money, resources, etc). >> > >> > I would suggest focusing on query minimization first since it is >> > pretty well defined and we could come up with solutions much faster >> > than CTEs. CTEs usually come up with decisions about materializing or >> > not the common expressions which are closer to lower level ("physical >> > plan") optimizations. >> > >> > Most minimization techniques focus on select project join (SPJ) >> > queries so I guess we would have to do some preprocessing to bring the >> > plan in this format (only Scan, Project, Filter, and Join operators) >> > before applying the rule. It would be a separate planning phase >> > combining a bunch of existing rules followed by some new which is >> > inline with what Julian was saying about bottom-up unification. >> > >> > The new rule could be something similar to LoptOptimizeJoinRule that >> > operates on a MultiJoin. I haven't checked if the MultiJoin operator >> > is sufficient to express an SPJ query but I think the general idea of >> > grouping joins together seems to be a promising direction for writing >> > new rules. >> > >> > Best, >> > Stamatis >> > >> > On Sun, Apr 16, 2023 at 2:27 AM Julian Hyde <jh...@apache.org> wrote: >> >> >> >> Ian Bertolacci recently logged >> >> https://issues.apache.org/jira/browse/CALCITE-5631, to convert >> >> >> >> select >> >> (select numarrayagg(C5633_203) from T893 where C5633_586 = >> T895.id), >> >> (select numarrayagg(C5633_170) from T893 where C5633_586 = T895.id) >> >> from T895 >> >> >> >> into >> >> >> >> select agg.agg1, >> >> agg.agg2 >> >> from T895 >> >> left join ( >> >> select C5633_586, >> >> numarrayagg(C5633_203) as agg1, >> >> numarrayagg(C5633_170) as agg2 >> >> from T893 >> >> where C5633_586 is not null >> >> group by C5633_586) as agg >> >> on agg.C5633_586 = T895.id >> >> >> >> This seems to me an interesting and important problem. But it's also a >> >> hard problem, and it's not clear to me which approach is the best. >> >> Does anyone have any ideas for how to approach it? >> >> >> >> Also, we could use more example queries that illustrate the general >> >> pattern. (Preferably in terms of simple databases such as EMP and >> >> DEPT.) >> >> >> >> In Calcite rewrite rules (RelRule) are usually the preferred approach. >> >> Because the common relational expressions scans can be an arbitrary >> >> distance apart in the RelNode tree, RelRule doesn't seem suitable. >> >> >> >> There seem to be some similarities to algorithms to use materialized >> >> views, which use bottom-up unification. >> >> >> >> Ian's original query actually has correlated scalar sub-queries rather >> >> than explicit joins. Would it be better to target common sub-queries >> >> rather than joins? >> >> >> >> Lastly, there are similarities with the WinMagic algorithm, which >> >> converts correlated sub-queries into window aggregates. Is that a >> >> useful direction? (My implementation of measures in CALCITE-4496 >> >> naturally creates correlated scalar sub-queries that can be inlined in >> >> the enclosing query if simple, or converted to window aggregates if >> >> more complex.) >> >> >> >> Julian >> >>