Dear Calcite Devs,

I would like to start prototyping the approach for the optimization
mentioned in (CALCITE-5631
<https://issues.apache.org/jira/browse/CALCITE-5631>). So I just want to
follow up on the approach mentioned in the above document.
Please let me know if this approach seems reasonable to all of you.

@Stamatis and @Julian,  Please let me know if you got a chance to take a
look at the document and are in-line with the approach.
Any suggestions/clarifications on this approach would be very helpful.

Thanks,


On Mon, Apr 24, 2023 at 7:25 PM Hanumath Maduri <hanumathmad...@gmail.com>
wrote:

>
> Thanks Julian and Stamatis for sharing your thoughts on this optimization.
> I have jotted down a general approach which could optimize the common
> subexpression with join elimination use cases in the following document.
> Please go through this
> <https://docs.google.com/document/d/1rF6sZOipfXe2UdBViSbIF33FMXyz5Ev_s-q4IMe8gEI/edit?usp=sharing>
> document and let me know what you think of this approach.
>
> In gist the approach mentioned in this document is
> 1. Finding the common subexpressions
> 2. Using the subexpression information to eliminate the joins when
> possible.
>
> I did think about implementing this optimization in the decorrelator as
> well, but this approach was not explored further because it could miss
> finding non correlated common sub expressions.
>
> Let me know if you have any questions or suggestions on this approach.
>
> Thanks,
> Hanu
>
>
>
>
> On Wed, Apr 19, 2023 at 11:15 AM Julian Hyde <jhyde.apa...@gmail.com>
> wrote:
>
>> Thank you, Stamatis. This is helpful. I have linked to this email thread
>> in CALCITE-5631.
>>
>> > On Apr 16, 2023, at 2:44 AM, Stamatis Zampetakis <zabe...@gmail.com>
>> wrote:
>> >
>> > Few quick thoughts about this.
>> >
>> > For to the problem of query minimization/redundant joins the simpler
>> > scenarios that I can think of are the following:
>> >
>> > # Scenario A
>> > select e1.id from emp e1 inner join emp e2 on e1.name = e2.name;
>> >
>> > If you know the name column is UNIQUE then you can drop the join on e2.
>> >
>> > # Scenario B
>> > select e.name from emp e inner join dept d on e.dept = d.id;
>> >
>> > If you know that e.dept and d.id is a foreign key relationship then
>> > you can drop the join on dept.
>> >
>> > There are probably other cases to define and handle but we should move
>> > incrementally.
>> >
>> > As Julian pointed out, the issue logged  in CALITE-5631 could also be
>> > addressed by employing common table expression related optimizations.
>> > CTE optimizations and query minimization are both interesting and
>> > powerful techniques to reduce the cost of the query (whatever that is
>> > speed, money, resources, etc).
>> >
>> > I would suggest focusing on query minimization first since it is
>> > pretty well defined and we could come up with solutions much faster
>> > than CTEs. CTEs usually come up with decisions about materializing or
>> > not the common expressions which are closer to lower level ("physical
>> > plan") optimizations.
>> >
>> > Most minimization techniques focus on select project join (SPJ)
>> > queries so I guess we would have to do some preprocessing to bring the
>> > plan in this format (only Scan, Project, Filter, and Join operators)
>> > before applying the rule. It would be a separate planning phase
>> > combining a bunch of existing rules followed by some new which is
>> > inline with what Julian was saying about bottom-up unification.
>> >
>> > The new rule could be something similar to LoptOptimizeJoinRule that
>> > operates on a MultiJoin. I haven't checked if the MultiJoin operator
>> > is sufficient to express an SPJ query but I think the general idea of
>> > grouping joins together seems to be a promising direction for writing
>> > new rules.
>> >
>> > Best,
>> > Stamatis
>> >
>> > On Sun, Apr 16, 2023 at 2:27 AM Julian Hyde <jh...@apache.org> wrote:
>> >>
>> >> Ian Bertolacci recently logged
>> >> https://issues.apache.org/jira/browse/CALCITE-5631, to convert
>> >>
>> >>  select
>> >>     (select numarrayagg(C5633_203) from T893 where C5633_586 =
>> T895.id),
>> >>     (select numarrayagg(C5633_170) from T893 where C5633_586 = T895.id)
>> >>  from T895
>> >>
>> >> into
>> >>
>> >>  select agg.agg1,
>> >>      agg.agg2
>> >>  from T895
>> >>  left join (
>> >>    select C5633_586,
>> >>        numarrayagg(C5633_203) as agg1,
>> >>        numarrayagg(C5633_170) as agg2
>> >>    from T893
>> >>    where C5633_586 is not null
>> >>    group by C5633_586) as agg
>> >>  on agg.C5633_586 = T895.id
>> >>
>> >> This seems to me an interesting and important problem. But it's also a
>> >> hard problem, and it's not clear to me which approach is the best.
>> >> Does anyone have any ideas for how to approach it?
>> >>
>> >> Also, we could use more example queries that illustrate the general
>> >> pattern.  (Preferably in terms of simple databases such as EMP and
>> >> DEPT.)
>> >>
>> >> In Calcite rewrite rules (RelRule) are usually the preferred approach.
>> >> Because the common relational expressions scans can be an arbitrary
>> >> distance apart in the RelNode tree, RelRule doesn't seem suitable.
>> >>
>> >> There seem to be some similarities to algorithms to use materialized
>> >> views, which use bottom-up unification.
>> >>
>> >> Ian's original query actually has correlated scalar sub-queries rather
>> >> than explicit joins. Would it be better to target common sub-queries
>> >> rather than joins?
>> >>
>> >> Lastly, there are similarities with the WinMagic algorithm, which
>> >> converts correlated sub-queries into window aggregates. Is that a
>> >> useful direction? (My implementation of measures in CALCITE-4496
>> >> naturally creates correlated scalar sub-queries that can be inlined in
>> >> the enclosing query if simple, or converted to window aggregates if
>> >> more complex.)
>> >>
>> >> Julian
>>
>>

Reply via email to