Hi Vladimir
Thanks for your reply:
1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after
re-run the PR ut, the ut is passed, so this ut timeout may be an occasional
situation.

2. For the pull/2222, it due increase the time of TpchTest#testQuery07,
about 23%, sorry for that, we test the feature in our product env(Calcite
is our SQL facade in ByteDance, it process hundreds of thousands SQL per
day), and we have monitor about optimization time, no obvious changes were
found. Without pull/2222, some cases will fail due to "There are not enough
rules to produce a node with desired properties: There is 1 empty subset",
the root cause is the outdated rule matches have been fired.

Actually, I found actually the time was taken by RelSubset#getRelList, a
fresh case I found yesterday, see
https://github.com/apache/calcite/pull/2283, it OOM
in VolcanoRuleCall.matchRecurse when calling getRelList(but the root clause
must be the deadlock between rules), so I think the pull/2222's fix is not
a problem, the true problem is in "getRelList", my proposal is we can store
the relList in RelSubSet like RelSet#rels.

3. For the case TpchTest#testQuery07, after optimization, the largest rel
ID is increased to *900k*, and has *20k* rels remain in volcano planner,
so it's pretty large. In fact, after profile, I found most of the time is
taken by ProjectMergeRule, in our company's error cases(OOM, Stack
Overflow) are also related to this rule, maybe we can see if we can
optimize this.

Finally, to summarize:
1. TpchTest#testQuery07 timeout is an occasional situation, normally it
ends about 40~45s(in my MBP15 2019).
2. pull/2222 itself is a positive fix, it due increase the time but the
true problem is RelSubset#getRelList, we can improve pull/2222 by
improving RelSubset#getRelList, this not only benefits this, it
makes getRelList a cheap function as its name, we can call this with
no burden
3. Calcite sometime will generate huge alternate rels, it actually not
reasonable most of the time, we've found many OOM/Stack Overflow/timeout,
but the SQL is highly private, If you are interested, I can share the
cases' root clause using blog or email, but I may have no time to construct
these cases. Let's workout together to decrease them.

Regards!

Aron Tao


Vladimir Sitnikov <sitnikov.vladi...@gmail.com> 于2020年12月2日周三 下午10:01写道:

> Hi,
>
> I've made a quick check, and the results are as follows:
> 1) The test was enabled recently (2020-10-20) in
>
> https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8
> 2) One of the recent changes that do significantly impact the performance
> of the test is https://github.com/apache/calcite/pull/2222
> 3) It is sad we commit performance-related fixes without taking performance
> into consideration :-(
>
> Do you think you could look into 2222 again and check if the slowdown was
> expected?
> Do you think you could improve 2222?
> Do you think we should revert 2222?
>
> Vladimir
>

Reply via email to