Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Chunwei Lei Thu, 03 Dec 2020 00:01:44 -0800

Thank you for your effort, Aron.

I am wondering if you can file a jira issue and paste your findings.
I would like to look into it if I find some time.


Best,
Chunwei


On Thu, Dec 3, 2020 at 1:15 PM JiaTao Tao <taojia...@gmail.com> wrote:

> Hi Vladimir
> Thanks for your reply:
> 1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after
> re-run the PR ut, the ut is passed, so this ut timeout may be an occasional
> situation.
>
> 2. For the pull/2222, it due increase the time of TpchTest#testQuery07,
> about 23%, sorry for that, we test the feature in our product env(Calcite
> is our SQL facade in ByteDance, it process hundreds of thousands SQL per
> day), and we have monitor about optimization time, no obvious changes were
> found. Without pull/2222, some cases will fail due to "There are not enough
> rules to produce a node with desired properties: There is 1 empty subset",
> the root cause is the outdated rule matches have been fired.
>
> Actually, I found actually the time was taken by RelSubset#getRelList, a
> fresh case I found yesterday, see
> https://github.com/apache/calcite/pull/2283, it OOM
> in VolcanoRuleCall.matchRecurse when calling getRelList(but the root clause
> must be the deadlock between rules), so I think the pull/2222's fix is not
> a problem, the true problem is in "getRelList", my proposal is we can store
> the relList in RelSubSet like RelSet#rels.
>
> 3. For the case TpchTest#testQuery07, after optimization, the largest rel
> ID is increased to *900k*, and has *20k* rels remain in volcano planner,
> so it's pretty large. In fact, after profile, I found most of the time is
> taken by ProjectMergeRule, in our company's error cases(OOM, Stack
> Overflow) are also related to this rule, maybe we can see if we can
> optimize this.
>
> Finally, to summarize:
> 1. TpchTest#testQuery07 timeout is an occasional situation, normally it
> ends about 40~45s(in my MBP15 2019).
> 2. pull/2222 itself is a positive fix, it due increase the time but the
> true problem is RelSubset#getRelList, we can improve pull/2222 by
> improving RelSubset#getRelList, this not only benefits this, it
> makes getRelList a cheap function as its name, we can call this with
> no burden
> 3. Calcite sometime will generate huge alternate rels, it actually not
> reasonable most of the time, we've found many OOM/Stack Overflow/timeout,
> but the SQL is highly private, If you are interested, I can share the
> cases' root clause using blog or email, but I may have no time to construct
> these cases. Let's workout together to decrease them.
>
> Regards!
>
> Aron Tao
>
>
> Vladimir Sitnikov <sitnikov.vladi...@gmail.com> 于2020年12月2日周三 下午10:01写道：
>
> > Hi,
> >
> > I've made a quick check, and the results are as follows:
> > 1) The test was enabled recently (2020-10-20) in
> >
> >
> https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8
> > 2) One of the recent changes that do significantly impact the performance
> > of the test is https://github.com/apache/calcite/pull/2222
> > 3) It is sad we commit performance-related fixes without taking
> performance
> > into consideration :-(
> >
> > Do you think you could look into 2222 again and check if the slowdown was
> > expected?
> > Do you think you could improve 2222?
> > Do you think we should revert 2222?
> >
> > Vladimir
> >
>

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Reply via email to