Thank you for your effort, Aron. I am wondering if you can file a jira issue and paste your findings. I would like to look into it if I find some time.
Best, Chunwei On Thu, Dec 3, 2020 at 1:15 PM JiaTao Tao <taojia...@gmail.com> wrote: > Hi Vladimir > Thanks for your reply: > 1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after > re-run the PR ut, the ut is passed, so this ut timeout may be an occasional > situation. > > 2. For the pull/2222, it due increase the time of TpchTest#testQuery07, > about 23%, sorry for that, we test the feature in our product env(Calcite > is our SQL facade in ByteDance, it process hundreds of thousands SQL per > day), and we have monitor about optimization time, no obvious changes were > found. Without pull/2222, some cases will fail due to "There are not enough > rules to produce a node with desired properties: There is 1 empty subset", > the root cause is the outdated rule matches have been fired. > > Actually, I found actually the time was taken by RelSubset#getRelList, a > fresh case I found yesterday, see > https://github.com/apache/calcite/pull/2283, it OOM > in VolcanoRuleCall.matchRecurse when calling getRelList(but the root clause > must be the deadlock between rules), so I think the pull/2222's fix is not > a problem, the true problem is in "getRelList", my proposal is we can store > the relList in RelSubSet like RelSet#rels. > > 3. For the case TpchTest#testQuery07, after optimization, the largest rel > ID is increased to *900k*, and has *20k* rels remain in volcano planner, > so it's pretty large. In fact, after profile, I found most of the time is > taken by ProjectMergeRule, in our company's error cases(OOM, Stack > Overflow) are also related to this rule, maybe we can see if we can > optimize this. > > Finally, to summarize: > 1. TpchTest#testQuery07 timeout is an occasional situation, normally it > ends about 40~45s(in my MBP15 2019). > 2. pull/2222 itself is a positive fix, it due increase the time but the > true problem is RelSubset#getRelList, we can improve pull/2222 by > improving RelSubset#getRelList, this not only benefits this, it > makes getRelList a cheap function as its name, we can call this with > no burden > 3. Calcite sometime will generate huge alternate rels, it actually not > reasonable most of the time, we've found many OOM/Stack Overflow/timeout, > but the SQL is highly private, If you are interested, I can share the > cases' root clause using blog or email, but I may have no time to construct > these cases. Let's workout together to decrease them. > > Regards! > > Aron Tao > > > Vladimir Sitnikov <sitnikov.vladi...@gmail.com> 于2020年12月2日周三 下午10:01写道: > > > Hi, > > > > I've made a quick check, and the results are as follows: > > 1) The test was enabled recently (2020-10-20) in > > > > > https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8 > > 2) One of the recent changes that do significantly impact the performance > > of the test is https://github.com/apache/calcite/pull/2222 > > 3) It is sad we commit performance-related fixes without taking > performance > > into consideration :-( > > > > Do you think you could look into 2222 again and check if the slowdown was > > expected? > > Do you think you could improve 2222? > > Do you think we should revert 2222? > > > > Vladimir > > >