[jira] Commented: (DERBY-2130) Optimizer performance slowdown from 10.1 to 10.2

A B (JIRA) Thu, 07 Dec 2006 11:05:43 -0800

    [ 
http://issues.apache.org/jira/browse/DERBY-2130?page=comments#action_12456541 ] 
            
A B commented on DERBY-2130:
----------------------------


> Is it possible that the 10.1 vs 10.3 database is a red herring here,
> and this is simply the same "non-determinism" that I was observing
> in my initial tests?

Good point, very well could be.  That didn't occur to me.

> My guess is that if you created 10.1 database A, and ran 10.3 against
> it, and then created 10.1 database B, and ran 10.3 against it, you
> might see dramatic differences even in just those two runs.

Have you yourself seen this variance with *10.1* databases, as well?  I was 
limiting myself to the details in the description of this issue, which only 
mention the variance with 10.2 and trunk.  If you are also seeing the variance 
in 10.1, then can we assume that the non-deterministic variance aspect of this 
issue is not a regression, but is an existing problem with Derby since 
who-knows-when?  That might correlate with Mike's earlier comments about that 
being a potential issue in the past (hash codes based on DDL)...

> I'm wondering if there is something going on, correlated to the DDL
> statements executed by the repro script, which sometimes results in
> tables that return "reasonable" estimates, and sometimes results in
> tables that return "infinite" estimates.

This is great question.  A follow-up question would be whether or not this 
DDL-based variance is a regression from 10.1 or has been there from the 
"beginning".

As you can tell, my primary focus here has been on potentially regressed 
behavior.  As I've been looking at this problem I have been asking myself "what 
changed from 10.1 to 10.2 to cause this sudden slow-down?" So that's where my 
mind (and all of my questions) have been leading...

And on that point, I mentioned in my last comment that the JUMP behavior "does 
not take effect" for 10.3, but it does for 10.1.  I spent some time tracing 
through the code and quickly realized that this change in behavior is an 
indirect result of DERBY-1357.  Here's why.

As explained on the wiki page that describes jumping 
(http://wiki.apache.org/db-derby/JoinOrderPermutations), the optimizer will 
find the first complete join order for a query before attempting to jump.  In 
the repro query the first join order is simply "0 1 2 3 4 5 6 7".  The 
associated cost is 5.607E62.  Using the row counts calculated for this join 
order, the "target" join order then becomes "1 4 2 6 7 0 3 5".  This means that 
the next 8 calls to "getNextPermutation()" will return partial join orders that 
build up to this target.  i.e.

1 -1 -1 -1 -1 -1 -1 -1
1  4 -1 -1 -1 -1 -1 -1
1  4  2 -1 -1 -1 -1 -1
etc.

Prior to DERBY-1357 the optimizer "short-circuit" logic was not working, so 
even if a partial join order is more expensive than the best complete join 
order so far, the optimizer will continue to waste time optimizing join orders 
that are always going to return an estimate that is higher than the best so 
far.  So in 10.1 (which does not have the fix for DERBY-1357) we'll merrily 
continue on our way from [1 4 2 ... ] all the way up until we either timeout or 
find a join order that is cheaper than 5.607E62.  We eventually find a cheaper 
join order after about 2500 permutations, and that's the point at which 
optimization stops (see details from my previous comment).

With 10.2 and trunk, though, the DERBY-1357 short-circuit logic has been 
corrected.  And as it happens, the estimated cost for the partial join order "1 
4 2 6 7 0 3" is "Infinity"; since that's greater than 5.607E62, we do not 
advance the join order and thus we effectively skip the rest of it (because we 
know it's going to return a cost that's worse than 5.607E62).  That part is, I 
think, correct.

But having chosen to NOT advance the join order, now we come to the following 
block of code in OptimizerImpl:

    if (permuteState == JUMPING && !joinPosAdvanced && joinPosition >= 0)
    {
        //not feeling well in the middle of jump
        // Note: we have to make sure we reload the best plans
        // as we rewind since they may have been clobbered
        // (as part of the current join order) before we gave
        // up on jumping.
        reloadBestPlan = true;
        rewindJoinOrder();  //fall
        permuteState = NO_JUMP;  //give up
    }

Since we're in the middle of jumping and since we did *not* advance the join 
order, we execute this if block--i.e. we rewind the join order back to the 
beginning and start all over.  Or put another way, we gave up on the "JUMP" and 
went back to normal processing.  From there we then have to go through 
thousands and thousands of permutations before we find a plan that's cheaper 
than 5.607E62.  Depending on which database is in use, that can be ten thousand 
permutations or it could be forty thousand, which explains the variance in the 
time taken to complete optimization.

I think the changes for DERBY-1357 are correct; what's not clear is whether the 
"if" block shown above is really necessary.  The idea sort of makes sense to 
me: if we broke away from normal processing and jumped to a _bad_ join order 
then we want to abort the jump as quickly as possible and return to normal 
processing.  However, now that the optimizer "short-circuit" logic is fixed, 
I'm not so sure we need to revert back to "normal" processing when we jump to a 
bad join order.  If the join order is bad then the short-circuit logic will 
make sure that we do not waste "too much" (yes, that's an awfully subjective 
term) time trying out bad join orders.  So would it make sense to just 
"iterate" the target join order as usual without defaulting back to square one?

As a sanity check I commented the above "if" block out in the trunk codeline 
and ran the repro; when I did so, the script executed in about the same time as 
it does on the latest 10.1 branch.  So this seems to confirm my findings.  FAR 
more testing is needed, though, to see if this simple removal of the "if" block 
is a viable solution.

All of that said, I still think the BIGGER problems we are seeing are 1) 
DERBY-1905 and 2) as Bryan has said, the apparent variance in cost estimates 
based on DDL.  DERBY-1905 is definitely playing a huge role here, what with the 
Infinity cost estimates followed by NaN.  These kinds of estimates render the 
vast majority of the join order estimates useless, even if we do remove the 
"if" block mentioned above.

Also note: when I remove the "if" block as mentioned above, I found that 
sometimes the cost estimate for the "best" join order ("2 0 1 3 4 5 6 7") was 
"14116", and sometimes it was "247".  This was on the *same* database and so 
does not appear to be the result of DDL statements.  I'm still looking into 
this little tidbit to see what might be going on there...

As usual, some of this may be useful, some of it may not be. Please excuse me 
if I'm babbling...

> Optimizer performance slowdown from 10.1 to 10.2
> ------------------------------------------------
>
>                 Key: DERBY-2130
>                 URL: http://issues.apache.org/jira/browse/DERBY-2130
>             Project: Derby
>          Issue Type: Bug
>          Components: Performance, SQL
>    Affects Versions: 10.2.1.6, 10.1.3.1, 10.3.0.0
>            Reporter: Bryan Pendleton
>         Attachments: plan10_1_2_1.txt, plan10_2.txt, plans.diff, repro.sql
>
>
> Attached is 'repro.sql', an IJ script which demonstrates what I
> believe to be a serious performance issue in the Optimizer.
> I have run this script in a number of configurations:
>  - 10.1.2.1: the script runs successfully. The 'prepare' statement
>    takes about 90 seconds, on a fairly powerful Windows machine
>  - 10.1.3.1: the script produces a NPE. I believe this is DERBY-1777
>  - 10.2.1.8/trunk: the script runs successfully. The 'prepare' statement
>    often takes about 220 seconds, on the same Windows machine
>    Intermittently, on 10.2 and on the trunk, the prepare statement takes
>    15+ minutes. I cannot reliably reproduce this; I run the same script
>    several times in a row and I cannot predict whether it will take 220
>    seconds or whether it will take 15+ minutes.
> I am quite motivated to work on this problem, as this is blocking me from
> using Derby for a project that I'm quite keen on, but I need some
> suggestions and ideas about how to attack it. From my perspective
> there are 3 primary topics:
> 1) Why did optimizer performance for this query degrade so significantly
> from 10.1.2.1 to 10.2? The optimizer seems to be at least 2.5 times slower,
> for this particular query at least, in 10.2. Sometimes it is 10x slower.
> 2) What is the source of the non-determinism? Why does the optimizer
> often take 4 minutes to optimize this query on the trunk, but sometimes
> take 15+ minutes? I don't believe that I'm changing anything from
> run to run.
> 3) Can we improve the optimizer performance even beyond what it was
> for 10.1.2? I realize that this is an ugly query, but I was hoping to
> see an optimization time of 5-10 seconds, not 90 seconds (and certainly
> not 220 seconds).
> I have attempted to start answering some of these questions, with
> limited success. Here is some of what I think I've discovered so far:
>  - the optimizer changes in 10.2 seem to have given the optimizer many
>    more choices of possible query plans to consider. I think this means
>    that, if the optimizer does not time out, it will spend substantially
>    more time optimizing because there are more choices to evaluate. Does
>    this by itself mean that the optimizer will take 2.5 times longer in
>    10.2 than it did in 10.1?
>  - something about this query seems to make the costing mechanism go
>    haywire, and produce extreme costs. While stepping through the
>    optimization of this query in the debugger I have seen it compute
>    costs like 1e63 and 1e200. This might be very closely related to
>    DERBY-1905, although I don't think I'm doing any subqueries here.
>    But maybe I'm misunderstanding the term "subquery" in DERBY-1905.
>    At any rate, due to the enormous estimated costs, timeout does not
>    occur.
>  - the WHERE clause in this query is converted during compilation to 
>    an equivalent IN clause, I believe, which then causes me to run into
>    a number of the problems described in DERBY-47 and DERBY-713.
>    Specifically, rather than constructing a plan which involves 4
>    index probes for the 4 WHERE clause values, the optimizer decides
>    that an index scan must be performed and that it will have to process
>    the entire index (because the query uses parameter markers, not
>    literal values). So perhaps solving DERBY-47 would help me
>  - the optimizer in fact comes up with a "decent" query plan quite quickly.
>    I have experimented with placing a hard limit into the optimizer
>    timeout code, so that I can force optimization to stop after an
>    arbitrary fixed period of time. Then I have been able to set that
>    value to as low as 1 second, and the optimizer has produced plans
>    that then execute in a few milliseconds. Of course, I have only tried
>    this with a trivial amount of data in my database, so it's possible
>    that the plan produced by the optimizer after just a second of
>    optimizing is in fact poor, and I'm just not noticing it because my
>    data sizes are so small.
> At this point, what would be really helpful to me would be some suggestions
> about some general approaches or techniques to try to start breaking down
> and analyzing this problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (DERBY-2130) Optimizer performance slowdown from 10.1 to 10.2

Reply via email to