[ 
https://issues.apache.org/jira/browse/DERBY-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699721#action_12699721
 ] 

Bryan Pendleton commented on DERBY-3926:
----------------------------------------

Thanks Army for taking the time to look at this, and for pointing us to a good 
theory!

I like Army's suggestion. I wish I had a clearer grasp on the concept of a 
'partial join order'. I think that
this is an intermediate stage during optimization, where the optimizer has so 
far chosen an ordering
for some, but not yet all, of the tables. It looks like the code is trying to 
handle the problem of how
to make a sort avoidance check at this intermediate point in optimization.

I think that all *chosen* query plans eventually reach a *complete* join order, 
but a query plan which
is *discarded* due to being too expensive may never get beyond a *partial* join 
order. Is that true?

It seems to me that there may be two topics here:

1) it would be valid to make such a check for the purposes of doing some 
preliminary costing of 
the join order up to this point, but for such an algorithm, the code would have 
to then re-visit the 
sort avoidance decision later in the optimization, once the partial join order 
had become a full join order.
That is, I wonder if the overall flow is something like:
 - optimizer investigates a partial join order, decides to cost it, determines 
that (so far) a sort is not required.
 - optimizer later completes this join order, decides that it is acceptable, 
but does NOT re-analyze
   whether a sort is now required for the complete join order
 - optimizer then chooses the correct full join order, but incorrectly avoids 
the sort due to the decision
   it made when considering the partial join order.

2) The comment ("ORDER BY S.A, T.B, S.C") raises the interesting question of 
the situation in
which each column, considered individually, is ordered properly, but because 
the ORDER BY
clause interleaves columns from different tables, a sort is still required. 
That is, if S had an index
on (A, C), and T had an index on B, we might look at ORDER BY S.A, T.B, S.C and 
think that no
sorting of the results was required, because the join from S -> T would emit 
the rows in the
correct order, but that is wrong; the interleaving of the columns means that 
the sort must still be performed.
And, presumably, there is the interesting situation where we need to cost out a 
query at a point
where we have determined a partial join order that contains a position for S, 
but not T, or vice versa.

It would be interesting to know more about the Wisconsin test cases, about the 
queries involved, and
about the before- and after- query plan differences, with respect to sorting. 
Are *all* the changes due
to situations where we formerly avoided a sort, but now we choose one? That's 
interesting, I think; we
want to be careful to avoid introducing un-necessary sorts because that could 
be a substantial
performance regression (of course, if the old query was returning the rows in 
the wrong order, but
the test didn't check, and the new query is now being performed correctly, 
that's important to know, too!)

It would also be interesting to see if we can construct an example along the 
lines of the
ORDER BY S.A, T.B, S.C case, such that various other permutations (ORDER BY 
S.A, S.C, T.B or
ORDER BY T.B, S.A, S.C) did not require sorting, but ORDER BY S.A, T.B, S.C 
did, and see how
the query plans emitted for these various cases behaved.


> Incorrect ORDER BY caused by index
> ----------------------------------
>
>                 Key: DERBY-3926
>                 URL: https://issues.apache.org/jira/browse/DERBY-3926
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.1.3.3, 10.2.3.0, 10.3.3.1, 10.4.2.0
>            Reporter: Tars Joris
>         Attachments: derby-reproduce.zip
>
>
> I think I found a bug in Derby that is triggered by an index on a large 
> column: VARCHAR(1024). I know it  is generally not a good idea to have an 
> index on such a large column.
> I have a table (table2) with a column "value", my query orders on this column 
> but the result is not sorted. It is sorted if I remove the index on that 
> column.
> The output of the attached script is as follows (results should be ordered on 
> the middle column):
> ID                  |VALUE        |VALUE
> ----------------------------------------------
> 2147483653          |000002       |21857
> 2147483654          |000003       |21857
> 4294967297          |000001       |21857
> While I would expect:
> ID                  |VALUE        |VALUE
> ----------------------------------------------
> 4294967297          |000001       |21857
> 2147483653          |000002       |21857
> 2147483654          |000003       |21857
> This is the definition:
> CREATE TABLE table1 (id BIGINT NOT NULL, PRIMARY KEY(id));
> CREATE INDEX key1 ON table1(id);
> CREATE TABLE table2 (id BIGINT NOT NULL, name VARCHAR(40) NOT NULL, value 
> VARCHAR(1024), PRIMARY KEY(id, name));
> CREATE UNIQUE INDEX key2 ON table2(id, name);
> CREATE INDEX key3 ON table2(value);
> This is the query:
> SELECT table1.id, m0.value, m1.value
> FROM table1, table2 m0, table2 m1
> WHERE table1.id=m0.id
> AND m0.name='PageSequenceId'
> AND table1.id=m1.id
> AND m1.name='PostComponentId'
> AND m1.value='21857'
> ORDER BY m0.value;
> The bug can be reproduced by just executing the attached script with the 
> ij-tool.
> Note that the result of the query becomes correct when enough data is 
> changed. This prevented me from creating a smaller example.
> See the attached file "derby-reproduce.zip" for sysinfo, derby.log and 
> script.sql.
> Michael Segel pointed out:
> "It looks like its hitting the index ordering on id,name from table 2 and is 
> ignoring the order by clause."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to