[ 
https://issues.apache.org/jira/browse/DERBY-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mamta A. Satoor updated DERBY-3926:
-----------------------------------

    Attachment: DERBY3926_patch6_060309_stat.txt
                DERBY3926_patch6_060309_diff.txt

I am attaching DERBY3926_patch5_060309_diff.txt Hopefully, this is the final 
patch for this jira entry. This patch takes care of the original problem query 
and one query from wisconsin which was getting an unnecessary sort node on it 
with the previous patch (DERBY3926_patch5_052709_stat.txt). 

Following are the files that were touched by the patch.
M      java\engine\org\apache\derby\impl\sql\compile\RowOrderingImpl.java
M      java\engine\org\apache\derby\impl\sql\compile\OrderByList.java
M      java\engine\org\apache\derby\impl\sql\compile\FromBaseTable.java
M      java\engine\org\apache\derby\impl\sql\compile\OptimizerImpl.java
M      java\engine\org\apache\derby\impl\sql\compile\PredicateList.java
M      java\engine\org\apache\derby\iapi\sql\compile\RowOrdering.java
M      java\engine\org\apache\derby\iapi\sql\compile\RequiredRowOrdering.java
M      
java\engine\org\apache\derby\iapi\sql\compile\OptimizablePredicateList.java
M      
java\testing\org\apache\derbyTesting\functionTests\tests\lang\wisc_setup.sql
M      java\testing\org\apache\derbyTesting\functionTests\tests\lang\_Suite.java
M      
java\testing\org\apache\derbyTesting\functionTests\tests\lang\OrderByAndSortAvoidance.java
M      java\testing\org\apache\derbyTesting\functionTests\master\wisconsin.out


Following is the patch description.
The problem with the trunk codeline is that when optimizer goes through 
optimizables in a join order, it only looks at those optimizables individually 
to decide whether sorting can be avoided on them or not. That approach leaves 
out few queries which require sorting but do not get sorted. The decision for 
avoiding sorting should also include relationship between the optimizables in a 
given join order. Following query demonstrates the trunk problem
SELECT table1.id, table2.value, table3.value FROM --DERBY-PROPERTIES 
joinOrder=FIXED 
table3 -- DERBY-PROPERTIES index=nonUniqueOnValue_Table3 
, table2 -- DERBY-PROPERTIES index=nonUniqueOnValue_Table2 
, table1 
WHERE table1.id=table2.id AND table2.name='PageSequenceId' 
AND table1.id=table3.id 
AND table3.name='PostComponentId' 
AND table3.value='21857' ORDER BY table2.value; 

In the query above, when optimizer is considering [table3, table2, -1] join 
order, it determines that sorting can be avoided on this join order because the 
order by column table2.value is already covered by the index 
nonUniqueOnValue_Table2. It does not see that the outermost optimizable table3 
will qualify more than one row and hence it will be a multi-row resulset and 
for each one of those rows, we will be doing a scan into table2. In other 
words, there will be multiple scans into table2(and the rows returned by each 
one of those scans will be ordered on table2.value) but the collective rows 
from those multiple scans are not necessarily going to be ordered on 
table2.value. This patch is attempting to fix that problem.

Currently, in trunk, a column is marked always ordered during a query 
processing when the optimizer finds that there is constant comparison predicate 
on the order by column. If the column does not have a constant predicate (as in 
our example above), we next see if we are using an index which will provide the 
required ordering on column (which is true in our case. The required ordering 
on table2.value is provided by the index nonUniqueOnValue_Table2). But as we 
can see in the query above, this index coverage is not enough to say that 
sorting is not needed. We need to add 2 more conditions before we can decide to 
avoid the sorting. One of those cases is 1)if the order by column does not 
belong to the outermost optimizable, then check if the order by column's 
optimizable is a one-row resultset. If yes, then it will be safe for the 
optimizer to avoid the sorting. The second case to consider is 2)if the order 
by column does not belong to the outermost optimizable, then check if the order 
by column's optimizable is multi-row resultset BUT all the outer optimizables 
are one-row resulsets. If either of these 2 additional conditions are satisfied 
then optimizer can choose to avoid the sorting. Otherwise
sorting should be added to the query plan. The example query above does not 
satisfy the 2 additional checks and hence sorting should be done as part of the 
query plan.

The changes for the 1)check above has gone into 
OrderbyByList.sortRequired(RowOrdering, JBitSet, OptimizableList). The 
implementation of this change just required us to check the outer optimizables 
to be one row since the order by column's optimizable is not one row. If outer 
optimizables are all one-row, then we say that sorting can be avoided. 
Otherwise sorting is required.

The changes for the 2)check above has gone into 
FromBaseTable.nextAccessPath(Optimizer optimizer, OptimizablePredicateList 
predList, RowOrdering rowOrdering) The implementation of this change requires 
us to see if the order by column is involved in equijoin with outer 
optimizable's indexed column. If yes, then we know that since outer optimizable 
is ordered, the rows qualified via the equijoin will also be ordered and hence 
sorting can be avoided. But if this is not true, then we can't rely on outer 
optimizables' rows to be ordered on the order by column. To avoid sorting, we 
need to identify this case 2) as another case when the column can be marked as 
always ordered and that is when there is an equijoin predicate on the order by 
column with some other column
which is already known to be always ordered. Taking the query from wisconsin as 
an example will explain this behavior
select * from --DERBY-PROPERTIES joinOrder=FIXED 
TENKTUP1 -- DERBY-PROPERTIES index=TK1UNIQUE1 
, TENKTUP2 -- DERBY-PROPERTIES index=TK2UNIQUE1 
where TENKTUP1.unique1 = TENKTUP2.unique1 
order by TENKTUP1.unique1, TENKTUP2.unique1; 

For the above query, as per the current trunk codeline, none of the order by 
columns are marked as always ordered because there is no constant comparison 
predicate on them. But, for the given join order, with TENKTUP1 as the 
outermost resultset and with the index TK1UNIQUE1, we know that the current row 
ordering at this point is going to ensure that rows from TENKTUP1 are ordered 
on UNIQUE1. Next, when we process TEKTUP2 in the 2nd join order position, we 
find that there is no constant predicate on TENKTUP2.unique1 and hence we 
conculde that the rows from TENKTUP2 are not going to be ordered and we decide 
to force a sort node on the top of the query. But in reality, even though the 
outer optimizable is not a single row resultset, it is ordered on 
TENKTUP1.unique1 and hence all those rows from outer optimizable are going to 
be ordered on TENKTUP1.unique1 and the inner optimizable has an equality join 
on 

TENKTUP1.unique1 using the order by column TENKTUP2.unique1 What that 
translates to is that even if there will be multiple scans into TENKTUP2, the 
rows qualified are going to be all ordered because of the equijoin between the 
outer and inner optimizables on the order by columns. So, with my latest patch, 
I have expanded the notion of always ordered columns to include both constant 
comparison predicates AND ordered column that has equijoin with an outer 
optimizable's ordered column. 

I think this patch is also improving the existing queries to include a better 
path than what it was picking up before. Following is an example of one such 
query from wisconsin.
select * from TENKTUP1, TENKTUP2 
where TENKTUP1.unique1 = TENKTUP2.unique1 
and TENKTUP2.unique1 < 100
order by TENKTUP1.unique1; 
For this query, the trunk currently decides to use TENKTUP1 as the outermost 
optimizable using the TK1UNIQUE1 index and then those rows are filtered using 
TENKTUP2.unique1 < 100. Each of the 2 tables involved in the query have 10000 
rows each. So we are going through 10000 qualified indexed rows from TENKTUP1 
and then applying TENKTUP2.unique1 < 100 on them. With the attached patch, we 
use TENKTUP2 as the outermost optimizable with the index TK2UNIQUE1 and only 
gets the indexed rows which satisfy TENKTUP2.unique1 < 100 and then on them, we 
use the equlity join to fetch qualified rows from TENKTUP1. 

I hope the above explanation helps understand the patch. I would appreciate if 
someone can take the time to go through the patch and provide any feedback they 
may have. If I don't hear anything by early next week, I will go ahead and 
commit the patch.


> Incorrect ORDER BY caused by index
> ----------------------------------
>
>                 Key: DERBY-3926
>                 URL: https://issues.apache.org/jira/browse/DERBY-3926
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.1.3.3, 10.2.3.0, 10.3.3.1, 10.4.2.0
>            Reporter: Tars Joris
>            Assignee: Mamta A. Satoor
>         Attachments: d3926_repro.sql, derby-reproduce.zip, 
> DERBY3926_notforcheckin_patch1_051109_diff.txt, 
> DERBY3926_notforcheckin_patch1_051109_stat.txt, 
> DERBY3926_notforcheckin_patch2_051109_diff.txt, 
> DERBY3926_patch3_051509_diff.txt, DERBY3926_patch3_051509_stat.txt, 
> DERBY3926_patch4_051519_diff.txt, DERBY3926_patch4_051519_stat.txt, 
> DERBY3926_patch5_052709_diff.txt, DERBY3926_patch5_052709_stat.txt, 
> DERBY3926_patch6_060309_diff.txt, DERBY3926_patch6_060309_stat.txt, 
> script3.sql, script3WithUserFriendlyIndexNames.sql, test-script.zip
>
>
> I think I found a bug in Derby that is triggered by an index on a large 
> column: VARCHAR(1024). I know it  is generally not a good idea to have an 
> index on such a large column.
> I have a table (table2) with a column "value", my query orders on this column 
> but the result is not sorted. It is sorted if I remove the index on that 
> column.
> The output of the attached script is as follows (results should be ordered on 
> the middle column):
> ID                  |VALUE        |VALUE
> ----------------------------------------------
> 2147483653          |000002       |21857
> 2147483654          |000003       |21857
> 4294967297          |000001       |21857
> While I would expect:
> ID                  |VALUE        |VALUE
> ----------------------------------------------
> 4294967297          |000001       |21857
> 2147483653          |000002       |21857
> 2147483654          |000003       |21857
> This is the definition:
> CREATE TABLE table1 (id BIGINT NOT NULL, PRIMARY KEY(id));
> CREATE INDEX key1 ON table1(id);
> CREATE TABLE table2 (id BIGINT NOT NULL, name VARCHAR(40) NOT NULL, value 
> VARCHAR(1024), PRIMARY KEY(id, name));
> CREATE UNIQUE INDEX key2 ON table2(id, name);
> CREATE INDEX key3 ON table2(value);
> This is the query:
> SELECT table1.id, m0.value, m1.value
> FROM table1, table2 m0, table2 m1
> WHERE table1.id=m0.id
> AND m0.name='PageSequenceId'
> AND table1.id=m1.id
> AND m1.name='PostComponentId'
> AND m1.value='21857'
> ORDER BY m0.value;
> The bug can be reproduced by just executing the attached script with the 
> ij-tool.
> Note that the result of the query becomes correct when enough data is 
> changed. This prevented me from creating a smaller example.
> See the attached file "derby-reproduce.zip" for sysinfo, derby.log and 
> script.sql.
> Michael Segel pointed out:
> "It looks like its hitting the index ordering on id,name from table 2 and is 
> ignoring the order by clause."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to