[ 
https://issues.apache.org/jira/browse/DERBY-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A B updated DERBY-47:
---------------------

    Attachment: d47_engine_doNotCommit_v1.stat
                d47_engine_doNotCommit_v1.patch

I've spent some time looking at this issue and have come up with a solution 
that is based on two earlier comments for this issue. Namely:

Comment #1:

> I was thinking if there was a way to re-write the query using something
> along the lines of a VALUES clause.
>
>  SELECT * FROM tableA TABLE(VALUES constC, ?, ?, constD) AS X(C)
>    where X.C = tableA.columnB = X.C;
>
> no idea how that would perform, it also needs tweaking to remove duplicates
> from the list of values in the VALUES clause.

The solution I've come up with does in fact rewrite IN-list queries to create a 
join between the target table and "something along the lines of a VALUES 
clause".  That said, though, a VALUES expression in Derby is internally parsed 
into a chain of nested UNIONs between RowResultSetNodes.  I did a quick 
prototype to duplicate that behavior and found that for IN lists with thousands 
of values in them, a chain of UNIONs is not acceptable.  More specifically, I 
found that such a chain inevitably leads to stack overflow because of recursion 
in preprocessing and/or optimization.  And even if the list is small enough to 
avoid stack overflow, the time it takes to optimize a UNION chain with, say, a 
thousand values is far too long (I was seeing optimization times of over 20 
seconds with 1000 IN-list values).  And lastly, assuming that we were able to 
code around stack overflow and optimization time, the limit to the size of a 
UNION chain that Derby can successfully generate is far less than the current 
limit on IN-lists--see DERBY-1315 for the details--which means we would have to 
skip the rewrite when the list has more than xxx values.  That's doable, but 
not pretty.

So in my proposed solution we do not actually create a normal VALUEs 
expression.  Instead we create a new type of node called an 
"InListResultSetNode" that is specifically designed to handle potentially large 
lists of constant and/or parameter values.  Then we create a new equality 
predicate to join the InListRSN with the appropriate table, and since 
InListResultSetNode is an instance of Optimizable, the optimizer can then use 
the predicate to optimize the join.

Note that I also made changes to the optimizer so that it recognizes 
InListResultSetNodes as "fixed outer tables", meaning that they are prepended 
to the join order and remain fixed at their prepended positions throughout the 
optimization process.  This ensures that the optimizer does not have to deal 
with additional join orders as a result of the IN-list rewrite (it already 
looks at too many; see DERBY-1907).

Comment #2:

> Based on past experience it would be good to avoid generating code per
> literal value, as that's an easy way to hit limits in the compiled byte
> code. I think you should be able to build up a collection of literal values
> (DataValueDescriptors) at compile time and store it as a saved object. The
> runtime ResultSet can then obtain the same collection from the saved objects.
> See CompilerContext.addSavedObject

In the solution I've written we do in fact create a "collection of literal 
values (DataValueDescriptors) at compile time and store it".  We do not, 
however, use saved objects to store/retrieve them.  Instead, we use existing 
code in InListOperatorNode to create an execution time *array* of 
DataValueDescriptors and we pass that array to a new execution time result set, 
InListResultSet, that returns the list of values as a single-column result set. 
 This approach works for both constants and parameters.

By re-using the existing InListOperator code generation we ensure that the max 
size of an IN-list after these changes will be the same as what it was before 
these changes.

----

All of that said, the internal addition of a new InListResultSetNode to a 
FromList is not without its side effects.  The following are the three areas in 
which I've noticed an unintended change in behavior caused by adding an 
InListRSN to a FromList.  I have a "progress-not-perfection" workaround for the 
first issue; I still need to investigate the latter two (any input from others 
would of course be much appreciated).

  1) Derby has internal restrictions on the types of queries that it can 
flatten.  Most notably (and as James Synge also noted) a subquery cannot be 
flattened if its FromList contains more than a single FromTable.  When 
rewriting the IN-list operator as described above, we add the new inListRSN to 
the appropriate FromList.  That means that the FromList will now have more than 
one FromTable and thus the subquery is no longer flattenable.  So for the 
following query:

  select * from t2, (select * from t1 where i in (1, 2, 3)) x

the inner select is currently flattened during preprocessing.  With the changes 
as I've described them, though, the IN list for the subquery would become an 
InListResultSetNode that is added to the inner FromList, thus rendering the 
inner query non-flattenable.  In the interest of making progress (instead of 
perfection) on this issue, I simply added logic to skip rewriting an IN-list if 
appears in a subquery.  In that case we will default to normal IN-list 
processing as it is today.

  2) I just discovered this morning that the addition of an InListResultSet to 
the FromList causes all of the SUR (scrollable updatable result set) tests that 
use an IN-clause to fail--apparently the presence of the InListResultSet 
results in a downgrade of the cursor scrollability to "CONCUR_READ_ONLY".  I do 
not yet know why this is the case, nor do I know how to prevent it.

  3) The store/readlocks.sql test fails with the proposed changes because of 
missing ROW locks.  I do not know if these missing locks are a problem or just 
a side effect that can be "fixed" with a master update.  More investigation 
required...

There were of course other tests that failed with row orderings and/or 
different plans, but so far as I can tell all of those are expected and 
correct--so I will just update the master files as needed.

For now I have just attached an initial version of the engine changes, 
d47_engine_doNotCommit_v1.patch, for general review/comments if anyone has any. 
 I plan to look into the above issues more and will probably go to derby-dev 
with questions where needed.  In the meantime, any feedback on the general 
approach as outlined above would be appreciated.

Oh, and by the way: I ran Derby47PerformanceTest.java (as attached to this 
issue) with 100,000 rows after applying this patch.  Whereas the "Markers" 
strategy was by far the worse query before my changes, it ends up being the 
best strategy after my changes, beating out the "Marker" and "JoinTemp" 
strategies (which were previously the best) by roughly 30 and 25 percent, 
respectively.

> Some possible improvements to IN optimization
> ---------------------------------------------
>
>                 Key: DERBY-47
>                 URL: https://issues.apache.org/jira/browse/DERBY-47
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 10.0.2.0
>         Environment: all
>            Reporter: Sunitha Kambhampati
>         Assigned To: A B
>         Attachments: d47_engine_doNotCommit_v1.patch, 
> d47_engine_doNotCommit_v1.stat, derby-47-performance-data.txt, 
> derby-47-performance-data.txt, Derby47PerformanceTest.java, 
> Derby47PerformanceTest.java, InListOperatorNode.java, 
> QueryPlanUniqueIndexAndWordIndexOneTerm.txt, 
> QueryPlanUniqueIndexAndWordIndexTwoTerms.txt, 
> QueryPlanUniqueIndexOnlyOneTerm.txt, QueryPlanUniqueIndexOnlyTwoTerms.txt
>
>
> Consider a simple case of  - 
> A table tbl has 10000 rows, there is a primary key index on i1
> and the query in question is 
>  select * from tbl where i1 in (-1,100000)
> derby does a table scan of the entire table even though the "IN" list has 
> only two values and the comparison is on a field that has an index.
> Briefly looking at the code, it seems like we insert a between and use the IN 
> list to get the start and stop values for the scan. Thus the range of the 
> values in the "IN" list here plays an important role. 
> Thus if the query was changed to select * from tbl where i1 in (-1, 1), an 
> index scan would be chosen.
> It would be nice if we could do something clever in this case where there is 
> clearly an index on the field and the number of values in the IN list is 
> known. Maybe use the rowcount estimate and the IN list size to do some 
> optimizations.  
> - consider the length of the "IN" list to do searches on the table.  ie use 
> the IN list values to do index key searches on the table,
> -or try to convert it to a join. Use the "IN" list values to create a 
> temporary table and do a join. It is most likely that the optimizer will 
> choose the table with "IN" list here as the outer table in the join and thus 
> will do key searches on the larger table. 
> -------------------------------------------------------------------
> some query plans that I logged using derby.language.logQueryPlan=true for 
> some similar queries:
> Table has ascending values from 0 - 9999 for i1. primary key index on i1.
> GMT Thread[UT0,5,main] (XID = 19941), (SESSIONID = 0), select * from 
> scanfixed where i1 in (-1,9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) 
> ******* Project-Restrict ResultSet (2):
> Number of opens = 1
> Rows seen = 10000
> Rows filtered = 9990
> restriction = true
> projection = false
>       constructor time (milliseconds) = 0
>       open time (milliseconds) = 0
>       next time (milliseconds) = 0
>       close time (milliseconds) = 0
>       restriction time (milliseconds) = 0
>       projection time (milliseconds) = 0
>       optimizer estimated row count:          750.38
>       optimizer estimated cost:         8579.46
> Source result set:
>       Table Scan ResultSet for SCANFIXED at read committed isolation level 
> using instantaneous share row locking chosen by the optimizer
>       Number of opens = 1
>       Rows seen = 10000
>       Rows filtered = 0
>       Fetch Size = 16
>               constructor time (milliseconds) = 0
>               open time (milliseconds) = 0
>               next time (milliseconds) = 0
>               close time (milliseconds) = 0
>               next time in milliseconds/row = 0
>       scan information: 
>               Bit set of columns fetched=All
>               Number of columns fetched=9
>               Number of pages visited=417
>               Number of rows qualified=10000
>               Number of rows visited=10000
>               Scan type=heap
>               start position: 
> null          stop position: 
> null          qualifiers:
> Column[0][0] Id: 0
> Operator: <=
> Ordered nulls: false
> Unknown return value: false
> Negate comparison result: false
> Column[0][1] Id: 0
> Operator: <
> Ordered nulls: false
> Unknown return value: true
> Negate comparison result: true
>               optimizer estimated row count:          750.38
>               optimizer estimated cost:         8579.46
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> l
> 2004-10-14 18:59:47.577 GMT Thread[UT0,5,main] (XID = 19216), (SESSIONID = 
> 0), select * from scanfixed where i1 in 
> (9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict 
> ResultSet (3):
> Number of opens = 1
> Rows seen = 10
> Rows filtered = 0
> restriction = true
> projection = true
>       constructor time (milliseconds) = 0
>       open time (milliseconds) = 0
>       next time (milliseconds) = 0
>       close time (milliseconds) = 0
>       restriction time (milliseconds) = 0
>       projection time (milliseconds) = 0
>       optimizer estimated row count:            4.80
>       optimizer estimated cost:           39.53
> Source result set:
>       Index Row to Base Row ResultSet for SCANFIXED:
>       Number of opens = 1
>       Rows seen = 10
>       Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6, 7, 8}
>               constructor time (milliseconds) = 0
>               open time (milliseconds) = 0
>               next time (milliseconds) = 0
>               close time (milliseconds) = 0
>               optimizer estimated row count:            4.80
>               optimizer estimated cost:           39.53
>               Index Scan ResultSet for SCANFIXED using index SCANFIXEDX at 
> read committed isolation level using instantaneous share row locking chosen 
> by the optimizer
>               Number of opens = 1
>               Rows seen = 10
>               Rows filtered = 0
>               Fetch Size = 16
>                       constructor time (milliseconds) = 0
>                       open time (milliseconds) = 0
>                       next time (milliseconds) = 0
>                       close time (milliseconds) = 0
>                       next time in milliseconds/row = 0
>               scan information: 
>                       Bit set of columns fetched=All
>                       Number of columns fetched=2
>                       Number of deleted rows visited=0
>                       Number of pages visited=2
>                       Number of rows qualified=10
>                       Number of rows visited=10
>                       Scan type=btree
>                       Tree height=2
>                       start position: 
>       >= on first 1 column(s).
>       Ordered null semantics on the following columns: 
>                       stop position: 
>       > on first 1 column(s).
>       Ordered null semantics on the following columns: 
>                       qualifiers:
> None
>                       optimizer estimated row count:            4.80
>                       optimizer estimated cost:           39.53

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to