On 23/01/2012 3:09 PM, Ryan Johnson wrote:
On 23/01/2012 12:51 PM, Richard Hipp wrote:
On Mon, Jan 23, 2012 at 12:48 PM, Simon Slavin<slav...@bigfraud.org>
wrote:
I don't know if Dr Hipp is pursuing this privately or expecting it
to be
solved collaboratively on this list.
I don't have a test case to work on.
My database file is ~150MB, but was generated by the TPC-H data
generator program. Assuming a linux-like environment (including
cygwin), the following will reproduce the setup in under five minutes:
The problem persists with a freshly-generated database on my machine,
using a just-compiled sqlite-3.7.10.
OK, it looks like I didn't install the 3.7.10 binary properly, because
the situation is significantly different once ANALYZE has run.
The first query from the OP is remarkably similar to a query mentioned
in where.c (from the 3.7.5 sources, which I happened to have handy):
Previous versions of SQLite [did not always find the lowest-cost plan]
for scripts such as the following:
**
** CREATE TABLE t1(a, b);
** CREATE TABLE t2(c, d);
** SELECT * FROM t2, t1 WHERE t2.rowid = t1.a;
**
The best strategy is to iterate through table t1 first. However it is
not possible to determine this with a simple greedy algorithm. Since
the cost of a linear scan through table t2 is the same as the cost of
a linear scan through table t1, a simple greedy algorithm may choose
to use t2 for the outer loop, which is a much costlier approach.
Re-casting my first query as the above gives:
sqlite> explain query plan select * from customer t2, orders t1 where
t2.rowid = t1.custkey;
0|0|0|SCAN TABLE customer AS t2 (~15000 rows)
0|1|1|SEARCH TABLE orders AS t1 USING INDEX OrderCustomers (custKey=?)
(~15 rows)
The same query plan is chosen if t1 comes first; execution times and
stats confirm the improvement.
For query 2, the situation is less clear. First of all, it appears to
execute more than 4x faster overall (10s and .2s -- WOW!) but the
optimizer still seems to choose the wrong plan (13.6M rows examined):
1|0|0|SCAN TABLE orders (~37500 rows)
1|0|0|USE TEMP B-TREE FOR DISTINCT
0|0|2|SCAN SUBQUERY 1 AS Y (~37500 rows)
0|1|0|SCAN TABLE orders AS O (~75000 rows)
0|2|1|SEARCH TABLE lineitem AS L USING INDEX LineItemOrders
(orderKey=?) (~2 rows)
If I force the proper ordering, "X cross join Y", then the following,
better-than-expected plan is used instead (only 300k rows examined!):
1|0|0|SCAN TABLE orders (~37500 rows)
1|0|0|USE TEMP B-TREE FOR DISTINCT
0|0|0|SCAN TABLE orders AS O (~75000 rows)
0|1|1|SEARCH TABLE lineitem AS L USING INDEX LineItemOrders
(orderKey=?) (~5 rows)
0|2|2|SCAN SUBQUERY 1 AS Y (~18750 rows)
The main difference seems to be that Y should be the innermost loop but
isn't.
Thoughts?
Ryan
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users