Re: [sqlite] Automatic join reordering doesn't seem to work?

Ryan Johnson Mon, 23 Jan 2012 17:54:37 -0800

On 23/01/2012 7:24 PM, Ryan Johnson wrote:

On 23/01/2012 3:09 PM, Ryan Johnson wrote:
On 23/01/2012 12:51 PM, Richard Hipp wrote:
On Mon, Jan 23, 2012 at 12:48 PM, SimonSlavin<slav...@bigfraud.org> wrote:
I don't know if Dr Hipp is pursuing this privately or expecting itto be
solved collaboratively on this list.
I don't have a test case to work on.
My database file is ~150MB, but was generated by the TPC-H datagenerator program. Assuming a linux-like environment (includingcygwin), the following will reproduce the setup in under five minutes:The problem persists with a freshly-generated database on my machine,using a just-compiled sqlite-3.7.10.
OK, it looks like I didn't install the 3.7.10 binary properly, becausethe situation is significantly different once ANALYZE has run.
The first query from the OP is remarkably similar to a query mentionedin where.c (from the 3.7.5 sources, which I happened to have handy):
Previous versions of SQLite [did not always find the lowest-costplan] for scripts such as the following:
    **
    **   CREATE TABLE t1(a, b);
    **   CREATE TABLE t2(c, d);
    **   SELECT * FROM t2, t1 WHERE t2.rowid = t1.a;
    **
The best strategy is to iterate through table t1 first. However it isnot possible to determine this with a simple greedy algorithm. Sincethe cost of a linear scan through table t2 is the same as the cost ofa linear scan through table t1, a simple greedy algorithm may chooseto use t2 for the outer loop, which is a much costlier approach.
Re-casting my first query as the above gives:
sqlite> explain query plan select * from customer t2, orders t1 wheret2.rowid = t1.custkey;
0|0|0|SCAN TABLE customer AS t2 (~15000 rows)
0|1|1|SEARCH TABLE orders AS t1 USING INDEX OrderCustomers(custKey=?) (~15 rows)
The same query plan is chosen if t1 comes first; execution times andstats confirm the improvement.
For query 2, the situation is less clear. First of all, it appears toexecute more than 4x faster overall (10s and .2s -- WOW!) but theoptimizer still seems to choose the wrong plan (13.6M rows examined):
1|0|0|SCAN TABLE orders (~37500 rows)
1|0|0|USE TEMP B-TREE FOR DISTINCT
0|0|2|SCAN SUBQUERY 1 AS Y (~37500 rows)
0|1|0|SCAN TABLE orders AS O (~75000 rows)
0|2|1|SEARCH TABLE lineitem AS L USING INDEX LineItemOrders(orderKey=?) (~2 rows)
If I force the proper ordering, "X cross join Y", then the following,better-than-expected plan is used instead (only 300k rows examined!):
1|0|0|SCAN TABLE orders (~37500 rows)
1|0|0|USE TEMP B-TREE FOR DISTINCT
0|0|0|SCAN TABLE orders AS O (~75000 rows)
0|1|1|SEARCH TABLE lineitem AS L USING INDEX LineItemOrders(orderKey=?) (~5 rows)
0|2|2|SCAN SUBQUERY 1 AS Y (~18750 rows)
The main difference seems to be that Y should be the innermost loopbut isn't.

Update: Running the different pieces alone shows:
- The scan of Orders for Y returns 5681 rows (not 37k)
- The temp B+Tree collapses that to 90 distinct rows
- The scan of Orders for X returns 104k rows (75k is the right ball-park)
- Each order averages 4 line items (5 was a pretty good estimate)
- Subquery 1 as Y has 90 rows (not 18k or 37k)

Given the above, the problem seems to lie with cardinality estimation,particularly because the estimate for DISTINCT's effectiveness wasvastly overconservative. I assume the cardinality estimates in the twoplans vary for the reason mentioned in where.c, that the estimated costof a scan might not be as low when it's in the outer loop becausecertain (index-related?) cardinality reductions only happen in theinnermost loop.

While I certainly won't complain if the above weakness could be fixed,at this point I'm satisfied that there's not an easy code tweak (orglaring omission in the SQL) to make the problem go away...


Ryan

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Automatic join reordering doesn't seem to work?

Reply via email to