[sqlite] Automatic join reordering doesn't seem to work?

Ryan Johnson Sat, 21 Jan 2012 09:49:18 -0800

Hi all,

I'm playing around with a small TPC-H dataset (scale factor 100) insqlite-3.7.3, and have noticed that several of the optimizationsdescribed at http://www.sqlite.org/optoverview.html don't seem to takeeffect, even after running ANALYZE.

In one case the optimizer seems to make a different decision dependingon which order I write the join in; in the other case, the join orderingchosen is bad and compounded by an expensive subquery not beingmaterialized into a transient table as it should be.


For the first issue, consider the following query:

select count(*) from orders O, Customer C where C.custkey=O.custkey andC.name like '%115';


Then .stats/explain reports 149999 fullscan steps for the query plan:

0|0|TABLE orders AS O
1|1|TABLE Customer AS C USING PRIMARY KEY

Putting Customer first in the FROM clause makes the query markedlyfaster and executes only 14999 fullscan steps. The query plan confirmsthe change:

0|0|TABLE Customer AS C
1|1|TABLE orders AS O WITH INDEX OrderCustomers

Cardinalities of the tables are customer:15k and orders:150k, so I wouldexpect any predicate on Customer to get the optimizer's attention. Ifthe LIKE clause was just confusing the optimizer's cardinality estimatesthen I would have expected it to always choose the same query plan, butit doesn't.

Note that the index referenced above corresponds to a foreign key in theschema

CREATE INDEX OrderCustomers on Orders(custKey);

A second problem lies with non-flattened nested queries: instead ofmaterializing the result in a transient table, sqlite reruns the queryfor each tuple in the other relation, even though the query cannotpossibly be correlated:


select count(*)
from
    (select
        julianday(O.orderdate) ordered,
        julianday(L.receiptdate) received
    from orders O, lineitem L
    where
        L.orderkey=O.orderkey
        and ordered >= julianday('1994-01-01')
        and received < julianday('1994-04-01')
    ) X,
    (select distinct(julianday(orderdate)) d
    from orders
    where
        d >= julianday('1994-01-01')
        and d < julianday('1994-04-01')
    ) Y
where Y.d between X.ordered and X.received;

The first subquery has cardinality 5918 and examines 150k rows, whilethe second has cardinality 90 and examines 150k rows; the overall queryshould therefore examine 150k+150k+90*5900 = ~830k rows. Instead, ittakes 45s and 13650087 fullscan steps to run, or roughly 90*150k + 150k+ 90 (the cost of evaluating Y, iterating over Y, and running X once perrow in Y).

Reordering the query doesn't help (X really should go first but theoptimizer insists on Y). Disabling join optimization (x cross join y)cuts the query's cost to 1.6s and 827k rows.

Is there something obvious I'm missing here? The second case, inparticular, doesn't seem to depend on cardinalities: the non-correlatedsubquery should be materialized rather than re-running repeatedly(according to the docs, at least), at which point join ordering wouldn'tmatter nearly so much.


Thanks,
Ryan
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Automatic join reordering doesn't seem to work?

Reply via email to