Hi all,

I'm playing around with a small TPC-H dataset (scale factor 100) in sqlite-3.7.3, and have noticed that several of the optimizations described at http://www.sqlite.org/optoverview.html don't seem to take effect, even after running ANALYZE.

In one case the optimizer seems to make a different decision depending on which order I write the join in; in the other case, the join ordering chosen is bad and compounded by an expensive subquery not being materialized into a transient table as it should be.

For the first issue, consider the following query:

select count(*) from orders O, Customer C where C.custkey=O.custkey and C.name like '%115';

Then .stats/explain reports 149999 fullscan steps for the query plan:
0|0|TABLE orders AS O
1|1|TABLE Customer AS C USING PRIMARY KEY

Putting Customer first in the FROM clause makes the query markedly faster and executes only 14999 fullscan steps. The query plan confirms the change:
0|0|TABLE Customer AS C
1|1|TABLE orders AS O WITH INDEX OrderCustomers

Cardinalities of the tables are customer:15k and orders:150k, so I would expect any predicate on Customer to get the optimizer's attention. If the LIKE clause was just confusing the optimizer's cardinality estimates then I would have expected it to always choose the same query plan, but it doesn't.

Note that the index referenced above corresponds to a foreign key in the schema
CREATE INDEX OrderCustomers on Orders(custKey);


A second problem lies with non-flattened nested queries: instead of materializing the result in a transient table, sqlite reruns the query for each tuple in the other relation, even though the query cannot possibly be correlated:

select count(*)
from
    (select
        julianday(O.orderdate) ordered,
        julianday(L.receiptdate) received
    from orders O, lineitem L
    where
        L.orderkey=O.orderkey
        and ordered >= julianday('1994-01-01')
        and received < julianday('1994-04-01')
    ) X,
    (select distinct(julianday(orderdate)) d
    from orders
    where
        d >= julianday('1994-01-01')
        and d < julianday('1994-04-01')
    ) Y
where Y.d between X.ordered and X.received;

The first subquery has cardinality 5918 and examines 150k rows, while the second has cardinality 90 and examines 150k rows; the overall query should therefore examine 150k+150k+90*5900 = ~830k rows. Instead, it takes 45s and 13650087 fullscan steps to run, or roughly 90*150k + 150k + 90 (the cost of evaluating Y, iterating over Y, and running X once per row in Y).

Reordering the query doesn't help (X really should go first but the optimizer insists on Y). Disabling join optimization (x cross join y) cuts the query's cost to 1.6s and 827k rows.

Is there something obvious I'm missing here? The second case, in particular, doesn't seem to depend on cardinalities: the non-correlated subquery should be materialized rather than re-running repeatedly (according to the docs, at least), at which point join ordering wouldn't matter nearly so much.

Thanks,
Ryan
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to