Hi all,
I'm playing around with a small TPC-H dataset (scale factor 100) in
sqlite-3.7.3, and have noticed that several of the optimizations
described at http://www.sqlite.org/optoverview.html don't seem to take
effect, even after running ANALYZE.
In one case the optimizer seems to make a different decision depending
on which order I write the join in; in the other case, the join ordering
chosen is bad and compounded by an expensive subquery not being
materialized into a transient table as it should be.
For the first issue, consider the following query:
select count(*) from orders O, Customer C where C.custkey=O.custkey and
C.name like '%115';
Then .stats/explain reports 149999 fullscan steps for the query plan:
0|0|TABLE orders AS O
1|1|TABLE Customer AS C USING PRIMARY KEY
Putting Customer first in the FROM clause makes the query markedly
faster and executes only 14999 fullscan steps. The query plan confirms
the change:
0|0|TABLE Customer AS C
1|1|TABLE orders AS O WITH INDEX OrderCustomers
Cardinalities of the tables are customer:15k and orders:150k, so I would
expect any predicate on Customer to get the optimizer's attention. If
the LIKE clause was just confusing the optimizer's cardinality estimates
then I would have expected it to always choose the same query plan, but
it doesn't.
Note that the index referenced above corresponds to a foreign key in the
schema
CREATE INDEX OrderCustomers on Orders(custKey);
A second problem lies with non-flattened nested queries: instead of
materializing the result in a transient table, sqlite reruns the query
for each tuple in the other relation, even though the query cannot
possibly be correlated:
select count(*)
from
(select
julianday(O.orderdate) ordered,
julianday(L.receiptdate) received
from orders O, lineitem L
where
L.orderkey=O.orderkey
and ordered >= julianday('1994-01-01')
and received < julianday('1994-04-01')
) X,
(select distinct(julianday(orderdate)) d
from orders
where
d >= julianday('1994-01-01')
and d < julianday('1994-04-01')
) Y
where Y.d between X.ordered and X.received;
The first subquery has cardinality 5918 and examines 150k rows, while
the second has cardinality 90 and examines 150k rows; the overall query
should therefore examine 150k+150k+90*5900 = ~830k rows. Instead, it
takes 45s and 13650087 fullscan steps to run, or roughly 90*150k + 150k
+ 90 (the cost of evaluating Y, iterating over Y, and running X once per
row in Y).
Reordering the query doesn't help (X really should go first but the
optimizer insists on Y). Disabling join optimization (x cross join y)
cuts the query's cost to 1.6s and 827k rows.
Is there something obvious I'm missing here? The second case, in
particular, doesn't seem to depend on cardinalities: the non-correlated
subquery should be materialized rather than re-running repeatedly
(according to the docs, at least), at which point join ordering wouldn't
matter nearly so much.
Thanks,
Ryan
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users