On Tue, Apr 08, 2008 at 12:20:01PM -0700, Ken scratched on the wall: > sqlite> explain query plan select * from foo where a is not null; > order|from|detail > 0|0|TABLE foo > > sqlite> explain query plan select * from foo where a is null; > order|from|detail > 0|0|TABLE foo WITH INDEX idx_a > > Sqlite is using the fact that You have an order by clause on both statements. > Removing the order by clears this up....
Not really... it is using or not using idx_a based off the inversion of "is not null" vs "is null." Removing the ORDER BY only prevents the system from using idx_b because it has no reason to do so. > > create table foo (id integer primary key autoincrement, a integer, > > b integer); > > > > And i fill it with 100,000 rows, with the a column always null and the > > b column an incrementing integer. I also add two indexes, one on > > foo.a and one on foo.b, and then run analyze. Now compare the query > > plans from these two selects: > > > > sqlite> explain query plan select * from foo where a is not null order by b; > > 0|0|TABLE foo WITH INDEX idx_b ORDER BY > > > > sqlite> explain query plan select * from foo where a is null order by b; > > 0|0|TABLE foo WITH INDEX idx_a The query planner assumes idx_a is really good and will be able to pull out an extremely small number of rows. This is incorrect. My guess (and this *is* a guess) is that the query planner assumes NULL values are much less common than non-NULL values. Additionally, (this is not a guess) the analysis system considers all the NULL values in idx_a to be unique rather than considering to be the same value. Logic aside, I'm not sure that's right for this case. In the first case, the planner seems will assume very few rows will be rejected via the "is not null" clause, so it is better to get the free "order by" from idx_b. In the second case, the planner seems to assume that nearly every row will be rejected by the "is null" clause, so it uses idx_a to do that operation and quickly extract the rows without a large scan. The planner assumes it will be left with only a handful of rows, and should be able to quickly order them. There are two areas where this breaks down. First, you've got an index on a column that has a high-percentage of similar values (in this case, 100%) and the analysis system misrepresents this. Indexes can actually hurt performance when they are filled with large numbers of similar values, since the planner will assume the index provides a more efficient lookup (and/or a high percentage of rejected rows). In this case it essentially ends up doing a full-scan anyways (since all the values in that index are the same), but it doesn't matter. Second, you're not operating under the assumptions of the planner and/or the analysis. There isn't a lot you can do about this, other than fixing the analysis. In the first case, the second issue doesn't really matter as you have to more or less do a full scan anyways-- but there is no sort cost and you end up rejecting all the rows anyways. No matter which index you use, the result would be the same: one full scan, all rows rejected, no rows sorted. In the second case, the planner assumes it can pick out a small handful of NULL rows out of the index without doing a full scan, and then take a lower cost hit to sort that small handful of rows after-the-fact. Because of the wacked out index, it ends up doing a full scan anyways, and then takes the huge hit of sorting the whole table post-process. The stat table actually hurts in this case, since the stats for both indexes are the same. This appears to be because two NULL values appear to be considered distinct values, so the analysis system sees column a as being full of unique values. That makes the "usefulness" weight of idx_a highly inflated. I have filed a bug on that. We'll see if the developers agree. If you go through this step-by-step it explains Ken's results as well, even if he didn't do an ANALYZE. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "'People who live in bamboo houses should not throw pandas.' Jesus said that." - "The Ninja", www.AskANinja.com, "Special Delivery 10: Pop!Tech 2006" _______________________________________________ sqlite-users mailing list [email protected] http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

