On Tue, Apr 08, 2008 at 12:20:01PM -0700, Ken scratched on the wall:
> sqlite> explain query plan select * from foo where a is not null;
> order|from|detail
> 0|0|TABLE foo
> 
> sqlite> explain query plan select * from foo where a is null;
> order|from|detail
> 0|0|TABLE foo WITH INDEX idx_a
> 
> Sqlite is using the fact that You have an order by clause on both statements.
> Removing the order by clears this up....

  Not really... it is using or not using idx_a based off the inversion
  of "is not null" vs "is null."  Removing the ORDER BY only prevents
  the system from using idx_b because it has no reason to do so.


> >  create table foo (id integer primary key autoincrement, a integer,
> >  b integer);
> > 
> > And i fill it with 100,000 rows, with the a column always null and the
> > b column an incrementing integer.  I also add two indexes, one on
> > foo.a and one on foo.b, and then run analyze.  Now compare the query
> > plans from these two selects:
> > 
> > sqlite> explain query plan select * from foo where a is not null order by b;
> > 0|0|TABLE foo WITH INDEX idx_b ORDER BY
> > 
> > sqlite> explain query plan select * from foo where a is null order by b;
> > 0|0|TABLE foo WITH INDEX idx_a

  The query planner assumes idx_a is really good and will be able to
  pull out an extremely small number of rows.  This is incorrect.


  My guess (and this *is* a guess) is that the query planner assumes
  NULL values are much less common than non-NULL values.  Additionally,
  (this is not a guess) the analysis system  considers all the NULL
  values in idx_a to be unique rather than considering to be the
  same value.  Logic aside, I'm not sure that's right for this case.
 

  In the first case, the planner seems will assume very few rows will be
  rejected via the "is not null" clause, so it is better to get the
  free "order by" from idx_b.

  In the second case, the planner seems to assume that nearly every row
  will be rejected by the "is null" clause, so it uses idx_a to do that
  operation and quickly extract the rows without a large scan.  The
  planner assumes it will be left with only a handful of rows, and
  should be able to quickly order them.

 
  There are two areas where this breaks down.  First, you've got an
  index on a column that has a high-percentage of similar values (in
  this case, 100%) and the analysis system misrepresents this.
  
  Indexes can actually hurt performance when they are filled with large
  numbers of similar values, since the planner will assume the index
  provides a more efficient lookup (and/or a high percentage of rejected
  rows).  In this case it essentially ends up doing a full-scan anyways
  (since all the values in that index are the same), but it doesn't matter.

  Second, you're not operating under the assumptions of the planner
  and/or the analysis.  There isn't a lot you can do about this, other
  than fixing the analysis.


  In the first case, the second issue doesn't really matter as you
  have to more or less do a full scan anyways-- but there is no sort cost
  and you end up rejecting all the rows anyways.  No matter which index
  you use, the result would be the same: one full scan, all rows
  rejected, no rows sorted.

  In the second case, the planner assumes it can pick out a small
  handful of NULL rows out of the index without doing a full scan, and
  then take a lower cost hit to sort that small handful of rows
  after-the-fact.  Because of the wacked out index, it ends up doing a
  full scan anyways, and then takes the huge hit of sorting the whole
  table post-process.

  The stat table actually hurts in this case, since the stats for both
  indexes are the same.  This appears to be because two NULL values
  appear to be considered distinct values, so the analysis system sees
  column a as being full of unique values.  That makes the "usefulness"
  weight of idx_a highly inflated.

  I have filed a bug on that.  We'll see if the developers agree.



  If you go through this step-by-step it explains Ken's results as
  well, even if he didn't do an ANALYZE.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"'People who live in bamboo houses should not throw pandas.' Jesus said that."
   - "The Ninja", www.AskANinja.com, "Special Delivery 10: Pop!Tech 2006"
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to