Seth Falcon <[EMAIL PROTECTED]> wrote:
I'm working with a table with ~3 million rows and 7 columns.
Recently, I found that a simple query takes longer _after_ adding an
index.  I find this surprising.  Can someone provide some insight?

Here are some details:

The schema:

   CREATE TABLE pmfeature (
       fid integer not null,
       strand integer,
       allele integer,
       fsetid integer not null references "featureSet" ("fsetid"),
       pos integer,
       x integer,
       y integer,
       UNIQUE("fid"));
   CREATE INDEX foo on pmfeature ("allele");

The query:

   select fid from pmfeature where allele = 0;

On my laptop, if I do (where basic.sql contains the above query):

   time sqlite thedb.sqlite < basic.sql > /dev/null

it takes about 9 sec without the 'foo' index and 13 sec with it.  The
query returns about half the records (1.5 million).

This is not unusual when you index a field with low cardinality (number of distinct values). To illustrate, imagine that allele=0 in all rows. A simple scan without using an index would look once at each row, for a complexity of O(N) (where N is the number of rows in a table). A scan using an index would visit every entry in the index (there are N of them), obtain rowid of the corresponding row and then look up the field "fid" based on this rowid (a log(N) operation), for an overall complexity of O(N log N).

In this particular case, you can improve performance of this specific query if you declare fid as "integer primary key" (and drop UNIQUE constraint, as it is implied by primary key). This will make fid a synonym for rowid, and the query will be satisfied by scanning the index alone, without ever referring to the table.

In general, however, you should avoid indexing low cardinality fields at all: it hurts more often than it helps.

Igor Tandetnik

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to