Daniel Franke <[EMAIL PROTECTED]> wrote:
> 
> Is there any chance to speed this up? 
> 
> CREATE TABLE genotypes(
>     markerid integer NOT NULL REFERENCES marker(id),
>     individualid integer NOT NULL REFERENCES individuals(id),
>     genA integer,
>     genB integer,
>     UNIQUE(markerid, individualid));
> 
> CREATE INDEX genotypeidx ON genotypes(markerid, individualid);
> 

Inserting bulk data into a table is very fast since the insert
can be done by a simple append.  Inserting data into an index,
on the other hand, is slower because the data has to be inserted
in order.  So if you are inserting into both a large table
and a large index, the index insertion time dominates.

In the schema above, you have created two identical indices.
The clause

    UNIQUE(markerid, individualid)

that appears in the table defintion defines an index on the
two fields.  Then you turn around and create a second, redundant
index on those same two fields.

Simply eliminating one or the other of the two indices should
increase your performance by a constant factor which approaches 2.0.

We have also found that inserts go faster still if you do the
table inserts first, then after the table is fully populated
create the index.  If you are actively using the uniqueness 
constraint to reject redundant entries, this approach will not
work for you - you will need to specify the uniqueness constraint
and thus create the index before any data has been inserted.  But
if your initial bulk insert contains no redundancy, then delaying
the index creating until after the insert completes will improve
performance.  Since you are using INSERT OR ROLLBACK, you
can presumably do without the uniqueness constraint during your
initial data insert.  Then add the constraint by specifying
the UNIQUE keyword when you create your index:

    CREATE UNIQUE INDEX genotypeidx ON ....
           ^^^^^^

Anything you can do to improve locality of reference while
inserting data into the database will help.  If you are
inserting all markers for the same individual at once, then
you will do better to create your index as

    CREATE UNIQUE INDEX idx ON genotypes(individualid, markerid)

rather than the other way around.  On the other hand, if you
use your database to search by markerid then you will want 
to use your original ordering, even if it is slower to insert.

Finally, parsing a few million INSERT statements is very fast
in SQLite but it still takes time.  You can increase the speed
by a factor of 3 or more by using prepared statements, if you
are not already.

--
D. Richard Hipp   <[EMAIL PROTECTED]>

Reply via email to