Daniel Franke <[EMAIL PROTECTED]> wrote: > > Is there any chance to speed this up? > > CREATE TABLE genotypes( > markerid integer NOT NULL REFERENCES marker(id), > individualid integer NOT NULL REFERENCES individuals(id), > genA integer, > genB integer, > UNIQUE(markerid, individualid)); > > CREATE INDEX genotypeidx ON genotypes(markerid, individualid); >
Inserting bulk data into a table is very fast since the insert can be done by a simple append. Inserting data into an index, on the other hand, is slower because the data has to be inserted in order. So if you are inserting into both a large table and a large index, the index insertion time dominates. In the schema above, you have created two identical indices. The clause UNIQUE(markerid, individualid) that appears in the table defintion defines an index on the two fields. Then you turn around and create a second, redundant index on those same two fields. Simply eliminating one or the other of the two indices should increase your performance by a constant factor which approaches 2.0. We have also found that inserts go faster still if you do the table inserts first, then after the table is fully populated create the index. If you are actively using the uniqueness constraint to reject redundant entries, this approach will not work for you - you will need to specify the uniqueness constraint and thus create the index before any data has been inserted. But if your initial bulk insert contains no redundancy, then delaying the index creating until after the insert completes will improve performance. Since you are using INSERT OR ROLLBACK, you can presumably do without the uniqueness constraint during your initial data insert. Then add the constraint by specifying the UNIQUE keyword when you create your index: CREATE UNIQUE INDEX genotypeidx ON .... ^^^^^^ Anything you can do to improve locality of reference while inserting data into the database will help. If you are inserting all markers for the same individual at once, then you will do better to create your index as CREATE UNIQUE INDEX idx ON genotypes(individualid, markerid) rather than the other way around. On the other hand, if you use your database to search by markerid then you will want to use your original ordering, even if it is slower to insert. Finally, parsing a few million INSERT statements is very fast in SQLite but it still takes time. You can increase the speed by a factor of 3 or more by using prepared statements, if you are not already. -- D. Richard Hipp <[EMAIL PROTECTED]>