Re: [sqlite] large table performance
"Daniel Franke" <[EMAIL PROTECTED]> wrote: >> > Another question that arose today: > Is there any penalty for switching tables during INSERTs within the > same COMMIT? E.g. > > BEGIN; > INSERT INTO tableA VALUES ...; > INSERT INTO tableB VALUES ...; > INSERT INTO tableA VALUES ...; > INSERT INTO tableB VALUES ...; > : > COMMIT; > > opposed to > > BEGIN; > INSERT INTO tableA VALUES ...; > INSERT INTO tableA VALUES ...; > INSERT INTO tableA VALUES ...; > : > COMMIT; > BEGIN; > INSERT INTO tableB VALUES ...; > INSERT INTO tableB VALUES ...; > INSERT INTO tableB VALUES ...; > : > COMMIT; > > Yesterday I did the former, it seemed to take ages. Today I use the > latter ... it seems to be faster?! > My guess is that locality of reference would make the second approach faster than the first. I would also guess that the resulting database would run queries faster as well. -- D. Richard Hipp <[EMAIL PROTECTED]>
Re: [sqlite] large table performance
While talking performance: Did anyone compare sqlite in a simplistic CREATE TABLE t(key BLOB PRIMARY KEY, value BLOB); scenario to other dedicated key/value DBs (like berkeley deb, gdbm, ...)? Elrond
Re: [sqlite] large table performance
Daniel Franke <[EMAIL PROTECTED]> wrote: > > Is there any chance to speed this up? > > CREATE TABLE genotypes( > markerid integer NOT NULL REFERENCES marker(id), > individualid integer NOT NULL REFERENCES individuals(id), > genA integer, > genB integer, > UNIQUE(markerid, individualid)); > > CREATE INDEX genotypeidx ON genotypes(markerid, individualid); > Inserting bulk data into a table is very fast since the insert can be done by a simple append. Inserting data into an index, on the other hand, is slower because the data has to be inserted in order. So if you are inserting into both a large table and a large index, the index insertion time dominates. In the schema above, you have created two identical indices. The clause UNIQUE(markerid, individualid) that appears in the table defintion defines an index on the two fields. Then you turn around and create a second, redundant index on those same two fields. Simply eliminating one or the other of the two indices should increase your performance by a constant factor which approaches 2.0. We have also found that inserts go faster still if you do the table inserts first, then after the table is fully populated create the index. If you are actively using the uniqueness constraint to reject redundant entries, this approach will not work for you - you will need to specify the uniqueness constraint and thus create the index before any data has been inserted. But if your initial bulk insert contains no redundancy, then delaying the index creating until after the insert completes will improve performance. Since you are using INSERT OR ROLLBACK, you can presumably do without the uniqueness constraint during your initial data insert. Then add the constraint by specifying the UNIQUE keyword when you create your index: CREATE UNIQUE INDEX genotypeidx ON ^^ Anything you can do to improve locality of reference while inserting data into the database will help. If you are inserting all markers for the same individual at once, then you will do better to create your index as CREATE UNIQUE INDEX idx ON genotypes(individualid, markerid) rather than the other way around. On the other hand, if you use your database to search by markerid then you will want to use your original ordering, even if it is slower to insert. Finally, parsing a few million INSERT statements is very fast in SQLite but it still takes time. You can increase the speed by a factor of 3 or more by using prepared statements, if you are not already. -- D. Richard Hipp <[EMAIL PROTECTED]>
RE: [sqlite] large table performance
>Given the schema below, feeding a million INSERTs into the database by >sqlite3_exec() takes about 30 minutes (this includes transactions, indices >and "PRAGMA synchronous=off" as well as a fully optimized build of sqlite). >Is there any chance to speed this up? Production datasets could easily bring a >billion genotypes ... I assumed from your description that you populate many rows in one shot. If that is the case I recommend that you just create tables without indices and populate the dB with the data. Then create the indices afterwards to improve reading performance. This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
Re: [sqlite] large table performance
Daniel Franke wrote: Hi all. I spent the last days bragging that a single database file as provided by sqlite is a far better approach to store data than -literally- thousands of flat files. Now, I got a small amount of testing data an wow ... I'm stuck. Area: Bioinformatics. Imagine a matrix of data: genetic marker names (attribute A) in columns and individuals (attribute B) in rows. Since the number of features per attribute varies between projects, I decided to create three tables: * Table markers: the genetic markers (attribute A), e.g. 100.000 rows * Table individuals: individual ids (attribute B), e.g. 1.000 rows * Table genotypes: the genetic data Tables "markers" and "individuals" have 2 and 6 columns respectively, a unique primary key, and the (basically) the name of the feature, "genotypes" holds foreign keys to "markers"/"individuals" respectively as well as the genotype column(s), see below. Genotypes are inserted by: INSERT OR ROLLBACK INTO genotypes VALUES ((SELECT id FROM marker WHERE name='$markername$'), (SELECT id FROM individuals WHERE pedigree='$pedigree$' AND person='$person$'), $genA$, $genB$); Where $markername$, ..., $genB$ are replaced with the appropiate values. Given the schema below, feeding a million INSERTs into the database by sqlite3_exec() takes about 30 minutes (this includes transactions, indices and "PRAGMA synchronous=off" as well as a fully optimized build of sqlite). Is there any chance to speed this up? Production datasets could easily bring a billion genotypes ... Any pointer would be appreciated! With kind regards Daniel Franke -- The database schema: CREATE TABLE marker ( id integer PRIMARY KEY AUTOINCREMENT, name varchar UNIQUE); CREATE INDEX markernameidx on marker(name); CREATE TABLE individuals ( id integer PRIMARY KEY AUTOINCREMENT, pedigree varchar NOT NULL, person varchar NOT NULL, father varchar, mother varchar, sex integer NOT NULL, UNIQUE(pedigree, person)); CREATE INDEX individualidx ON individuals (pedigree, person); CREATE TABLE genotypes( markerid integer NOT NULL REFERENCES marker(id), individualid integer NOT NULL REFERENCES individuals(id), genA integer, genB integer, UNIQUE(markerid, individualid)); CREATE INDEX genotypeidx ON genotypes(markerid, individualid); Test with sqlite3 command line shell with transaction(s) That would be a better (an easier to reproduce) test. Regards Boguslaw