Daniel Franke wrote:
Hi all.
I spent the last days bragging that a single database file as provided by
sqlite is a far better approach to store data than -literally- thousands of
flat files. Now, I got a small amount of testing data an wow ... I'm stuck.
Area: Bioinformatics. Imagine a matrix of data: genetic marker names
(attribute A) in columns and individuals (attribute B) in rows. Since the
number of features per attribute varies between projects, I decided to create
three tables:
* Table markers: the genetic markers (attribute A), e.g. 100.000 rows
* Table individuals: individual ids (attribute B), e.g. 1.000 rows
* Table genotypes: the genetic data
Tables "markers" and "individuals" have 2 and 6 columns respectively, a unique
primary key, and the (basically) the name of the feature, "genotypes" holds
foreign keys to "markers"/"individuals" respectively as well as the genotype
column(s), see below.
Genotypes are inserted by:
INSERT OR ROLLBACK INTO genotypes VALUES ((SELECT id FROM marker WHERE
name='$markername$'), (SELECT id FROM individuals WHERE pedigree='$pedigree$'
AND person='$person$'), $genA$, $genB$);
Where $markername$, ..., $genB$ are replaced with the appropiate values.
Given the schema below, feeding a million INSERTs into the database by
sqlite3_exec() takes about 30 minutes (this includes transactions, indices
and "PRAGMA synchronous=off" as well as a fully optimized build of sqlite).
Is there any chance to speed this up? Production datasets could easily bring a
billion genotypes ...
Any pointer would be appreciated!
With kind regards
Daniel Franke
--
The database schema:
CREATE TABLE marker (
id integer PRIMARY KEY AUTOINCREMENT,
name varchar UNIQUE);
CREATE INDEX markernameidx on marker(name);
CREATE TABLE individuals (
id integer PRIMARY KEY AUTOINCREMENT,
pedigree varchar NOT NULL,
person varchar NOT NULL,
father varchar,
mother varchar,
sex integer NOT NULL,
UNIQUE(pedigree, person));
CREATE INDEX individualidx ON individuals (pedigree, person);
CREATE TABLE genotypes(
markerid integer NOT NULL REFERENCES marker(id),
individualid integer NOT NULL REFERENCES individuals(id),
genA integer,
genB integer,
UNIQUE(markerid, individualid));
CREATE INDEX genotypeidx ON genotypes(markerid, individualid);
Test with sqlite3 command line shell with transaction(s)
That would be a better (an easier to reproduce) test.
Regards
Boguslaw