Re: [sqlite] large table performance

2006-03-03 Thread drh
"Daniel Franke" <[EMAIL PROTECTED]> wrote:
>> 
> Another question that arose today:
> Is there any penalty for switching tables during INSERTs within the
> same COMMIT? E.g.
> 
> BEGIN;
> INSERT INTO tableA VALUES ...;
> INSERT INTO tableB VALUES ...;
> INSERT INTO tableA VALUES ...;
> INSERT INTO tableB VALUES ...;
>  :
> COMMIT;
> 
> opposed to
> 
> BEGIN;
> INSERT INTO tableA VALUES ...;
> INSERT INTO tableA VALUES ...;
> INSERT INTO tableA VALUES ...;
>   :
> COMMIT;
> BEGIN;
> INSERT INTO tableB VALUES ...;
> INSERT INTO tableB VALUES ...;
> INSERT INTO tableB VALUES ...;
>   :
> COMMIT;
> 
> Yesterday I did the former, it seemed to take ages. Today I use the
> latter ... it seems to be faster?!
> 

My guess is that locality of reference would make the second
approach faster than the first.  I would also guess that the
resulting database would run queries faster as well.
--
D. Richard Hipp   <[EMAIL PROTECTED]>



Re: [sqlite] large table performance

2006-03-03 Thread Elrond

While talking performance:

Did anyone compare sqlite in a simplistic

CREATE TABLE t(key BLOB PRIMARY KEY, value BLOB);

scenario to other dedicated key/value DBs (like berkeley
deb, gdbm, ...)?


Elrond


Re: [sqlite] large table performance

2006-03-03 Thread drh
Daniel Franke <[EMAIL PROTECTED]> wrote:
> 
> Is there any chance to speed this up? 
> 
> CREATE TABLE genotypes(
> markerid integer NOT NULL REFERENCES marker(id),
> individualid integer NOT NULL REFERENCES individuals(id),
> genA integer,
> genB integer,
> UNIQUE(markerid, individualid));
> 
> CREATE INDEX genotypeidx ON genotypes(markerid, individualid);
> 

Inserting bulk data into a table is very fast since the insert
can be done by a simple append.  Inserting data into an index,
on the other hand, is slower because the data has to be inserted
in order.  So if you are inserting into both a large table
and a large index, the index insertion time dominates.

In the schema above, you have created two identical indices.
The clause

UNIQUE(markerid, individualid)

that appears in the table defintion defines an index on the
two fields.  Then you turn around and create a second, redundant
index on those same two fields.

Simply eliminating one or the other of the two indices should
increase your performance by a constant factor which approaches 2.0.

We have also found that inserts go faster still if you do the
table inserts first, then after the table is fully populated
create the index.  If you are actively using the uniqueness 
constraint to reject redundant entries, this approach will not
work for you - you will need to specify the uniqueness constraint
and thus create the index before any data has been inserted.  But
if your initial bulk insert contains no redundancy, then delaying
the index creating until after the insert completes will improve
performance.  Since you are using INSERT OR ROLLBACK, you
can presumably do without the uniqueness constraint during your
initial data insert.  Then add the constraint by specifying
the UNIQUE keyword when you create your index:

CREATE UNIQUE INDEX genotypeidx ON 
   ^^

Anything you can do to improve locality of reference while
inserting data into the database will help.  If you are
inserting all markers for the same individual at once, then
you will do better to create your index as

CREATE UNIQUE INDEX idx ON genotypes(individualid, markerid)

rather than the other way around.  On the other hand, if you
use your database to search by markerid then you will want 
to use your original ordering, even if it is slower to insert.

Finally, parsing a few million INSERT statements is very fast
in SQLite but it still takes time.  You can increase the speed
by a factor of 3 or more by using prepared statements, if you
are not already.

--
D. Richard Hipp   <[EMAIL PROTECTED]>



RE: [sqlite] large table performance

2006-03-03 Thread Brandon, Nicholas (UK)



>Given the schema below, feeding a million INSERTs into the database by
>sqlite3_exec() takes about 30 minutes (this includes transactions, indices
>and "PRAGMA synchronous=off" as well as a fully optimized build of sqlite).

>Is there any chance to speed this up? Production datasets could easily bring a
>billion genotypes ...

I assumed from your description that you populate many rows in one shot. If 
that is the case I recommend that you just create tables without indices and 
populate the dB with the data. Then create the indices afterwards to improve 
reading performance.



This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender.
You should not copy it or use it for any purpose nor disclose or
distribute its contents to any other person.




Re: [sqlite] large table performance

2006-03-03 Thread Bogusław Brandys

Daniel Franke wrote:

Hi all.

I spent the last days bragging that a single database file as provided by 
sqlite is a far better approach to store data than -literally- thousands of 
flat files. Now, I got a small amount of testing data an wow ... I'm stuck.


Area: Bioinformatics. Imagine a matrix of data: genetic marker names 
(attribute A) in columns and individuals (attribute B) in rows. Since the 
number of features per attribute varies between projects, I decided to create 
three tables:


 * Table markers: the genetic markers (attribute A), e.g. 100.000 rows
 * Table individuals: individual ids (attribute B), e.g. 1.000 rows
 * Table genotypes:   the genetic data

Tables "markers" and "individuals" have 2 and 6 columns respectively, a unique 
primary key, and the (basically) the name of the feature, "genotypes" holds 
foreign keys to "markers"/"individuals" respectively as well as the genotype 
column(s), see below.


Genotypes are inserted by:

INSERT OR ROLLBACK INTO genotypes VALUES ((SELECT id FROM marker WHERE 
name='$markername$'), (SELECT id FROM individuals WHERE pedigree='$pedigree$' 
AND person='$person$'), $genA$, $genB$);


Where $markername$, ..., $genB$ are replaced with the appropiate values.

Given the schema below, feeding a million INSERTs into the database by 
sqlite3_exec() takes about 30 minutes (this includes transactions, indices 
and "PRAGMA synchronous=off" as well as a fully optimized build of sqlite). 

Is there any chance to speed this up? Production datasets could easily bring a 
billion genotypes ...



Any pointer would be appreciated!

With kind regards

Daniel Franke


--

The database schema:


CREATE TABLE marker (
id integer PRIMARY KEY AUTOINCREMENT,
name varchar UNIQUE);

CREATE INDEX markernameidx on marker(name);


CREATE TABLE individuals (
id integer PRIMARY KEY AUTOINCREMENT,
pedigree varchar NOT NULL,
person varchar NOT NULL,
father varchar,
mother varchar,
sex integer NOT NULL,
UNIQUE(pedigree, person));

CREATE INDEX individualidx ON individuals (pedigree, person);


CREATE TABLE genotypes(
markerid integer NOT NULL REFERENCES marker(id),
individualid integer NOT NULL REFERENCES individuals(id),
genA integer,
genB integer,
UNIQUE(markerid, individualid));

CREATE INDEX genotypeidx ON genotypes(markerid, individualid);





Test with sqlite3 command line shell with transaction(s)
That would be a better (an easier to reproduce) test.


Regards
Boguslaw