Hi all

it seems that I am running in a problem with the way sqlite accesses the disk 
when inserting rows of data in databases that are large in number of records 
but not necessary big on disk (I am talking millions of records in files that 
are in the order of a few hundred MBytes).

I reduced to this test case:

I have a table that I create with
PRAGMA cache_size = 32000
CREATE TABLE IF NOT EXISTS name2uid(fname VARCHAR(100) NOT NULL , lname 
VARCHAR(32) NOT NULL, uid INTEGER NOT NULL,
 PRIMARY KEY (fname, lname));

then for my test, I do inserts that look like this:
REPLACE INTO name2uid (fname, lname, uid) VALUES ('%s','SMITH',%d)
where I set fname as U%xser (%x replaced by the uid that I generate myself).

I create transactions of 10000 replace at a time.

I am running on Linux Fedora Core 3 on a Opteron 146 (2GHz), with 2GB RAM, SATA 
drive with a partition reserved for the sqlite database.

IMPORTANT: I umount/mount the partition between tests to clear the disk caches 
from the OS.

Test 1:
>From an empty DB.
I loop so that the uids are consecutive numbers from 1 to 1 million.
at this point, each transaction takes less than 1 second to execute. The whole 
1M inserts (100 transactions) take 74 seconds ie 13000 inserts/second.

Test 2:
with the DB from Test 1, I run the exact same sequence of inserts.
the 1M inserts take 103 seconds to execute that's still 9700 inserts/second.
First transaction is 8 seconds then about 1 second.
At this point everything is OK.

Test 3:
from the DB from Test 2, I run 1 million inserts where the uid is selected 
randomly between 1 and 1 million.
At this point, the performance is pretty bad:
the first 10000 insert transaction takes 31 seconds to run, the next ones take 
over 5 seconds, for a total run time of 623 seconds.
That's 1600 inserts/second (6 times slower than the ordered case).

It seems that the performance degrades pretty badly with the number of records:
this is still a relatively small dataset (especially given the simplicity of 
the table).
To me, it looks like there is a problem in the way the files are accessed.

I tried to partition the dataset by creating separate databases or tables, but 
creating separate databases make things slower (expected, as the problem is 
disk I/O seeks probably), and partitioning tables give me only a 20% speed gain 
on those 1M insert tests.

Things get really slow afterwards, for example moving to 2M records (137 MB on 
disk):
Test 1 w/ 2M is 138 seconds, about 2 times slower than the 1M case (normal)
Test 2 w/ 2M is 192 seconds, 2 times slower (normal).
Test 3 w/ 2M is 2390 seconds 4 times slower than in the 1M case (12 times 
slower than the ordered case).

I didn't try these tests with bigger sets (my original app was on a DB of about 
8M records, and things were very very slow).

Given the size of the DB on disk, I would think that the speed could be much 
more consistant than that (especially considering that it should cache a lot of 
the information quickly).

Any idea?

Nicolas


PS: I use the c api for running those tests.

Reply via email to