Sorry it took me some time to get back to this thread.

----- Original Message ----
From: Christian Smith <[EMAIL PROTECTED]>

> When your database does not fit in memory, yes, you're right, the OS may
> well get caching wrong, and in the worst way possible. Two things though:
> - SQLite does have known scaling issues when using multi-gigabyte
>   databases.
> - Memory is cheap. If it don't fit, spend that few hundred dollars a few
>   days of your time is worth and buy another few gigs or RAM.

The machine has 2 GB of RAM and the table that seems to cause the problems is 
less than 700 megs (proven by the fact that if I precache this database, things 
get zippy).
To me it seems like the problem is related to the way the reads are done, but I 
can be wrong:
to me it seems that caches never really get "hot" (and with nothing else 
running on the machine, the OS is pretty much caching all reads done by sqlite).

>
> >Right now, sqlite shows performance that is on par with a simple
> >filesystem structure (except it's much faster to backup because
> >traversing a multimilion file structure takes several hours). I was
> >expecting a few things by moving to sqlite:
> >* getting a better ramp up (cold cache) performance than a dump
> >  filesystem structure.

> Nothing will speed up the physical disks.

I agree that disk i/o is the bottleneck, but what can be worked around is the 
latencies needed to seek in the file and the way operations are send out to the 
disk to help the OS cache more useful information and/or have a better caching 
mechanism.

> >* having a [much] higher throughput (so that I can combine multiple
> >  servers into one), as the records are small and there is no
> >  fragmentation of the folders here.
> >* having consistant latencies (filesystems tend to vary a lot).
> >
> >> - Have you tested other databases? What sort of performance did you get
> >  from those?
> >I am in the process of setting up mysql with our app, I will keep you
> >posted on the result.
> 
> Prepare to be disappointed, IMO. The most important thing when dealing
> with big fast databases is disk IO, and anything (MySQL, PGSQL, Oracle,
> SQLite) will find disks a bottleneck.

The reason it took me a while to reply to this thread was that I setup a MySQL 
server (5.0.19, run on the same machine) and adapted my application to run with 
it.

The results on the exact same operations that I did before:
* importing of the data is faster with MySQL, and performance doesn't degrade 
exponentially with the number or rows (insertions at the beginning and at the 
end of the import operation are of similar speed). sqlite and MySQL started at 
around the same speed, but after a few million inserts, sqlite becomes slow.
* cached cleared, on my typical run test (read&write combination), MySQL ran 2 
minutes faster than sqlite (6 minutes vs 8 minutes), getting nice latencies 
after about 3 minutes (and that's where the difference is, I think). I think 
that after 3 minutes, MySQL manage to have most critical data cached in RAM.
* with precached dbs (file sizes are very similar for sqlite and MySQL, for the 
difference that MySQL separates the data and index into 2 files), MySQL is 
faster too (1m30 vs 2 minutes).
* latencies stay low (as in, rarely blocks for > 4 seconds, sqlite would block 
for > 10 seconds especially for the early queries when there is no cache).

Maybe the reason MySQL performs better is simply because they use 2 files 
instead of 1.

It does make sense: if the index/offsets are all together on disk, then the OS 
can very efficiently cache this information for us.
Was splitting the file into 2 ever tried for sqlite?
I mean, there is already the -journal file, so why not having a -index file as 
well?

> > To accomodate a lot of users, I have one main DB that holds all users ->
> > unique ID The other dbs are a partitioned db really, so that later
> > threads conflict only when accessing users within the same range and the
> > dbs don't get out of proportion (because blobs are used I thought a lot
> > of records could become a performance bottleneck). If I precache the big
> > table name -> uid, the import of 650,000 records takes about 2 minutes.
> >If I don't precache it, it takes about 8 minutes. I start/stop
> >transactions every X seconds (right now, I have it set at 10 seconds).
> 
> Are these name->uid lookups used in a read-only fashion after import? If
> so, then multiple threads can indeed read the table in parallel, and you
> should have good performance.

A lot of the time this table is accessed read only, yes.
I tried to have several threads access the table at the same time before, and 
the performance was a bit lower (maybe because transactions occur at the same 
time, and block the readers?), but also, and this was why I rolled back this 
change, the busy handler that does a simple "usleep", causes starvation and 
thus potential high latencies.

> 
> For now, I'd say:
> - If the import/update case is predominant, look at another DB. I'd
>   suggest PostgreSQL, as it's MVCC lockless row protocol allows concurrent
>   updates to disjoint sets of rows in the same table.
> - If read-only access is the predominant use, SQLite may well fit the
>   bill, and take the performance hit when importing data.

What has been confusing me since the beginning is that sqlite performs well for 
databases that are rarelly updated. I think the performance page should 
underline this fact.

Also, I think the performance page should be updated to reflect the fact that 
sqlite might not be the solution for bigger databases (and maybe add a section 
of tests that last more than 5 seconds :)).
I'll see if I can take the test scripts from the performance page and change 
the parameters to try to run a 10 minutes test or so (with clearing of caches, 
probably reformatting of the partition used for the data, between tests, so 
that all engines are tested equally).

I am a bit disapointed that I can't use sqlite at this point: I really liked 
the fact that there is no setup for the enduser, and thought that because I was 
not doing complex sql queries, I could get decent performance (like I said 
earlier, low latency is really what I need here).
Of course, if anyone comes with an idea, I am more than willing to try it and 
help improve this great library.

 Nicolas




Reply via email to