[sqlite] Disk caching impacts performance.

2007-11-08 Thread Julien Renggli
Hello,

 

I'm currently working on a project which is very likely to use SQLite as
data storage. Since performance is important for us, I already found on
the SQLite website many ways to optimise the code (always working in
transactions where possible, using a page size of 4096 since it's
running on the Windows platform, using integers primary keys, ...). But
I have one problem that I "solved" in an unorthodox way; it works, but
maybe you have a better solution than mine?

 

I've been doing some test with a ~100 MB database, in which I have three

tables: one for structured data (Objects, 2000 entries), one for the
blobs we have to store (ObjectBlobs ID INTEGER PRIMARY KEY, Data BLOB)
(8000 entries), and one which binds the structured data and the blobs
(8000 entries). As you can imagine, each Object has 4 blobs linked to
it; the blobs can be quite large (let's say up to 1 MB).

My (C++) application just has to read the table "Objects", and one of
the Blobs for each Object.

 

Now the first time I run my application, it takes quite a long time

(30s) to load the blobs. But if I re-run the app, it only takes 1s to
load them. It's clearly a disk caching issue: if I copy huge files to
the disk between two runs, it takes again 30s to load the blobs (i.e.

the DB is no more in the disk cache). Profiling the application
indicates sqlite::winRead() is the bottleneck.

 

I then had the following idea: SQLite is probably reading the file
randomly, depending on where the data lies. If I can force the DB to be
cached, everything should be fine. So before connecting the database, I
first read it sequentially (using a C++ ifstream) until the end of file.

It perfectly solves the database problem, even though I still notice a
difference (3s to read the file on 1st run, 0.2s later). But 3s is OK
where 30s was worrying me.

 

I hope I explain the situation clear enough, and ask you now: is it the
only way to do it? I find the trick a bit nasty and don't like it; maybe
I missed something? Before you ask: I tried to VACUUM the DB, it only
reduced the timings to 16s, which was still bad for our requirements.

Tests with a larger DB (it can get much bigger than my example) and on
different machines tend to confirm my theory.

 

Thanks in advance (and a big thank for SQLite which is really nice and
easy to use !),

 

Julien Renggli

 

P.S.: Some technical informations:

- sqlite v.3.3.16 (we will upgrade to the latest version later), C API

- Windows XP SP2

- Timings on Pentium 4 3.4GHz, 2GB RAM

 



RE: [sqlite] Disk caching impacts performance.

2007-11-08 Thread Julien Renggli
Thanks for your quick answer.

RaghavendraK 70574 wrote:
> Try this, use mmap 
> (i assume u'r db is readonly)this is much faster and
> better than ifstream read. Also ifstream read can keep the data in
cache as long as no other serious
> i/o occurs.

Our DB is readwrite, though in my example readonly is a valid option.
I just saw the sqlite3_open_v2 function takes a READONLY flag as
argument, so I'll try to use it for my example, and see if it makes any
difference to READWRITE mode...

> 
> U need to accept it as we work with Virtual Mem or
> write your own FileSystem which is mem based and short circuits os
calls. Sqlite 3.5x has good support for such ext.

As I said, the "nasty" trick works and we'll live with it (writing our
own FileSystem is not an option). We would just like to understand
better what's happening, to know why and when the "first run" is so much
slower. And should we read the whole file beforehand, only parts of it?
Does it depend on how much RAM is installed, ...? If you have any hints
they are welcome. I guess I should ask NTFS experts as well.

> regrads
> ragha
> 
>

**
>  This email and its attachments contain confidential information from
HUAWEI, which is intended only for the person or entity whose address is
listed above. Any use of the information contained herein in any way
(including, but not limited to, total or partial disclosure,
reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please
notify the sender by phone or email immediately and delete it!
>

*

Regards,

Julien

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] Re: Disk caching impacts performance.

2007-11-09 Thread Julien Renggli
Mark Spiegel wrote:

> [...]

 

Thanks for the explanation. Yes, I'll have to make sure not to use this
technique when the file is too large. But I think there is definitively
an improvement for us to pull data in cache whenever possible.

 

> Given that you can't write your own VFS, there is not much to suggest,


> but one question to ask.  Is the sum of the time for the pre-read you 

> perform and the subsequent database operation(s) smaller than doing
the 

> database operation(s) without the pre-read?  I see that in the 3.5.x 

> source Dr. Hipp gives the file system the proper random access hint to


> the file system when opening databases.  This is just a hint to the 

> cache manager and it is not obligated to honor it, but it will 

> effectively shut down most read ahead and large block reads which is 

> what you are getting when you sequentially pre-read.

 

I can give you a few timings I have to give you an idea:

 

Initialisation: connects to DB, pro-compile some queries, load
structured data (SELECT * FROM Objects; i.e. sequentially, no blobs)

Load Blobs: load 1/4 of all Blobs (each ~23k of size, I just checked),
"randomly"

 

-Uncached

 -14s to initialise

 -31s to load blobs

 

 

-Uncached, but VACUUMed

 -3s to initialise

 -16s to load blobs

 

-Pre cached, no VACUUM

 -3s to cache and initialise

 -1s to load blobs

 

Notes:

- VACUUM took (as expected) a very long time to do it's task (few
minutes).

- This test is for a ~100MB file. From other tests I made it looks like
the timings depend linearly on size of file, and number of blobs loaded

 

Pre-caching is clearly a winner here.

 

> One more thing, did raising the limit on the number of pages SQLITE
can 

> cache internally have any effect?

 

I just tried (hadn't noticed that option before) to go from 2000 to 4000
and 8000, without noticing any difference. I might try next week to
raise the page size to 50k and see if it makes a difference?