Sleepycat Software writes:
> All you're saving is the disk seek, and that's not worth saving.
> In Berkeley DB, the cache absolutely has to work. If you have
> to do I/O, you've already lost the performance contest, doing
> an additional seek doesn't really matter.
Ok.
> This will be complex code to write, and it only helps when the
> application or system crashes. How often does that happen,
> anyway? If you're crashing more than once a month, you've got
> hardware problems. It's not worth optimizing for cases that
> only happen rarely.
Ok.
> It seems to me that you can do this reasonably easily by putting
> all the work into the mpool disk read/write functions. Lock
> the buffer down and then compress/write it, or read/decompress
> it. The in-memory buffers stay the way they are.
>
> The tricky part (well, the only tricky part I've thought of so
> far) is that blocks are normally addressed by page number, which
> means that the first block in the set of compressed blocks that
> make up the on-disk group is expected to be at the offset
> (page-number * real-block-size).
>
> Continuing to do that would result in holes in the on-disk file,
> which isn't what you want. There's going to have to be some
> additional mapping layer that makes all this work, I think, and
> I haven't thought of any clean way to do that.
It's much simpler than that : 8k page in memory, compressed to
4k pages on disk. To find a page you page-number * 4k. No holes.
The idea is to do the following when writing an 8k page:
Write operation
. compress
. if compressed size <= 4k
write page-number * 4k
. if compressed size > 4k
allocate a new page
reference to the new page
write first uncompressed 4k at page-number * 4k
write last uncompressed 4k at new page-number * 4k
And the read operation is similar.
The tricky part is the exception. I've not a clear idea
at present of the exact implications.
I did some tests yesterday. I took a 900Mb db file containing
a btree that contains entries that look like what we are going to use.
I took each 4k page and compressed them individually using zlib. The
result is that only 171 pages out of 230 000 compressed to a size that
is greater that 2k. This is less than 0.1%.
In order to make my test valid I used a btree built from randomly
inserted keys (the one that only compresses to 55% with gzip) instead
of the result of a db_dump|db_load (that compresses to 80% with gzip).
Therefore I'm in the worst case and the number of blocs that does
not compress enough and lead to exception is small. The exception
handling could therefore be complex and time consuming. I'll try
an implementation today. And try to find a way not to steal bits from
the page structure. It would be annoying to store a page reference
in every page when only 0.1% of them will use it.
Regarding performances the compression time when compressing individual
pages as if they were individual files is the same as compressing the
file as a whole using gzip. The time needed to compress pages is equal
to the time needed to build the db file from scratch when keys are sorted.
This is far from negligeable but still acceptable according to me.
> All that said, this change puts a stronger burden on the cache,
> I think, as a miss in the cache may translate into multiple disk
> seek/read combinations, and a flush from the cache may translate
> into multiple seek/write combinations. That's probably OK, but
> worth mentioning.
I agree but we know that it's > 0.1% of the cases that will lead
to this 2 x seek/write/read.
> BTW -- I noticed we're cc'ing a mailing list -- is this just
> for archival purposes, or are we boring the hell out of lots of
> people? :-)
It's for archival also but mainly for Geoff. And the list is dedicated to
htdig development and we're discussing something that is essential for
coding htdig. I had no complains yet :-)
--
Loic Dachary
ECILA
100 av. du Gal Leclerc
93500 Pantin - France
Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.