>> I'm not sure why you want pages ordered the same way keys are?
>> I don't see why that's an advantage.
>
> Maybe it's a narrow htdig oriented view. But I think it will be usefull
> to others. When doing DB_NEXT on a cursor it goes a lot faster when pages
> are ordered as key are because when it goes from one page to another disk
> access is reduced. It's related to the idea that internal pages should
> be as close as possible to the pages they reference (locality of references).
> But this last one is much more complex and I can't see that you win much.
All you're saving is the disk seek, and that's not worth saving.
In Berkeley DB, the cache absolutely has to work. If you have
to do I/O, you've already lost the performance contest, doing
an additional seek doesn't really matter.
> I have another idea in mind though. If we are able to reorganize db pages
> transparently, we could do something that would greatly speed up the
> boostraping and usage of large database. The cache could move at the
> beginning (or the end) of the file, according to the frequency of their usage.
> This would naturally cluster frequently used pages in the same portion of
> the disk and therefore reduce disk head latency when the cache misses. When
> restarting applications that have a common usage pattern, they will be able
> to restore their cache faster because frequently used pages are close to
> each other.
This will be complex code to write, and it only helps when the
application or system crashes. How often does that happen,
anyway? If you're crashing more than once a month, you've got
hardware problems. It's not worth optimizing for cases that
only happen rarely.
> We are using static tables. The question is, as always, how to avoid
> db_dump + db_load. Calculating the 'new' table on the fly is stupid
> though. It should be calculated after all pages that have 'old'
> compression are upgraded to 'current' compression. I don't understand
> what you mean by 'stole bits' ?
The on-disk pages have to be marked as to what compression they're
using. That requires bits in the on-disk format, which takes away
from bits I can use to represent data.
> You seem skeptical about that. I hope this won't be a dead end.
I honestly don't know.
It seems to me that you can do this reasonably easily by putting
all the work into the mpool disk read/write functions. Lock
the buffer down and then compress/write it, or read/decompress
it. The in-memory buffers stay the way they are.
The tricky part (well, the only tricky part I've thought of so
far) is that blocks are normally addressed by page number, which
means that the first block in the set of compressed blocks that
make up the on-disk group is expected to be at the offset
(page-number * real-block-size).
Continuing to do that would result in holes in the on-disk file,
which isn't what you want. There's going to have to be some
additional mapping layer that makes all this work, I think, and
I haven't thought of any clean way to do that.
All that said, this change puts a stronger burden on the cache,
I think, as a miss in the cache may translate into multiple disk
seek/read combinations, and a flush from the cache may translate
into multiple seek/write combinations. That's probably OK, but
worth mentioning.
BTW -- I noticed we're cc'ing a mailing list -- is this just
for archival purposes, or are we boring the hell out of lots of
people? :-)
Regards,
--keith
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic
Sleepycat Software Inc. [EMAIL PROTECTED]
394 E. Riding Dr. +1-978-287-4781
Carlisle, MA 01741-1601 http://www.sleepycat.com
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.