new disk cache fixed block size model suggestion

Aaron Smuts Sat, 19 Aug 2006 17:11:16 -0700

Since I've been working on the optimization patch,
I've been watching the disk cache behavior for a while
and I noticed a few, relatively obvious things.  If
the objects you put on disk are all the same size,
then the recycle bin is very effective.  When you
reach the max keys, the disk size will stay constant,
since every item put on disk can reuse an existing
spot.


Overall, the pre-optimized disk size stays much
smaller when the size of the objects is the same,
compared to the situation where the object sizes vary.
 The more variation, the greater the chance that the
recycle bin spot will be a bit larger than you need. 
That means that the extra is wasted.  Do this enough
and you end up with a lot of free, unusable spaces
between items.

Since many real world case will involve items of
varying sizes, the disk will require optimization
eventually.  There may be a way to make optimization
unnecessary.

Rather than using the exactly the amount of space
needed for an item, we could used fixed block sizes. 
If an item fits in a block, great.  At worst we wasted
space in the block, as we almost always do with the
current recycle bin.  If it doesn't fit in a block, it
will be divided between blocks.  

The disadvantages of this model are that a divided
item is slower to write and slower to read, so it is
best if the block sizes are large enough to minimize
splits.  However, if the size is too large, then you
waste to much space.  

The big advantage is that you will never need to
optimize.  The recycle bin can be much larger and very
simple.  It would simply be a list of positions.  No
size data is needed, since all the blocks are the same
size.  We'd never need to optimize since eventually,
all the spaces would be reclaimed if the disk cache
ever reached the same size.  The position simply
becomes an int, since we don't need offset, we merely
need the slot number. . .  The key map would be
smaller and easier to write to disk.  The first 4
bytes on disk would indicate the size of all the other
blocks, so we could determine the size on startup.  

We could also persist the key map in a similar
fashion.  In memory we could keep the key, an array of
spot numbers on disk for the data, and an array of
spot numbers for the key in the key file.  The key
file would have the key and the block number.  If we
could require that all the keys fit into one block,
then we could just use the first block number in the
data file as the key block number.  Either way, we
could bake in key persistence.

The advantages make it worth trying.  I'm considering
building another indexed disk cache that works on the
block model.  By default we would use the size of the
first element inserted as the block size.  We could
also make the block size configurable.

Aaron




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

new disk cache fixed block size model suggestion

Reply via email to