Since I've been working on the optimization patch, I've been watching the disk cache behavior for a while and I noticed a few, relatively obvious things. If the objects you put on disk are all the same size, then the recycle bin is very effective. When you reach the max keys, the disk size will stay constant, since every item put on disk can reuse an existing spot.
Overall, the pre-optimized disk size stays much smaller when the size of the objects is the same, compared to the situation where the object sizes vary. The more variation, the greater the chance that the recycle bin spot will be a bit larger than you need. That means that the extra is wasted. Do this enough and you end up with a lot of free, unusable spaces between items. Since many real world case will involve items of varying sizes, the disk will require optimization eventually. There may be a way to make optimization unnecessary. Rather than using the exactly the amount of space needed for an item, we could used fixed block sizes. If an item fits in a block, great. At worst we wasted space in the block, as we almost always do with the current recycle bin. If it doesn't fit in a block, it will be divided between blocks. The disadvantages of this model are that a divided item is slower to write and slower to read, so it is best if the block sizes are large enough to minimize splits. However, if the size is too large, then you waste to much space. The big advantage is that you will never need to optimize. The recycle bin can be much larger and very simple. It would simply be a list of positions. No size data is needed, since all the blocks are the same size. We'd never need to optimize since eventually, all the spaces would be reclaimed if the disk cache ever reached the same size. The position simply becomes an int, since we don't need offset, we merely need the slot number. . . The key map would be smaller and easier to write to disk. The first 4 bytes on disk would indicate the size of all the other blocks, so we could determine the size on startup. We could also persist the key map in a similar fashion. In memory we could keep the key, an array of spot numbers on disk for the data, and an array of spot numbers for the key in the key file. The key file would have the key and the block number. If we could require that all the keys fit into one block, then we could just use the first block number in the data file as the key block number. Either way, we could bake in key persistence. The advantages make it worth trying. I'm considering building another indexed disk cache that works on the block model. By default we would use the size of the first element inserted as the block size. We could also make the block size configurable. Aaron --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
