On Friday 12 June 2009 21:07:32 Evan Daniel wrote:
> Currently, the Bloom filters use 46 bits per key (23 buckets, 2-bit
> counter per bucket) of RAM, using 16 hash functions.  This gives a
> false positive rate of 1.5E-5.
> 
> Oddly enough, a simple hash table has lower memory usage.  Create an
> array in memory of n-bit values, one value per slot in the salted-hash
> store.  The value stored in the array is an n-bit hash of the key in
> the corresponding store location.  On a given lookup, the odds of a
> false positive are 2^-n.  Because the store does quadratic probing
> with 4 slots, there are 4 lookups per key request.  The false positive
> rate is then 2^-(n-2).  For n=18, we get a false positive rate of
> 1.5E-5.  n=16 would align on byte boundaries and save a tiny bit of
> memory at a cost of a false positive rate of 6.1E-5.
> 
> The reason this works better than the Bloom filter is because a given
> key can only go in a limited set of locations in the salted hash
> store.  The Bloom filter works equally well whether or not that
> restriction is present.  With this method, as the number of possible
> locations grows, so does the false positive rate.  For low
> associativity, the simple hash table wins.
> 
> This also makes updates and removal trivial.  However, we probably
> can't share it with our neighbors without giving away our salt value.
> For that, we probably want to continue planning to use Bloom filters.

Hmmm. Yes, it is important that we not give away our salt value, because with 
it a nearby peer can force collisions to remove specific keys.

And we can't generate the Bloom filter from the above structure, because 18 
bits presumably isn't enough input data.

Anyway, in context:

The current proposal is to split each datastore (or perhaps each cache:store 
pair) by hashed keyspace into manageable sized chunks, each of which has its 
own Bloom filter. One issue with this: These will naturally vary a bit in terms 
of the number of keys assigned to each; how much bigger does a Bloom filter for 
1/Nth of the keyspace need to be than 1/Nth of the size of a Bloom filter for 
the whole keyspace, given the latter has a bounded number of keys in it and the 
former doesn't? I guess we could keep a list of keys for each filter on disk, 
and thus be able to change the keyspace size of each chunk?

On new nodes, we can use this both for sharing and for checking the store. On 
old nodes, we want our shareable bloom filters to not contain keys added before 
the (related) caching changes (new keys will be flagged in the store entry 
flags), so we will need a separate filter for all keys to serve the function of 
the current filters.

IMHO memory is not the main concern here, because we will be using a large 
amount of memory for other people's Bloom filters.

In the long term we want to only be using the split filters. In the short term 
we want to avoid additional code complexity so it makes sense to keep the 
existing filters. No?

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
Devl@freenetproject.org
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Reply via email to