[Tech] Bloom filters considered useful... maybe

Matthew Toseland Wed, 11 Jan 2006 03:59:24 +0000

Suppose we have a 20GB datastore, divided into 32kB blocks. Thus, we
have 655,360 keys. So for a bloom filter, n = 655360.

With optimal k, we get p(false positive) = 0.62^(m/n)

Now, for plausible values of m:

m in bytes p(false positive) k
2^20 128kB 47% 1
2^21 256kB 22% 2
2^22 512kB 5% 4
2^23 1024kB 0.2% 9

So, if we are simply trying to avoid superfluous datastore lookups
(which are expensive as they may involve disk I/O), we might be able to
make some profitable use of Bloom filters, with a relatively large bit
array, however we would have to use a counting bloom filter which would
be several times larger.

One application I had considered, based on Oskar's suggestions, was that
we track the content on our immediate peers, and their immediate peers,
etc, using Bloom filters. One problem with this is that nodes go up and
down, so we would most likely have to keep track of a bloom filter for
every node within our horizon. Also if we are going to forward a
request, even if it is as a last resort (to compensate for movement of
data when locations have been swapped, for example), we need a
reasonable confidence level.

Lets say we set it at 1MB bloom filters. We are then limited by both RAM
(or disk I/O) and the traffic required to keep everyone's filters up to
date (even if via periodic updates).

Lets just consider the latter:
k=9, so on average a single inserted key will change 9 bits. To describe
this change, unless we update extremely infrequently, we will need
around 23*9 = 207 bits = 26 bytes.

Let our horizon be x nodes.

Lets assume that p fraction of requests, which occur every 33kB of data
bandwidth used, result in new keys being stored on a typical node.

Then the traffic required to keep everything up to date is:

For a single update on a single node, 26 bytes * x nodes.

If our node A is tracking only node B, then B will send 26 bytes on
every stored key; we are tracking x nodes, so we receive 26 bytes * x
nodes for every key stored globally. The overall bandwidth used scales
quadratically with x, but the bandwidth used by any given node scales
linearly with it.

So our overall bandwidth usage is 26 * x * p bytes for every 33kB of
data transferred by an average node (assuming that we bundle them
sensibly).

Each node will therefore send and receive 26 * x * p bytes for every 33kB
of data it sends. I suspect that p will be quite high, but that's just
my suspicion. Lets assume it's 50%. The overhead incurred by tracking
what our nearby nodes have in their stores is therefore
(26*x*p / 33*1024).

This is really rather low. For 5% overhead, we can track 130 nodes. For
1% overhead, we can still track 26 nodes. (In practice memory concerns
mean we probably wouldn't want to...). We need to keep memory usage
within reason; we can use 512kB filters instead of 1MB ones to reduce it
a bit... Obviously the overhead may be a bit higher than is predicted
here, hence the need for it to be very low. It is not hard to make this
reasonably secure; we can detect when nodes are consistently lying. It
does mean that a certain amount of the nearby network topology must be
exposed, so that we can separate one-hop-away from two hops etc.

There is plainly a limit to the extent to which this can scale. We still
need small-world darknet routing. There's a strong argument that we
should only try this after we have got a DNF through regular routing,
possibly only from the node closest to the key, and certainly we should
send the data to all nodes on the natural routing chain if we do find it.
But it might help to bridge the gap caused by data being left behind when
a location swap occurs.

Not perhaps something for 0.7.0. But perhaps something to consider for
later versions. Of course to be sure that it works, we would need a
testnet much bigger than the update horizon... There are obviously
issues with larger stores...
--
Matthew J Toseland - toad at amphibian.dyndns.org
Freenet Project Official Codemonkey - http://freenetproject.org/
ICTHUS - Nothing is impossible. Our Boss says so.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL:
<https://emu.freenetproject.org/pipermail/tech/attachments/20060111/9c153f64/attachment.pgp>

[Tech] Bloom filters considered useful... maybe

Reply via email to