[Tech] Bloom filters considered useful... maybe

Matthew Toseland Wed, 11 Jan 2006 04:15:51 +0000

Of course, one disadvantage would be that we eliminate any
was-it-there-before-you-asked datastore plausible deniability. But
that's pretty dodgy anyway IMHO. The solution to that issue is to
disconnect your datastore from your browsing habits, via premix routing
etc...


On Wed, Jan 11, 2006 at 03:59:24AM +0000, Matthew Toseland wrote:
> Suppose we have a 20GB datastore, divided into 32kB blocks. Thus, we
> have 655,360 keys. So for a bloom filter, n = 655360.
> 
> With optimal k, we get p(false positive) = 0.62^(m/n)
> 
> Now, for plausible values of m:
> 
> m     in bytes        p(false positive)       k
> 2^20  128kB           47%                     1
> 2^21  256kB           22%                     2
> 2^22  512kB           5%                      4
> 2^23  1024kB          0.2%                    9
> 
> So, if we are simply trying to avoid superfluous datastore lookups
> (which are expensive as they may involve disk I/O), we might be able to
> make some profitable use of Bloom filters, with a relatively large bit
> array, however we would have to use a counting bloom filter which would
> be several times larger.
> 
> One application I had considered, based on Oskar's suggestions, was that
> we track the content on our immediate peers, and their immediate peers,
> etc, using Bloom filters. One problem with this is that nodes go up and
> down, so we would most likely have to keep track of a bloom filter for
> every node within our horizon. Also if we are going to forward a
> request, even if it is as a last resort (to compensate for movement of
> data when locations have been swapped, for example), we need a
> reasonable confidence level.
> 
> Lets say we set it at 1MB bloom filters. We are then limited by both RAM
> (or disk I/O) and the traffic required to keep everyone's filters up to
> date (even if via periodic updates).
> 
> Lets just consider the latter:
> k=9, so on average a single inserted key will change 9 bits. To describe
> this change, unless we update extremely infrequently, we will need
> around 23*9 = 207 bits = 26 bytes.
> 
> Let our horizon be x nodes.
> 
> Lets assume that p fraction of requests, which occur every 33kB of data
> bandwidth used, result in new keys being stored on a typical node.
> 
> Then the traffic required to keep everything up to date is:
> 
> For a single update on a single node, 26 bytes * x nodes.
> 
> If our node A is tracking only node B, then B will send 26 bytes on
> every stored key; we are tracking x nodes, so we receive 26 bytes * x
> nodes for every key stored globally. The overall bandwidth used scales
> quadratically with x, but the bandwidth used by any given node scales
> linearly with it.
> 
> So our overall bandwidth usage is 26 * x * p bytes for every 33kB of
> data transferred by an average node (assuming that we bundle them
> sensibly).
> 
> Each node will therefore send and receive 26 * x * p bytes for every 33kB
> of data it sends. I suspect that p will be quite high, but that's just
> my suspicion. Lets assume it's 50%. The overhead incurred by tracking
> what our nearby nodes have in their stores is therefore 
> (26*x*p / 33*1024).
> 
> This is really rather low. For 5% overhead, we can track 130 nodes. For
> 1% overhead, we can still track 26 nodes. (In practice memory concerns
> mean we probably wouldn't want to...). We need to keep memory usage
> within reason; we can use 512kB filters instead of 1MB ones to reduce it
> a bit... Obviously the overhead may be a bit higher than is predicted
> here, hence the need for it to be very low. It is not hard to make this
> reasonably secure; we can detect when nodes are consistently lying. It
> does mean that a certain amount of the nearby network topology must be
> exposed, so that we can separate one-hop-away from two hops etc.
> 
> There is plainly a limit to the extent to which this can scale. We still
> need small-world darknet routing. There's a strong argument that we
> should only try this after we have got a DNF through regular routing,
> possibly only from the node closest to the key, and certainly we should
> send the data to all nodes on the natural routing chain if we do find it.
> But it might help to bridge the gap caused by data being left behind when
> a location swap occurs.
> 
> Not perhaps something for 0.7.0. But perhaps something to consider for
> later versions. Of course to be sure that it works, we would need a
> testnet much bigger than the update horizon... There are obviously
> issues with larger stores...
> -- 
> Matthew J Toseland - toad at amphibian.dyndns.org
> Freenet Project Official Codemonkey - http://freenetproject.org/
> ICTHUS - Nothing is impossible. Our Boss says so.



> _______________________________________________
> Tech mailing list
> Tech at freenetproject.org
> http://emu.freenetproject.org/cgi-bin/mailman/listinfo/tech

-- 
Matthew J Toseland - toad at amphibian.dyndns.org
Freenet Project Official Codemonkey - http://freenetproject.org/
ICTHUS - Nothing is impossible. Our Boss says so.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: 
<https://emu.freenetproject.org/pipermail/tech/attachments/20060111/56bf38cd/attachment.pgp>

[Tech] Bloom filters considered useful... maybe

Reply via email to