Suppose we have a 20GB datastore, divided into 32kB blocks. Thus, we have 655,360 keys. So for a bloom filter, n = 655360.
With optimal k, we get p(false positive) = 0.62^(m/n) Now, for plausible values of m: m in bytes p(false positive) k 2^20 128kB 47% 1 2^21 256kB 22% 2 2^22 512kB 5% 4 2^23 1024kB 0.2% 9 So, if we are simply trying to avoid superfluous datastore lookups (which are expensive as they may involve disk I/O), we might be able to make some profitable use of Bloom filters, with a relatively large bit array, however we would have to use a counting bloom filter which would be several times larger. One application I had considered, based on Oskar's suggestions, was that we track the content on our immediate peers, and their immediate peers, etc, using Bloom filters. One problem with this is that nodes go up and down, so we would most likely have to keep track of a bloom filter for every node within our horizon. Also if we are going to forward a request, even if it is as a last resort (to compensate for movement of data when locations have been swapped, for example), we need a reasonable confidence level. Lets say we set it at 1MB bloom filters. We are then limited by both RAM (or disk I/O) and the traffic required to keep everyone's filters up to date (even if via periodic updates). Lets just consider the latter: k=9, so on average a single inserted key will change 9 bits. To describe this change, unless we update extremely infrequently, we will need around 23*9 = 207 bits = 26 bytes. Let our horizon be x nodes. Lets assume that p fraction of requests, which occur every 33kB of data bandwidth used, result in new keys being stored on a typical node. Then the traffic required to keep everything up to date is: For a single update on a single node, 26 bytes * x nodes. If our node A is tracking only node B, then B will send 26 bytes on every stored key; we are tracking x nodes, so we receive 26 bytes * x nodes for every key stored globally. The overall bandwidth used scales quadratically with x, but the bandwidth used by any given node scales linearly with it. So our overall bandwidth usage is 26 * x * p bytes for every 33kB of data transferred by an average node (assuming that we bundle them sensibly). Each node will therefore send and receive 26 * x * p bytes for every 33kB of data it sends. I suspect that p will be quite high, but that's just my suspicion. Lets assume it's 50%. The overhead incurred by tracking what our nearby nodes have in their stores is therefore (26*x*p / 33*1024). This is really rather low. For 5% overhead, we can track 130 nodes. For 1% overhead, we can still track 26 nodes. (In practice memory concerns mean we probably wouldn't want to...). We need to keep memory usage within reason; we can use 512kB filters instead of 1MB ones to reduce it a bit... Obviously the overhead may be a bit higher than is predicted here, hence the need for it to be very low. It is not hard to make this reasonably secure; we can detect when nodes are consistently lying. It does mean that a certain amount of the nearby network topology must be exposed, so that we can separate one-hop-away from two hops etc. There is plainly a limit to the extent to which this can scale. We still need small-world darknet routing. There's a strong argument that we should only try this after we have got a DNF through regular routing, possibly only from the node closest to the key, and certainly we should send the data to all nodes on the natural routing chain if we do find it. But it might help to bridge the gap caused by data being left behind when a location swap occurs. Not perhaps something for 0.7.0. But perhaps something to consider for later versions. Of course to be sure that it works, we would need a testnet much bigger than the update horizon... There are obviously issues with larger stores... -- Matthew J Toseland - toad at amphibian.dyndns.org Freenet Project Official Codemonkey - http://freenetproject.org/ ICTHUS - Nothing is impossible. Our Boss says so. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Digital signature URL: <https://emu.freenetproject.org/pipermail/tech/attachments/20060111/9c153f64/attachment.pgp>
