[freenet-dev] Can we implement Bloom filter sharing quickly???

Matthew Toseland Fri, 19 Jun 2009 03:07:11 +0100

On Fri, May 01, 2009 at 03:58:27PM -0400, Evan Daniel wrote:
> On Fri, May 1, 2009 at 3:37 PM, Robert Hailey <robert at freenetproject.org> 
> wrote:
> >
> > On May 1, 2009, at 9:35 AM, Matthew Toseland wrote:
> >
> > IMPLEMENTING IT:
> > Main tasks:
> > - Converting our datastore indexes into a compact, slightly more lossy,
> > 1-bit
> > filter. (Easy)
> > - Creating a snapshot once an hour, and keeping for some period (1 week?).
> > (Easy)
> > - Creating an efficient binary diff format for hour-to-hour diffs, and
> > keeping
> > them. (Moderate)
> >
> > This might actually be quite hard as an?efficient?bloom filter will scatter
> > (even a few) updates all over the filter.
> 
> Actually, it's not hard at all.
> 
> The naive version is to observe that a tiny fraction of the bits in
> the filter changed, and just record the location of each bit that
> changed.  At 0.24 writes per second, you'll get on average 0.24*3600*2
> keys changed per hour (each write represents 1 add and 1 delete), or
> 0.24*3600*2*16 counters changed per hour.  Unless I'm mistaken,
> changing a counter in the counting bloom filter has ~ 50% probability
> of changing the bit in the compressed non-counting version.  So that
> means 0.24*3600*2*16*0.5=13824 bits changed per hour.
> 
> The address of a bit can be represented in 32 bits trivially (for
> bloom filters < 512MiB in size), so the 1-hour diff should consume
> 13824*32/8=55296 bytes.  That represents 15.36 bytes/s of traffic for
> each peer, or 307.2B/s across 20 peers.
> 
> That encoding isn't terribly efficient.  More efficient is to sort the
> addresses and compute the deltas.  (So if bits 19, 7, and 34 changed,
> I send the numbers 7, 12, 15.)  Those deltas should follow a geometric
> distribution with mean (number of bits changed) / (size of filter).
> It's easy to build an arithmetic coding for that data that will
> achieve near-perfect compression (see
> http://en.wikipedia.org/wiki/Arithmetic_coding for example).  My BOTE
> estimate using toad's 84MiB filter it would compress at 14.5 bits per
> address (instead of the 30 or 32 you'd get with no compression; gzip
> or lzw should be somewhere in between).
> 
> Evan


CHUNK SIZE PROBLEM:
===================

Current plans call to split the keyspace up for each datastore, and assign keys 
to a manageable sized bloom filter for a section of the (hashed) keyspace. 
These can then be transferred separately, are useful immediately after being 
transferred, can be juggled in memory, etc. However, we cannot guarantee that 
the populaion of such a chunk will be small enough for the filter to be 
effective. Solving this requires either moving the boundaries dynamically 
(possibly continually), or making the chunks significantly larger than they 
would need to be in an ideal world.

FALSE POSITIVES ISSUES WITH SECOND PHASE:
=========================================

23 bits = 0.0000159 false positives probability (0.00159%).

Second phase, we will want to check all of our queued keys against an hourly 
Bloom filter update. Probably we would do this hourly plus on connect. We 
simply keep a file containing a list of all the keys we care about, and read it 
linearly, comparing with the Bloom filters as we go (thanks evanbd). One 
problem with this is that the false positives ratio applies to each lookup, and 
since we'd be looking up the entire queue, we could expect a lot of false 
positives. Only a small fraction of the filter has changed, so most false 
positives will have already been hit at some point... If we can be sure we have 
tried all opportunities already, we can safely ignore any hit that is in both 
the old and new filters, otherwise we will need to keep a list of false 
positives for the node; this is only consulted when we are about to send a 
request so it is probably acceptable overhead. Or we can leave it to chance, 23 
bits gives 532 false positives out of 1TB queue.

HASHES VERSUS BLOOMS:
=====================

Curiously enough, we could cut the update traffic by a factor of 4 by sending 
64-bit hashes of the new and removed keys, rather than the positions of the 
changed bits. Even compared to an arithmetic encoding on the positions this 
cuts it by a factor of 2:
((0.24*3600*2*64)/8)/3600 = 3.84 bytes/sec/peer for the same assumptions as the 
numbers above.

The size in memory, and the size of the initial transfer, are the interesting 
bits though. Clearly if the hash size is 64 bits, the original transfer would 
be larger than with bloom filters' 23 bits, even taking into account the chunk 
size problem. But the hash size only needs to be 16 bits (gives us same false 
positives as blooms of 23 bits) plus upper bound log2 of the datastore size. So 
31 bits per key for a 500GB datastore. Since the data is only updated hourly, 
we could simply keep a sorted array of the hashed keys in RAM... On the server 
side we can do something very similar, generating a sorted list and then 
recording new and removed keys separately and consolidating hourly.

In terms of security I doubt this is significantly different to Bloom filter 
sharing, especially if we have a separate (but public) salt for each node's key 
hashes. We will probably need that even with Bloom filters.

LOCAL SECURITY ISSUES: AFTER-THE-FACT TRACING BY POWERFUL ATTACKERS
===================================================================

We have IMHO adequately addressed the network level security issues, by not 
caching until HTL has reduced a bit. A local security issue is that any bloom 
filter sharing or similar neighbours-stores scheme will have to keep hourly 
diffs of its datastore for a reasonable period, probably a week. Whether this 
is bloom filters or key hashes, it potentially leaves a trail for an attacker 
capable of compromising lots of nodes after a request has already occurred. He 
would need to be able to compromise the peers of each node along the chain, to 
find where the previous hop was. This is a rare case of an attack that might 
actually be harder on opennet. The datastores of the nodes will yield similar 
data, although might be obscured if the cache has turned over since the 
original post; the bloom data tells him *when* the keys were stored, which 
could be important data. Caching changes would help to a degree. An attacker 
capable of such feats would in any case be able to trace big inserts or 
requests back to their originator in real time, provided he could identify 
them, and compromise a large enough proportion of the nodes they pass through 
(e.g. unpatched windows systems); that is true now without bloom filters. He 
could even break cryptographic tunnels, provided he could either break every 
node on the chain or find another chain related to the same data and break 
every node on that chain... (It is likely to be necessary to have multiple 
tunnels for most requests in practice for performance reasons). He could also 
exploit ULPRs if he was quick enough. IMHO this is a fundamental issue with 
distributed systems, storing bloom updates may make life easier if churn is low 
(= large caches, small bandwidth), but such a powerful attacker can probably do 
whatever he wants in any case...

[freenet-dev] Can we implement Bloom filter sharing quickly???

Reply via email to