"Fredrik Lundh" <[EMAIL PROTECTED]> writes: > btw, since my brain is on vacation, could anyone who already has all > the necessary background information in their head (Paul?) tell me > if using a chopped-up MD5 or SHA-1 (or whatever) digest as the hash > functions for a bloom filter is a good idea...
Erm, my brain isn't on vacation, it's just plain missing ;-). However I'd say there's nothing wrong in principle with chopping up sha/md5 output if you want to get several hashes as you're doing, assuming the total hash length is enough. The advantage over doing separate hashes for each piece is optimization: you don't compute as many hashes. But there's more flexibility with something like (untested): def multi_hash(data, i, size): # find i'th slice of hash return sha.new('%x,%s'% (i, data)).digest()[:size] Also, the idea of a Bloom filter is to use less memory than a hash table, by not storing the keys. In your implementation you chop the sha/md5 output into 4-byte pieces, which means about 4 billion "slots" in each set, which might be too many for a typical computer. Also, you end up storing each of those 4-byte values as a Python string. So I think in practice you'd want smaller sets, and to represent them as bit vectors instead of as Python sets (untested): import array class bit_vector(object): def __init__(self, size_in_bits): nbytes = (size_in_bits + 7) // 8 self.vec = array.array('B', '\0' * nbytes) def __getattr__(self, i): # get i'th bit from vector a,b = divmod(i, 8) return bool(self.vec[a] & (1 << b)) def __setattr__(self, i, value): # set i'th bit in vector a,b = divmod(i, 8) v = int(bool(value)) # the actual bit to set, 1 or 0 self.vec[a] = (self.vec[a] & ~(1 << b)) | (v << b) If we use a bit vector representation we want the hash value to be an integer (untested): def bloom_hash(data, i, size_in_bits): # my favorite way to get an integer value from sha h = int(sha.new('%x,%s'% (i, data)).hexdigest(), 16) return h % size_in_bits We'd then implement the Bloom filter (untested): class bloom(object): def __init__(self, nlayers, layersize): self.nlayers, self.layersize = nlayers, layersize self.filter = [bit_vector(layersize) for i in xrange(nlayers)] def add(self, data): for i in xrange(self.nlayers): self.filter[i][bloom_hash(data, i, self.layersize)] = True def __contains__(self, data): return False not in \ (self.filter[i][bloom_hash(data, i, self.layersize)] \ for i in xrange(self.nlayers)) Finally, I think to use Bloom filters effectively you have to choose nlayers and layersize carefully, according to the number of keys you expect to see, the amount of memory you have available, and/or the probability of false matches that you're willing to accept. The standard references (whatever they are) about Bloom filters probably have some suitable formulas for computing these things. -- http://mail.python.org/mailman/listinfo/python-list