Re: [collections] Bloom filters

Alex Herbert Thu, 05 Mar 2020 18:15:01 -0800

The change to make the CountingBloomFilter an interface is in this PR [1].

This renames the existing CountingBloomFilter to MapCountingBloomFilter. I’ve 
left this alone and just done my work on new files:


- CountingBloomFilter: An interface to extend BloomFilter
- ArrayCountingBloomFilter: An implementation using a backing array

Have a look and see what you think. To implement the CountingBloomFilter you 
have to state the policy for handling any negative counts that may be generated 
by incorrect usage.

For this implementation the policy is that negative counts are left alone and 
the filter is invalid. In most realistic use cases the generation of negative 
counts should not happen. It means that you are using the filter incorrectly by 
removing something you never added. So I don’t think there is a need to police 
this event strictly with the associated overhead of doing so.

There was previous discussion about the limitations of using an array with 
regard to how many bits the filter can have. I’ve explained this limitation in 
the javadoc header for ArrayCountingBloomFilter. However it seems to apply to 
the entire API given that it is based around bit indexes as 32-bit integers. So 
the maximum size for any Bloom filter in this API is 2^31 - 1. I note that a 
BitSet only supports a size up to this level. You could store 64 times more 
than this using the long[] storage representation. But then you would have to 
address them using a long for the index. Thus this warning in the 
ArrayCountingBloomFilter should probably be taken out and put in the 
package-info somewhere.

The new renamed MapCountingBloomFilter could either be updated to implement the 
CountingBloomFilter interface or removed from the API.

Other changes to the existing API should follow after this (i.e. returning 
iterators/splitators of indexes, returning booleans after merge operations, 
etc.)

One detail is I have removed the toString() implementation which concatenated 
all indexes and their counts. This may be helpful for debugging but printing 
out all the set indexes in a big filter could overflow a StringBuilder buffer! 
I think that all the toString() overrides should be removed or changed to a 
simple summary such as the filter shape and the current cardinality.

If you want a set notation for the current state of the filter (e.g. like 
BitSet.toString()) then a helper class could be provided that takes the filter 
or counting filter and extracts the string representation:

BloomFilterConverter
static Appendable toAppendable(BloomFilter bf, Appendable out);

BloomFilter bf = … // set bits 1,2,3,10
BloomFilterConverter.toAppendable(bf, new StringBuilder()).toString()

"{1,2,5,10}”


[1] https://github.com/apache/commons-collections/pull/137
 <https://github.com/apache/commons-collections/pull/137>

Re: [collections] Bloom filters

Reply via email to