The change to make the CountingBloomFilter an interface is in this PR [1].
This renames the existing CountingBloomFilter to MapCountingBloomFilter. I’ve
left this alone and just done my work on new files:
- CountingBloomFilter: An interface to extend BloomFilter
- ArrayCountingBloomFilter: An implementation using a backing array
Have a look and see what you think. To implement the CountingBloomFilter you
have to state the policy for handling any negative counts that may be generated
by incorrect usage.
For this implementation the policy is that negative counts are left alone and
the filter is invalid. In most realistic use cases the generation of negative
counts should not happen. It means that you are using the filter incorrectly by
removing something you never added. So I don’t think there is a need to police
this event strictly with the associated overhead of doing so.
There was previous discussion about the limitations of using an array with
regard to how many bits the filter can have. I’ve explained this limitation in
the javadoc header for ArrayCountingBloomFilter. However it seems to apply to
the entire API given that it is based around bit indexes as 32-bit integers. So
the maximum size for any Bloom filter in this API is 2^31 - 1. I note that a
BitSet only supports a size up to this level. You could store 64 times more
than this using the long[] storage representation. But then you would have to
address them using a long for the index. Thus this warning in the
ArrayCountingBloomFilter should probably be taken out and put in the
package-info somewhere.
The new renamed MapCountingBloomFilter could either be updated to implement the
CountingBloomFilter interface or removed from the API.
Other changes to the existing API should follow after this (i.e. returning
iterators/splitators of indexes, returning booleans after merge operations,
etc.)
One detail is I have removed the toString() implementation which concatenated
all indexes and their counts. This may be helpful for debugging but printing
out all the set indexes in a big filter could overflow a StringBuilder buffer!
I think that all the toString() overrides should be removed or changed to a
simple summary such as the filter shape and the current cardinality.
If you want a set notation for the current state of the filter (e.g. like
BitSet.toString()) then a helper class could be provided that takes the filter
or counting filter and extracts the string representation:
BloomFilterConverter
static Appendable toAppendable(BloomFilter bf, Appendable out);
BloomFilter bf = … // set bits 1,2,3,10
BloomFilterConverter.toAppendable(bf, new StringBuilder()).toString()
"{1,2,5,10}”
[1] https://github.com/apache/commons-collections/pull/137
<https://github.com/apache/commons-collections/pull/137>