I've updated master with some of the fixes discussed.

I've been looking through the rest of the BloomFilter code and the CountingBloomFilter appears to be broken:

1. When another CountingBloomFiler is merged or removed the counts of the other filter are ignored. This is not as documented.

2. When a remove operation is performed and the count for an index decrements past zero it throws an exception. This is not documented but enforced in the unit tests. This is very unfriendly. Digging into this part of the code and there is a duplication of logic for remove. The first part will not throw if the decrement passes zero. Then the same index is decremented again (with the same new value) and it will throw this time for underflow. So maybe at some point it didn't throw and now it does.

I commented the code where I think the problems are and added tests to master that fail with the current implementation.

I thought this type of logic is ideally suited to a Bag. So I rewrote the class using Bag<Integer>. This passes the same tests but has two differences:

a) It will not throw when remove would underflow. The bit index just gets set to zero (and that item is removed from the Bag).

b) Bag has no support for overflow in the count of each item. It actually has a weird feature that adding 2 to Integer.MAX_VALUE will overflow to negative and then subtracting 1 would reset the count to zero because it was still negative after subtracting 1. In Bag this is not an issue as it is limited by Collections.size() returning an int for the count of everything in the Bag. So if you add that many items the collection will break elsewhere too.

I think the change (a) is an improvement. But are there situations where you want to know that you cannot remove a filter exactly from another one? If so perhaps a removeExactly(...) method for this case. I'd prefer to just drop the throwing of exceptions. After all if you do an andCardinality(BloomFilter) you do not get errors thrown when one cannot be "subtracted" from the other.

Change (b) is faster but could lead to errors. So how likely is overflow in a counting Bloom filter?

Both can be supported but it would have to use a custom Bag implementation. This is pretty trivial and would actually speed up other parts of the filter by having direct access to the Map underlying the Bag for faster merge and remove operations. This would then be a change to the current Map<Integer, Integer> to a Map<Integer, MutableInteger> to store counts.

I can update the code to fix the implementation but will only do so after a decision on overflow is made. We could even have a custom implementation that stores counts as a long for very high counting filters. Is this likely to be used? Only in a situation where the exceptions from overflow are an issue.


Other questions about this:

- Why public Stream<Map.Entry<Integer, Integer>> getCounts()?

Perhaps this is better served raw as arrays of length 2. This has lower memory overhead:

public Stream<int[]> getCounts()

I've ruled out an accessor method as there is no equivalent for boolean getBit(int index) in BloomFilter, e.g:

public int getCount(int index)

- Why a TreeMap? Is the order of the indices important?

Or do you have some benchmarks to show that the TreeMap handles lots of growth and shrinkage better than a HashMap. There are situations where each one would be a better choice and so perhaps this is a case for having a CountingBloomFilter with a choice for the backing Map. This could be done using an abstract class with implementations. This is one to leave until some benchmarks are in place in the main code.


On 18/02/2020 09:54, Claude Warren wrote:
On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <alex.d.herb...@gmail.com>
wrote:


My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for
all
positive values of N?  If so then you are correct and Signedness does not
matter and can be removed.  I thought that the statement was false.

Yes, you are correct. Addition and multiplication use the bits in the same
way. Modulus does not.

So are the Murmur3 hash functions actually unsigned? I thought the
original c code returned unsigned 64-bit values (uint64_t).

IIUC the purpose of the Signedness enum is to specify that a negative
value returned from the hash function should be used as a numerical
magnitude if then used in downstream computations since sign extension on
negatives will set leading bits to 1s. I.e. using the negative as if it
were unsigned (and all bits are random) is not valid. Perhaps you can give
an example of how signed or unsigned would be handled differently.


The murmur hash is not a good example here as it returns the resulting
buffer as 2 signed longs.  The MD5 is probably a better example as the
messageDigest.digest() returns a byte[128].  The code then interprets that
as 2 signed longs.  A C implementation could interpret that as 2 unsigned
longs.  Additionally, there are hashes that return smaller buffers such as
with murmur2 which returns a 32bit unsigned int in the C code.  In java
that could be represented as a long as per the Integer.asUnsignedlong()
method or it could just be cast to a long and thus signed.

The values returned are then passed to floorMod( value, numberOfBits ) to
turn on the appropriate bit in the resulting bit vector.  As noted above
this will yield different values depending upon how the byte[] was
represented.

Claude


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to