Re: [collections] Bloom filters

Alex Herbert Tue, 18 Feb 2020 09:32:24 -0800

I've updated master with some of the fixes discussed.

I've been looking through the rest of the BloomFilter code and theCountingBloomFilter appears to be broken:

1. When another CountingBloomFiler is merged or removed the counts ofthe other filter are ignored. This is not as documented.

2. When a remove operation is performed and the count for an indexdecrements past zero it throws an exception. This is not documented butenforced in the unit tests. This is very unfriendly. Digging into thispart of the code and there is a duplication of logic for remove. Thefirst part will not throw if the decrement passes zero. Then the sameindex is decremented again (with the same new value) and it will throwthis time for underflow. So maybe at some point it didn't throw and nowit does.

I commented the code where I think the problems are and added tests tomaster that fail with the current implementation.

I thought this type of logic is ideally suited to a Bag. So I rewrotethe class using Bag<Integer>. This passes the same tests but has twodifferences:

a) It will not throw when remove would underflow. The bit index justgets set to zero (and that item is removed from the Bag).

b) Bag has no support for overflow in the count of each item. Itactually has a weird feature that adding 2 to Integer.MAX_VALUE willoverflow to negative and then subtracting 1 would reset the count tozero because it was still negative after subtracting 1. In Bag this isnot an issue as it is limited by Collections.size() returning an int forthe count of everything in the Bag. So if you add that many items thecollection will break elsewhere too.

I think the change (a) is an improvement. But are there situations whereyou want to know that you cannot remove a filter exactly from anotherone? If so perhaps a removeExactly(...) method for this case. I'd preferto just drop the throwing of exceptions. After all if you do anandCardinality(BloomFilter) you do not get errors thrown when one cannotbe "subtracted" from the other.

Change (b) is faster but could lead to errors. So how likely is overflowin a counting Bloom filter?

Both can be supported but it would have to use a custom Bagimplementation. This is pretty trivial and would actually speed up otherparts of the filter by having direct access to the Map underlying theBag for faster merge and remove operations. This would then be a changeto the current Map<Integer, Integer> to a Map<Integer, MutableInteger>to store counts.

I can update the code to fix the implementation but will only do soafter a decision on overflow is made. We could even have a customimplementation that stores counts as a long for very high countingfilters. Is this likely to be used? Only in a situation where theexceptions from overflow are an issue.



Other questions about this:

- Why public Stream<Map.Entry<Integer, Integer>> getCounts()?

Perhaps this is better served raw as arrays of length 2. This has lowermemory overhead:


public Stream<int[]> getCounts()

I've ruled out an accessor method as there is no equivalent for booleangetBit(int index) in BloomFilter, e.g:


public int getCount(int index)

- Why a TreeMap? Is the order of the indices important?

Or do you have some benchmarks to show that the TreeMap handles lots ofgrowth and shrinkage better than a HashMap. There are situations whereeach one would be a better choice and so perhaps this is a case forhaving a CountingBloomFilter with a choice for the backing Map. Thiscould be done using an abstract class with implementations. This is oneto leave until some benchmarks are in place in the main code.



On 18/02/2020 09:54, Claude Warren wrote:

On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for

all

positive values of N?  If so then you are correct and Signedness does not
matter and can be removed.  I thought that the statement was false.

Yes, you are correct. Addition and multiplication use the bits in the same
way. Modulus does not.

So are the Murmur3 hash functions actually unsigned? I thought the
original c code returned unsigned 64-bit values (uint64_t).

IIUC the purpose of the Signedness enum is to specify that a negative
value returned from the hash function should be used as a numerical
magnitude if then used in downstream computations since sign extension on
negatives will set leading bits to 1s. I.e. using the negative as if it
were unsigned (and all bits are random) is not valid. Perhaps you can give
an example of how signed or unsigned would be handled differently.

The murmur hash is not a good example here as it returns the resulting
buffer as 2 signed longs.  The MD5 is probably a better example as the
messageDigest.digest() returns a byte[128].  The code then interprets that
as 2 signed longs.  A C implementation could interpret that as 2 unsigned
longs.  Additionally, there are hashes that return smaller buffers such as
with murmur2 which returns a 32bit unsigned int in the C code.  In java
that could be represented as a long as per the Integer.asUnsignedlong()
method or it could just be cast to a long and thus signed.

The values returned are then passed to floorMod( value, numberOfBits ) to
turn on the appropriate bit in the resulting bit vector.  As noted above
this will yield different values depending upon how the byte[] was
represented.

Claude


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [collections] Bloom filters

Reply via email to