Re: [collections] Bloom filters

Claude Warren Wed, 19 Feb 2020 23:37:36 -0800

Way back in your first post on this topic you talked about changing the
AbstractBloomFilter to use:



     @Override
     public int cardinality() {
         int count = 0;
         for (final long bits : getBits()) {
             count += Long.bitCount(bits);
         }
         return count;
     }

among other changes.  Those were the changes I was referring to.

Claude

On Wed, Feb 19, 2020 at 11:33 PM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

>
>
> > On 19 Feb 2020, at 21:14, Claude Warren <cla...@xenei.com> wrote:
> >
> > I think the compromise solution of removing the thrown exception and
> adding
> > an isInvalid() method makes sense and fits within the parameters of how
> the
> > counting Bloom filter should work.  However, I think the add() and
> remove()
> > methods should (or add and subtract) should only accept
> CountingBloomFilter
> > arguments.  The reasoning being that the methods are specific  to the
> > counting filter and are defined in terms of the internal structure of the
> > filter (the count of each bit) that are not present in the "standard"
> bloom
> > filter.  I think that it makes it lexically clear.
>
> OK. We can always add a transaction counting Bloom filter later that will
> ensure entire filters are added or removed cleanly. For now it can be user
> beware documentation that states if you subtract a filter you did not add
> then you can end up with a set of counts that are not a sum of filters.
>
> I do note that you could do this anyway even with a transaction counting
> Bloom filter, e.g. A+B+C+D - E. It may or may not throw and the trust in on
> the user to ensure they do not subtract E from something that never had it
> added.
>
> Thinking about the invalid flag perhaps the methods merge/add/subtract
> should just return 0 if the filter state is valid and -1 if invalid. Thus
> once a filter is invalid all calls would return the invalid flag.
>
> The reason for this is efficiency. The counts are allowed to roam freely
> over any range for an integer. You have a valid flag that is the bitwise OR
> of every count you have adjusted. If any over/underflow occurs then the
> valid flag will have a negative sign bit. The methods then return the
> signed shift of the valid flag:
>
> return valid >> 31;
>
> This would be -1 if any count ever over/underflows, zero otherwise.
>
> This eliminates any conditional statements from the filter operations. The
> user just has to know this is how it works and so any downstream
> consumption of the counts would have to first check the isInvalid flag and
> sanitise them as necessary. You can even view the counts as unsigned int32
> if you never subtract filters.
>
> >
> > Do you have a branch someplace where you are doing this work?   I would
> > like to see the proposed changes you outlined for the AbstractBloomFilter
> > implemented as I think you are correct and they would be faster and
> cleaner.
>
> No but I’ll create one. I’m waiting for a big PR against collections to go
> through (an update to force use of checkstyle). Then I’ll revert the
> changes I made to the CountingBloomFilterTest to its previous state and put
> the changes into a branch. These would have a new internal structure for
> the counts and the methods for add/subtract.
>
> For the AbstractBloomFilter do you mean the use of the mod(long, int)
> method in the Signedness enum? I think the removal of Math.floorMod can be
> done in two places with bit shift but should we also use the % operator
> instead of Math.floorMod for Signedness.UNSIGNED?
>
>
> >
> > Claude
> >
> > On Wed, Feb 19, 2020 at 12:41 AM Alex Herbert <alex.d.herb...@gmail.com>
> > wrote:
> >
> >>
> >>
> >>> On 18 Feb 2020, at 22:34, Gary Gregory <garydgreg...@gmail.com> wrote:
> >>>
> >>> On Tue, Feb 18, 2020 at 5:32 PM Claude Warren <cla...@xenei.com>
> wrote:
> >>>
> >>>> Last one first, why a tree map?  I think it is a holdover from an
> >> earlier
> >>>> implementation.  It can be any reasonable Map (e.g. HashMap).
> >>>>
> >>>> on the remove() issue.  You are absolutely correct, there is a bug.  I
> >>>>
> >>>
> >>> May you please add a test to validate whatever fix you guys implement?
> >> I'm
> >>> not following the details here but I want to stress the importance of
> >> tests
> >>> to help future maintainers validate any future changes and protect us
> >> from
> >>> regression bugs. IOW, while the current code base is in your head, more
> >>> tests! ;-)
> >>>
> >>
> >> I added a test that shows the current implementation does not do what I
> >> thought it should do based on the documentation in javadoc:
> >>
> >> "For each bit that is turned on in the other filter; if the other
> filter is
> >> also a CountingBloomFilter the count is added to this filter, otherwise
> >> the
> >> count is incremented by one."
> >>
> >> My test checks the counts are added.
> >>
> >> However it does what Claude thought it should do. The original test
> checks
> >> that the indices are incremented by 1.
> >>
> >> So what to do about counting? If overflow or underflow occur on addition
> >> or subtraction you can throw an exception to signal error but if a
> partial
> >> operation has been made this leaves the filter in an undetermined state.
> >> The filter does not actually correspond to a sum of filters anymore. So
> it
> >> is invalid. Should any further operations on the filter be prevented? In
> >> this case throwing an IllegalStateException for all further operations
> >> seems extreme. It may be better to have the addition/subtraction return
> >> true/false if the operation was clean, i.e similar to a standard Set
> >> collection that returns true if adding/removing an item changes the
> state,
> >> false otherwise. The filter could also maintain a state flag to indicate
> >> that an overflow/underflow has occurred and the filter no longer
> represents
> >> a sum of filters.
> >>
> >> Given this idea we change merge to increment the indices by 1 for any
> >> BloomFilter including a counting filter. It will not throw if it
> overflows
> >> but will set an invalid state flag. The logic here is the method is
> >> inherited from BloomFilter and so will function in the same way as other
> >> filters by accumulating set bits (indices).
> >>
> >> Then add some methods:
> >>
> >> boolean add(BloomFilter)
> >> boolean remove(BloomFilter)
> >> boolean isInvalid()
> >>
> >> These update the counts using the counts of the other filter if it is a
> >> CountingBloomFilter, otherwise add or subtract 1. The methods return
> true
> >> if the operation added or removed the filter entirely, i.e. it is now a
> >> part of the sum or was removed from the sum. Otherwise they return false
> >> and the filter is marked as an invalid sum of filters. This leaves the
> >> counting filter in an undetermined state but it can still be used as a
> >> non-counting filter as the indices will be correctly maintained.
> >>
> >> It is a compromise but there does not seem to be a nice way to do this
> >> without a lot of overhead. That would be to check the add/remove can be
> >> done entirely before actually performing the operation. If the operation
> >> would under/overflow then an exception can be thrown and the filter is
> left
> >> in a usable state. This could be achieved with some efficiency if the
> >> filter stored the maximum count. Then any add would be able to check for
> >> potential overflow quickly and do a safe add. However the subtraction
> could
> >> not know the indices that may underflow using a max value (since 0 - x
> >> underflows) and would always have to do a safe removal by checking all
> >> decrements before actually doing the entire removal. One idea here is a
> >> reusable list is kept to store all the indices that are to be
> decremented.
> >> Stage 1 finds all the indices and checks for underflow in the
> subtraction.
> >> If underflow then throw. Otherwise traverse the list and actually do the
> >> subtraction. The overhead of the second list traversal is likely to be
> much
> >> lower than the initial identification of indices in the backing map.
> This
> >> all becomes easy to do if the backing data structure stores both the
> index
> >> and a mutable field, e.g.
> >>
> >> Set<MutableIndex> counts;
> >> class MutableIndex {
> >>   int index;
> >>   int count;
> >>   int hashCode() { return index; }
> >>   // etc
> >> }
> >>
> >> You then find all the MutableIndex objects in the Set and put them in a
> >> reusable temp List (with the new count stored in a parallel list). If
> all
> >> can be incremented/decremented without under/overflow then just go over
> the
> >> list and update the count using the precomputed new count. The entire
> >> operation then works as a single transaction.
> >>
> >> It may be better put into a TransactionCountingBloomFilter and then
> >> CountingBloomFilter is left to be faster and not care about
> over/underflow.
> >>
> >> Thoughts on this approach?
> >>
> >>> Gary
> >>>
> >>>
> >>>> looks like a partial edit got restored back into the code.  The first
> >> block
> >>>> should be removed.  In fact it should probably read
> >>>>
> >>>>          if (val == null || val == 0) {
> >>>>               throw new IllegalStateException("Underflow on index " +
> >>>> idx);
> >>>>           } else if (val == 1) {
> >>>>               counts.remove(idx);
> >>>>           } else {
> >>>>               counts.put(idx, val - 1);
> >>>>           }
> >>>>
> >>>> The question of merging/removing counting bloom filters is an
> >> interesting
> >>>> one.
> >>>>
> >>>> Merging a "normal" Bloom filter A into a counting Bloom filter B
> simply
> >>>> increments all the bits specified  by A.getBits().
> >>>>
> >>>> I took the approach that a counting bloom filter was just another
> Bloom
> >>>> filter.  So merging counting Bloom filter A into counting Bloom
> filter B
> >>>> simply incremented all the bits specified by A.getBits().
> >>>>
> >>>> However the source has moved on since the original implementation was
> >> laid
> >>>> down, and it might make sense that merging bloom filters adds the
> counts
> >>>> from A into B.  Then if the developer wanted to perform a simple merge
> >> for
> >>>> a Bloom filter the developer could use B.merge( A.getHasher() ).
> >> However,
> >>>> I felt that since a counting Bloom filter is an instance of Bloom
> filter
> >>>> the merge should be driven by getBits().
> >>>>
> >>>> Perhaps it makes sense for counting Bloom filters to have another
> method
> >>>> add( CountingBloomFilter a) that would add the counts from A into B.
> >> This
> >>>> would necessitate a subtract( CountingBloomFilter a) as well.
> >>>>
> >>>> the getCounts() method returns a Map.Entry because it was more
> semantic
> >> in
> >>>> nature (e.g. it is obvious that it is a key-value pair) but I am note
> >>>> adverse to to changing it to an int[].  If the question is why the
> >> method
> >>>> at all  then the answers are 1. it is needed for testing (bad answer),
> >> 2.
> >>>> it is the only way to retrieve the counts for each bit which can be
> >>>> required by some applications.
> >>>>
> >>>> Using a Bag might make sense, but my reading of the AbstractMapBag is
> >> that
> >>>> it tracks the total of all counts in the bag, so we would overflow
> that
> >>>> counter before we overflowed a single count.   This definition of
> size()
> >>>> differs from the definition of size() for a map and so we would have
> to
> >>>> rewrite the cardinality() method.  However, it would be nice if we
> could
> >>>> lift the MutableInteger class.
> >>>>
> >>>> In addition, we do need to check for overflow and underflow.  An
> >> underflow
> >>>> indicates that a filter was removed that was not originally added.
> >>>> Overflow is possible when processing streams as in a "stable Bloom
> >> filter"
> >>>> (
> >>>> https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).
> >>>>
> >>>> Claude
> >>>>
> >>>>
> >>>> On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <
> alex.d.herb...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> I've updated master with some of the fixes discussed.
> >>>>>
> >>>>> I've been looking through the rest of the BloomFilter code and the
> >>>>> CountingBloomFilter appears to be broken:
> >>>>>
> >>>>> 1. When another CountingBloomFiler is merged or removed the counts of
> >>>>> the other filter are ignored. This is not as documented.
> >>>>>
> >>>>> 2. When a remove operation is performed and the count for an index
> >>>>> decrements past zero it throws an exception. This is not documented
> but
> >>>>> enforced in the unit tests. This is very unfriendly. Digging into
> this
> >>>>> part of the code and there is a duplication of logic for remove. The
> >>>>> first part will not throw if the decrement passes zero. Then the same
> >>>>> index is decremented again (with the same new value) and it will
> throw
> >>>>> this time for underflow. So maybe at some point it didn't throw and
> now
> >>>>> it does.
> >>>>>
> >>>>> I commented the code where I think the problems are and added tests
> to
> >>>>> master that fail with the current implementation.
> >>>>>
> >>>>> I thought this type of logic is ideally suited to a Bag. So I rewrote
> >>>>> the class using Bag<Integer>. This passes the same tests but has two
> >>>>> differences:
> >>>>>
> >>>>> a) It will not throw when remove would underflow. The bit index just
> >>>>> gets set to zero (and that item is removed from the Bag).
> >>>>>
> >>>>> b) Bag has no support for overflow in the count of each item. It
> >>>>> actually has a weird feature that adding 2 to Integer.MAX_VALUE will
> >>>>> overflow to negative and then subtracting 1 would reset the count to
> >>>>> zero because it was still negative after subtracting 1. In Bag this
> is
> >>>>> not an issue as it is limited by Collections.size() returning an int
> >> for
> >>>>> the count of everything in the Bag. So if you add that many items the
> >>>>> collection will break elsewhere too.
> >>>>>
> >>>>> I think the change (a) is an improvement. But are there situations
> >> where
> >>>>> you want to know that you cannot remove a filter exactly from another
> >>>>> one? If so perhaps a removeExactly(...) method for this case. I'd
> >> prefer
> >>>>> to just drop the throwing of exceptions. After all if you do an
> >>>>> andCardinality(BloomFilter) you do not get errors thrown when one
> >> cannot
> >>>>> be "subtracted" from the other.
> >>>>>
> >>>>> Change (b) is faster but could lead to errors. So how likely is
> >> overflow
> >>>>> in a counting Bloom filter?
> >>>>>
> >>>>> Both can be supported but it would have to use a custom Bag
> >>>>> implementation. This is pretty trivial and would actually speed up
> >> other
> >>>>> parts of the filter by having direct access to the Map underlying the
> >>>>> Bag for faster merge and remove operations. This would then be a
> change
> >>>>> to the current Map<Integer, Integer> to a Map<Integer,
> MutableInteger>
> >>>>> to store counts.
> >>>>>
> >>>>> I can update the code to fix the implementation but will only do so
> >>>>> after a decision on overflow is made. We could even have a custom
> >>>>> implementation that stores counts as a long for very high counting
> >>>>> filters. Is this likely to be used? Only in a situation where the
> >>>>> exceptions from overflow are an issue.
> >>>>>
> >>>>>
> >>>>> Other questions about this:
> >>>>>
> >>>>> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
> >>>>>
> >>>>> Perhaps this is better served raw as arrays of length 2. This has
> lower
> >>>>> memory overhead:
> >>>>>
> >>>>> public Stream<int[]> getCounts()
> >>>>>
> >>>>> I've ruled out an accessor method as there is no equivalent for
> boolean
> >>>>> getBit(int index) in BloomFilter, e.g:
> >>>>>
> >>>>> public int getCount(int index)
> >>>>>
> >>>>> - Why a TreeMap? Is the order of the indices important?
> >>>>>
> >>>>> Or do you have some benchmarks to show that the TreeMap handles lots
> of
> >>>>> growth and shrinkage better than a HashMap. There are situations
> where
> >>>>> each one would be a better choice and so perhaps this is a case for
> >>>>> having a CountingBloomFilter with a choice for the backing Map. This
> >>>>> could be done using an abstract class with implementations. This is
> one
> >>>>> to leave until some benchmarks are in place in the main code.
> >>>>>
> >>>>>
> >>>>> On 18/02/2020 09:54, Claude Warren wrote:
> >>>>>> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <
> >> alex.d.herb...@gmail.com
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned
> >>>> and
> >>>>>>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod
> N)
> >>>> for
> >>>>>>> all
> >>>>>>>> positive values of N?  If so then you are correct and Signedness
> >> does
> >>>>> not
> >>>>>>>> matter and can be removed.  I thought that the statement was
> false.
> >>>>>>>>
> >>>>>>> Yes, you are correct. Addition and multiplication use the bits in
> the
> >>>>> same
> >>>>>>> way. Modulus does not.
> >>>>>>>
> >>>>>>> So are the Murmur3 hash functions actually unsigned? I thought the
> >>>>>>> original c code returned unsigned 64-bit values (uint64_t).
> >>>>>>>
> >>>>>>> IIUC the purpose of the Signedness enum is to specify that a
> negative
> >>>>>>> value returned from the hash function should be used as a numerical
> >>>>>>> magnitude if then used in downstream computations since sign
> >> extension
> >>>>> on
> >>>>>>> negatives will set leading bits to 1s. I.e. using the negative as
> if
> >>>> it
> >>>>>>> were unsigned (and all bits are random) is not valid. Perhaps you
> can
> >>>>> give
> >>>>>>> an example of how signed or unsigned would be handled differently.
> >>>>>>>
> >>>>>>>
> >>>>>> The murmur hash is not a good example here as it returns the
> resulting
> >>>>>> buffer as 2 signed longs.  The MD5 is probably a better example as
> the
> >>>>>> messageDigest.digest() returns a byte[128].  The code then
> interprets
> >>>>> that
> >>>>>> as 2 signed longs.  A C implementation could interpret that as 2
> >>>> unsigned
> >>>>>> longs.  Additionally, there are hashes that return smaller buffers
> >> such
> >>>>> as
> >>>>>> with murmur2 which returns a 32bit unsigned int in the C code.  In
> >> java
> >>>>>> that could be represented as a long as per the
> >> Integer.asUnsignedlong()
> >>>>>> method or it could just be cast to a long and thus signed.
> >>>>>>
> >>>>>> The values returned are then passed to floorMod( value,
> numberOfBits )
> >>>> to
> >>>>>> turn on the appropriate bit in the resulting bit vector.  As noted
> >>>> above
> >>>>>> this will yield different values depending upon how the byte[] was
> >>>>>> represented.
> >>>>>>
> >>>>>> Claude
> >>>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>>>> For additional commands, e-mail: dev-h...@commons.apache.org
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> I like: Like Like - The likeliest place on the web
> >>>> <http://like-like.xenei.com>
> >>>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >> For additional commands, e-mail: dev-h...@commons.apache.org
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

Reply via email to