Way back in your first post on this topic you talked about changing the AbstractBloomFilter to use:
@Override public int cardinality() { int count = 0; for (final long bits : getBits()) { count += Long.bitCount(bits); } return count; } among other changes. Those were the changes I was referring to. Claude On Wed, Feb 19, 2020 at 11:33 PM Alex Herbert <alex.d.herb...@gmail.com> wrote: > > > > On 19 Feb 2020, at 21:14, Claude Warren <cla...@xenei.com> wrote: > > > > I think the compromise solution of removing the thrown exception and > adding > > an isInvalid() method makes sense and fits within the parameters of how > the > > counting Bloom filter should work. However, I think the add() and > remove() > > methods should (or add and subtract) should only accept > CountingBloomFilter > > arguments. The reasoning being that the methods are specific to the > > counting filter and are defined in terms of the internal structure of the > > filter (the count of each bit) that are not present in the "standard" > bloom > > filter. I think that it makes it lexically clear. > > OK. We can always add a transaction counting Bloom filter later that will > ensure entire filters are added or removed cleanly. For now it can be user > beware documentation that states if you subtract a filter you did not add > then you can end up with a set of counts that are not a sum of filters. > > I do note that you could do this anyway even with a transaction counting > Bloom filter, e.g. A+B+C+D - E. It may or may not throw and the trust in on > the user to ensure they do not subtract E from something that never had it > added. > > Thinking about the invalid flag perhaps the methods merge/add/subtract > should just return 0 if the filter state is valid and -1 if invalid. Thus > once a filter is invalid all calls would return the invalid flag. > > The reason for this is efficiency. The counts are allowed to roam freely > over any range for an integer. You have a valid flag that is the bitwise OR > of every count you have adjusted. If any over/underflow occurs then the > valid flag will have a negative sign bit. The methods then return the > signed shift of the valid flag: > > return valid >> 31; > > This would be -1 if any count ever over/underflows, zero otherwise. > > This eliminates any conditional statements from the filter operations. The > user just has to know this is how it works and so any downstream > consumption of the counts would have to first check the isInvalid flag and > sanitise them as necessary. You can even view the counts as unsigned int32 > if you never subtract filters. > > > > > Do you have a branch someplace where you are doing this work? I would > > like to see the proposed changes you outlined for the AbstractBloomFilter > > implemented as I think you are correct and they would be faster and > cleaner. > > No but I’ll create one. I’m waiting for a big PR against collections to go > through (an update to force use of checkstyle). Then I’ll revert the > changes I made to the CountingBloomFilterTest to its previous state and put > the changes into a branch. These would have a new internal structure for > the counts and the methods for add/subtract. > > For the AbstractBloomFilter do you mean the use of the mod(long, int) > method in the Signedness enum? I think the removal of Math.floorMod can be > done in two places with bit shift but should we also use the % operator > instead of Math.floorMod for Signedness.UNSIGNED? > > > > > > Claude > > > > On Wed, Feb 19, 2020 at 12:41 AM Alex Herbert <alex.d.herb...@gmail.com> > > wrote: > > > >> > >> > >>> On 18 Feb 2020, at 22:34, Gary Gregory <garydgreg...@gmail.com> wrote: > >>> > >>> On Tue, Feb 18, 2020 at 5:32 PM Claude Warren <cla...@xenei.com> > wrote: > >>> > >>>> Last one first, why a tree map? I think it is a holdover from an > >> earlier > >>>> implementation. It can be any reasonable Map (e.g. HashMap). > >>>> > >>>> on the remove() issue. You are absolutely correct, there is a bug. I > >>>> > >>> > >>> May you please add a test to validate whatever fix you guys implement? > >> I'm > >>> not following the details here but I want to stress the importance of > >> tests > >>> to help future maintainers validate any future changes and protect us > >> from > >>> regression bugs. IOW, while the current code base is in your head, more > >>> tests! ;-) > >>> > >> > >> I added a test that shows the current implementation does not do what I > >> thought it should do based on the documentation in javadoc: > >> > >> "For each bit that is turned on in the other filter; if the other > filter is > >> also a CountingBloomFilter the count is added to this filter, otherwise > >> the > >> count is incremented by one." > >> > >> My test checks the counts are added. > >> > >> However it does what Claude thought it should do. The original test > checks > >> that the indices are incremented by 1. > >> > >> So what to do about counting? If overflow or underflow occur on addition > >> or subtraction you can throw an exception to signal error but if a > partial > >> operation has been made this leaves the filter in an undetermined state. > >> The filter does not actually correspond to a sum of filters anymore. So > it > >> is invalid. Should any further operations on the filter be prevented? In > >> this case throwing an IllegalStateException for all further operations > >> seems extreme. It may be better to have the addition/subtraction return > >> true/false if the operation was clean, i.e similar to a standard Set > >> collection that returns true if adding/removing an item changes the > state, > >> false otherwise. The filter could also maintain a state flag to indicate > >> that an overflow/underflow has occurred and the filter no longer > represents > >> a sum of filters. > >> > >> Given this idea we change merge to increment the indices by 1 for any > >> BloomFilter including a counting filter. It will not throw if it > overflows > >> but will set an invalid state flag. The logic here is the method is > >> inherited from BloomFilter and so will function in the same way as other > >> filters by accumulating set bits (indices). > >> > >> Then add some methods: > >> > >> boolean add(BloomFilter) > >> boolean remove(BloomFilter) > >> boolean isInvalid() > >> > >> These update the counts using the counts of the other filter if it is a > >> CountingBloomFilter, otherwise add or subtract 1. The methods return > true > >> if the operation added or removed the filter entirely, i.e. it is now a > >> part of the sum or was removed from the sum. Otherwise they return false > >> and the filter is marked as an invalid sum of filters. This leaves the > >> counting filter in an undetermined state but it can still be used as a > >> non-counting filter as the indices will be correctly maintained. > >> > >> It is a compromise but there does not seem to be a nice way to do this > >> without a lot of overhead. That would be to check the add/remove can be > >> done entirely before actually performing the operation. If the operation > >> would under/overflow then an exception can be thrown and the filter is > left > >> in a usable state. This could be achieved with some efficiency if the > >> filter stored the maximum count. Then any add would be able to check for > >> potential overflow quickly and do a safe add. However the subtraction > could > >> not know the indices that may underflow using a max value (since 0 - x > >> underflows) and would always have to do a safe removal by checking all > >> decrements before actually doing the entire removal. One idea here is a > >> reusable list is kept to store all the indices that are to be > decremented. > >> Stage 1 finds all the indices and checks for underflow in the > subtraction. > >> If underflow then throw. Otherwise traverse the list and actually do the > >> subtraction. The overhead of the second list traversal is likely to be > much > >> lower than the initial identification of indices in the backing map. > This > >> all becomes easy to do if the backing data structure stores both the > index > >> and a mutable field, e.g. > >> > >> Set<MutableIndex> counts; > >> class MutableIndex { > >> int index; > >> int count; > >> int hashCode() { return index; } > >> // etc > >> } > >> > >> You then find all the MutableIndex objects in the Set and put them in a > >> reusable temp List (with the new count stored in a parallel list). If > all > >> can be incremented/decremented without under/overflow then just go over > the > >> list and update the count using the precomputed new count. The entire > >> operation then works as a single transaction. > >> > >> It may be better put into a TransactionCountingBloomFilter and then > >> CountingBloomFilter is left to be faster and not care about > over/underflow. > >> > >> Thoughts on this approach? > >> > >>> Gary > >>> > >>> > >>>> looks like a partial edit got restored back into the code. The first > >> block > >>>> should be removed. In fact it should probably read > >>>> > >>>> if (val == null || val == 0) { > >>>> throw new IllegalStateException("Underflow on index " + > >>>> idx); > >>>> } else if (val == 1) { > >>>> counts.remove(idx); > >>>> } else { > >>>> counts.put(idx, val - 1); > >>>> } > >>>> > >>>> The question of merging/removing counting bloom filters is an > >> interesting > >>>> one. > >>>> > >>>> Merging a "normal" Bloom filter A into a counting Bloom filter B > simply > >>>> increments all the bits specified by A.getBits(). > >>>> > >>>> I took the approach that a counting bloom filter was just another > Bloom > >>>> filter. So merging counting Bloom filter A into counting Bloom > filter B > >>>> simply incremented all the bits specified by A.getBits(). > >>>> > >>>> However the source has moved on since the original implementation was > >> laid > >>>> down, and it might make sense that merging bloom filters adds the > counts > >>>> from A into B. Then if the developer wanted to perform a simple merge > >> for > >>>> a Bloom filter the developer could use B.merge( A.getHasher() ). > >> However, > >>>> I felt that since a counting Bloom filter is an instance of Bloom > filter > >>>> the merge should be driven by getBits(). > >>>> > >>>> Perhaps it makes sense for counting Bloom filters to have another > method > >>>> add( CountingBloomFilter a) that would add the counts from A into B. > >> This > >>>> would necessitate a subtract( CountingBloomFilter a) as well. > >>>> > >>>> the getCounts() method returns a Map.Entry because it was more > semantic > >> in > >>>> nature (e.g. it is obvious that it is a key-value pair) but I am note > >>>> adverse to to changing it to an int[]. If the question is why the > >> method > >>>> at all then the answers are 1. it is needed for testing (bad answer), > >> 2. > >>>> it is the only way to retrieve the counts for each bit which can be > >>>> required by some applications. > >>>> > >>>> Using a Bag might make sense, but my reading of the AbstractMapBag is > >> that > >>>> it tracks the total of all counts in the bag, so we would overflow > that > >>>> counter before we overflowed a single count. This definition of > size() > >>>> differs from the definition of size() for a map and so we would have > to > >>>> rewrite the cardinality() method. However, it would be nice if we > could > >>>> lift the MutableInteger class. > >>>> > >>>> In addition, we do need to check for overflow and underflow. An > >> underflow > >>>> indicates that a filter was removed that was not originally added. > >>>> Overflow is possible when processing streams as in a "stable Bloom > >> filter" > >>>> ( > >>>> https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters). > >>>> > >>>> Claude > >>>> > >>>> > >>>> On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert < > alex.d.herb...@gmail.com> > >>>> wrote: > >>>> > >>>>> I've updated master with some of the fixes discussed. > >>>>> > >>>>> I've been looking through the rest of the BloomFilter code and the > >>>>> CountingBloomFilter appears to be broken: > >>>>> > >>>>> 1. When another CountingBloomFiler is merged or removed the counts of > >>>>> the other filter are ignored. This is not as documented. > >>>>> > >>>>> 2. When a remove operation is performed and the count for an index > >>>>> decrements past zero it throws an exception. This is not documented > but > >>>>> enforced in the unit tests. This is very unfriendly. Digging into > this > >>>>> part of the code and there is a duplication of logic for remove. The > >>>>> first part will not throw if the decrement passes zero. Then the same > >>>>> index is decremented again (with the same new value) and it will > throw > >>>>> this time for underflow. So maybe at some point it didn't throw and > now > >>>>> it does. > >>>>> > >>>>> I commented the code where I think the problems are and added tests > to > >>>>> master that fail with the current implementation. > >>>>> > >>>>> I thought this type of logic is ideally suited to a Bag. So I rewrote > >>>>> the class using Bag<Integer>. This passes the same tests but has two > >>>>> differences: > >>>>> > >>>>> a) It will not throw when remove would underflow. The bit index just > >>>>> gets set to zero (and that item is removed from the Bag). > >>>>> > >>>>> b) Bag has no support for overflow in the count of each item. It > >>>>> actually has a weird feature that adding 2 to Integer.MAX_VALUE will > >>>>> overflow to negative and then subtracting 1 would reset the count to > >>>>> zero because it was still negative after subtracting 1. In Bag this > is > >>>>> not an issue as it is limited by Collections.size() returning an int > >> for > >>>>> the count of everything in the Bag. So if you add that many items the > >>>>> collection will break elsewhere too. > >>>>> > >>>>> I think the change (a) is an improvement. But are there situations > >> where > >>>>> you want to know that you cannot remove a filter exactly from another > >>>>> one? If so perhaps a removeExactly(...) method for this case. I'd > >> prefer > >>>>> to just drop the throwing of exceptions. After all if you do an > >>>>> andCardinality(BloomFilter) you do not get errors thrown when one > >> cannot > >>>>> be "subtracted" from the other. > >>>>> > >>>>> Change (b) is faster but could lead to errors. So how likely is > >> overflow > >>>>> in a counting Bloom filter? > >>>>> > >>>>> Both can be supported but it would have to use a custom Bag > >>>>> implementation. This is pretty trivial and would actually speed up > >> other > >>>>> parts of the filter by having direct access to the Map underlying the > >>>>> Bag for faster merge and remove operations. This would then be a > change > >>>>> to the current Map<Integer, Integer> to a Map<Integer, > MutableInteger> > >>>>> to store counts. > >>>>> > >>>>> I can update the code to fix the implementation but will only do so > >>>>> after a decision on overflow is made. We could even have a custom > >>>>> implementation that stores counts as a long for very high counting > >>>>> filters. Is this likely to be used? Only in a situation where the > >>>>> exceptions from overflow are an issue. > >>>>> > >>>>> > >>>>> Other questions about this: > >>>>> > >>>>> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()? > >>>>> > >>>>> Perhaps this is better served raw as arrays of length 2. This has > lower > >>>>> memory overhead: > >>>>> > >>>>> public Stream<int[]> getCounts() > >>>>> > >>>>> I've ruled out an accessor method as there is no equivalent for > boolean > >>>>> getBit(int index) in BloomFilter, e.g: > >>>>> > >>>>> public int getCount(int index) > >>>>> > >>>>> - Why a TreeMap? Is the order of the indices important? > >>>>> > >>>>> Or do you have some benchmarks to show that the TreeMap handles lots > of > >>>>> growth and shrinkage better than a HashMap. There are situations > where > >>>>> each one would be a better choice and so perhaps this is a case for > >>>>> having a CountingBloomFilter with a choice for the backing Map. This > >>>>> could be done using an abstract class with implementations. This is > one > >>>>> to leave until some benchmarks are in place in the main code. > >>>>> > >>>>> > >>>>> On 18/02/2020 09:54, Claude Warren wrote: > >>>>>> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert < > >> alex.d.herb...@gmail.com > >>>>> > >>>>>> wrote: > >>>>>> > >>>>>> > >>>>>>>> My maths is rusty. If A=0xF000ABCD as interpreted as an unsigned > >>>> and > >>>>>>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod > N) > >>>> for > >>>>>>> all > >>>>>>>> positive values of N? If so then you are correct and Signedness > >> does > >>>>> not > >>>>>>>> matter and can be removed. I thought that the statement was > false. > >>>>>>>> > >>>>>>> Yes, you are correct. Addition and multiplication use the bits in > the > >>>>> same > >>>>>>> way. Modulus does not. > >>>>>>> > >>>>>>> So are the Murmur3 hash functions actually unsigned? I thought the > >>>>>>> original c code returned unsigned 64-bit values (uint64_t). > >>>>>>> > >>>>>>> IIUC the purpose of the Signedness enum is to specify that a > negative > >>>>>>> value returned from the hash function should be used as a numerical > >>>>>>> magnitude if then used in downstream computations since sign > >> extension > >>>>> on > >>>>>>> negatives will set leading bits to 1s. I.e. using the negative as > if > >>>> it > >>>>>>> were unsigned (and all bits are random) is not valid. Perhaps you > can > >>>>> give > >>>>>>> an example of how signed or unsigned would be handled differently. > >>>>>>> > >>>>>>> > >>>>>> The murmur hash is not a good example here as it returns the > resulting > >>>>>> buffer as 2 signed longs. The MD5 is probably a better example as > the > >>>>>> messageDigest.digest() returns a byte[128]. The code then > interprets > >>>>> that > >>>>>> as 2 signed longs. A C implementation could interpret that as 2 > >>>> unsigned > >>>>>> longs. Additionally, there are hashes that return smaller buffers > >> such > >>>>> as > >>>>>> with murmur2 which returns a 32bit unsigned int in the C code. In > >> java > >>>>>> that could be represented as a long as per the > >> Integer.asUnsignedlong() > >>>>>> method or it could just be cast to a long and thus signed. > >>>>>> > >>>>>> The values returned are then passed to floorMod( value, > numberOfBits ) > >>>> to > >>>>>> turn on the appropriate bit in the resulting bit vector. As noted > >>>> above > >>>>>> this will yield different values depending upon how the byte[] was > >>>>>> represented. > >>>>>> > >>>>>> Claude > >>>>>> > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > >>>>> For additional commands, e-mail: dev-h...@commons.apache.org > >>>>> > >>>>> > >>>> > >>>> -- > >>>> I like: Like Like - The likeliest place on the web > >>>> <http://like-like.xenei.com> > >>>> LinkedIn: http://www.linkedin.com/in/claudewarren > >>>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > >> For additional commands, e-mail: dev-h...@commons.apache.org > >> > >> > > > > -- > > I like: Like Like - The likeliest place on the web > > <http://like-like.xenei.com> > > LinkedIn: http://www.linkedin.com/in/claudewarren > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren