[GitHub] [commons-collections] Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter

GitBox Sat, 07 Mar 2020 09:50:40 -0800

Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter
URL: 
https://github.com/apache/commons-collections/pull/137#discussion_r389299774


 ##########
 File path: 
src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of 
counts will
+ * mark this filter as invalid. This transition is not reversible. The counts 
for the
+ * filter immediately prior to the operation that create invalid counts can be 
recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. 
Behaviour
+ * of an invalid filter is undefined. It will no longer function identically 
to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a 
memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements 
CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all 
updated
+     * counts. If negative then a negative count or integer overflow has 
occurred on
+     * one or more counts in the history of the filter and the state is 
invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved 
performance. It
+     * eliminates a conditional check for a negative count during 
remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A 
count
+     * that overflows indicates that the number of items in the filter exceeds 
the
+     * maximum possible size (number of bits) of any Bloom filter constrained 
by
+     * integer indices. At this point the filter is most likely full (all bits 
are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a 
major
+     * performance burden not deemed necessary for the unlikely scenarios when 
an
+     * invalid state is created. Maintenance of the state flag is a concession 
to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a 
negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an 
empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices 
for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of 
merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> 
counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to 
long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index 
iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
 
 Review comment:
   Would this work for the BloomFilter Hasher implementation ?
   
   ```java
       public class BloomFilterHasher implements Hasher {
           BloomFilter bf;
           Function<BloomFilter,PrimitiveIterator.OfInt> func;
           
           BloomFilterHasher( BloomFilter bf, 
Function<BloomFilter,PrimitiveIterator.OfInt> func) {
               this.bf = bf;
               this.func = func;
           }
   
           @Override
           public OfInt getBits(Shape shape) {
               if (!bf.getShape().equals(shape)) {
                   throw new IllegalArgumentException(String.format("Hasher 
shape (%s) is not the same as shape (%s)",
                       bf.getShape().toString(), shape.toString()));
               }
               return func.apply( bf );
           }
   
           @Override
           public HashFunctionIdentity getHashFunctionIdentity() {
               return bf.getShape().getHashFunctionIdentity();
           }
   
           @Override
           public boolean isEmpty() {
               return bf.cardinality() == 0;
           }
   
       }
   ```
   
   `BloomFilter.getHasher()` would have to be changed to return a `Hasher` 
(rather than `StaticHasher`) or perhaps a `UniqueHasher` where `UniqueHasher 
extends Hasher` and is only a marker to indicate that the values do not contain 
duplicates.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter

Reply via email to