Claudenw commented on a change in pull request #258:
URL: 
https://github.com/apache/commons-collections/pull/258#discussion_r798325679



##########
File path: src/main/java/org/apache/commons/collections4/bloomfilter/Shape.java
##########
@@ -0,0 +1,486 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.Objects;
+
+/**
+ * The definition of a Bloom filter shape.
+ *
+ * <p> This class contains the values for the filter configuration and is used 
to
+ * convert a Hasher into a BloomFilter as well as verify that two Bloom 
filters are
+ * compatible. (i.e. can be compared or merged)</p>
+ *
+ * <h2>Interrelatedness of values</h2>
+ *
+ * <dl> <dt>Number of Items ({@code n})</dt>
+ * <dd>{@code n = ceil(m / (-k / ln(1 - exp(ln(p) / k))))}</dd> 
<dt>Probability of
+ * False Positives ({@code p})</dt> <dd>{@code p = pow(1 - exp(-k / (m / n)), 
k)}</dd> <dt>Number
+ * of Bits ({@code m})</dt>
+ * <dd>{@code m = ceil((n * ln(p)) / ln(1 / pow(2, ln(2))))}</dd> <dt>Number of
+ * Functions ({@code k})</dt> <dd>{@code k = round((m / n) * ln(2))}</dd> </dl>
+ *
+ * @see <a href="http://hur.st/bloomfilter?n=3&p=1.0E-5";>Bloom Filter 
calculator</a>
+ * @see <a href="https://en.wikipedia.org/wiki/Bloom_filter";>Bloom filter
+ * [Wikipedia]</a>
+ * @since 4.5
+ */
+public final class Shape implements Comparable<Shape> {
+
+    /**
+     * Number of hash functions to create a filter ({@code k}).
+     */
+    private final int numberOfHashFunctions;
+
+    /**
+     * Number of bits in the filter ({@code m}).
+     */
+    private final int numberOfBits;
+
+    /**
+     * Constructs a filter configuration with the specified number of items 
({@code n}) and
+     * bits ({@code m}).
+     *
+     * <p>The optimal number of hash functions ({@code k}) is computed.
+     * <pre>k = round((m / n) * ln(2))</pre>
+     *
+     * <p>The false-positive probability is computed using the number of 
items, bits and hash
+     * functions. An exception is raised if this is greater than or equal to 1 
(i.e. the
+     * shape is invalid for use as a Bloom filter).
+     *
+     * @param numberOfHashFunctions Number of hash functions to use for each 
item placed in the filter.
+     * @param numberOfBits The number of bits in the filter
+     * @throws IllegalArgumentException if {@code numberOfHashFunctions < 1} 
or {@code numberOfBits < 1}
+     */
+    public Shape(final int numberOfHashFunctions, final int numberOfBits) {
+        this.numberOfBits = checkNumberOfBits(numberOfBits);
+        this.numberOfHashFunctions = 
checkNumberOfHashFunctions(numberOfHashFunctions);
+    }
+
+    /**
+     * Check number of bits is strictly positive.
+     *
+     * @param numberOfBits the number of bits
+     * @return the number of bits
+     * @throws IllegalArgumentException if the number of bits is {@code < 1}
+     */
+    private static int checkNumberOfBits(final int numberOfBits) {
+        if (numberOfBits < 1) {
+            throw new IllegalArgumentException("Number of bits must be greater 
than 0: " + numberOfBits);
+        }
+        return numberOfBits;
+    }
+
+    /**
+     * Check number of hash functions is strictly positive
+     *
+     * @param numberOfHashFunctions the number of hash functions
+     * @return the number of hash functions
+     * @throws IllegalArgumentException if the number of hash functions is 
{@code < 1}
+     */
+    private static int checkNumberOfHashFunctions(final int 
numberOfHashFunctions) {
+        if (numberOfHashFunctions < 1) {
+            throw new IllegalArgumentException(
+                    "Number of hash functions must be greater than 0: " + 
numberOfHashFunctions);
+        }
+        return numberOfHashFunctions;
+    }
+
+    @Override
+    public int compareTo(Shape other) {
+        int i = Integer.compare(numberOfBits, other.numberOfBits);
+        return i == 0 ? Integer.compare(numberOfHashFunctions, 
other.numberOfHashFunctions) : i;
+    }
+
+    @Override
+    public boolean equals(final Object o) {
+        return (o instanceof Shape) ? compareTo((Shape) o) == 0 : false;
+    }
+
+    @Override
+    public int hashCode() {
+        return Objects.hash(numberOfBits, numberOfHashFunctions);
+    }
+
+    /**
+     * Gets the number of bits in the Bloom filter.
+     * This is also known as {@code m}.
+     *
+     * @return the number of bits in the Bloom filter ({@code m}).
+     */
+    public int getNumberOfBits() {
+        return numberOfBits;
+    }
+
+    /**
+     * Gets the number of hash functions used to construct the filter.
+     * This is also known as {@code k}.
+     *
+     * @return the number of hash functions used to construct the filter 
({@code k}).
+     */
+    public int getNumberOfHashFunctions() {
+        return numberOfHashFunctions;
+    }
+
+    /**
+     * Calculates the probability of false positives ({@code p}) given
+     * numberOfItems ({@code n}), numberOfBits ({@code m}) and 
numberOfHashFunctions ({@code k}).
+     * <pre>p = pow(1 - exp(-k / (m / n)), k)</pre>
+     *
+     * <p>This is the probability that a Bloom filter will return true for the 
presence of an item
+     * when it does not contain the item.</p>
+     *
+     * <p>The probability assumes that the Bloom filter is filled with the 
expected number of
+     * items. If the filter contains fewer items then the actual probability 
will be lower.
+     * Thus, this returns the worst-case false positive probability for a 
filter that has not
+     * exceeded its expected number of items.</p>
+     *
+     * @param numberOfItems the number of items hashed into the Bloom filter.
+     * @return the probability of false positives.
+     */
+    public double getProbability(int numberOfItems) {
+        if (numberOfItems < 0) {
+            throw new IllegalArgumentException("Number of items must be 
greater than or equal to 0: " + numberOfItems);
+        }
+        if (numberOfItems == 0) {
+            return 0;
+        }
+        return Math.pow(1.0 - Math.exp(-1.0 * numberOfHashFunctions * 
numberOfItems / numberOfBits),
+                numberOfHashFunctions);
+    }
+
+    @Override
+    public String toString() {
+        return String.format("Shape[ m=%s k=%s ]", numberOfBits, 
numberOfHashFunctions);
+    }
+
+    /**
+     * Determines if a cardinality is sparse based on the shape.
+     * <p>This method assumes that BitMaps are 64bits and indexes are 32bits.  
If the memory
+     * necessary to store the cardinality as indexes is less than the 
estimated memory for BitMaps,
+     * the cardinality is determined to be {@code sparse}.</p>
+     * @param cardinality the cardinality to check.
+     * @return true if the cardinality is sparse within the shape.
+     */
+    public boolean isSparse(int cardinality) {
+        /*
+         * Since the size of a BitMap is a long and the size of an index is an 
int,
+         * there can be 2 indexes for each bitmap. In Bloom filters indexes 
are evenly
+         * distributed across the range of possible values, Thus if the 
cardinality
+         * (number of indexes) is less than or equal to 2*number of BitMaps the
+         * cardinality is sparse within the shape.
+         */
+        return cardinality <= (BitMap.numberOfBitMaps(getNumberOfBits()) * 2);
+    }
+
+    /**
+     * Estimate the number of items in a Bloom filter with this shape and the 
specified number of bits enabled.
+     *
+     * <p><em>Note:</em></p>
+     * <ul>
+     * <li> if hammingValue == numberOfBits, then result is infinity.</li>
+     * <li> if hammingValue &gt; numberOfBits, then result is NaN.</li>
+     * </ul>
+     *
+     * @param hammingValue the number of enabled  bits.
+     * @return An estimate of the number of items in the Bloom filter.
+     */
+    public double estimateN(int hammingValue) {
+        double c = hammingValue;
+        double m = numberOfBits;
+        double k = numberOfHashFunctions;
+        return -(m / k) * Math.log(1.0 - (c / m));
+    }
+
+    /**
+     * The factory to assist in the creation of proper Shapes.
+     *
+     * In the methods of this factory the `from` names are appended with the 
standard variable
+     * names in the order expected:
+     *
+     * <dl>
+     * <dt>{@code N})</dt><dd>The number of items to be placed in the Bloom 
filter</dd>
+     * <dt>{@code M})</dt><dd>The number of bits in the Bloom filter</dd>
+     * <dt>{@code K})</dt><dd>The number of hash functions for each item 
placed in the Bloom filter</dd>
+     * <dt>{@code P})</dt><dd>The probability of a collision once N items have 
been placed in the Bloom filter</dd>
+     * </dl>
+     */
+    public static class Factory {
+
+        /**
+         * The natural logarithm of 2. Used in several calculations. 
Approximately 0.693147180559945.
+         */
+        private static final double LN_2 = Math.log(2.0);
+
+        /**
+         * ln(1 / 2^ln(2)). Used in calculating the number of bits. 
Approximately -0.480453013918201.
+         *
+         * <p>ln(1 / 2^ln(2)) = ln(1) - ln(2^ln(2)) = -ln(2) * ln(2)
+         */
+        private static final double DENOMINATOR = -LN_2 * LN_2;
+
+        /**
+         * Do not instantiate.
+         */
+        private Factory() {
+
+        }
+
+        /**
+         * Constructs a filter configuration with a desired false-positive 
probability ({@code p}) and the
+         * specified number of bits ({@code m}) and hash functions ({@code k}).
+         *
+         * <p>The number of items ({@code n}) to be stored in the filter is 
computed.
+         * <pre>n = ceil(m / (-k / ln(1 - exp(ln(p) / k))))</pre>
+         *
+         * <p>The actual probability will be approximately equal to the
+         * desired probability but will be dependent upon the calculated Bloom 
filter capacity
+         * (number of items). An exception is raised if this is greater than 
or equal to 1 (i.e. the
+         * shape is invalid for use as a Bloom filter).
+         *
+         * @param probability The desired false-positive probability in the 
range {@code (0, 1)}
+         * @param numberOfBits The number of bits in the filter
+         * @param numberOfHashFunctions The number of hash functions in the 
filter
+         * @return a valid Shape.
+         * @throws IllegalArgumentException if the desired probability is not 
in the range {@code (0, 1)},
+         * {@code numberOfBits < 1}, {@code numberOfHashFunctions < 1}, or the 
actual
+         * probability is {@code >= 1.0}
+         */
+        public static Shape fromPMK(final double probability, final int 
numberOfBits, final int numberOfHashFunctions) {
+            checkProbability(probability);
+            checkNumberOfBits(numberOfBits);
+            checkNumberOfHashFunctions(numberOfHashFunctions);
+
+            // Number of items (n):
+            // n = ceil(m / (-k / ln(1 - exp(ln(p) / k))))
+            final double n = Math.ceil(numberOfBits
+                    / (-numberOfHashFunctions / Math.log(1 - 
Math.exp(Math.log(probability) / numberOfHashFunctions))));
+
+            // log of probability is always < 0
+            // number of hash functions is >= 1
+            // e^x where x < 0 = [0,1)
+            // log 1-e^x = [log1, log0) = <0 with an effective lower limit of 
-53

Review comment:
       Added test cases to verify extreme value calculations. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to