(commons-collections) branch master updated: [Collections-857] update bloom filter documentation (#508)

ggregory Wed, 03 Jul 2024 06:14:18 -0700

This is an automated email from the ASF dual-hosted git repository.

ggregory pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/commons-collections.git



The following commit(s) were added to refs/heads/master by this push:
     new 812539116 [Collections-857] update bloom filter documentation (#508)
812539116 is described below

commit 81253911656a0723944e686efe5e634a9a5f59ef
Author: Claude Warren <cla...@apache.org>
AuthorDate: Wed Jul 3 14:13:48 2024 +0100

    [Collections-857] update bloom filter documentation (#508)
    
    * clarification and links
    
    * updated documentation
---
 .../collections4/bloomfilter/package-info.java     |  72 +++++----
 src/site/markdown/bloomFilters/advanced.md         | 174 ++++++++++++++++++++-
 src/site/markdown/bloomFilters/implementation.md   |  21 ++-
 src/site/markdown/bloomFilters/intro.md            |  14 +-
 src/site/markdown/bloomFilters/multidimensional.md |   4 +-
 5 files changed, 230 insertions(+), 55 deletions(-)

diff --git 
a/src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java 
b/src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java
index d251fc525..e2b71f3b7 100644
--- 
a/src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java
+++ 
b/src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java
@@ -32,7 +32,7 @@
  * list. There are lots of other uses, and in most cases the reason is to 
perform a fast check as a gateway for a longer
  * operation.</p>
  *
- * <p>Some Bloom filters (e.g. {@link CountingBloomFilter}) use counters 
rather than bits. In this case each counter
+ * <p>Some Bloom filters (e.g. {@link 
org.apache.commons.collections4.bloomfilter.CountingBloomFilter}) use counters 
rather than bits. In this case each counter
  * is called a <em>cell</em>.</p>
  *
  * <h3>BloomFilter</h3>
@@ -46,19 +46,25 @@
  * <ul>
  *     <li><em>bit map</em> - In the {@code bloomfilter} package a <em>bit 
map</em> is not a structure but a logical construct.  It is conceptualized
  *     as an ordered collection of {@code long} values each of which is 
interpreted as the enabled true/false state of 64 continuous indices.  The 
mapping of
- *     bits into the {@code long} values is described in the {@link BitMaps} 
Javadoc.</li>
+ *     bits into the {@code long} values is described in the {@link 
org.apache.commons.collections4.bloomfilter.BitMaps} Javadoc.</li>
  *
  *     <li><em>index</em> - In the {@code bloomfilter} package an Index is a 
logical collection of {@code int}s specifying the enabled
  *     bits in the bit map.</li>
  *
- *     <li><em>cell</em> - Some Bloom filters (e.g. {@link 
CountingBloomFilter}) use counters rather than bits.  In the {@code 
bloomfilter} package
- *     Cells are pairs of ints representing an index and a value.  They are 
not {@code Pair} objects.  </li>
+ *     <li><em>cell</em> - Some Bloom filters (e.g. {@link 
org.apache.commons.collections4.bloomfilter.CountingBloomFilter}) use counters 
rather than bits.  In the {@code bloomfilter} package
+ *     Cells are pairs of ints representing an index and a value.  They are 
not the standard Java {@code Pair} objects,
+ *     nor the Apache Commons Lang version either.</li>
  *
- *     <li><em>extractor</em> - The extractors are {@link 
FunctionalInterface}s that are conceptually iterators on a bit map, an 
<em>index</em>, or a
+ *     <li><em>extractor</em> - The extractors are {@link 
java.lang.FunctionalInterface}s that are conceptually iterators on a bit map, 
an <em>index</em>, or a
  *     collection of <em>cells</em>, with an early termination switch.  
Extractors have
- *     names like {@link 
org.apache.commons.collections4.bloomfilter.BitMapExtractor} or {@link 
org.apache.commons.collections4.bloomfilter.IndexExtractor} and have a {@code 
processXs} methods that take a
- *     {@code Predicate<X>} argument (e.g. {@link 
org.apache.commons.collections4.bloomfilter.BitMapExtractor#processBitMaps(java.util.function.LongPredicate)}
 or {@code processIndicies(IntPredicate)}).
- *     That predicate is expected to process each of the Xs in turn and return 
{@code true} if the processing should continue
+ *     names like {@link 
org.apache.commons.collections4.bloomfilter.BitMapExtractor} or
+ *     {@link org.apache.commons.collections4.bloomfilter.IndexExtractor} and 
have a {@code processXs} methods that take a
+ *     type specialization of {@link java.util.function.Predicate}.
+ *     {@code Predicate} type argument.
+ *     (e.g. {@link 
org.apache.commons.collections4.bloomfilter.BitMapExtractor#processBitMaps(java.util.function.LongPredicate)},
+ *     {@link 
org.apache.commons.collections4.bloomfilter.IndexExtractor#processIndices(java.util.function.IntPredicate)},
+ *     and {@link 
org.apache.commons.collections4.bloomfilter.CellExtractor#processCells(org.apache.commons.collections4.bloomfilter.CellExtractor.CellPredicate)}).
+ *     The predicate is expected to process each of the Xs in turn and return 
{@code true} if the processing should continue
  *     or {@code false} to stop it. </li>
  * </ul>
  *
@@ -69,66 +75,72 @@
  * <h4>Implementation Notes</h4>
  *
  * <p>The architecture is designed so that the implementation of the storage 
of bits is abstracted. Rather than specifying a
- * specific state representation we require that all Bloom filters implement 
the {@link org.apache.commons.collections4.bloomfilter.BitMapExtractor} and 
{@link org.apache.commons.collections4.bloomfilter.IndexExtractor} interfaces,
+ * specific state representation we require that all Bloom filters implement 
the {@link org.apache.commons.collections4.bloomfilter.BitMapExtractor}
+ * and {@link org.apache.commons.collections4.bloomfilter.IndexExtractor} 
interfaces,
  * Counting-based Bloom filters implement {@link 
org.apache.commons.collections4.bloomfilter.CellExtractor} as well.  There are 
static
  * methods in the various Extractor interfaces to convert from one type to 
another.</p>
  *
- * <p>Programs that utilize the Bloom filters may use the {@link 
org.apache.commons.collections4.bloomfilter.BitMapExtractor} or {@link 
org.apache.commons.collections4.bloomfilter.IndexExtractor} to retrieve
+ * <p>Programs that utilize the Bloom filters may use the {@link 
org.apache.commons.collections4.bloomfilter.BitMapExtractor}
+ * or {@link org.apache.commons.collections4.bloomfilter.IndexExtractor} to 
retrieve
  * or process a representation of the internal structure.
- * Additional methods are available in the {@link 
org.apache.commons.collections4.bloomfilter.BitMaps} class to assist in 
manipulation of bit map representations.</p>
+ * Additional methods are available in the {@link 
org.apache.commons.collections4.bloomfilter.BitMaps} class to assist in
+ * manipulation of bit map representations.</p>
  *
  * <p>The Bloom filter is an interface that requires implementation of 9 
methods:</p>
  * <ul>
- * <li>{@link BloomFilter#cardinality()} returns the number of bits enabled in 
the Bloom filter.</li>
+ * <li>{@link 
org.apache.commons.collections4.bloomfilter.BloomFilter#cardinality()} returns 
the number of bits enabled in the Bloom filter.</li>
  *
- * <li>{@link BloomFilter#characteristics()} which returns an integer of 
characteristics flags.</li>
+ * <li>{@link 
org.apache.commons.collections4.bloomfilter.BloomFilter#characteristics()} 
which returns an integer of characteristics flags.</li>
  *
- * <li>{@link BloomFilter#clear()} which resets the Bloomfilter to its initial 
empty state.</li>
+ * <li>{@link org.apache.commons.collections4.bloomfilter.BloomFilter#clear()} 
which resets the Bloomfilter to its initial empty state.</li>
  *
- * <li>{@link BloomFilter#contains(IndexExtractor)} which returns true if the 
bits specified by the indices generated by
- * IndexExtractor are enabled in the Bloom filter.</li>
+ * <li>{@link 
org.apache.commons.collections4.bloomfilter.BloomFilter#contains(IndexExtractor)}
 which returns true if the bits specified
+ * by the indices generated by IndexExtractor are enabled in the Bloom 
filter.</li>
  *
- * <li>{@link BloomFilter#copy()} which returns a fresh copy of the 
bitmap.</li>
+ * <li>{@link org.apache.commons.collections4.bloomfilter.BloomFilter#copy()} 
which returns a fresh copy of the bitmap.</li>
  *
- * <li>{@link BloomFilter#getShape()} which returns the shape the Bloom filter 
was created with.</li>
+ * <li>{@link 
org.apache.commons.collections4.bloomfilter.BloomFilter#getShape()} which 
returns the shape the Bloom filter was created with.</li>
  *
- * <li>{@link BloomFilter#merge(BitMapExtractor)} which merges the BitMaps 
from the BitMapExtractor into the internal
- * representation of the Bloom filter.</li>
+ * <li>{@link 
org.apache.commons.collections4.bloomfilter.BloomFilter#merge(BitMapExtractor)} 
which merges the BitMaps from the BitMapExtractor
+ * into the internal representation of the Bloom filter.</li>
  *
- * <li>{@link BloomFilter#merge(IndexExtractor)} which merges the indices from 
the IndexExtractor into the internal
- * representation of the Bloom filter.</li>
+ * <li>{@link 
org.apache.commons.collections4.bloomfilter.BloomFilter#merge(IndexExtractor)} 
which merges the indices from the IndexExtractor
+ * into the internal representation of the Bloom filter.</li>
  * </ul>
  *
  * <p>Other methods should be implemented where they can be done so more 
efficiently than the default implementations.</p>
  *
  * <h3>CountingBloomFilter</h3>
  *
- * <p>The {@link 
org.apache.commons.collections4.bloomfilter.CountingBloomFilter} extends the 
Bloom filter by counting the number of times a specific bit has been
+ * <p>The {@link 
org.apache.commons.collections4.bloomfilter.CountingBloomFilter} extends the 
Bloom filter by counting the number
+ * of times a specific bit has been
  * enabled or disabled. This allows the removal (opposite of merge) of Bloom 
filters at the expense of additional
  * overhead.</p>
  *
  * <h3>LayeredBloomFilter</h3>
  *
- * <p>The {@link 
org.apache.commons.collections4.bloomfilter.LayeredBloomFilter} extends the 
Bloom filter by creating layers of Bloom filters that can be queried as a single
+ * <p>The {@link 
org.apache.commons.collections4.bloomfilter.LayeredBloomFilter} extends the 
Bloom filter by creating layers of Bloom
+ * filters that can be queried as a single
  * Filter or as a set of filters. This adds the ability to perform windowing 
on streams of data.</p>
  *
  * <h3>Shape</h3>
  *
- * <p>The {@link org.apache.commons.collections4.bloomfilter.Shape} describes 
the Bloom filter using the number of bits and the number of hash functions.  It 
can be specified
+ * <p>The {@link org.apache.commons.collections4.bloomfilter.Shape} describes 
the Bloom filter using the number of bits and the number
+ * of hash functions.  It can be specified
  * by the number of expected items and desired false positive rate.</p>
  *
  * <h3>Hasher</h3>
  *
- * <p>A {@link org.apache.commons.collections4.bloomfilter.Hasher} converts 
bytes into a series of integers based on a Shape. Each hasher represents one 
item being added
+ * <p>A {@link org.apache.commons.collections4.bloomfilter.Hasher} converts 
bytes into a series of integers based on a Shape.
+ * Each hasher represents one item being added
  * to the Bloom filter.</p>
  *
- * <p>The {@link 
org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher} uses a 
combinatorial generation technique to create the integers. It is easily
+ * <p>The {@link 
org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher} uses a 
combinatorial generation technique to
+ * create the integers. It is easily
  * initialized by using a byte array returned by the standard {@link 
java.security.MessageDigest} or other hash function to
  * initialize the Hasher. Alternatively, a pair of a long values may also be 
used.</p>
  *
- * <p>Other implementations of the {@link 
org.apache.commons.collections4.bloomfilter.Hasher} are easy to implement, and 
should make use of the {@code Hasher.Filter}
- * and/or {@code Hasher.FileredIntConsumer} classes to filter out duplicate 
indices when implementing
- * {@code Hasher.uniqueIndices(Shape)}.</p>
+ * <p>Other implementations of the {@link 
org.apache.commons.collections4.bloomfilter.Hasher} are easy to implement.</p>
  *
  * <h2>References</h2>
  *
diff --git a/src/site/markdown/bloomFilters/advanced.md 
b/src/site/markdown/bloomFilters/advanced.md
index 6090eeb26..97a0bc7f1 100644
--- a/src/site/markdown/bloomFilters/advanced.md
+++ b/src/site/markdown/bloomFilters/advanced.md
@@ -220,9 +220,9 @@ However, the issue with this solution is that once the 
filters are saturated, th
 ```
 The above calculation is dependent upon `BloomFilter.cardinality()` function, 
so it is advisable to use BloomFilters that track cardinality or can calculate 
cardinality quickly.
 
-## Counting Bloom filters
+## Counting Bloom Filters
 
-Standard Bloom filters do not have a mechanism to remove items.  One of the 
solutions for this is to convert each bit to a counter<span><a 
class="footnote-ref" href="#fn1">1</a></span>. The counter and index together 
are commonly called a cell.  As items are added to the filter, the values of 
the cells associated with the enabled bits are incremented.  When an item is 
removed the values of the cells associated with the enabled bits are 
decremented.  This solution supports removal of item [...]
+Standard Bloom filters do not have a mechanism to remove items.  One of the 
solutions for this is to convert each bit to a counter<span><a 
class="footnote-ref" href="#fn1">1</a></span>. The counter and index together 
are commonly called a cell.  As items are added to the filter, the values of 
the cells associated with the enabled bits are incremented.  When an item is 
removed the values of the cells associated with the enabled bits are 
decremented.  This solution supports removal of item [...]
 
 The counting Bloom filter also has a couple of operations not found in other 
Bloom filters:
  * Counting Bloom filters can be added together so that the sum of their 
counts is achieved.
@@ -269,7 +269,7 @@ The stable filter works well in environments where inserts 
occur at a fairly fix
 
 There is no implementation of a stable Bloom filter in commons collections.
 
-### Layered Bloom filter
+### Layered Bloom Filter
 
 The layered Bloom filter<span><a class="footnote-ref" href="#fn3">3</a></span> 
creates a list of filters.  Items are added to a target filter and, after a 
specified condition is met, a new filter is added to the list and it becomes 
the target.  Filters are removed from the list based on other specified 
conditions.
 
@@ -279,7 +279,7 @@ The layered filter also has the capability to locate the 
layer in which a filter
 
 The layered filter can also be used in situations where the actual number of 
items is unknown when the Bloom filter is defined.  By using a layered filter 
that adds a target whenever the saturation of the current target reaches 1, the 
false positive rate for the entire filter does not rise above the value 
calculated for the shape.
 
-### Apache Commons Collections Implementation
+#### Apache Commons Collections Implementation
 
 The Apache Commons Collections Bloom filter package contains a layered Bloom 
filter implementation.  To support the `LayeredBloomFilter` there are a 
`LayerManager` class and a `BloomFilterExtractor` interface.
 
@@ -301,9 +301,173 @@ The `LayeredBloomFilter` class defines several new 
methods:
 * `contains(BloomFilterExtractor others)` - returns true if the layered filter 
contains all of the other filters.
 * `find(BitmapExtractor)`, `find(BloomFilter)`, `find(Hasher)`, and 
`find(IndexExtractor)` - returns an array of ints identifying which layers 
match the pattern.
 * `get(int layer)` - returns the Bloom filter for the layer.
-* `getDepth()` - returns the number of layers in the filter..
+* `getDepth()` - returns the number of layers in the filter.
 * `next()` - forces the the creation of a new layer.
 
+#### Layered Bloom Filter Example
+
+In this example we are building a layered Bloom filter that will 
+* Retain information for a specified period of time: `layerExpiry`.
+* Create a new layer when  
+  * the current layer becomes saturates: `numberOfItems`; or
+  * a timer has expired `layerExpiry / layerCount`.
+
+This construct has the interesting effect of growing and shrinking the number 
of layers as demand grows or shrinks.  If there is a steady flow of items that 
does not exceed \\( \frac{numberOfItems}{layerExpiry / layerCount} \\) then 
there will be `layerCount` layers.  If the rate of entities is higher, then the 
number of layers will grow to handle the rate.  When the rate decreases the 
number of layers will decrease as the items age out `(layerExpiry)`.  If the 
rate drops to zero there w [...]
+
+This implementation is not thread safe and has the effect that the item is 
removed from the filter after `layerExpiry` has elapsed even if it was seen 
again. If the desired effect is to retain the reference until after the last 
time the item was seen then the check needs to verify that the item was seen in 
or near the last filter.
+
+```java
+package org.example;
+
+import org.apache.commons.collections4.bloomfilter.BloomFilter;
+import org.apache.commons.collections4.bloomfilter.Hasher;
+import org.apache.commons.collections4.bloomfilter.LayerManager;
+import org.apache.commons.collections4.bloomfilter.LayeredBloomFilter;
+import org.apache.commons.collections4.bloomfilter.Shape;
+import org.apache.commons.collections4.bloomfilter.SimpleBloomFilter;
+import org.apache.commons.collections4.bloomfilter.WrappedBloomFilter;
+
+import java.time.Duration;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.function.Consumer;
+import java.util.function.Predicate;
+
+public class LayeredExample {
+    // the layered Bloom filter.
+    private final LayeredBloomFilter<TimestampedBloomFilter> bloomFilter;
+    // the expiry time for a layer
+    private final Duration layerExpiry;
+    // the expected number of layers.
+    private final int layerCount;
+
+    /**
+     * @param numberOfItems the expected number of item in a layer.
+     * @param falsePositiveRate the acceptable false positive rate for the 
filter.
+     * @param layerCount the number of expected layers.
+     * @param layerExpiry The length of time a layer should exist
+     */
+    public LayeredExample(int numberOfItems, double falsePositiveRate, int 
layerCount, Duration layerExpiry) {
+        this.layerCount = layerCount;
+        this.layerExpiry = layerExpiry;
+        Shape shape = Shape.fromNP(numberOfItems, falsePositiveRate);
+
+        // we will create a new Bloom filter (advance) every time the active 
filter in the layered filter becomes full
+        Predicate<LayerManager<TimestampedBloomFilter>> advance = 
LayerManager.ExtendCheck.advanceOnCount(numberOfItems);
+        //  or when the next window should be started.
+        advance = advance.or(new TimerPredicate());
+
+        // the cleanup for the LayerManager determines when to remove a layer.
+        // always remove the empty target.
+        Consumer<Deque<TimestampedBloomFilter>> cleanup = 
LayerManager.Cleanup.removeEmptyTarget();
+        // remove any expired layers from the front of the list.
+        cleanup = cleanup.andThen(
+                lst -> {
+                    if (!lst.isEmpty()) {
+                        long now = System.currentTimeMillis();
+                        Iterator<TimestampedBloomFilter> iter = lst.iterator();
+                        while (iter.hasNext()) {
+                            if (iter.next().expires < now) {
+                                iter.remove();
+                            } else {
+                                break;
+                            }
+                        }
+                    }
+                });
+        
+        LayerManager.Builder<TimestampedBloomFilter> builder = 
LayerManager.builder();
+        // the layer manager for the Bloom filter, performs automatic cleanup 
and advance when necessary.
+        LayerManager<TimestampedBloomFilter> layerManager = 
builder.setExtendCheck(advance)
+                .setCleanup(cleanup)
+                // create a new TimestampedBloomFilter when needed.
+                .setSupplier(() -> new TimestampedBloomFilter(shape, 
layerExpiry.toMillis()))
+                .build();
+        // create the layered bloom filter.
+        bloomFilter = new LayeredBloomFilter<TimestampedBloomFilter>(shape, 
layerManager);
+    }
+
+    // merge a hasher into the bloom filter.
+    public LayeredExample merge(Hasher hasher) {
+        bloomFilter.merge(hasher);
+        return this;
+    }
+
+    /**
+     * Returns true if {@code bf} is found in the layered filter.
+     *
+     * @param bf the filter to look for.
+     * @return true if found.
+     */
+    public boolean contains(BloomFilter bf) {
+        return bloomFilter.contains(bf);
+    }
+
+    /**
+     * Returns true if {@code bf} is found in the layered filter.
+     *
+     * @param hasher the hasher representation to search for.
+     * @return true if found.
+     */
+    public boolean contains(Hasher hasher) {
+        return bloomFilter.contains(hasher);
+    }
+    
+    /**
+     * @return true if there are no entities being tracked for the principal.
+     */
+    public boolean isEmpty() {
+        bloomFilter.cleanup(); // forces clean
+        return bloomFilter.isEmpty();
+    }
+
+    /**
+     * A cleanup() predicate that triggers when a new layer should be created 
based on time
+     * and the desired number of layers.
+     */
+    class TimerPredicate implements 
Predicate<LayerManager<TimestampedBloomFilter>> {
+        long expires;
+
+        TimerPredicate() {
+            expires = System.currentTimeMillis() + layerExpiry.toMillis() / 
layerCount;
+        }
+
+        @Override
+        public boolean test(LayerManager o) {
+            long now = System.currentTimeMillis();
+            if (expires > now) {
+                return false;
+            }
+            expires = now + layerExpiry.toMillis() / layerCount;
+            return true;
+        }
+    }
+    
+    /**
+     * A Bloom filter implementation that has a timestamp indicating when it 
expires.
+     * Used as the Bloom filter for the layer in the LayeredBloomFilter
+     */
+    static class TimestampedBloomFilter extends WrappedBloomFilter {
+        long expires;
+
+        TimestampedBloomFilter(Shape shape, long ttl) {
+            super(new SimpleBloomFilter(shape));
+            expires = System.currentTimeMillis() + ttl;
+        }
+
+        private TimestampedBloomFilter(TimestampedBloomFilter other) {
+            super(other.getWrapped().copy());
+            this.expires = other.expires;
+        }
+
+        @Override
+        public TimestampedBloomFilter copy() {
+            return new TimestampedBloomFilter(this);
+        }
+    }
+}
+```
+
 ## Review
 
 In this post we covered some unusual uses for Bloom filters as well as a 
couple of interesting unusual Bloom filters.  In the next post, we will 
introduce the reference Bloom filter, and delve into multidimensional Bloom 
filters.  We will also show how multidimensional Bloom filters can be used to 
search encrypted data without decrypting.
diff --git a/src/site/markdown/bloomFilters/implementation.md 
b/src/site/markdown/bloomFilters/implementation.md
index 3020c8a2b..0bd48d928 100644
--- a/src/site/markdown/bloomFilters/implementation.md
+++ b/src/site/markdown/bloomFilters/implementation.md
@@ -35,19 +35,19 @@ The `Shape` is constructed from one of five combinations of 
parameters specified
 
 ## The Extractors
 
-The question of how to efficiently externally represent the internal 
representation of a Bloom filter led to the development of two “extractors”.   
One that logically represents an ordered collection of BitMap longs, and the 
other that represents a collection of enabled bits indices.  In both cases the 
extractor feeds each value to a function - called a predicate - that does some 
operation with the value and returns true to continue processing, or false to 
stop processing.
+The question of how to efficiently externally represent the internal 
representation of a Bloom filter led to the development of two “extractors”.   
One that logically represents an ordered collection of bit map longs, and the 
other that represents a collection of enabled bits indices.  In both cases the 
extractor feeds each value to a function - called a predicate - that does some 
operation with the value and returns true to continue processing, or false to 
stop processing.
 
 The extractors allow implementation of different internal storage as long as 
implementation can produce one or both of the standard producers.  Each 
producer has a static method to convert from the other type so only one 
implementation is required of the internal storage.
 
 ### BitMapExtractor
 
-The BitMap extractor is an interface that defines objects that can produce 
BitMap vectors.  BitMap vectors are vectors of long values where the bits are 
enabled as per the `BitMap` class.  All Bloom filters implement this interface. 
 The `BitMapExtractor` has static methods to create a producer from an array of 
long values, as well as from an `IndexExtractor`.  The interface has default 
implementations that convert the extractor into an array of long values, and 
one that executes a LongP [...]
+The `BitMapExtractor` is an interface that defines objects that can produce 
bit map vectors.  Bit map vectors are vectors of long values where the bits are 
enabled as per the `BitMap` class.  All Bloom filters implement this interface. 
 The `BitMapExtractor` has static methods to create an extractor from an array 
of long values, as well as from an `IndexExtractor`.  The interface has default 
implementations that convert the extractor into an array of long values, and 
one that executes a  [...]
 
-The method `processBitMaps(LongPredicate)` is the standard access method for 
the extractor.  Each long value in the BitMap is passed to the predicate in 
order.  Processing continues until the last BitMap is processed or the 
`LongPredicate` returns false.
+The method `processBitMaps(LongPredicate)` is the standard access method for 
the extractor.  Each long value in the bit map is passed to the predicate in 
order.  Processing continues until the last bit map is processed or the 
`LongPredicate` returns false.
 
 ### IndexExtractor
 
-The index extractor produces the index values for the enabled bits in a Bloom 
filter.  All Bloom filters implement this interface.  The `IndexExtractor` has 
static methods to create the extractor from an array of integers or a 
`BitMapExtractor`.  The interface also has default implementations to convert 
the extractor to an array of integers and one that executes an `IntPredicate` 
on each index.
+The `IndexExtractor` produces the index values for the enabled bits in a Bloom 
filter.  All Bloom filters implement this interface.  The `IndexExtractor` has 
static methods to create the extractor from an array of integers or a 
`BitMapExtractor`.  The interface also has default implementations to convert 
the extractor to an array of integers and one that executes an `IntPredicate` 
on each index.
 
 The method `processIndices(IntPredicate)` is the standard access method for 
the extractor.  Each int value in the extractor is passed to the predicate.  
Processing continues until the last int is processed or the `IntPredicate` 
returns false.  The order and uniqueness of the values is not guaranteed.
 
@@ -72,12 +72,11 @@ The constructor accepts either two longs or a byte array 
from a hashing function
 ### Bloom Filter
 
 Now we come to the Bloom filter.  As noted above the `BloomFilter` interface 
implements both the `IndexExtractor` and the `BitMapExtractor`.  The two 
extractors are the external representations of the internal representation of 
the Bloom filter.  Bloom filter implementations are free to store bit values in 
any way deemed fit.  The requirements for bit value storage are:
- * Must be able to produce an IndexExtractor.
- * Must be able to produce a BitMapExtractor.
+ * Must be able to produce an `IndexExtractor` or `BitMapExtractor`.
  * Must be able to clear the values.  Reset the cardinality to zero.
- * Must specify if the filter prefers (is faster creating) the IndexExtractor 
(Sparse characteristic) or the BitMapExtractor (Not sparse characteristic).
- * Must be able to merge hashers, Bloom filters, index extractors, and BitMap 
extractors.  When handling extractors it is often the case that an 
implementation will convert one type of extractor to the other for merging.  
The BloomFilter interface has default implementations for Bloom filter and 
hasher merging.
- * Must be able to determine if the filter contains hashers, Bloom filters, 
index extractors, and BitMap extractors. The BloomFilter interface has default 
implementations for BitMap extractor, Bloom filter and hasher checking.
+ * Must specify if the filter prefers (is faster creating) the 
`IndexExtractor` (Sparse characteristic) or the `BitMapExtractor` (Not sparse 
characteristic).
+ * Must be able to merge hashers, Bloom filters, index extractors, and bit map 
extractors.  When handling extractors it is often the case that an 
implementation will convert one type of extractor to the other for merging.  
The BloomFilter interface has default implementations for Bloom filter and 
hasher merging.
+ * Must be able to determine if the filter contains hashers, Bloom filters, 
index extractors, and bit map extractors. The BloomFilter interface has default 
implementations for bit map extractor, Bloom filter and hasher checking.
  * Must be able to make a deep copy of itself.
  * Must be able to produce its `Shape`.
 
@@ -85,10 +84,10 @@ Several implementations of Bloom filter are provided.  We 
will start by focusing
 
 ### SimpleBloomFilter
 
-The `SimpleBloomFilter` implements its storage as an on-heap array of longs.  
This implementation is suitable for many applications.  It can also serve as a 
good model for how to implement `BitMap` storage in specific off-heap 
situations.
+The `SimpleBloomFilter` implements its storage as an on-heap array of longs.  
This implementation is suitable for many applications.  It can also serve as a 
good model for how to implement bit map storage in specific off-heap situations.
 
 ### SparseBloomFilter
-The `SparseBloomFilter` implements its storage as a TreeSet of Integers.  
Since each bit is randomly selected across the entire [0,shape.numberOfBits) 
range the Sparse Bloom filter only makes sense when \\( \frac{2 \times m}{64} 
\lt k \times n \\).  The reasoning being that every BitMap in the array is 
equivalent to 2 integers in the sparse filter.  The number of integers is the 
number of hash functions times the number of items – this is an overestimation 
due to the number of hash colli [...]
+The `SparseBloomFilter` implements its storage as a TreeSet of Integers.  
Since each bit is randomly selected across the entire [0,shape.numberOfBits) 
range the Sparse Bloom filter only makes sense when \\( \frac{2m}{64} \lt kn 
\\).  The reasoning being that every bit map in the array is equivalent to 2 
integers in the sparse filter.  The number of integers is the number of hash 
functions times the number of items – this is an overestimation due to the 
number of hash collisions that will [...]
 
 ### Other
 
diff --git a/src/site/markdown/bloomFilters/intro.md 
b/src/site/markdown/bloomFilters/intro.md
index ea14fd880..dec248e8f 100644
--- a/src/site/markdown/bloomFilters/intro.md
+++ b/src/site/markdown/bloomFilters/intro.md
@@ -35,7 +35,7 @@ The equation for matching can be expressed as: \\( T \cap C = 
T \\)
 
 There are several properties that define a Bloom filter: the number of bits in 
the vector (`m`), the number of items that will be merged into the filter 
(`n`), the number of hash functions for each item (`k`), and the probability of 
false positives (`p`).  All of these values are mathematically related. 
Mitzenmacher and Upfal<span><a class="footnote-ref" href="#fn2">2</a></span> 
have shown that the relationship between these properties is
 
-\\[ p = \left( 1 - e^{-kn/m} \rigth) ^k \\]
+\\[ p = \left( 1 - e^{-kn/m} \right) ^k \\]
 
 However, it has been reported  that the false positive rate in real 
deployments is higher than the value given by this equation.<span><a 
class="footnote-ref" href="#fn3">3</a></span><span><a class="footnote-ref" 
href="#fn4">4</a></span> Theoretically it has been proven that the equation 
offered a lower bound of the false positive rate, and a more accurate false 
positive rate has been discovered.<span><a class="footnote-ref" 
href="#fn5">5</a></span>
 
@@ -48,7 +48,7 @@ Bloom filters are often used to reduce the search space. For 
example, consider a
 
 Examples of large Bloom filter collections can be found in 
bioinformatics<span><a class="footnote-ref" href="#fn6">6</a></span><span><a 
class="footnote-ref" href="#fn7">7</a></span><span><a class="footnote-ref" 
href="#fn8">8</a></span> where Bloom filters are used to represent gene 
sequences, and Bloom filter based databases where records are encoded into 
Bloom filters.<span><a class="footnote-ref" href="#fn9">9</a></span><span><a 
class="footnote-ref" href="#fn10">10</a></span>
 
-Medium, the digital publishing company, uses Bloom filters to track what 
articles have been read.<span><a class="footnote-ref" 
href="#fn11">11</a></span>  Bitcoin uses them to speed up clients.<span><a 
class="footnote-ref" href="#fn12">12</a></span>  They have been used to improve 
caching performance<span><a class="footnote-ref" href="#fn13">13</a></span> and 
in detecting malicious websites.<span><a class="footnote-ref" 
href="#fn14">14</a></span>
+Medium, the digital publishing company, uses Bloom filters to track what 
articles have been read.<span><a class="footnote-ref" 
href="#fn11">11</a></span>  Bitcoin uses them to speed up clients.<span><a 
class="footnote-ref" href="#fn12">12</a></span>  Bloom filters have been used 
to improve caching performance<span><a class="footnote-ref" 
href="#fn13">13</a></span> and in detecting malicious websites.<span><a 
class="footnote-ref" href="#fn14">14</a></span>
 
 ## How?
 So, let’s work through an example.  Let's assume we want to put 3 items in a 
filter `(n = 3)` with a 1/5 probability of collision `(p = 1/5 = 0.2)`.  
Solving \\( p = \left( 1 - e^{-kn/m} \right) ^k \\) yields  `m=11` and `k=3`.  
Thomas Hurst has provided an online calculator<span><a class="footnote-ref" 
href="#fn15">15</a></span> where you can explore the interactions between the 
values.
@@ -107,22 +107,22 @@ The Jaccard distance, like the cosine distance, is 
calculated as `1 - Jaccard si
 
 The similarity and distance statistics can be used to group similar Bloom 
filters together; for example when distributing files across a system that uses 
Bloom filters to determine where the file might be located.  In this case it 
might make sense to store Bloom filters in the collection that has minimal 
distance.
 
-In addition to basic similarity and difference, if the shape of the filter is 
known some information about the data behind the filters can be estimated.  For 
example the number of items merged into a filter (n) can be estimated provided 
we have the cardinality (`c`), number of bits in the vector (`m`) and the 
number of hash functions (`k`) used when adding each element.
+In addition to basic similarity and difference, if the shape of the filter is 
known some information about the data behind the filters can be estimated.  For 
example the number of items merged into a filter (`n`) can be estimated 
provided we have the cardinality (`c`), number of bits in the vector (`m`) and 
the number of hash functions (`k`) used when adding each element.
 
 \\[ n = \frac{-m ln(1 - c/m)}{k} \\]
 
-Estimating the size of the union of two filters is simply calculating n for 
the union (bitwise ‘or’) of the two filters.
+Estimating the size of the union of two filters is simply calculating `n` for 
the union (bitwise ‘or’) of the two filters.
 
-Estimating the size of the intersection of two filters is the estimated n of 
the first + the estimated n of the second - the estimated union of the two.  
There are some tricky edge conditions, such as when one or both of the 
estimates of n is infinite.
+Estimating the size of the intersection of two filters is the estimated `n` of 
the first + the estimated `n` of the second - the estimated union of the two.  
There are some tricky edge conditions, such as when one or both of the 
estimates of `n` is infinite.
 
 ## Usage Errors
 There are several places that errors can creep into Bloom filter usage.
 
 ### Saturation errors
 
-Saturation errors arise from under estimating the number of items that will be 
placed into the filter.  Let’s define  “saturation” as the number of items 
merged into a filter divided by the number of items specified in the Shape.  
Then once `n` items have been inserted the filter is at a saturation of 1.  As 
the saturation increases the false positive rate increases.   Using the 
calculation for the false positive rate noted above, we can calculate the 
expected false positive rate for the [...]
+Saturation errors arise from underestimating the number of items that will be 
placed into the filter.  Let’s define  “saturation” as the number of items 
merged into a filter divided by the number of items specified in the Shape.  
Then once `n` items have been inserted the filter is at a saturation of 1.  As 
the saturation increases the false positive rate increases.   Using the 
calculation for the false positive rate noted above, we can calculate the 
expected false positive rate for the  [...]
 
-For Bloom filters defined as k=17 and n=3 the calculation yields m=72, and 
p=0.00001.  As the saturation increases the rates of false positives increase 
as per the following table:
+For Bloom filters defined as `k=17` and `n=3` the calculation yields `m=72`, 
and `p=0.00001`.  As the saturation increases the rates of false positives 
increase as per the following table:
 
 | Saturation | Probability of false positive |
 |------------|-------------------------------|
diff --git a/src/site/markdown/bloomFilters/multidimensional.md 
b/src/site/markdown/bloomFilters/multidimensional.md
index 10f1fdbcc..693a3b4d2 100644
--- a/src/site/markdown/bloomFilters/multidimensional.md
+++ b/src/site/markdown/bloomFilters/multidimensional.md
@@ -22,7 +22,7 @@ We can use reference Bloom filters to index data by storing 
the Bloom filter alo
 
 Searching can be performed by creating a target Bloom filter with partial 
data, for example name and date of birth from the person example, and then 
searching through the list as described above.  The associated records either 
have the name and birthdate or are false positives and need to be filtered out 
during retrieval.
 
-## Multidimensional Bloom filters
+## Multidimensional Bloom Filters
 
 The description above is a multidimensional Bloom filter.  A multidimensional 
Bloom filter is simply a collection of searchable filters, the simplest 
implementation being a list.  In fact, for fewer than 10K filters the list is 
the fastest possible solution.  There are two basic reasons for this:
   * Bloom filter comparisons are extremely fast taking on approximately five 
(5) machine instructions for the simple comparison.
@@ -72,7 +72,7 @@ This solution utilizes the rapidity of the standard list 
solution, while providi
 
 Natural Bloofi uses a Tree structure like Bloofi does except that each node in 
the tree is a filter that was inserted into the index.<span><a 
class="footnote-ref" href="#fn2">2</a></span>  Natural Bloofi operates like the 
sharded list except that if the Bloom filter for a node is contained by a node 
in the list then it is made a child of that node.  If the Bloom filter node 
contains a node in the list, then it becomes the parent of that node.  This 
yields a flat Bloofi tree where the mor [...]
 
-## Encrypted indexing
+## Encrypted Indexing
 
 The idea of using Bloom filters for indexing encrypted data is not a new 
idea.<span><a class="footnote-ref" href="#fn3">3</a></span><span><a 
class="footnote-ref" href="#fn4">4</a></span><span><a class="footnote-ref" 
href="#fn5">5</a></span><span><a class="footnote-ref" href="#fn6">6</a></span>  
  The salient points are that Bloom filters are a very effective one way hash 
with matching capabilities.  The simplest solution is to create a reference 
Bloom filter comprising the plain text of  [...]

(commons-collections) branch master updated: [Collections-857] update bloom filter documentation (#508)

Reply via email to