[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960025#comment-16960025
 ] 

Alex Herbert commented on COLLECTIONS-728:
------------------------------------------

Following on from a conversation had on GitHub for the initial BloomFilter code 
contribution PR the example there was a Bloom filter created for 10,000 objects 
with a false positive probability of 1/50000. This requires a Bloom filter of 
225,200 bits with 16 bits set per object hash. Thus even though the Bloom 
filter may require a lot of storage to represent the combination of thousands 
of objects each object only requires a very small amount of storage. In 
addition you only need to find a single bit index in the object hash that is 
not in the Bloom filter to determine that there is no match.

Thus the common use case of checking if an object matches the filter is to 
compute a relatively small set of indices and check each one is in the Bloom 
filter. This would suggest the following:
{code:java}
public interface BloomFilter {
    public interface BitIterator {
        /**
         * Get the next set bit, or -1 if no more bits are set.
         *
         * @return the bit index (or -1)
         */
        int nextSetBit();

        static BitIterator of(int... bits) {
            return new BitIterator() {
                int count;

                @Override
                public int nextSetBit() {
                    if (count < bits.length) {
                        return bits[count++];
                    }
                    return -1;
                }
            }
        }
    }

    default boolean match(BitIterator bits) {
        int index = bits.nextSetBit();
        if (index == -1) {
            // Cannot match no indices
            return false;
        }
        while (index != -1) {
            if (!get(index)) {
                return false;
            }
        }
        return true;
    }

    boolean get(int bitIndex);
}{code}
Thus the Bloom filter satisfies the following:
 # A BloomFilter is a representation of the logical OR (combination) of many 
hashes.
 # A hash is a specification of indices that are set to 1.
 # A BloomFilter can answer the question: is bit index _n_ set?
 # A BloomFilter can answer the question: are all bit indices _N_ set?
 # A "hasher" is able to convert an object to a set of indices.
 # A comparison of each index in turn opens the possibility of dynamic 
computation of each hash index with fail fast matching.

To use a BloomFilter would require an implementation of hashing that generates 
a number of indices for your objects and a concrete representation of the 
BloomFilter composite of many hashes.

This is just an idea and I can see that if constant time access to the method 
{{get(int)}} is not available then the filter would be slow. It is an 
alternative and can be combined with the current idea to build an entire 
BloomFilter for each new object and perform a match of two BloomFilters:
{code:java}
public interface BloomFilter {
    public interface BitIterator {
        /**
         * Get the next set bit, or -1 if no more bits are set.
         *
         * @return the bit index (or -1)
         */
        int nextSetBit();
    }

    boolean match(BloomFilter other);

    boolean match(BitIterator bits);
}
{code}
 

> BloomFilter contribution
> ------------------------
>
>                 Key: COLLECTIONS-728
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
>             Project: Commons Collections
>          Issue Type: Task
>            Reporter: Claude Warren
>            Priority: Minor
>         Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to