First it is important to remember that Bloom filters tell you where things are NOT. Second it is important to understand that Bloom filters can give false positives but never false negatives. Seems kind of pointless I know but consider the case where you have 10K buckets that may contain the item you are looking for. If you can reduce the number of buckets you are searching you can significantly speed up the search. In a case like this a bloom filter could be used "in front" of each bucket as a gatekeeper. When ever an object goes in the bucket the objects bloom filter is added to the bucket bloom filter. If you want to search the 10K buckets for an item then you build the bloom filter for the item you are looking for and check the bloom filter on each bucket. If the filter says that the item is not in the bucket then you can skip that bucket, if the filter says it is in the bucket you search the bucket to verify that it is not a false positive. A common use for bloom filters is to determine if an expensive call should be made. For example many browsers have a bloom filter that comprises all the known bad URLs (ones that serve malware, etc). When the URL is entered in the browser it is checked against the bloom filter. If it is not there the request goes through as normal. If it is there then the browser makes the expensive lookup call to a server to determine if the URL really is in the database of bad URLs.
So a bloom filter is generally used to front a collection to determine if the collection should be searched. And as has been pointed out it doesn't make much sense to use it in front of an in-memory hash table. However, applications like Cassandra and Hadoop use bloom filters for various reasons. I have recently been made aware of an Apache Incubator project that wants to implement bloom filters as part of their project. Other uses for bloom filters include sharding data. There is a measure of difference between filters called a hamming distance. This is the number of bits that have to be "flipped" to turn one filter into another, and is very similar to Hamming measures found in string and other similar comparisons. By using the hamming value it is possible to distribute data among a set of buckets by simply putting the value in the bucket that it is "closest" to in terms of Hamming distance. Searcing takes place as noted above. However this has some interesting properties. For example you can add new buckets at any time simply by adding an empty bucket and bloom filter to the collection of buckets and the system will start filling the bucket as appropriate. In addition if a bucket/shard becomes "full", where "full" is an implementation dependent decision (e.g. the index on a DB table reaches the inflection point where performance degradation begins), you can pull a bucket out of consideration for inserts but still search it without significant stress or change to the system. Internally Bloom filters are bit vectors. The length of the vector being determined by the number of items that are to be placed in the bucket and the acceptable hash collision rate. There is a function that will calculate the length of the vector and the number of functions to use to turn on the bits.[1] In general you build a bloom filter by creating a hash and using the modulus of that to determine which bit in the vector to turn on. You then furn a second hash, usually the same hash function with a different seed to determine the next bit and so on until the number of functions has been executed. Importantly, there are comments tin the Cassandra code that describe a new and much faster way of doing this using 128-bit hashes and splitting them into a pair of longs. To check membership in a bloom filter you buid the bloom-filter for the target (T - the thing we are looking for) and get the filter for the candidate (C - the bucket) and evaluate T&C = T if it evaluates as true there is a match if it not then T is guaranteed not to be in the bucket. There are several of us at Apache that work on bloom filters and we have been unable to locate an open source library that is under the Apache or similar license. I have done work on a concept call a proto-bloom filter that does the hashing early and then makes it faster to generated concrete bloom filters of various sizes, thus enabling a more efficient layering of filters. Several of us have done research into ways to index filters so that if you have a collection of filters you can quickly locate the candidates. This is not as simple as it sounds due to the way in which filters are checked and the issues with over filled filters yielding high false positive counts, in addition the check is so fast that the over head for most indexing eats up any increase in speed. My research has shown that for filter collections of less than 1000 it is always faster to do a linear search through an array than any other means. Above 1000 entries there are techniques that can yield faster evaluation in some cases. The long and short of this is that there is no good unencumbered open source library available at the current time. Myself and several others, in conversation here at ApacheCon, have expressed interest in creating such a library. We have fairly mature code that we are willing to contribute along with code that embodies new thinking in the bloom filter arena (like proto-bloom filters). We just need a space within the Apache family to host it. The code base seems to small to be a separate project and so we come to Apache Commons seeking a home. Claude [1] https://hur.st/bloomfilter/ On Wed, Sep 11, 2019 at 12:36 PM Gary Gregory <garydgreg...@gmail.com> wrote: > I would like to know more. I am curious since looking up whether an element > is in a set is done via a hash code. How do you do better than that? > > Gary > > On Tue, Sep 10, 2019, 16:51 Bruno P. Kinoshita <ki...@apache.org> wrote: > > > +1 Collections sounds like a good place for a bloom filter. > > > > Bruno > > > > On Wednesday, 11 September 2019, 8:00:45 am NZST, Jochen Wiedmann < > > jochen.wiedm...@gmail.com> wrote: > > > > Hi, Claude, > > > > having read, what a bloom filter is, a subproject sounds unnecessary > > to me. I'd recommend, that you contribute your code to Commons > > Collections, which seems to me to be a logical target. > > > > Jochen > > > > On Tue, Sep 10, 2019 at 8:45 PM Claude Warren <cla...@xenei.com> wrote: > > > > > > Having spoken with several people at ApacheCon, I would like to see a > > > bloomfilter sub project. I have code that is already under Apache > > License > > > that I am willing to contribute as the basis The goal of the > sub-project > > > would be to produce a reference implementation that could be used by > > other > > > projects that desire to have use bloom filters and bloom filter based > > > collections. > > > > > > Is there any objection to doing this? Other than asking here, what is > > the > > > proper path to get a sub-project created, What does the Commons PMC > > > require? > > > > > > Any assistance and comments would be apprecieated. > > > Claude > > > > > > -- > > > I like: Like Like - The likeliest place on the web > > > <http://like-like.xenei.com> > > > LinkedIn: http://www.linkedin.com/in/claudewarren > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren