Re: New Sub-project Proposal.

Claude Warren Wed, 11 Sep 2019 08:07:21 -0700

First it is important to remember that Bloom filters tell you where things
are NOT.  Second it is important to understand that Bloom filters can give
false positives but never false negatives.  Seems kind of pointless I know
but consider the case where you have 10K buckets that may contain the item
you are looking for.  If you can reduce the number of buckets you are
searching you can significantly speed up the search.  In a case like this a
bloom filter could be used "in front" of each bucket as a gatekeeper.  When
ever an object goes in the bucket the objects bloom filter is added to the
bucket bloom filter.  If you want to search the 10K buckets for an item
then you build the  bloom filter for the item you are looking for and check
the bloom filter on each bucket.  If the filter says that the item is not
in the bucket then you can skip that bucket, if the filter says it is in
the bucket you search the bucket to verify that it is not a false
positive.  A common use for bloom filters is to determine if an expensive
call should be made.  For example many browsers have a bloom filter that
comprises all the known bad URLs (ones that serve malware, etc).  When the
URL is entered in the browser it is checked against the bloom filter.  If
it is not there the request goes through as normal.  If it is there then
the browser makes the expensive lookup call to a server to determine if the
URL really is in the database of bad URLs.

So a bloom filter is generally used to front a collection to determine if
the collection should be searched.  And as has been pointed out it doesn't
make much sense to use it in front of an in-memory hash table.  However,
applications like Cassandra and Hadoop use bloom filters for various
reasons.  I have recently been made aware of an Apache Incubator project
that wants to implement bloom filters as part of their project.  Other uses
for bloom filters include sharding data.  There is a measure of difference
between filters called a hamming distance.  This is the number of bits that
have to be "flipped" to turn one filter into another, and is very similar
to Hamming measures found in string and other similar comparisons.  By
using the hamming value it is possible to distribute data among a set of
buckets by simply putting the value in the bucket that it is "closest" to
in terms of Hamming distance.  Searcing takes place as noted above.
However this has some interesting properties.  For example you can add new
buckets at any time simply by adding an empty bucket and bloom filter to
the collection of buckets and the system will start filling the bucket as
appropriate.  In addition if a bucket/shard becomes "full", where "full" is
an implementation dependent decision (e.g. the index on a DB table reaches
the inflection point where performance degradation begins), you can pull a
bucket out of consideration for inserts but still search it without
significant stress or change to the system.

Internally Bloom filters are bit vectors.  The length of the vector being
determined by the number of items that are to be placed in the bucket and
the acceptable hash collision rate.  There is a function that will
calculate the length of the vector and the number of  functions to use to
turn on the bits.[1]  In general you build a bloom filter by creating a
hash and using the modulus of that to determine which bit in the vector to
turn on.  You then furn a second hash, usually the same hash function with
a different seed to determine the next bit and so on until the number of
functions has been executed.  Importantly, there are comments tin the
Cassandra code that describe a new and much faster way of doing this using
128-bit hashes and splitting them into a pair of longs.  To check
membership in a bloom filter you buid the bloom-filter for the target (T -
the thing we are looking for) and get the filter for the candidate (C - the
bucket) and evaluate T&C = T
if it evaluates as true there is a match if it not then T is guaranteed not
to be in the bucket.

There are several of us at Apache that work on bloom filters and we have
been unable to locate an open source library that is under the Apache or
similar license.  I have done work on a concept call a proto-bloom filter
that does the hashing early and then makes it faster to generated concrete
bloom filters of various sizes, thus enabling a more efficient layering of
filters.

Several of us have done research into ways to index filters so that if you
have a collection of filters you can quickly locate the candidates.  This
is not as simple as it sounds due to the way in which filters are checked
and the issues with over filled filters yielding high false positive
counts, in addition the check is so fast that the over head for most
indexing eats up any increase in speed. My research has shown that for
filter collections of less than 1000 it is always faster to do a linear
search through an array than any other means.  Above 1000 entries there are
techniques that can yield faster evaluation in some cases.

The long and short of this is that there is no good unencumbered open
source library available at the current time.  Myself and several others,
in conversation here at ApacheCon, have expressed interest in creating such
a library.  We have fairly mature code that we are willing to contribute
along with code that embodies new thinking in the bloom filter arena (like
proto-bloom filters).  We just need a space within the Apache family to
host it.  The code base seems to small to be a separate project and so we
come to Apache Commons seeking a home.

Claude

[1] https://hur.st/bloomfilter/

On Wed, Sep 11, 2019 at 12:36 PM Gary Gregory <garydgreg...@gmail.com>
wrote:

> I would like to know more. I am curious since looking up whether an element
> is in a set is done via a hash code. How do you do better than that?
>
> Gary
>
> On Tue, Sep 10, 2019, 16:51 Bruno P. Kinoshita <ki...@apache.org> wrote:
>
> >  +1 Collections sounds like a good place for a bloom filter.
> >
> > Bruno
> >
> >     On Wednesday, 11 September 2019, 8:00:45 am NZST, Jochen Wiedmann <
> > jochen.wiedm...@gmail.com> wrote:
> >
> >  Hi, Claude,
> >
> > having read, what a bloom filter is, a subproject sounds unnecessary
> > to me. I'd recommend, that you contribute your code to Commons
> > Collections, which seems to me to be a logical target.
> >
> > Jochen
> >
> > On Tue, Sep 10, 2019 at 8:45 PM Claude Warren <cla...@xenei.com> wrote:
> > >
> > > Having spoken with several people at ApacheCon, I would like to see a
> > > bloomfilter sub project.  I have code that is already under Apache
> > License
> > > that I am willing to contribute as the basis The goal of the
> sub-project
> > > would be to produce a reference implementation that could be used by
> > other
> > > projects that desire to have use bloom filters and bloom filter based
> > > collections.
> > >
> > > Is there any objection to doing this?  Other than asking here, what is
> > the
> > > proper path to get a sub-project created,  What does the Commons PMC
> > > require?
> > >
> > > Any assistance and comments would be apprecieated.
> > > Claude
> > >
> > > --
> > > I like: Like Like - The likeliest place on the web
> > > <http://like-like.xenei.com>
> > > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: New Sub-project Proposal.

Reply via email to