Re: [google-appengine] Bloom filter index

Tom Gibara Tue, 13 Dec 2011 17:04:59 -0800

Ikai,

Prompted by your request for a link, I took time out this evening to write
walkthrough for how to use my Bloom filter implementation and combined it
with some very brief build instructions to make this blog post:


http://blog.tomgibara.com/post/14190080839/crinch-bloom-filter

Hopefully others will find it useful but it was high-time I documented it
anyway.

Having just scanned my previous comments, I realize that they might be
dissuasive; that wasn't my intention. I use Bloom filters heavily in a
variety of situations. For example, I recently came up with the idea of
using them to identify which columns in a large dataset contain unique
values: a first pass populates the BloomFilter and possible duplicates are
stored in an in-memory set that a second pass then uses to verify
that genuine duplicates exist (incidentally, if anyone knows a better
approach I'm interested, but I was pleased to come up this one).

On the subject of computer science-y solutions, if Bloom filters provide a
weakened abstraction of a set, then compact approximators provide an
equivalent abstraction for maps. Unfortunately they have nowhere near the
popularity of Bloom filters even though its possible to find all sorts of
opportunities for them to 'bound' queries over big-data.

http://blog.tomgibara.com/post/664922709/compact-approximators

For example, in the problem initially posed, a compact approximator could
be used to return an actual upper bound on the number of matches returned
instead of simply whether a match exists or not. This is expense of a
greater memory overhead and whether its worthwhile is naturally dependent
on the application.

Tom.

On 13 December 2011 19:33, Ikai Lan (Google) <ika...@google.com> wrote:

> Tom,
>
> Can you link directly to the Bloom Filter implementation? It's not obvious
> from the main project page how to find it. Any time we can get people to
> think about computer science-y solutions to big data problems instead of
> just subscribing to RDBMS dogma, we win =).
>
> While bloom filters are not a solution for everything, there are places
> where they would greatly improve performance. One place where you'd likely
> see huge benefits from bloom filters are in N x N comparison type of
> operations. Here's an example from RapLeaf:
>
> http://blog.rapleaf.com/dev/2007/09/05/bloomfilter/
>
> By doing a single pass over all entities to build the initial filter, when
> you cross-check on the second pass, you can do the same operation in much
> less time if you know that matches tend to be sparse (for other folks
> reading this thread that are new to the concept: this means spread out -
> most comparison operations will be NO MATCH). In the N x N hash comparison
> example, you know that you are only going to do a write in the first phase
> with no deletes, and you know you will only be reading the filter later on
> so you can use local memory.
>
> --
> Ikai Lan
> Developer Programs Engineer, Google App Engine
> plus.ikailan.com | twitter.com/ikai
>
>
>
> On Mon, Dec 12, 2011 at 5:07 PM, Tom Gibara <tomgib...@gmail.com> wrote:
>
>> I've used Bloom filters for a number of purposes, including the scenario
>> you describe - though not on App Engine.
>>
>> Bloom filters are an extremely versatile tool, but are not a panacea.
>> Firstly they aren't as compact as one might expect. By my reckoning, a 1MB
>> bloom filter will accommodate 1M entries with a false positive rate of
>> around 2%. A well implemented trie with a suitable set of keys can approach
>> similar levels of compactness.
>>
>> Secondly, Bloom filters are obviously sensitive to their initial sizing;
>> though they degrade gracefully, in some applications it can be difficult to
>> resize them in-situ without causing unacceptable spikes in throughput.
>>
>> Finally, just in case you hadn't considered it; Bloom filters don't
>> support element removal. This may or may not be an issue for your
>> application.
>>
>> Also, I'm not sure if Brandon's suggestion to consider "Short misses" was
>> targeted at storing more data in the Bloom filter. Bloom filter capacity is
>> independent of key size and smaller keys may result in poorer performance
>> due to weaker hashing.
>>
>> If it's any use I've made an open source Bloom filter
>> implementation available as part of my collections package at
>> http://code.google.com/p/tomgibara/
>>
>> Tom.
>>
>> On 12 December 2011 03:17, John Tantalo <john.tant...@gmail.com> wrote:
>>
>>> I wonder whether anybody has tried to build an in-memory bloom filter in
>>> front of an index to reduce datastore read operations?
>>>
>>> In my application, I have an exact-match query on a single field, and it
>>> commonly matches no results. However, I still have to pay for datastore
>>> read operations in this case.
>>>
>>> My idea was to build a bloom filter on every value of the field in my
>>> datastore. Given a query input, if the bloom filter says the value is a
>>> member of the set, I will query the datastore for it, which may or may not
>>> match results (i.e., a false positive).
>>>
>>> The bloom filter would be wrapped in an app engine model and stored in
>>> the datastore and memcached. The write rate to the datastore for this index
>>> is rather low, so I plan to update the bloom filter transactionally and
>>> cache it on every write. The updates could also be done offline in a task
>>> queue.
>>>
>>> The goal is to reduce the cost of searches, especially in the "no
>>> matches" case. I believe this change would reduce costs on datastore read
>>> operations, but increase CPU time because each request would have to read
>>> and deserialize a potentially large bloom filter from memcached. Clearly,
>>> this tradeoff could be tuned to the needs of the app, as a larger bloom
>>> filter would produce fewer false positives and wasted datastore reads.
>>>
>>> Thoughts?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Google App Engine" group.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msg/google-appengine/-/ViVc7VJ8iOAJ.
>>> To post to this group, send email to google-appengine@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> google-appengine+unsubscr...@googlegroups.com.
>>> For more options, visit this group at
>>> http://groups.google.com/group/google-appengine?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Google App Engine" group.
>> To post to this group, send email to google-appengine@googlegroups.com.
>> To unsubscribe from this group, send email to
>> google-appengine+unsubscr...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/google-appengine?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Bloom filter index

Reply via email to