Re: [google-appengine] Bloom filter index

Ikai Lan (Google) Tue, 13 Dec 2011 17:33:19 -0800

Tom,

Awesome! Your writing is really accessible, and I'd love to see more of it.


I think Bloom Filters have caught on because conceptually, it's much easier
to figure out optimizations when you know you can space/time efficient set
inclusion checks. Bloom Filters are used everywhere - they're used in
several places here, including in parts of the datastore. You don't hear as
much about Compact Approximators. In fact ... I can't seem to find a
library for them. Someone needs to take this paper and turn it into code:

http://arxiv.org/pdf/cs.DS/0306046

--
Ikai Lan
Developer Programs Engineer, Google App Engine
plus.ikailan.com | twitter.com/ikai



On Tue, Dec 13, 2011 at 5:04 PM, Tom Gibara <tomgib...@gmail.com> wrote:

> Ikai,
>
> Prompted by your request for a link, I took time out this evening to write
> walkthrough for how to use my Bloom filter implementation and combined it
> with some very brief build instructions to make this blog post:
>
> http://blog.tomgibara.com/post/14190080839/crinch-bloom-filter
>
> Hopefully others will find it useful but it was high-time I documented it
> anyway.
>
> Having just scanned my previous comments, I realize that they might be
> dissuasive; that wasn't my intention. I use Bloom filters heavily in a
> variety of situations. For example, I recently came up with the idea of
> using them to identify which columns in a large dataset contain unique
> values: a first pass populates the BloomFilter and possible duplicates are
> stored in an in-memory set that a second pass then uses to verify
> that genuine duplicates exist (incidentally, if anyone knows a better
> approach I'm interested, but I was pleased to come up this one).
>
> On the subject of computer science-y solutions, if Bloom filters provide a
> weakened abstraction of a set, then compact approximators provide an
> equivalent abstraction for maps. Unfortunately they have nowhere near the
> popularity of Bloom filters even though its possible to find all sorts of
> opportunities for them to 'bound' queries over big-data.
>
> http://blog.tomgibara.com/post/664922709/compact-approximators
>
> For example, in the problem initially posed, a compact approximator could
> be used to return an actual upper bound on the number of matches returned
> instead of simply whether a match exists or not. This is expense of a
> greater memory overhead and whether its worthwhile is naturally dependent
> on the application.
>
> Tom.
>
> On 13 December 2011 19:33, Ikai Lan (Google) <ika...@google.com> wrote:
>
>> Tom,
>>
>> Can you link directly to the Bloom Filter implementation? It's not
>> obvious from the main project page how to find it. Any time we can get
>> people to think about computer science-y solutions to big data problems
>> instead of just subscribing to RDBMS dogma, we win =).
>>
>> While bloom filters are not a solution for everything, there are places
>> where they would greatly improve performance. One place where you'd likely
>> see huge benefits from bloom filters are in N x N comparison type of
>> operations. Here's an example from RapLeaf:
>>
>> http://blog.rapleaf.com/dev/2007/09/05/bloomfilter/
>>
>> By doing a single pass over all entities to build the initial filter,
>> when you cross-check on the second pass, you can do the same operation in
>> much less time if you know that matches tend to be sparse (for other folks
>> reading this thread that are new to the concept: this means spread out -
>> most comparison operations will be NO MATCH). In the N x N hash comparison
>> example, you know that you are only going to do a write in the first phase
>> with no deletes, and you know you will only be reading the filter later on
>> so you can use local memory.
>>
>> --
>> Ikai Lan
>> Developer Programs Engineer, Google App Engine
>> plus.ikailan.com | twitter.com/ikai
>>
>>
>>
>> On Mon, Dec 12, 2011 at 5:07 PM, Tom Gibara <tomgib...@gmail.com> wrote:
>>
>>> I've used Bloom filters for a number of purposes, including the scenario
>>> you describe - though not on App Engine.
>>>
>>> Bloom filters are an extremely versatile tool, but are not a panacea.
>>> Firstly they aren't as compact as one might expect. By my reckoning, a 1MB
>>> bloom filter will accommodate 1M entries with a false positive rate of
>>> around 2%. A well implemented trie with a suitable set of keys can approach
>>> similar levels of compactness.
>>>
>>> Secondly, Bloom filters are obviously sensitive to their initial sizing;
>>> though they degrade gracefully, in some applications it can be difficult to
>>> resize them in-situ without causing unacceptable spikes in throughput.
>>>
>>> Finally, just in case you hadn't considered it; Bloom filters don't
>>> support element removal. This may or may not be an issue for your
>>> application.
>>>
>>> Also, I'm not sure if Brandon's suggestion to consider "Short misses"
>>> was targeted at storing more data in the Bloom filter. Bloom filter
>>> capacity is independent of key size and smaller keys may result in poorer
>>> performance due to weaker hashing.
>>>
>>> If it's any use I've made an open source Bloom filter
>>> implementation available as part of my collections package at
>>> http://code.google.com/p/tomgibara/
>>>
>>> Tom.
>>>
>>> On 12 December 2011 03:17, John Tantalo <john.tant...@gmail.com> wrote:
>>>
>>>> I wonder whether anybody has tried to build an in-memory bloom filter
>>>> in front of an index to reduce datastore read operations?
>>>>
>>>> In my application, I have an exact-match query on a single field, and
>>>> it commonly matches no results. However, I still have to pay for datastore
>>>> read operations in this case.
>>>>
>>>> My idea was to build a bloom filter on every value of the field in my
>>>> datastore. Given a query input, if the bloom filter says the value is a
>>>> member of the set, I will query the datastore for it, which may or may not
>>>> match results (i.e., a false positive).
>>>>
>>>> The bloom filter would be wrapped in an app engine model and stored in
>>>> the datastore and memcached. The write rate to the datastore for this index
>>>> is rather low, so I plan to update the bloom filter transactionally and
>>>> cache it on every write. The updates could also be done offline in a task
>>>> queue.
>>>>
>>>> The goal is to reduce the cost of searches, especially in the "no
>>>> matches" case. I believe this change would reduce costs on datastore read
>>>> operations, but increase CPU time because each request would have to read
>>>> and deserialize a potentially large bloom filter from memcached. Clearly,
>>>> this tradeoff could be tuned to the needs of the app, as a larger bloom
>>>> filter would produce fewer false positives and wasted datastore reads.
>>>>
>>>> Thoughts?
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Google App Engine" group.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msg/google-appengine/-/ViVc7VJ8iOAJ.
>>>> To post to this group, send email to google-appengine@googlegroups.com.
>>>> To unsubscribe from this group, send email to
>>>> google-appengine+unsubscr...@googlegroups.com.
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/google-appengine?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Google App Engine" group.
>>> To post to this group, send email to google-appengine@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> google-appengine+unsubscr...@googlegroups.com.
>>> For more options, visit this group at
>>> http://groups.google.com/group/google-appengine?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Google App Engine" group.
>> To post to this group, send email to google-appengine@googlegroups.com.
>> To unsubscribe from this group, send email to
>> google-appengine+unsubscr...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/google-appengine?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Bloom filter index

Reply via email to