Re: [google-appengine] Bloom filter index

Tom Gibara Thu, 15 Dec 2011 02:30:53 -0800

Already done :)

The package containing my Bloom filter implementation already contains a
general-purpose compact approximator implementation; something else I need
to document though :(


On 14 December 2011 01:32, Ikai Lan (Google) <ika...@google.com> wrote:

> Tom,
>
> Awesome! Your writing is really accessible, and I'd love to see more of it.
>
> I think Bloom Filters have caught on because conceptually, it's much
> easier to figure out optimizations when you know you can space/time
> efficient set inclusion checks. Bloom Filters are used everywhere - they're
> used in several places here, including in parts of the datastore. You don't
> hear as much about Compact Approximators. In fact ... I can't seem to find
> a library for them. Someone needs to take this paper and turn it into code:
>
> http://arxiv.org/pdf/cs.DS/0306046
>
> --
> Ikai Lan
> Developer Programs Engineer, Google App Engine
> plus.ikailan.com | twitter.com/ikai
>
>
>
> On Tue, Dec 13, 2011 at 5:04 PM, Tom Gibara <tomgib...@gmail.com> wrote:
>
>> Ikai,
>>
>> Prompted by your request for a link, I took time out this evening to
>> write walkthrough for how to use my Bloom filter implementation and
>> combined it with some very brief build instructions to make this blog post:
>>
>> http://blog.tomgibara.com/post/14190080839/crinch-bloom-filter
>>
>> Hopefully others will find it useful but it was high-time I documented it
>> anyway.
>>
>> Having just scanned my previous comments, I realize that they might be
>> dissuasive; that wasn't my intention. I use Bloom filters heavily in a
>> variety of situations. For example, I recently came up with the idea of
>> using them to identify which columns in a large dataset contain unique
>> values: a first pass populates the BloomFilter and possible duplicates are
>> stored in an in-memory set that a second pass then uses to verify
>> that genuine duplicates exist (incidentally, if anyone knows a better
>> approach I'm interested, but I was pleased to come up this one).
>>
>> On the subject of computer science-y solutions, if Bloom filters provide
>> a weakened abstraction of a set, then compact approximators provide an
>> equivalent abstraction for maps. Unfortunately they have nowhere near the
>> popularity of Bloom filters even though its possible to find all sorts of
>> opportunities for them to 'bound' queries over big-data.
>>
>> http://blog.tomgibara.com/post/664922709/compact-approximators
>>
>> For example, in the problem initially posed, a compact approximator could
>> be used to return an actual upper bound on the number of matches returned
>> instead of simply whether a match exists or not. This is expense of a
>> greater memory overhead and whether its worthwhile is naturally dependent
>> on the application.
>>
>> Tom.
>>
>> On 13 December 2011 19:33, Ikai Lan (Google) <ika...@google.com> wrote:
>>
>>> Tom,
>>>
>>> Can you link directly to the Bloom Filter implementation? It's not
>>> obvious from the main project page how to find it. Any time we can get
>>> people to think about computer science-y solutions to big data problems
>>> instead of just subscribing to RDBMS dogma, we win =).
>>>
>>> While bloom filters are not a solution for everything, there are places
>>> where they would greatly improve performance. One place where you'd likely
>>> see huge benefits from bloom filters are in N x N comparison type of
>>> operations. Here's an example from RapLeaf:
>>>
>>> http://blog.rapleaf.com/dev/2007/09/05/bloomfilter/
>>>
>>> By doing a single pass over all entities to build the initial filter,
>>> when you cross-check on the second pass, you can do the same operation in
>>> much less time if you know that matches tend to be sparse (for other folks
>>> reading this thread that are new to the concept: this means spread out -
>>> most comparison operations will be NO MATCH). In the N x N hash comparison
>>> example, you know that you are only going to do a write in the first phase
>>> with no deletes, and you know you will only be reading the filter later on
>>> so you can use local memory.
>>>
>>> --
>>> Ikai Lan
>>> Developer Programs Engineer, Google App Engine
>>> plus.ikailan.com | twitter.com/ikai
>>>
>>>
>>>
>>> On Mon, Dec 12, 2011 at 5:07 PM, Tom Gibara <tomgib...@gmail.com> wrote:
>>>
>>>> I've used Bloom filters for a number of purposes, including the
>>>> scenario you describe - though not on App Engine.
>>>>
>>>> Bloom filters are an extremely versatile tool, but are not a panacea.
>>>> Firstly they aren't as compact as one might expect. By my reckoning, a 1MB
>>>> bloom filter will accommodate 1M entries with a false positive rate of
>>>> around 2%. A well implemented trie with a suitable set of keys can approach
>>>> similar levels of compactness.
>>>>
>>>> Secondly, Bloom filters are obviously sensitive to their initial
>>>> sizing; though they degrade gracefully, in some applications it can be
>>>> difficult to resize them in-situ without causing unacceptable spikes in
>>>> throughput.
>>>>
>>>> Finally, just in case you hadn't considered it; Bloom filters don't
>>>> support element removal. This may or may not be an issue for your
>>>> application.
>>>>
>>>> Also, I'm not sure if Brandon's suggestion to consider "Short misses"
>>>> was targeted at storing more data in the Bloom filter. Bloom filter
>>>> capacity is independent of key size and smaller keys may result in poorer
>>>> performance due to weaker hashing.
>>>>
>>>> If it's any use I've made an open source Bloom filter
>>>> implementation available as part of my collections package at
>>>> http://code.google.com/p/tomgibara/
>>>>
>>>> Tom.
>>>>
>>>> On 12 December 2011 03:17, John Tantalo <john.tant...@gmail.com> wrote:
>>>>
>>>>> I wonder whether anybody has tried to build an in-memory bloom filter
>>>>> in front of an index to reduce datastore read operations?
>>>>>
>>>>> In my application, I have an exact-match query on a single field, and
>>>>> it commonly matches no results. However, I still have to pay for datastore
>>>>> read operations in this case.
>>>>>
>>>>> My idea was to build a bloom filter on every value of the field in my
>>>>> datastore. Given a query input, if the bloom filter says the value is a
>>>>> member of the set, I will query the datastore for it, which may or may not
>>>>> match results (i.e., a false positive).
>>>>>
>>>>> The bloom filter would be wrapped in an app engine model and stored in
>>>>> the datastore and memcached. The write rate to the datastore for this 
>>>>> index
>>>>> is rather low, so I plan to update the bloom filter transactionally and
>>>>> cache it on every write. The updates could also be done offline in a task
>>>>> queue.
>>>>>
>>>>> The goal is to reduce the cost of searches, especially in the "no
>>>>> matches" case. I believe this change would reduce costs on datastore read
>>>>> operations, but increase CPU time because each request would have to read
>>>>> and deserialize a potentially large bloom filter from memcached. Clearly,
>>>>> this tradeoff could be tuned to the needs of the app, as a larger bloom
>>>>> filter would produce fewer false positives and wasted datastore reads.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Google App Engine" group.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msg/google-appengine/-/ViVc7VJ8iOAJ.
>>>>> To post to this group, send email to google-appengine@googlegroups.com
>>>>> .
>>>>> To unsubscribe from this group, send email to
>>>>> google-appengine+unsubscr...@googlegroups.com.
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/google-appengine?hl=en.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Google App Engine" group.
>>>> To post to this group, send email to google-appengine@googlegroups.com.
>>>> To unsubscribe from this group, send email to
>>>> google-appengine+unsubscr...@googlegroups.com.
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/google-appengine?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Google App Engine" group.
>>> To post to this group, send email to google-appengine@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> google-appengine+unsubscr...@googlegroups.com.
>>> For more options, visit this group at
>>> http://groups.google.com/group/google-appengine?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Google App Engine" group.
>> To post to this group, send email to google-appengine@googlegroups.com.
>> To unsubscribe from this group, send email to
>> google-appengine+unsubscr...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/google-appengine?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Bloom filter index

Reply via email to