Re: [google-appengine] Bloom filter index

Tom Gibara Mon, 12 Dec 2011 17:08:11 -0800

I've used Bloom filters for a number of purposes, including the scenario
you describe - though not on App Engine.


Bloom filters are an extremely versatile tool, but are not a panacea.
Firstly they aren't as compact as one might expect. By my reckoning, a 1MB
bloom filter will accommodate 1M entries with a false positive rate of
around 2%. A well implemented trie with a suitable set of keys can approach
similar levels of compactness.

Secondly, Bloom filters are obviously sensitive to their initial sizing;
though they degrade gracefully, in some applications it can be difficult to
resize them in-situ without causing unacceptable spikes in throughput.

Finally, just in case you hadn't considered it; Bloom filters don't support
element removal. This may or may not be an issue for your application.

Also, I'm not sure if Brandon's suggestion to consider "Short misses" was
targeted at storing more data in the Bloom filter. Bloom filter capacity is
independent of key size and smaller keys may result in poorer performance
due to weaker hashing.

If it's any use I've made an open source Bloom filter
implementation available as part of my collections package at
http://code.google.com/p/tomgibara/

Tom.

On 12 December 2011 03:17, John Tantalo <john.tant...@gmail.com> wrote:

> I wonder whether anybody has tried to build an in-memory bloom filter in
> front of an index to reduce datastore read operations?
>
> In my application, I have an exact-match query on a single field, and it
> commonly matches no results. However, I still have to pay for datastore
> read operations in this case.
>
> My idea was to build a bloom filter on every value of the field in my
> datastore. Given a query input, if the bloom filter says the value is a
> member of the set, I will query the datastore for it, which may or may not
> match results (i.e., a false positive).
>
> The bloom filter would be wrapped in an app engine model and stored in the
> datastore and memcached. The write rate to the datastore for this index is
> rather low, so I plan to update the bloom filter transactionally and cache
> it on every write. The updates could also be done offline in a task queue.
>
> The goal is to reduce the cost of searches, especially in the "no matches"
> case. I believe this change would reduce costs on datastore read
> operations, but increase CPU time because each request would have to read
> and deserialize a potentially large bloom filter from memcached. Clearly,
> this tradeoff could be tuned to the needs of the app, as a larger bloom
> filter would produce fewer false positives and wasted datastore reads.
>
> Thoughts?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/ViVc7VJ8iOAJ.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Bloom filter index

Reply via email to