Seems like even a 1MB bloom filter (the largest size you can fit in a
single entity/memcache bucket) would accomodate an enormously large
dataset.

The not-so-obvious part is how to update the memcache in sync with the
datastore.  If you have transactional requirements, you can't update
the memcache on datastore write without risking collisions.  The
answer is to clear the value on write and use CAS operations on read.
If you're in Javaland you can use Objectify's
CachingAsyncDatastoreService directly with the low-level api.  If
you're in Pythonland you'll have to implement this yourself... just
remember, don't try to update memcache on datastore write.  It will
fail.

Jeff

On Sun, Dec 11, 2011 at 11:35 PM, Brandon Wirtz <drak...@digerat.com> wrote:
> Your CPU time will likely be lower. Your Bloom won’t be able to be more than
> about 28 megs in size.
>
> You may consider “Short” misses if your data doesn’t all fit.  (Using
> partial matches based on fewer characters of the complete value   Look up
> “Robert ZCXYNVsnup” Do I have any last names that start with ZCXY No. Great
> don’t look in data store.)
>
>
>
> You may want to play with Instance ram vs Memcache. This will likely be a
> factor of the Instance life and the number of running instances.
>
>
>
>
>
>
>
>
>
> From: google-appengine@googlegroups.com
> [mailto:google-appengine@googlegroups.com] On Behalf Of John Tantalo
> Sent: Sunday, December 11, 2011 7:17 PM
> To: google-appengine@googlegroups.com
> Subject: [google-appengine] Bloom filter index
>
>
>
> I wonder whether anybody has tried to build an in-memory bloom filter in
> front of an index to reduce datastore read operations?
>
>
>
> In my application, I have an exact-match query on a single field, and it
> commonly matches no results. However, I still have to pay for datastore read
> operations in this case.
>
>
>
> My idea was to build a bloom filter on every value of the field in my
> datastore. Given a query input, if the bloom filter says the value is a
> member of the set, I will query the datastore for it, which may or may not
> match results (i.e., a false positive).
>
>
>
> The bloom filter would be wrapped in an app engine model and stored in the
> datastore and memcached. The write rate to the datastore for this index is
> rather low, so I plan to update the bloom filter transactionally and cache
> it on every write. The updates could also be done offline in a task queue.
>
>
>
> The goal is to reduce the cost of searches, especially in the "no matches"
> case. I believe this change would reduce costs on datastore read operations,
> but increase CPU time because each request would have to read and
> deserialize a potentially large bloom filter from memcached. Clearly, this
> tradeoff could be tuned to the needs of the app, as a larger bloom filter
> would produce fewer false positives and wasted datastore reads.
>
>
>
> Thoughts?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/ViVc7VJ8iOAJ.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.



-- 
We are the 20%

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to