Tom, Awesome! Your writing is really accessible, and I'd love to see more of it.
I think Bloom Filters have caught on because conceptually, it's much easier to figure out optimizations when you know you can space/time efficient set inclusion checks. Bloom Filters are used everywhere - they're used in several places here, including in parts of the datastore. You don't hear as much about Compact Approximators. In fact ... I can't seem to find a library for them. Someone needs to take this paper and turn it into code: http://arxiv.org/pdf/cs.DS/0306046 -- Ikai Lan Developer Programs Engineer, Google App Engine plus.ikailan.com | twitter.com/ikai On Tue, Dec 13, 2011 at 5:04 PM, Tom Gibara <tomgib...@gmail.com> wrote: > Ikai, > > Prompted by your request for a link, I took time out this evening to write > walkthrough for how to use my Bloom filter implementation and combined it > with some very brief build instructions to make this blog post: > > http://blog.tomgibara.com/post/14190080839/crinch-bloom-filter > > Hopefully others will find it useful but it was high-time I documented it > anyway. > > Having just scanned my previous comments, I realize that they might be > dissuasive; that wasn't my intention. I use Bloom filters heavily in a > variety of situations. For example, I recently came up with the idea of > using them to identify which columns in a large dataset contain unique > values: a first pass populates the BloomFilter and possible duplicates are > stored in an in-memory set that a second pass then uses to verify > that genuine duplicates exist (incidentally, if anyone knows a better > approach I'm interested, but I was pleased to come up this one). > > On the subject of computer science-y solutions, if Bloom filters provide a > weakened abstraction of a set, then compact approximators provide an > equivalent abstraction for maps. Unfortunately they have nowhere near the > popularity of Bloom filters even though its possible to find all sorts of > opportunities for them to 'bound' queries over big-data. > > http://blog.tomgibara.com/post/664922709/compact-approximators > > For example, in the problem initially posed, a compact approximator could > be used to return an actual upper bound on the number of matches returned > instead of simply whether a match exists or not. This is expense of a > greater memory overhead and whether its worthwhile is naturally dependent > on the application. > > Tom. > > On 13 December 2011 19:33, Ikai Lan (Google) <ika...@google.com> wrote: > >> Tom, >> >> Can you link directly to the Bloom Filter implementation? It's not >> obvious from the main project page how to find it. Any time we can get >> people to think about computer science-y solutions to big data problems >> instead of just subscribing to RDBMS dogma, we win =). >> >> While bloom filters are not a solution for everything, there are places >> where they would greatly improve performance. One place where you'd likely >> see huge benefits from bloom filters are in N x N comparison type of >> operations. Here's an example from RapLeaf: >> >> http://blog.rapleaf.com/dev/2007/09/05/bloomfilter/ >> >> By doing a single pass over all entities to build the initial filter, >> when you cross-check on the second pass, you can do the same operation in >> much less time if you know that matches tend to be sparse (for other folks >> reading this thread that are new to the concept: this means spread out - >> most comparison operations will be NO MATCH). In the N x N hash comparison >> example, you know that you are only going to do a write in the first phase >> with no deletes, and you know you will only be reading the filter later on >> so you can use local memory. >> >> -- >> Ikai Lan >> Developer Programs Engineer, Google App Engine >> plus.ikailan.com | twitter.com/ikai >> >> >> >> On Mon, Dec 12, 2011 at 5:07 PM, Tom Gibara <tomgib...@gmail.com> wrote: >> >>> I've used Bloom filters for a number of purposes, including the scenario >>> you describe - though not on App Engine. >>> >>> Bloom filters are an extremely versatile tool, but are not a panacea. >>> Firstly they aren't as compact as one might expect. By my reckoning, a 1MB >>> bloom filter will accommodate 1M entries with a false positive rate of >>> around 2%. A well implemented trie with a suitable set of keys can approach >>> similar levels of compactness. >>> >>> Secondly, Bloom filters are obviously sensitive to their initial sizing; >>> though they degrade gracefully, in some applications it can be difficult to >>> resize them in-situ without causing unacceptable spikes in throughput. >>> >>> Finally, just in case you hadn't considered it; Bloom filters don't >>> support element removal. This may or may not be an issue for your >>> application. >>> >>> Also, I'm not sure if Brandon's suggestion to consider "Short misses" >>> was targeted at storing more data in the Bloom filter. Bloom filter >>> capacity is independent of key size and smaller keys may result in poorer >>> performance due to weaker hashing. >>> >>> If it's any use I've made an open source Bloom filter >>> implementation available as part of my collections package at >>> http://code.google.com/p/tomgibara/ >>> >>> Tom. >>> >>> On 12 December 2011 03:17, John Tantalo <john.tant...@gmail.com> wrote: >>> >>>> I wonder whether anybody has tried to build an in-memory bloom filter >>>> in front of an index to reduce datastore read operations? >>>> >>>> In my application, I have an exact-match query on a single field, and >>>> it commonly matches no results. However, I still have to pay for datastore >>>> read operations in this case. >>>> >>>> My idea was to build a bloom filter on every value of the field in my >>>> datastore. Given a query input, if the bloom filter says the value is a >>>> member of the set, I will query the datastore for it, which may or may not >>>> match results (i.e., a false positive). >>>> >>>> The bloom filter would be wrapped in an app engine model and stored in >>>> the datastore and memcached. The write rate to the datastore for this index >>>> is rather low, so I plan to update the bloom filter transactionally and >>>> cache it on every write. The updates could also be done offline in a task >>>> queue. >>>> >>>> The goal is to reduce the cost of searches, especially in the "no >>>> matches" case. I believe this change would reduce costs on datastore read >>>> operations, but increase CPU time because each request would have to read >>>> and deserialize a potentially large bloom filter from memcached. Clearly, >>>> this tradeoff could be tuned to the needs of the app, as a larger bloom >>>> filter would produce fewer false positives and wasted datastore reads. >>>> >>>> Thoughts? >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Google App Engine" group. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msg/google-appengine/-/ViVc7VJ8iOAJ. >>>> To post to this group, send email to google-appengine@googlegroups.com. >>>> To unsubscribe from this group, send email to >>>> google-appengine+unsubscr...@googlegroups.com. >>>> For more options, visit this group at >>>> http://groups.google.com/group/google-appengine?hl=en. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Google App Engine" group. >>> To post to this group, send email to google-appengine@googlegroups.com. >>> To unsubscribe from this group, send email to >>> google-appengine+unsubscr...@googlegroups.com. >>> For more options, visit this group at >>> http://groups.google.com/group/google-appengine?hl=en. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to google-appengine@googlegroups.com. >> To unsubscribe from this group, send email to >> google-appengine+unsubscr...@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To post to this group, send email to google-appengine@googlegroups.com. > To unsubscribe from this group, send email to > google-appengine+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.