Already done :) The package containing my Bloom filter implementation already contains a general-purpose compact approximator implementation; something else I need to document though :(
On 14 December 2011 01:32, Ikai Lan (Google) <ika...@google.com> wrote: > Tom, > > Awesome! Your writing is really accessible, and I'd love to see more of it. > > I think Bloom Filters have caught on because conceptually, it's much > easier to figure out optimizations when you know you can space/time > efficient set inclusion checks. Bloom Filters are used everywhere - they're > used in several places here, including in parts of the datastore. You don't > hear as much about Compact Approximators. In fact ... I can't seem to find > a library for them. Someone needs to take this paper and turn it into code: > > http://arxiv.org/pdf/cs.DS/0306046 > > -- > Ikai Lan > Developer Programs Engineer, Google App Engine > plus.ikailan.com | twitter.com/ikai > > > > On Tue, Dec 13, 2011 at 5:04 PM, Tom Gibara <tomgib...@gmail.com> wrote: > >> Ikai, >> >> Prompted by your request for a link, I took time out this evening to >> write walkthrough for how to use my Bloom filter implementation and >> combined it with some very brief build instructions to make this blog post: >> >> http://blog.tomgibara.com/post/14190080839/crinch-bloom-filter >> >> Hopefully others will find it useful but it was high-time I documented it >> anyway. >> >> Having just scanned my previous comments, I realize that they might be >> dissuasive; that wasn't my intention. I use Bloom filters heavily in a >> variety of situations. For example, I recently came up with the idea of >> using them to identify which columns in a large dataset contain unique >> values: a first pass populates the BloomFilter and possible duplicates are >> stored in an in-memory set that a second pass then uses to verify >> that genuine duplicates exist (incidentally, if anyone knows a better >> approach I'm interested, but I was pleased to come up this one). >> >> On the subject of computer science-y solutions, if Bloom filters provide >> a weakened abstraction of a set, then compact approximators provide an >> equivalent abstraction for maps. Unfortunately they have nowhere near the >> popularity of Bloom filters even though its possible to find all sorts of >> opportunities for them to 'bound' queries over big-data. >> >> http://blog.tomgibara.com/post/664922709/compact-approximators >> >> For example, in the problem initially posed, a compact approximator could >> be used to return an actual upper bound on the number of matches returned >> instead of simply whether a match exists or not. This is expense of a >> greater memory overhead and whether its worthwhile is naturally dependent >> on the application. >> >> Tom. >> >> On 13 December 2011 19:33, Ikai Lan (Google) <ika...@google.com> wrote: >> >>> Tom, >>> >>> Can you link directly to the Bloom Filter implementation? It's not >>> obvious from the main project page how to find it. Any time we can get >>> people to think about computer science-y solutions to big data problems >>> instead of just subscribing to RDBMS dogma, we win =). >>> >>> While bloom filters are not a solution for everything, there are places >>> where they would greatly improve performance. One place where you'd likely >>> see huge benefits from bloom filters are in N x N comparison type of >>> operations. Here's an example from RapLeaf: >>> >>> http://blog.rapleaf.com/dev/2007/09/05/bloomfilter/ >>> >>> By doing a single pass over all entities to build the initial filter, >>> when you cross-check on the second pass, you can do the same operation in >>> much less time if you know that matches tend to be sparse (for other folks >>> reading this thread that are new to the concept: this means spread out - >>> most comparison operations will be NO MATCH). In the N x N hash comparison >>> example, you know that you are only going to do a write in the first phase >>> with no deletes, and you know you will only be reading the filter later on >>> so you can use local memory. >>> >>> -- >>> Ikai Lan >>> Developer Programs Engineer, Google App Engine >>> plus.ikailan.com | twitter.com/ikai >>> >>> >>> >>> On Mon, Dec 12, 2011 at 5:07 PM, Tom Gibara <tomgib...@gmail.com> wrote: >>> >>>> I've used Bloom filters for a number of purposes, including the >>>> scenario you describe - though not on App Engine. >>>> >>>> Bloom filters are an extremely versatile tool, but are not a panacea. >>>> Firstly they aren't as compact as one might expect. By my reckoning, a 1MB >>>> bloom filter will accommodate 1M entries with a false positive rate of >>>> around 2%. A well implemented trie with a suitable set of keys can approach >>>> similar levels of compactness. >>>> >>>> Secondly, Bloom filters are obviously sensitive to their initial >>>> sizing; though they degrade gracefully, in some applications it can be >>>> difficult to resize them in-situ without causing unacceptable spikes in >>>> throughput. >>>> >>>> Finally, just in case you hadn't considered it; Bloom filters don't >>>> support element removal. This may or may not be an issue for your >>>> application. >>>> >>>> Also, I'm not sure if Brandon's suggestion to consider "Short misses" >>>> was targeted at storing more data in the Bloom filter. Bloom filter >>>> capacity is independent of key size and smaller keys may result in poorer >>>> performance due to weaker hashing. >>>> >>>> If it's any use I've made an open source Bloom filter >>>> implementation available as part of my collections package at >>>> http://code.google.com/p/tomgibara/ >>>> >>>> Tom. >>>> >>>> On 12 December 2011 03:17, John Tantalo <john.tant...@gmail.com> wrote: >>>> >>>>> I wonder whether anybody has tried to build an in-memory bloom filter >>>>> in front of an index to reduce datastore read operations? >>>>> >>>>> In my application, I have an exact-match query on a single field, and >>>>> it commonly matches no results. However, I still have to pay for datastore >>>>> read operations in this case. >>>>> >>>>> My idea was to build a bloom filter on every value of the field in my >>>>> datastore. Given a query input, if the bloom filter says the value is a >>>>> member of the set, I will query the datastore for it, which may or may not >>>>> match results (i.e., a false positive). >>>>> >>>>> The bloom filter would be wrapped in an app engine model and stored in >>>>> the datastore and memcached. The write rate to the datastore for this >>>>> index >>>>> is rather low, so I plan to update the bloom filter transactionally and >>>>> cache it on every write. The updates could also be done offline in a task >>>>> queue. >>>>> >>>>> The goal is to reduce the cost of searches, especially in the "no >>>>> matches" case. I believe this change would reduce costs on datastore read >>>>> operations, but increase CPU time because each request would have to read >>>>> and deserialize a potentially large bloom filter from memcached. Clearly, >>>>> this tradeoff could be tuned to the needs of the app, as a larger bloom >>>>> filter would produce fewer false positives and wasted datastore reads. >>>>> >>>>> Thoughts? >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Google App Engine" group. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msg/google-appengine/-/ViVc7VJ8iOAJ. >>>>> To post to this group, send email to google-appengine@googlegroups.com >>>>> . >>>>> To unsubscribe from this group, send email to >>>>> google-appengine+unsubscr...@googlegroups.com. >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/google-appengine?hl=en. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Google App Engine" group. >>>> To post to this group, send email to google-appengine@googlegroups.com. >>>> To unsubscribe from this group, send email to >>>> google-appengine+unsubscr...@googlegroups.com. >>>> For more options, visit this group at >>>> http://groups.google.com/group/google-appengine?hl=en. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Google App Engine" group. >>> To post to this group, send email to google-appengine@googlegroups.com. >>> To unsubscribe from this group, send email to >>> google-appengine+unsubscr...@googlegroups.com. >>> For more options, visit this group at >>> http://groups.google.com/group/google-appengine?hl=en. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to google-appengine@googlegroups.com. >> To unsubscribe from this group, send email to >> google-appengine+unsubscr...@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To post to this group, send email to google-appengine@googlegroups.com. > To unsubscribe from this group, send email to > google-appengine+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.