Hi, this is a very interesting topic and I agree it would be lovely to have something like this in CouchDB, however I have one concern. How do you handle deletions? Bloom filters have the disadvantage that you cannot delete records as you don't know whether you may be affecting other records. The more deletions you have, the more false positives your filter will produce and the more it will penalise performance.
Aside from that I think that's a very good idea and I'd love to collaborate on adding it into Couch if possible. Regards On Sun, Jul 16, 2017 at 10:23 PM aa mm <assaf.mor...@gmail.com> wrote: > Hi guys. > > For a couple of months now I've been using the bulk API to query a lot of > data from my databases. I have some databases with hundreds of millions of > documents and a few with billions of documents. All and all about 10TB of > hard disk is used. > > I'm on 2.0 single mode. > > Sometimes querying for 1000-2000 keys at once can take up to 150 seconds. > Especially with reduce=true, group=true and include_docs=true. I found that > ~80% of the query keys are unknown to my databases. > > What I've discovered is that using bloom filters I can reduce query times > in these situations to ~2-3 seconds! > > The general flow of my setup is as follows: > 1. Get all the keys of a view (e.g. curl "$view_url" -G -d reduce=false | > awk -F '"' '{print $6}' > keys) > 2. Build a bloom filter for this view .This can be very large. In some of > my views I use this configuration - > https://hur.st/bloomfilter?n=20000000000&p=1e-7 - Which cannot cheaply be > stored in memory. This is why I used this library - > https://github.com/axiak/pybloomfiltermmap - that uses mmap and is memory > efficient. (I probably should use p=1e-4 or p=1e-3 because a false positive > is okay here) > 3. When a query of multiple keys comes along, use CouchDB bulk API only on > the keys that can be found in the bloom filter. > > This has worked pretty well for me, but the downside is obviously step 1 - > getting all the keys - which takes a lot of time. A more efficient > solutions would be to use the changes API. This is my next plan. > > It will be great if this was part of CouchDB (Check if a key exists in a > bloom filter before querying the database), but in the mean time I'm just > sharing my experience. Maybe someone will find it useful. > > I wrote a HTTP REST wrapper for https://github.com/axiak/pybloomfiltermmap > , > so it'll be independent from my business logic code and I could query it > remotely. I also wrote an efficient command line tool to create and > populate a bloom filter using https://github.com/axiak/pybloomfiltermmap. > > I'll open source my code in the near future. > -- [image: Cabify - Your private Driver] <http://www.cabify.com/> *Carlos Alonso* Data Engineer Madrid, Spain carlos.alo...@cabify.com Prueba gratis con este código #CARLOSA6319 <https://cabify.com/i/carlosa6319> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image: Linkedin] <https://www.linkedin.com/in/mrcalonso> -- Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su destinatario, pudiendo contener información confidencial sometida a secreto profesional. No está permitida su reproducción o distribución sin la autorización expresa de Cabify. Si usted no es el destinatario final por favor elimínelo e infórmenos por esta vía. This message and any attached file are intended exclusively for the addressee, and it may be confidential. You are not allowed to copy or disclose it without Cabify's prior written authorization. If you are not the intended recipient please delete it from your system and notify us by e-mail.