Hi guys.

For a couple of months now I've been using the bulk API to query a lot of
data from my databases. I have some databases with hundreds of millions of
documents and a few with billions of documents. All and all about 10TB of
hard disk is used.

I'm on 2.0 single mode.

Sometimes querying for 1000-2000 keys at once can take up to 150 seconds.
Especially with reduce=true, group=true and include_docs=true. I found that
~80% of the query keys are unknown to my databases.

What I've discovered is that using bloom filters I can reduce query times
in these situations to ~2-3 seconds!

The general flow of my setup is as follows:
1. Get all the keys of a view (e.g. curl "$view_url" -G -d reduce=false |
awk -F '"' '{print $6}' > keys)
2. Build a bloom filter for this view .This can be very large. In some of
my views I use this configuration -
https://hur.st/bloomfilter?n=20000000000&p=1e-7 - Which cannot cheaply be
stored in memory. This is why I used this library -
https://github.com/axiak/pybloomfiltermmap - that uses mmap and is memory
efficient. (I probably should use p=1e-4 or p=1e-3 because a false positive
is okay here)
3. When a query of multiple keys comes along, use CouchDB bulk API only on
the keys that can be found in the bloom filter.

This has worked pretty well for me, but the downside is obviously step 1 -
getting all the keys - which takes a lot of time. A more efficient
solutions would be to use the changes API. This is my next plan.

It will be great if this was part of CouchDB (Check if a key exists in a
bloom filter before querying the database), but in the mean time I'm just
sharing my experience. Maybe someone will find it useful.

I wrote a HTTP REST wrapper for https://github.com/axiak/pybloomfiltermmap,
so it'll be independent from my business logic code and I could query it
remotely. I also wrote an efficient command line tool to create and
populate a bloom filter using https://github.com/axiak/pybloomfiltermmap.

I'll open source my code in the near future.

Reply via email to