Re: Storage Issues on 600,000 document insertion

Dave Cottlehuber Fri, 05 Feb 2016 14:36:54 -0800

On Tue, 2 Feb 2016, at 10:08 PM, Anik Das wrote:
> Hello All,
> 
> We were developing an application where we had to insert approximately
> 600,000
> documents into a database. The database had only one view (value emitted
> as
> null).
> 
>   
> 
> It was not a batch insertion. After the insertion the database took up
> 3.5GB
> to our wonder. I googled around and did a compact query. After the
> compact
> query the size reduced to 350MB.
> 
>   
> 
> I am new to couchdb and I'm unable to figure out what exactly is
> happening/happened.  
> 
>   
> 
> Anik Das


Welcome Anik :-)

Some quick points:

- we use a B tree in CouchDB
- its append only
- to find a doc we walk down the tree from the root node
- the root node is always the last node in the .couch btree file
- adding or updating a doc requires appending (in order) the doc, and
intermediary levels, and finally the new root node of the tree
- thus a single doc update needs to rewrite at least 2 nodes, itself +
the new root
- as the tree gets wider (more leaf node documents) it also grows slowly
and increases levels
- this adds more intermediate nodes to be updated as we go along

http://horicky.blogspot.co.at/2008/10/couchdb-implementation.html is a
very nice but old picture of this.

You should always plan to compact after a big upload or replication, but
a couple of things will ease the pain:

- use _bulk_docs (and do some testing for optimum chunk size)
- upload docs in uuid order (don't rely on couch generated uuids)

both of these reduce the number of interim updates to the tree, the
first simply by only rewriting at the end of each bulk update, the the
last by adding data in sorted order, less intermediary nodes need
updating.

Most people run compaction through a cron job or similar out of hours
scheduling tool.

A+
Dave

Re: Storage Issues on 600,000 document insertion

Reply via email to