Re: Storage Issues on 600,000 document insertion

Anik Das Fri, 05 Feb 2016 15:06:57 -0800

Thanks Dave,


That was a very comprehensive article. We are currently running compaction
nightly. That's helping for now.

  

Regards,

> On Feb 5 2016, at 7:36 pm, Dave Cottlehuber &lt;[email protected]&gt; wrote: 
>  

>

> On Tue, 2 Feb 2016, at 10:08 PM, Anik Das wrote:  
&gt; Hello All,  
&gt;  
&gt; We were developing an application where we had to insert approximately  
&gt; 600,000  
&gt; documents into a database. The database had only one view (value emitted  
&gt; as  
&gt; null).  
&gt;  
&gt;  
&gt;  
&gt; It was not a batch insertion. After the insertion the database took up  
&gt; 3.5GB  
&gt; to our wonder. I googled around and did a compact query. After the  
&gt; compact  
&gt; query the size reduced to 350MB.  
&gt;  
&gt;  
&gt;  
&gt; I am new to couchdb and I'm unable to figure out what exactly is  
&gt; happening/happened.  
&gt;  
&gt;  
&gt;  
&gt; Anik Das

>

> Welcome Anik :-)

>

> Some quick points:

>

> \- we use a B tree in CouchDB  
\- its append only  
\- to find a doc we walk down the tree from the root node  
\- the root node is always the last node in the .couch btree file  
\- adding or updating a doc requires appending (in order) the doc, and  
intermediary levels, and finally the new root node of the tree  
\- thus a single doc update needs to rewrite at least 2 nodes, itself +  
the new root  
\- as the tree gets wider (more leaf node documents) it also grows slowly  
and increases levels  
\- this adds more intermediate nodes to be updated as we go along

>

> <http://horicky.blogspot.co.at/2008/10/couchdb-implementation.html> is a  
very nice but old picture of this.

>

> You should always plan to compact after a big upload or replication, but  
a couple of things will ease the pain:

>

> \- use _bulk_docs (and do some testing for optimum chunk size)  
\- upload docs in uuid order (don't rely on couch generated uuids)

>

> both of these reduce the number of interim updates to the tree, the  
first simply by only rewriting at the end of each bulk update, the the  
last by adding data in sorted order, less intermediary nodes need  
updating.

>

> Most people run compaction through a cron job or similar out of hours  
scheduling tool.

>

> A+  
Dave

Re: Storage Issues on 600,000 document insertion

Reply via email to