[jira] Updated: (COUCHDB-220) Extreme sparseness in couch files

Chris Anderson (JIRA) Tue, 07 Apr 2009 18:27:35 -0700

     [ 
https://issues.apache.org/jira/browse/COUCHDB-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Anderson updated COUCHDB-220:
-----------------------------------

    Attachment: stream.diff

I applied Robert's patch (benchmarking before and after with multiple methods) 
and saw very little change. I then also applied the IRC suggestion (set 
min_alloc=1 instead of 64kb), and saw substantial speedups as well as much 
tighter file sizes. 

Summary of my benchmarks:

Bulk_docs posts of 1000 docs (roughly 100bytes each) did not seem to be 
effected by this patch at all.

However, loading docs in (using a custom erlang loader) where each doc has a 
4kb attachment (100 concurrent writers, with docs committed in batches of 10) I 
saw the big improvements. Here's what I saw after running my loader for 30 
seconds:

Before the patch:

db-file-size: 364.8 MB
docs: 5690
bytes/doc = 67229.25 (thats more than 10x wasted space)
doc/sec = 190

After the patch:

db-file-size: 76.15 MB
docs: 13340 (wow, more than twice as many in the same amount of time!)
bytes / doc = 5985.45 (this is a better size for sure)
docs / sec = 445

I've attached the changes I made (stream.diff). I'm hoping Damien can look at 
it before we apply it to trunk. It seems strange that bypassing min-alloc would 
make such a big difference, maybe there's a better answer we don't see.


> Extreme sparseness in couch files
> ---------------------------------
>
>                 Key: COUCHDB-220
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-220
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.9
>         Environment: ubuntu 8.10 64-bit, ext3
>            Reporter: Robert Newson
>         Attachments: 220.patch, 220.patch, attachment_sparseness.js, 
> stream.diff
>
>
> When adding ten thousand documents, each with a small attachment, the 
> discrepancy between reported file size and actual file size becomes huge;
> ls -lh shard0.couch
> 698M 2009-01-23 13:42 shard0.couch
> du -sh shard0.couch
> 57M   shard0.couch
> On filesystems that do not support write holes, this will cause an order of 
> magnitude more I/O.
> I think it was introduced by the streaming attachment patch as each 
> attachment is followed by huge swathes of zeroes when viewed with 'hd -v'.
> Compacting this database reduced it to 7.8mb, indicating other sparseness 
> besides attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (COUCHDB-220) Extreme sparseness in couch files

Reply via email to