I've been doing some performance testing of the various ways that attachments can be uploaded to CouchDB. I think that what I'm seeing points to some pathological behavoir inside couch, but that's just a guess (I don't really know anything about couch internals). However, if I'm understanding the implications correctly, there might be the possibility to make replication much, much faster for large attachments (by speeding up the multipart API).
To get the data yourself, run 'python makedata.py' once, and then repeatedly run 'bash do-curls.sh' to get timing information (perhaps while making performance tweaks, if you're a dev). Code is on github: https://github.com/wickedgrey/couchdb-attachment-speed It's a bit janky, but gets the job done. The main takeaway: the multipart API is just as slow as base64 encoding everything. Expect to pay roughly a 10x performance penalty for using either api vs. uploading the attachment separately. All of the tests were run against a local 1.1.1 couch recently installed via brew with delayed commits set to false. Hardware was a 2010 macbook pro w/ 8GB of ram, lightly loaded (browser and IDE running but idle at the same time as the tests were run). The general shape of the timing data didn't change over multiple runs. I haven't looked into couch memory or cpu usage while handling the uploads. n raw base64 multipart py b64 encode py b64 decode 1 0m0.136s 0m0.014s 0m0.013s 0:00:00.000015 0:00:00.000009 2 0m0.014s 0m0.016s 0m0.015s 0:00:00.000012 0:00:00.000011 3 0m0.015s 0m0.017s 0m1.027s 0:00:00.000016 0:00:00.000021 4 0m0.015s 0m0.018s 0m2.020s 0:00:00.000057 0:00:00.000090 5 0m0.017s 0m0.035s 0m2.027s 0:00:00.000361 0:00:00.000801 6 0m0.054s 0m0.202s 0m1.133s 0:00:00.003541 0:00:00.005455 7 0m0.361s 0m1.859s 0m2.318s 0:00:00.043847 0:00:00.059307 8 0m3.531s 0m19.336s 0m15.820s 0:00:00.472431 0:00:00.822210 9 0m36.594s 3m24.152s 5m45.110s ? ? One of the interesting issues that I ran into when working on constructing the data was with trying to run a gig of text data through the python JSON parser. It seemed that there were a couple copies of the data being made (I'd guess the original data, then an escaped version, and then the final string?) which slowed things down quite a bit. The current state of affairs is especially frustrating for me, since my use case doesn't permit having documents in an attachment-less (read: inconsistent) state. My ideal case would to have the multipart API: - Sped up to be roughly the same speed as standalone attachments - Extended/changed/supplemented to allow for multiple documents at once, like the bulk API. In any case, thanks for reading. I hope this helps make CouchDB even better. :) Cheers, Eli
