On Wed, Jun 29, 2011 at 7:59 PM, Zdravko Gligic <zgli...@gmail.com> wrote:
> Hi Folks,
> In many places I have read how Erlang runs on small devices and how
> (as a result) it is very frugal with resources.  I think that I have
> read that or at least something to that effect.  However, none of that
> seems to apply to CouchDB.
> I believe that I read somewhere that the length of key names can make
> a significant reduction in disk usage - as in, cutting it in half or
> less.  However, when I asked about it on #couchdb, a very smart person
> stated point blank (with a bit of attitude or maybe just conviction)
> that if I was worried about disk then I should not be using CouchDB.
> In many places I have read how both DB and View compactions can free
> up as much as 90% of occupied space.  Similarly, I have read how
> CouchDB would be struggling on smaller VPS allocations and how a mere
> 2GB database would struggle with anything less than that much in RAM -
> especially when compactions and/or cleanups are running.
> Whenever I come across such CouchDB resources related postings, I keep
> thinking about all of those Couches on all of those mobile devices (at
> least in all of those presentations and slides) and asking my self
> "how do they do that" ?
> Regards,
> teslan


I'm not sure where you were getting the impression that Erlang was
frugal with disk space. In general, its true that Erlang is pretty
good at using a minimal amount of CPU/RAM resources while it runs,
though as in all things, that usage will scale with load.

As to disk usage, that's a direct trade off in the design of CouchDB.
The append only b+tree is going to cause fragmentation in the database
files. There are of course games we could play to minimize to a
certain extent by doing things like log structured merge trees with
more aggressive compaction but then the issue becomes that we end up
requiring more active file descriptors per database which in turn
hurts people that are hosting a large number of databases on a single
node (think hosting, or db per user account).

My guess that whoever it was on IRC was just speaking with conviction.
We don't try and hide the fact that CouchDB uses quite a bit more
space than people would expect at first by any means.

As to the amount of space that can be cleaned up, it really depends on
the specific load patterns and how aggressive people are at keeping
the database files compacted. Obviously I could write a single
document hundreds of thousands of times without compacting, and then
compact and have a database that is a percent or less of the
"uncompacted" size.

I'm also not sure about why someone would say that a 2GiB database
would struggle with less than 2GiB of RAM. RAM usage is more or less
tied to the number of concurrent clients you have accessing the
database and the amount and type of view generations you have running.
Its not really tied to the physical size of the database as we don't
hold caches to anything. There used to be a silly benchmark floating
around that showed CouchDB handling a couple thousand requests for a
small doc and it was only using 9M of RAM. Granted that's a super
idealized case, but I'd just point out that it's more about access
patterns rather than disk usage.

As to the mobile stuff, my guess would probably be "don't store a lot
of data on the device". AFAIK the story for mobile developers revolves
quite a bit around the fact that replicating data in and out from The
Cloud &trade; makes it super easy for them to have bits and pieces of
a marge larger database.

But in the end, the fact that CouchDB has a much larger disk usage
size than some would expect is that's the trade off in the grand
design. There are features we have like database snapshots, append
only storage to simplify guarantees on consistency (also, hot backups)
and hosting a large number of db's in a single Erlang VM that end up
intersecting in such a way that the price we pay is using more bytes.

Also, I'd like to recommend you keep an eye on development because
this is an active area of optimization. Filipe has been doing awesome
work integrating things like snappy compression and other things deep
down at the storage layer to improve the situation. We may be frank in
saying we use a non-trivial amount of extra space, but its not like
we're not working on improving that situation. :D

That ended up longer than expected. Let us know if you have any other questions.

