Re: Space (in)efficiency

Roger Binns Thu, 07 Jan 2010 17:32:39 -0800

Sorry if I wasn't clear but your statement above is the complete opposite of
what is happening.

My objects have references to each other (circular and recursive) and I have
a script that generates JSON for them, one per line.  A separate script
imports those into CouchDB.

My generation script also generates the _id as it is referenced by other
items that have been or will be emitted.  The algorithm I used to generate
the _id was the same as CouchDB's default - 16 random hex digits.  That also
resulted in a massive database.  Switching my algorithm to generate 4
"digits" resulted in a significantly smaller database.

The bottom line is that CouchDB file size is very dependent on the size of
the _id.  Not only that, it seems to have an exponential factor of the _id size.

That information does not appear to be documented anywhere, nor does it seem
to be a "good thing".

> The cause, if you're interested, is simply that the low locality of
> reference between identifiers causes lots more btree nodes to be
> updated on each insert, which increases the size of the file and
> requires more effort to traverse during compaction.

That still doesn't explain the major difference in file sizes, especially
post compaction which is what I was measuring.  Even better how about a
formula to describe how big the database should be?  It appears to be
something like:

  size = nrecords * (avg_record_size + len_id ^ 3)

The power is probably based on log of the len_id, but in any event shows
just how dramatically database size can increase.

> I saw similar amounts of "bloat", which is why I contributed the
> original sequential uuids patch some months ago. The uuid's generated
> by "sequential" (the default is called "random") are still very
> unlikely to collide but are much kinder to the btree algorithm.

You have done what I did - addressed the symptoms rather than the cause :-)

> Finally, for large documents, or modest amounts of attachments, this
> bloat, even with random 128-bit uuid's, is much reduced.

Are you saying that how different the uuids are to each other is the biggest
determinant of space consumption rather than their size?

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktGiv0ACgkQmOOfHg372QRMRwCgryfiGdKLma7qjnGmJOwAEpcR
Io4AnjcSewuZjiFnKnrecxaHgWvnHOyL
=PTkC
-----END PGP SIGNATURE-----
Re: Space (in)efficiency

Reply via email to