Re: Space (in)efficiency

Robert Newson Fri, 08 Jan 2010 02:55:09 -0800

Ok, then switching the algorithm in couchdb won't help you.

Longer ids take up more space for two reasons. Firstly, the obvious
storage requirement. Secondly, the .couch file will contain lots of
obsolete btree nodes, since each insert will very like invalidate
several nodes (ids have no locality of reference), this is probably
compounded by the unbalanced btree insertion algorithm in couch.
Random 128-bit values will be among the worst performing ids, whether
couchdb generates them or you do. Since you are generating random ids
(that is, you're not forced to choose a specific value), why not
follow the sequential algorithm instead?


I'd be against switching the default to sequential, or recommending it
to others as the default, only because it often is a premature
optimization. For many cases (where documents are large or attachments
are common), the overheard largely evaporates. It seems to matter in
your case (assuming a 21gb file is really a problem; it wouldn't be
for me) so you might find couch_uuids.erl instructive.

B.

On Fri, Jan 8, 2010 at 1:31 AM, Roger Binns <rog...@rogerbinns.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Robert Newson wrote:
>> The sequential uuids mentioned above are a configuration option. Since
>> it's implicit in what you've said, you must be allowing couchdb to
>> generate uuids on the server-side,
>
> Sorry if I wasn't clear but your statement above is the complete opposite of
> what is happening.
>
> My objects have references to each other (circular and recursive) and I have
> a script that generates JSON for them, one per line.  A separate script
> imports those into CouchDB.
>
> My generation script also generates the _id as it is referenced by other
> items that have been or will be emitted.  The algorithm I used to generate
> the _id was the same as CouchDB's default - 16 random hex digits.  That also
> resulted in a massive database.  Switching my algorithm to generate 4
> "digits" resulted in a significantly smaller database.
>
> The bottom line is that CouchDB file size is very dependent on the size of
> the _id.  Not only that, it seems to have an exponential factor of the _id 
> size.
>
> That information does not appear to be documented anywhere, nor does it seem
> to be a "good thing".
>
>> The cause, if you're interested, is simply that the low locality of
>> reference between identifiers causes lots more btree nodes to be
>> updated on each insert, which increases the size of the file and
>> requires more effort to traverse during compaction.
>
> That still doesn't explain the major difference in file sizes, especially
> post compaction which is what I was measuring.  Even better how about a
> formula to describe how big the database should be?  It appears to be
> something like:
>
>  size = nrecords * (avg_record_size + len_id ^ 3)
>
> The power is probably based on log of the len_id, but in any event shows
> just how dramatically database size can increase.
>
>> I saw similar amounts of "bloat", which is why I contributed the
>> original sequential uuids patch some months ago. The uuid's generated
>> by "sequential" (the default is called "random") are still very
>> unlikely to collide but are much kinder to the btree algorithm.
>
> You have done what I did - addressed the symptoms rather than the cause :-)
>
>> Finally, for large documents, or modest amounts of attachments, this
>> bloat, even with random 128-bit uuid's, is much reduced.
>
> Are you saying that how different the uuids are to each other is the biggest
> determinant of space consumption rather than their size?
>
> Roger
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAktGiv0ACgkQmOOfHg372QRMRwCgryfiGdKLma7qjnGmJOwAEpcR
> Io4AnjcSewuZjiFnKnrecxaHgWvnHOyL
> =PTkC
> -----END PGP SIGNATURE-----
>
>

Re: Space (in)efficiency

Reply via email to