-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Robert Newson wrote: > The sequential uuids mentioned above are a configuration option. Since > it's implicit in what you've said, you must be allowing couchdb to > generate uuids on the server-side,
Sorry if I wasn't clear but your statement above is the complete opposite of what is happening. My objects have references to each other (circular and recursive) and I have a script that generates JSON for them, one per line. A separate script imports those into CouchDB. My generation script also generates the _id as it is referenced by other items that have been or will be emitted. The algorithm I used to generate the _id was the same as CouchDB's default - 16 random hex digits. That also resulted in a massive database. Switching my algorithm to generate 4 "digits" resulted in a significantly smaller database. The bottom line is that CouchDB file size is very dependent on the size of the _id. Not only that, it seems to have an exponential factor of the _id size. That information does not appear to be documented anywhere, nor does it seem to be a "good thing". > The cause, if you're interested, is simply that the low locality of > reference between identifiers causes lots more btree nodes to be > updated on each insert, which increases the size of the file and > requires more effort to traverse during compaction. That still doesn't explain the major difference in file sizes, especially post compaction which is what I was measuring. Even better how about a formula to describe how big the database should be? It appears to be something like: size = nrecords * (avg_record_size + len_id ^ 3) The power is probably based on log of the len_id, but in any event shows just how dramatically database size can increase. > I saw similar amounts of "bloat", which is why I contributed the > original sequential uuids patch some months ago. The uuid's generated > by "sequential" (the default is called "random") are still very > unlikely to collide but are much kinder to the btree algorithm. You have done what I did - addressed the symptoms rather than the cause :-) > Finally, for large documents, or modest amounts of attachments, this > bloat, even with random 128-bit uuid's, is much reduced. Are you saying that how different the uuids are to each other is the biggest determinant of space consumption rather than their size? Roger -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktGiv0ACgkQmOOfHg372QRMRwCgryfiGdKLma7qjnGmJOwAEpcR Io4AnjcSewuZjiFnKnrecxaHgWvnHOyL =PTkC -----END PGP SIGNATURE-----