On Thu, Jan 7, 2010 at 3:53 PM, Robert Newson <robert.new...@gmail.com> wrote: > The sequential uuids mentioned above are a configuration option. Since > it's implicit in what you've said, you must be allowing couchdb to > generate uuids on the server-side, otherwise the inefficiencies that > they cause would not be in your report. Switch the setting and try > again, I think it remedies this problem. If I've misread you and you > are generating the document id's yourself, then the problem is not > with couch.
I'd be up for making sequential the default for uuids. fast by default isn't a bad philosophy. > > The cause, if you're interested, is simply that the low locality of > reference between identifiers causes lots more btree nodes to be > updated on each insert, which increases the size of the file and > requires more effort to traverse during compaction. > > I saw similar amounts of "bloat", which is why I contributed the > original sequential uuids patch some months ago. The uuid's generated > by "sequential" (the default is called "random") are still very > unlikely to collide but are much kinder to the btree algorithm. > > Finally, for large documents, or modest amounts of attachments, this > bloat, even with random 128-bit uuid's, is much reduced. > > B. > > On Thu, Jan 7, 2010 at 9:26 PM, Roger Binns <rog...@rogerbinns.com> wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Chris Anderson wrote: >>>> You might try using CouchDB's builtin sequential "uuids". These should >>>> give you some more storage efficiency. >> >> I can't as various items reference various other items. The source data is >> in a SQLite database - highly normalized. I then generate couch documents >> from that are effectively denormalized using a SQLite temporary table to >> help with mapping SQL primary keys of various tables into CouchDB document >> ids. >> >>>> Thanks for reporting. The last time we tried, we were able to do make >>>> some major progress in storage size efficiency. I'm not sure how much >>>> low-hanging fruit we have left, but if you try sequential uuids that >>>> would be a good start. >> >> Using _ids that were maximum 4 bytes long resulted in *massively* less >> CouchDB storage consumption. Here are the numbers, first column is >> Gigabytes. There are 9.8 million couchdb documents. None of the CouchDB >> databases have any views added (yet). >> >> 1.3 SQLite database (53 tables, raw data) >> 2.3 SQLite database after adding indices >> >> 2.5 Couch objects, JSON, one per line text file (no _rev, 16 byte _id) >> 21.4 Those same couch objects after compaction (saved ~2GB doing compaction) >> >> 2.0 Couch objects, JSON, one per line text file (no _rev, 4 byte _id) >> 3.8 Those same couch objects after compaction (didn't note pre-compaction) >> >> Doing compaction on the 21GB database took about 24 hours. Doing it on the >> 3.8 gb database took about 30 mins and probably way less. The machine has >> 6GB ram. (It also doesn't help that compaction does frequent fsync's - they >> really are not needed.) >> >> What this shows is that that CouchDB storage efficiency is *highly* >> correlated with _id size. Rather than using 16 hex digits you could get the >> same number of values but using base 62 (10 digits, 26 lower, 26 upper) and >> only need 4 or so of those "digits". >> >> Roger >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.9 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iEYEARECAAYFAktGUZcACgkQmOOfHg372QTRjwCdHqiFHqBzNCYAR/DFbsNNq4PV >> 3aIAn2/AOBTAWrejsYa8XIyiBoUfe8/R >> =m7Pn >> -----END PGP SIGNATURE----- >> >> > -- Chris Anderson http://jchrisa.net http://couch.io