The sequential uuids mentioned above are a configuration option. Since
it's implicit in what you've said, you must be allowing couchdb to
generate uuids on the server-side, otherwise the inefficiencies that
they cause would not be in your report. Switch the setting and try
again, I think it remedies this problem. If I've misread you and you
are generating the document id's yourself, then the problem is not
with couch.

The cause, if you're interested, is simply that the low locality of
reference between identifiers causes lots more btree nodes to be
updated on each insert, which increases the size of the file and
requires more effort to traverse during compaction.

I saw similar amounts of "bloat", which is why I contributed the
original sequential uuids patch some months ago. The uuid's generated
by "sequential" (the default is called "random") are still very
unlikely to collide but are much kinder to the btree algorithm.

Finally, for large documents, or modest amounts of attachments, this
bloat, even with random 128-bit uuid's, is much reduced.

B.

On Thu, Jan 7, 2010 at 9:26 PM, Roger Binns <rog...@rogerbinns.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Chris Anderson wrote:
>>> You might try using CouchDB's builtin sequential "uuids". These should
>>> give you some more storage efficiency.
>
> I can't as various items reference various other items.  The source data is
> in a SQLite database - highly normalized.  I then generate couch documents
> from that are effectively denormalized using a SQLite temporary table to
> help with mapping SQL primary keys of various tables into CouchDB document 
> ids.
>
>>> Thanks for reporting. The last time we tried, we were able to do make
>>> some major progress in storage size efficiency. I'm not sure how much
>>> low-hanging fruit we have left, but if you try sequential uuids that
>>> would be a good start.
>
> Using _ids that were maximum 4 bytes long resulted in *massively* less
> CouchDB storage consumption.  Here are the numbers, first column is
> Gigabytes.  There are 9.8 million couchdb documents.  None of the CouchDB
> databases have any views added (yet).
>
>  1.3 SQLite database (53 tables, raw data)
>  2.3 SQLite database after adding indices
>
>  2.5 Couch objects, JSON, one per line text file (no _rev, 16 byte _id)
> 21.4 Those same couch objects after compaction (saved ~2GB doing compaction)
>
>  2.0 Couch objects, JSON, one per line text file (no _rev, 4 byte _id)
>  3.8 Those same couch objects after compaction (didn't note pre-compaction)
>
> Doing compaction on the 21GB database took about 24 hours.  Doing it on the
> 3.8 gb database took about 30 mins and probably way less.  The machine has
> 6GB ram.  (It also doesn't help that compaction does frequent fsync's - they
> really are not needed.)
>
> What this shows is that that CouchDB storage efficiency is *highly*
> correlated with _id size.  Rather than using 16 hex digits you could get the
> same number of values but using base 62 (10 digits, 26 lower, 26 upper) and
> only need 4 or so of those "digits".
>
> Roger
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAktGUZcACgkQmOOfHg372QTRjwCdHqiFHqBzNCYAR/DFbsNNq4PV
> 3aIAn2/AOBTAWrejsYa8XIyiBoUfe8/R
> =m7Pn
> -----END PGP SIGNATURE-----
>
>

Reply via email to