Re: Space (in)efficiency

Chris Anderson Thu, 07 Jan 2010 16:34:37 -0800

On Thu, Jan 7, 2010 at 3:53 PM, Robert Newson <robert.new...@gmail.com> wrote:
> The sequential uuids mentioned above are a configuration option. Since
> it's implicit in what you've said, you must be allowing couchdb to
> generate uuids on the server-side, otherwise the inefficiencies that
> they cause would not be in your report. Switch the setting and try
> again, I think it remedies this problem. If I've misread you and you
> are generating the document id's yourself, then the problem is not
> with couch.


I'd be up for making sequential the default for uuids.

fast by default isn't a bad philosophy.

>
> The cause, if you're interested, is simply that the low locality of
> reference between identifiers causes lots more btree nodes to be
> updated on each insert, which increases the size of the file and
> requires more effort to traverse during compaction.
>
> I saw similar amounts of "bloat", which is why I contributed the
> original sequential uuids patch some months ago. The uuid's generated
> by "sequential" (the default is called "random") are still very
> unlikely to collide but are much kinder to the btree algorithm.
>
> Finally, for large documents, or modest amounts of attachments, this
> bloat, even with random 128-bit uuid's, is much reduced.
>
> B.
>
> On Thu, Jan 7, 2010 at 9:26 PM, Roger Binns <rog...@rogerbinns.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Chris Anderson wrote:
>>>> You might try using CouchDB's builtin sequential "uuids". These should
>>>> give you some more storage efficiency.
>>
>> I can't as various items reference various other items.  The source data is
>> in a SQLite database - highly normalized.  I then generate couch documents
>> from that are effectively denormalized using a SQLite temporary table to
>> help with mapping SQL primary keys of various tables into CouchDB document 
>> ids.
>>
>>>> Thanks for reporting. The last time we tried, we were able to do make
>>>> some major progress in storage size efficiency. I'm not sure how much
>>>> low-hanging fruit we have left, but if you try sequential uuids that
>>>> would be a good start.
>>
>> Using _ids that were maximum 4 bytes long resulted in *massively* less
>> CouchDB storage consumption.  Here are the numbers, first column is
>> Gigabytes.  There are 9.8 million couchdb documents.  None of the CouchDB
>> databases have any views added (yet).
>>
>>  1.3 SQLite database (53 tables, raw data)
>>  2.3 SQLite database after adding indices
>>
>>  2.5 Couch objects, JSON, one per line text file (no _rev, 16 byte _id)
>> 21.4 Those same couch objects after compaction (saved ~2GB doing compaction)
>>
>>  2.0 Couch objects, JSON, one per line text file (no _rev, 4 byte _id)
>>  3.8 Those same couch objects after compaction (didn't note pre-compaction)
>>
>> Doing compaction on the 21GB database took about 24 hours.  Doing it on the
>> 3.8 gb database took about 30 mins and probably way less.  The machine has
>> 6GB ram.  (It also doesn't help that compaction does frequent fsync's - they
>> really are not needed.)
>>
>> What this shows is that that CouchDB storage efficiency is *highly*
>> correlated with _id size.  Rather than using 16 hex digits you could get the
>> same number of values but using base 62 (10 digits, 26 lower, 26 upper) and
>> only need 4 or so of those "digits".
>>
>> Roger
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.9 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iEYEARECAAYFAktGUZcACgkQmOOfHg372QTRjwCdHqiFHqBzNCYAR/DFbsNNq4PV
>> 3aIAn2/AOBTAWrejsYa8XIyiBoUfe8/R
>> =m7Pn
>> -----END PGP SIGNATURE-----
>>
>>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Space (in)efficiency

Reply via email to