Re: Unicode normalization (was Re: The 1.0 Thread)

Damien Katz Thu, 25 Jun 2009 14:37:55 -0700

I am now working on an implementation of deterministic revs. After alot of thinking about this, I've decided to not reuse the revision idsfor integrity checking. The canonicalization problem is unresolved andusing a CouchDB specific canonicalization means other libs/langs/platforms can't play easily with CouchDB replication.

Integrity will be preserved by use of Content-MD5 when transferring/replicating documents, and checking the document hashing when readingfrom of disk. The replicator http client will check the integrity ofthe network bodies.

If you need end-to-end integrity checking, you can use an applicationspecific scheme to sign/hash various fields and attachments, if youcan deal with the string and floating point canonicalization issues.

My plan is that when generating new rev ids, CouchDB willdeterministically generate the same revision id when edited with thesame data. But it still is specific to the version of CouchDB and it'sdependencies (version of Erlang, version of ICU, etc). It usually bethe same across versions, but is not guaranteed.

What this will allow is for a single client to send the same edits to2 identical Erlang servers and see the same revids generated on both.Optionally will allow that if 2 clients make byte identical saves fora document, they will get the same revision, and you don't need toreturn a conflict error the second client to save. I'm not sure aboutimplementing this though.

To implement this couchdb will store a md5 hash of the all theattachments along with the json document, when saving a new documentwe hash the native document and the attachment hashes together togenerate the revision id.

CouchDB will also store a md5 hash of the json document itself. Thiswill give us disk integrity checking for all documents and theirattachments in a database. When CouchdB encounters a corrupt documentor attachment it will stop what it's doing and return an error. Theadmin can restore from backup or recreate by deleting and re-replicating from a peer.

I think this is the most pragmatic way to do deterministic revs andintegrity checking. That is, do as little as possible and let othersdeal with the problems and implications of canonicalization if theywant to to do end to end integrity checking.


Feedback please.

-Damien

Re: Unicode normalization (was Re: The 1.0 Thread)

Reply via email to