Re: Unicode normalization (was Re: The 1.0 Thread)

Damien Katz Fri, 26 Jun 2009 04:09:10 -0700

Md5 here is for integrity purposes, not security, so manufacturedcollisions aren't a problem we are worried about. And I don't thinkthere is standard SHA1 header, not that I could find anyway.


-Damien



On Jun 26, 2009, at 1:32 AM, kowsik wrote:

Please use SHA-1 because creating collisions with MD5 is trivial:

http://web.archive.org/web/20070604205756/http://www.infosec.sdu.edu.cn/paper/md5-attack.pdf
http://www.mscs.dal.ca/~selinger/md5collision/

etc.

Google for "md5 collision". Effectively, what this means that it's
easy to generate two documents that have the same MD5 hash. I'm sure
SHA-1 will be an issue at "some point in the future", but MD5 is
already broken from a hashing perspective.

K.

On Thu, Jun 25, 2009 at 2:37 PM, Damien Katz<[email protected]> wrote:
I am now working on an implementation of deterministic revs. Aftera lot of
thinking about this, I've decided to not reuse the revision ids for
integrity checking. The canonicalization problem is unresolved andusing aCouchDB specific canonicalization means other libs/langs/platformscan't
play easily with CouchDB replication.

Integrity will be preserved by use of Content-MD5 when
transferring/replicating documents, and checking the documenthashing whenreading from of disk. The replicator http client will check theintegrity of
the network bodies.

If you need end-to-end integrity checking, you can use an application
specific scheme to sign/hash various fields and attachments, if youcan deal
with the string and floating point canonicalization issues.
My plan is that when generating new rev ids, CouchDB willdeterministicallygenerate the same revision id when edited with the same data. Butit stillis specific to the version of CouchDB and it's dependencies(version ofErlang, version of ICU, etc). It usually be the same acrossversions, but is
not guaranteed.
What this will allow is for a single client to send the same editsto 2
identical Erlang servers and see the same revids generated on both.
Optionally will allow that if 2 clients make byte identical savesfor adocument, they will get the same revision, and you don't need toreturn aconflict error the second client to save. I'm not sure aboutimplementing
this though.
To implement this couchdb will store a md5 hash of the all theattachmentsalong with the json document, when saving a new document we hashthe nativedocument and the attachment hashes together to generate therevision id.
CouchDB will also store a md5 hash of the json document itself.This willgive us disk integrity checking for all documents and theirattachments in adatabase. When CouchdB encounters a corrupt document or attachmentit willstop what it's doing and return an error. The admin can restorefrom backup
or recreate by deleting and re-replicating from a peer.

I think this is the most pragmatic way to do deterministic revs and
integrity checking. That is, do as little as possible and letothers dealwith the problems and implications of canonicalization if they wantto to do
end to end integrity checking.

Feedback please.

-Damien

Re: Unicode normalization (was Re: The 1.0 Thread)

Reply via email to