Re: Unicode normalization (was Re: The 1.0 Thread)

Noah Slater Mon, 22 Jun 2009 12:32:43 -0700

On Mon, Jun 22, 2009 at 03:15:24PM -0400, Paul Davis wrote:
> I think he means optimization in so much as that the deterministic
> revision algorithm still works regardless.


For what definition of "works" though?

Should two versions of the same document, one using combining code points and
the other using single code points, generate the same hash? If the answer is no,
then why don't we drop the pretence of working with a canonical version of the
document and just SHA1 the binary serialisation?

> If clients want to avoid spurious conflicts, then they should send normalized
> unicode to avoid the issue.

You're arguing for using a blind a binary hash like SHA1, and putting the onus
on clients to perform canonicalisation. That's fine, as long as you realise that
this isn't what was originally being proposed.

> In other words, if we just write the algorithm to not care about normalization
> it'll solve lots of cases for free and the cases that aren't solved can be
> solved if the client so desires.

I'm not so sure what this "solves" other than implementation effort for us.

Again, it seems like you're wanting to offer to routes for us:

  * calculate the document hash from a canonical binary serialisation

  * calculate the document hash from the binary serialisation

Both of these are reasonable choices, depending on our goals. But we need to
realise that JSON canonicalisation REQUIRES Unicode canonicalisation, and so the
choice isn't about ignoring Unicode issues, it's about deciding what a canonical
serialisation hash buys us above and beyond a totally blind one.

> The thing that worries me most about normalization is that we would end up
> causing more problems by being complete than if we just took the naive route.
> Requiring that a client have an implementation of normalization byte identical
> to the one CouchDB uses instead of just being internally consistent seems like
> it could trip up alot of clients.

Interestingly enough, your suggestion to use a blind binary hash such as SHA1
pushes the canonicalisation issues onto the client, and would force them to find
a complete Unicode library. If CouchDB calculated the hash from a canonical
serialisation internally, we remove this burden from the clients.

Best,

-- 
Noah Slater, http://tumbolia.org/nslater

Re: Unicode normalization (was Re: The 1.0 Thread)

Reply via email to