Unicode normalization (was Re: The 1.0 Thread)

Chris Anderson Sun, 21 Jun 2009 23:21:36 -0700

On Sun, Jun 21, 2009 at 4:40 PM, Antony Blakey<[email protected]> wrote:
>
> On 22/06/2009, at 7:26 AM, Paul Davis wrote:
>
>> Also +lots on deterministic revisions. As a side note, we've been
>> worrying a bit about how to calculate the revision id's in the face of
>> JSON differences in clients. I make a motion that we just stop caring
>> and define how we calculate the signature. Ie, instead of calling it
>> canonical JSON just call it, "The CouchDB deterministic revision
>> algorithm" or some such. Then we can feel free to punt on any of the
>> awesome issues we'd run into with a full canonical JSON standard.
>
> I haven't seen the recent discussions about canonicalisation, but IMO a
> minimum requirement is that the infoset <-> serialisation mapping must be
> 1:1, which requires completeness and prescription. Doing unicode
> normalisation (NFC probably) is IMO also an absolute requirement - it's
> virtually impossible to construct documents by hand with repeatable results
> without it.
>


My gut reaction is that normalizing strings using NFC [1] is not
appropriate for a database. Here's why we should treat strings as
binary and not worry about unicode normalization at all:

First of all, I'm certain we can't require that all input already be
NFC normalized. The real-life failure condition would be: "your
language / operating system is not supported by CouchDB." A normal
user is not going to understand the first bit of the fact that the
underlying binary representation of their text could be subtly
different in a way that would be invisible to them. And if they did
understand that, they'd be hard pressed to change it. So rejecting
non-normalized strings is unacceptable.

Secondly, we're a database, so I find highly suspicious the notion
that we should auto-normalize user input on-the-quiet. Maybe
normalization is not lossy, but one particular use-case (however slim)
that we can't support if we auto-normalize is a document which lists
variations on the same string, to illustrate how non-normalized forms
look the same but have different binary representations. A database
which can't store that seems flawed, to me.

So we can't require normalized input and we can't auto-normalize.
Where does this leave us?

Under the current (raw binary) string handling, two variations on a
document which would, when NFC normalized, be binary identical, could
have different deterministic revs. Since 99+% of content is already
normalized, we're looking at a very small set of cases where we'd have
distinct revs for documents that have similar (or identical, depending
on your pov) content. The fact that there are rare pairs of documents
out there which one could argue are the same, but which have different
revs, strikes me as ever so slightly non-optimal, but not really a big
deal.

I think the potential optimization can be nicely accounted for by a
simple recommendation:

* If you are doing independent updates (from distinct client software)
of strings in a document, and relying on deterministic revs to avoid
conflict-on-replication, you should NFC normalize your content.

The antecedents in that clause show how the case where normalization
matters for deterministic revs is even more rare than the existence of
non-normalized unicode. A secondary recommendation for people relying
on deterministic revs to avoid conflicts on multi-node-updates, would
be:

"don't mutilate strings you didn't edit" so as long as client software
doesn't go jiggling forms to other random look-alike codepoints
without asking, any potential trouble is confined to fields actually
effected by an update.

The common use case for these revs is not lots of distinct client
softwares all doing identical updates by hand and then pushing them to
different eventually-replicating CouchDB cluster members. (Which is
the use case where any of the above discussion is relevant.)

The paradigm use case of deterministic revs is a single piece of
software, running on a single box, creating a document and saving it
to multiple cluster members using the same rev. Treating strings as
binary completely and totally serves this use case.

Chris

[1] http://www.macchiato.com/unicode/nfc-faq

> Unicode normalisation is an issue for clients because it requires they have
> access to a Unicode NFC function.
>
> Antony Blakey
> --------------------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> It is as useless to argue with those who have renounced the use of reason as
> to administer medication to the dead.
>  -- Thomas Jefferson
>
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Unicode normalization (was Re: The 1.0 Thread)

Reply via email to