Re: Unicode normalization (was Re: The 1.0 Thread)

Paul Davis Mon, 22 Jun 2009 09:17:13 -0700

Couple things over thinking this over for a night.

Firstly, I was about to write an email almost exactly like Chris' last
night, but during the time of drafting it I started looking into
unicode normalization and what effects it might have. As it turns out,
we're already messing around with strings in such a way as to be
confusing. For instance, unicode escaped latin characters are mutated
when going through the JSON decoder. As in, "\u0043" -> "C".


Secondly, as Noah points out, we shouldn't be using the derived type
for storage, only for the revision calculation, and as Anthony points
out, we could always implement an endpoint to make the normalization
algorithm available to clients. And lest we forget, there are also
major headaches awaiting us with float normalization.

Yay fun stuff!

Paul

On Mon, Jun 22, 2009 at 11:19 AM, Antony Blakey<antony.bla...@gmail.com> wrote:
>
> On 23/06/2009, at 12:06 AM, Noah Slater wrote:
>
>> On Sun, Jun 21, 2009 at 11:21:00PM -0700, Chris Anderson wrote:
>>>
>>> My gut reaction is that normalizing strings using NFC [1] is not
>>> appropriate
>>> for a database. Here's why we should treat strings as binary and not
>>> worry
>>> about unicode normalization at all:
>>
>> [...]
>>>
>>> First of all, I'm certain we can't require that all input already be NFC
>>> normalized.
>>
>> [...]
>>>
>>> Secondly, we're a database, so I find highly suspicious the notion that
>>> we
>>> should auto-normalize user input on-the-quiet.
>>
>> [...]
>>>
>>> So we can't require normalized input and we can't auto-normalize.
>>
>> CouchDB would create a canonicalised copy of the document while creating
>> the
>> document hash. There is no reason why CouchDB, or the clients, should
>> worry
>> about canonicalising the actual documents.
>>
>>> Where does this leave us?
>>
>> Canonicalisation is a temporary step, so there are no problems.
>
> +1 to those two points.
>
>>>> Unicode normalisation is an issue for clients because it requires they
>>>> have
>>>> access to a Unicode NFC function.
>>
>> Why would clients need to worry about this? CouchDB is creating the
>> hashes.
>
> At the moment, sure, but I was anticipating cases where this the canonical
> form, or a hash thereof would then creep into other contexts i.e. once you
> have the facility, who knows what you might want to do. OTOH, this could be
> dealt with via a canonicalisation service e.g. POST json payload(s), get
> back hashes of the canonical form(s) (or the forms themselves), which means
> that systems without access to unicode normalisation can still function with
> future facilities.
>
> Antony Blakey
> --------------------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> The greatest challenge to any thinker is stating the problem in a way that
> will allow a solution
>  -- Bertrand Russell
>
>

Re: Unicode normalization (was Re: The 1.0 Thread)

Reply via email to