On Mon, Jun 22, 2009 at 09:21:44AM -0700, Chris Anderson wrote: > My larger point is that normalization is basically an optimization.
Optimisation of what? Unicode normalisation should be considered absolutely critical to any canonical form. If we want to use some proprietary algorithm for determining a document hash, then fine - but if we are advertising that it is calculated from some canonical serialisation, then Unicode normalisation really is a base requirement for that. > Occasionally getting the hash wrong (for whatever reason) is just > going to result in spurious conflicts, which aren't critical errors, > just an annoyance as the application will sometimes have to repair the > conflicts. Presumably you'll already be repairing conflicts, so fixing > ones that result from spurious conflicts (especially as they are so > rare) is not a big cost, and should be easy as the application can > just pick a random version of the doc, with no ill effects. As native English speakers, it's fairly easy for us to assume that most documents are comprised of some simple Latin character subset. As soon as you start working with languages that make heavy use of combining characters - accents, diacritical marks, etc - then this character normalisation becomes a major issue in a multi-user environment. Consider the following byte sequences: U+006B U+014D U+0061 U+006E U+006B U+006F U+0304 U+0061 U+006E Both of these look like "kōan" yet the byte sequence depends on my input method. > I think this makes the NFC stuff a nice-to-have, not a necessity. I disagree strongly. Best, -- Noah Slater, http://tumbolia.org/nslater
