On Mon, Jun 22, 2009 at 2:46 PM, Noah Slater<nsla...@apache.org> wrote: > On Mon, Jun 22, 2009 at 09:21:44AM -0700, Chris Anderson wrote: >> My larger point is that normalization is basically an optimization. > > Optimisation of what? Unicode normalisation should be considered absolutely > critical to any canonical form. If we want to use some proprietary algorithm > for > determining a document hash, then fine - but if we are advertising that it is > calculated from some canonical serialisation, then Unicode normalisation > really > is a base requirement for that. >
I think he means optimization in so much as that the deterministic revision algorithm still works regardless. If clients want to avoid spurious conflicts, then they should send normalized unicode to avoid the issue. In other words, if we just write the algorithm to not care about normalization it'll solve lots of cases for free and the cases that aren't solved can be solved if the client so desires. >> Occasionally getting the hash wrong (for whatever reason) is just >> going to result in spurious conflicts, which aren't critical errors, >> just an annoyance as the application will sometimes have to repair the >> conflicts. Presumably you'll already be repairing conflicts, so fixing >> ones that result from spurious conflicts (especially as they are so >> rare) is not a big cost, and should be easy as the application can >> just pick a random version of the doc, with no ill effects. > > As native English speakers, it's fairly easy for us to assume that most > documents are comprised of some simple Latin character subset. As soon as you > start working with languages that make heavy use of combining characters - > accents, diacritical marks, etc - then this character normalisation becomes a > major issue in a multi-user environment. > > Consider the following byte sequences: > > U+006B U+014D U+0061 U+006E > > U+006B U+006F U+0304 U+0061 U+006E > > Both of these look like "kōan" yet the byte sequence depends on my input > method. > The question is what are we trying to accomplish. We could sit around and argue whether we're creating revisions of the sequences of code points or sequences of characters. I'm more than happy to say sequences of bytes (and thus code points) because that is applicable to most clients and does not prevent the algorithm from working on normalized unicode. >> I think this makes the NFC stuff a nice-to-have, not a necessity. > > I disagree strongly. > > Best, > > -- > Noah Slater, http://tumbolia.org/nslater > The thing that worries me most about normalization is that we would end up causing more problems by being complete than if we just took the naive route. Requiring that a client have an implementation of normalization byte identical to the one CouchDB uses instead of just being internally consistent seems like it could trip up alot of clients. Granted there's a lot of other corners to get snagged on, so perhaps for the time being we should just say screw it, implement something and see what clients have problems. Even adding normalization as part of the step or not I could go with a coin toss at this point assuming our ICU library dependency can do it. Either way, enough hand waving for one day.