On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote: > On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote: > > On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote: > >> On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote: > >>> Saying that operating at the code point level - UTF-32 - is correct > >>> is like saying that operating at UTF-16 instead of UTF-8 is correct. > >> > >> Could you please substantiate that? My understanding is that code unit > >> is a higher-level Unicode notion independent of encoding, whereas code > >> point is an encoding-dependent representation detail. -- Andrei > > > Does walkLength yield the same number for all representations?
walkLength treats a code point like it's a character. My point is that that's incorrect behavior. It will not result in correct string processing in the general case, because a code point is not guaranteed to be a full character. walkLength does not report the length of a character as one in all cases just like length does not report the length of a character as one in all cases. walkLength is counting bigger units than length, but it's still counting pieces of a character rather than counting full characters. > > And you can even put that accent on 0 by doing something like > > > > assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d); > > > > One or more code units combine to make a single code point, but one or > > more > > code points also combine to make a grapheme. > > That's right. D's handling of UTF is at the code unit level (like all of > Unicode is portably defined). If you want graphemes use byGrapheme. > > It seems you destroyed your own argument, which was: > > Saying that operating at the code point level - UTF-32 - is correct > > is like saying that operating at UTF-16 instead of UTF-8 is correct. > > You can't claim code units are just a special case of code points. The point is that treating a code point like it's a full character is just as wrong as treating a code unit as if it were a full character. It's _not_ guaranteed to be a full character. Treating code points as full characters does give you the correct result in more cases than treating a code unit as a full character gives you the correct result, but it still gives you the wrong result in many cases. If we want to have fully correct behavior without making the programmer deal with all of the Unicode issues themselves, then we need to operate at the grapheme level so that we are operating on full characters (though that obviously comes at a high cost to efficiency). Treating code points as characters like we do right now does not give the correct result in the general case just like treating code units as characters doesn't give the correct result in the general case. Both work some of the time, but neither works all of the time. Autodecoding attempts to hide the fact that it's operating on Unicode but does not actually go far enough to result in correct behavior. So, we pay the cost of decoding without getting the benefit of correctness. - Jonathan M Davis