On Thu, Jun 02, 2016 at 05:19:48PM -0700, Walter Bright via Digitalmars-d wrote: > On 6/2/2016 3:27 PM, John Colvin wrote: > > > I wonder what rationale there is for Unicode to have two different > > > sequences of codepoints be treated as the same. It's madness. > > > > There are languages that make heavy use of diacritics, often several > > on a single "character". Hebrew is a good example. Should there be > > only one valid ordering of any given set of diacritics on any given > > character? > > I didn't say ordering, I said there should be no such thing as > "normalization" in Unicode, where two codepoints are considered to be > identical to some other codepoint.
I think it was a combination of historical baggage and trying to accomodate unusual but still valid use cases. The historical baggage was that Unicode was trying to unify all of the various already-existing codepages out there, and many of those codepages already come with various precomposed characters. To maximize compatibility with existing codepages, Unicode tried to preserve as much of the original mappings as possible within each 256-point block, so these precomposed characters became part of the standard. However, there weren't enough of them -- some people demanded less common character + diacritic combinations, and some languages had writing so complex their characters had to be composed from more basic parts. The original Unicode range was 16-bit, so there wasn't enough room to fit all of the precomposed characters people demanded, plus there were other things people wanted, like multiple diacritics (e.g., in IPA). So the concept of combining diacritics was invented, in part to prevent combinatorial explosion from soaking up the available code point space, in part to allow for novel combinations of diacritics that somebody out there somewhere might want to represent. However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence. (Normalization, of course, also subsumes a few other things, such as collation, but this is one of the factors behind it.) (This is a greatly over-simplified description, of course. At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see). Then you have the wonderful Indic and Arabic cursive writings, where letterforms mutate depending on the surrounding context, which, if you were to include all variants as distinct code points, would occupy many more pages than they currently do. And also sticky issues like the oft-mentioned Turkish i, which is encoded as a Latin i but behaves differently w.r.t. upper/lowercasing when in Turkish locale -- some cases of this, IIRC, are unfixable bugs in Phobos because we currently do not handle locales. So you see, imagining that code points == the solution to Unicode string handling is a joke. Writing correct Unicode handling is *hard*.) As with all sufficiently complex software projects, Unicode represents a compromise between many contradictory factors -- writing systems in the world being the complex, not-very-consistent beasts they are -- so such "dirty" details are somewhat inevitable. T -- Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan