On Fri, May 27, 2016 at 02:42:27PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: > On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote: > > Exactly. And we just keep getting stuck on this point. It seems that > > the message just isn't getting through. The unfounded assumption > > continues to be made that iterating by code point is somehow > > "correct" by definition and nobody can challenge it. > > Which languages are covered by code points, and which languages > require graphemes consisting of multiple code points? How does > normalization play into this? -- Andrei
This is a complicated issue; for a full explanation you'll probably want to peruse the Unicode codices. For example: http://www.unicode.org/faq/char_combmark.html But in brief, it's mostly a number of common European languages have 1-to-1 code point to character mapping, as well as Chinese writing. Outside of this narrow set, you're on shaky ground. Examples (that I can think of, there are many others): - Almost all Korean characters are composed of multiple code points. - The Indic languages (which cover quite a good number of Unicode code pages) have ligatures that require multiple code points. - The Thai block contains a series of combining diacritics for vowels and tones. - Hebrew vowel points require multiple code points; - A good number of native American scripts require combining marks, e.g., Navajo. - International Phonetic Alphabet (primarily only for linguistic uses, but could be widespread because it's relevant everywhere language is spoken). - Classical Greek accents (though this is less common, mostly being used only in academic circles). Even within the realm of European languages and languages that use some version of the Latin script, there is an entire block of code points in Unicode (the U+0300 block) dedicated to combining diacritics. A good number of combinations do not have precomposed characters. Now as far as normalization is concerned, it only helps if a particular combination of diacritics on a base glyph have a precomposed form. A large number of the above languages do not have precomposed characters simply because of the sheer number of combinations. The only reason the CJK block actually includes a huge number of precomposed characters was because the rules for combining the base forms are too complex to encode compositionally. Otherwise, most languages with combining diacritics would not have precomposed characters assigned to their respective blocks. In fact, a good number (all?) of precomposed Latin characters were included in Unicode only because they existed in pre-Unicode days and some form of compatibility was desired back when Unicode was still not yet widely adopted. So basically, besides a small number of languages, the idea of 1 code point == 1 character is pretty unworkable. Especially in this day and age of worldwide connectivity. T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!