The "Wide character support in D" thread got me to question and double-check some of my assumptions about unicode. From double-checking the UTF-8 encoding, and looking at the charts at ( http://www.unicode.org/charts/ ), I realized that Japanese, Chinese and Korean characters are almost entirely (if not entirely) 3 bytes on UTF-8. For some reason I had been under the impression that the Japanese -kanas and at least a few of the Chinese characters were 2 bytes on UTF-8. Turns out that's not the case. I thought I'd share that in case any one else didn't know. Also, FWIW, Cyrillic (ex, Russian, AIUI), and Greek appear to be primarily, if not entirely, 2 bytes in UTF-8.
But then I noticed something on the charts for the Japanese -kanas (ex: http://www.unicode.org/charts/PDF/U3040.pdf ). Umm, first of all, for those unfamiliar with Japanese: There are two phonetic alphabets, hiragana and katakana (in addition to the chinese characters), and they're based more on syllables than the individual sounds of western-style letters. Also, some of the sounds are formed by adding a modifier to a symbol for a similar sound. For instance: ? (U+305D, hiragana "so") is the sound "so", and to make "zo" you add what looks like a double-quote to it: ? (U+305E, hiragana "zo") (You may need to increase your font size to see it well). That same modifier converts most of the "s"'s to "z"'s, or any of the "h"'s to "b"'s, etc. And there's also another modifier that converts the "h"'s to "p"'s (looks like a little circle). The thing is, there appears to also be Unicode code points for these modifiers by themselves (U+3099 and U+309A). Maybe I'm understanding it wrong, but according to Page 3 in the document I linked to above, it looks like these are intended to be used in conjunction with the regular letters in order to modify them. So, it seems that there are two valid ways to encode a single character like ? ("zo"): Either (U+305E) or (U+305D, U+3099). I think these are what people call "combining characters" but every explanation of Unicode I've ever seen that actually mentions such things always just hand-waves it away with "oh, yea, and then there's something called 'combining characters' that can complicate things", and that's all they ever say. So, my questions: 1. Am I correct in all of that? 2. Is there a proper way to encode that modifier character by itself? For instance, if you wanted to write "Japanese has a (the modifier by itself here) that changes a sound". 3. A text editor, for instance, is intended to treat something like (U+305D, U+3099) as a single character, right? 4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to compare as equal? 5. Does Phobos/Tango correctly abide by whatever the answer to #4 is? 6. Are there other languages with similar things for which the answers to #3 and #4 are different? (And if so, how does Phobos/Tango handle it?) 7. I assume Unicode doesn't have any provisions for Furigana, right? I assume that would be outside the scope of Unicode, but I thought I'd ask.