yes, I understand that the rendering needs to be localized, but I believe that is not at all out of keeping with the unicode design philosophy. for instance if I were to have a list of words taken from spanish, french and english, I would not be able to sort any of them in language-specific order because each language induces its own collation order. it seems to me that your friend is still talking about glyphic variants, and that glyphic variants are not grounds for breaking a character into 2.
Unicode has been insistent from the beginning on _not_ encoding information about specific languages. languages change far more frequently than characters. What if next year the chinese variant of U+516B becomes popular for use in vietnam? should the character set be rewritten to accomodate? No, clearly the language has changed and not the character. You need to indicate in your locale settings (including choice of font) which variants you want to use. To put it another way -- imagine your theoretical article with chinese, japanese, korean and vietnamese being displayed in it. lets say you're a japanese person reading it. You use a "japanese" locale, which includes loading a font made by a japanese foundry, which supports only the japanese glyph variants. there are 2 possibilities: (1) you can, through some amazing miracle of language training, read all 4 languages in their native orthography. You grew up with the japanese variant of U+516B, so when you see it occur in the middle of the vietnamese block of text you see it as just a "japanese-friendly font rendition" of an obviously vietnamese character, make a reasonable assumption that the author meant the vietnamese variant which just doesn't happen to exist in your japan-made font, and carry on. (2) you can't read the other 3 languages anyway. who cares what glyphs they use? you can read U+516B in the japanese portions of the text, which is the only part you understand anyway. I really don't think this is a CJKV-specific problem. the same thing will happen to me if I find myself in germany writing email in english using a german-localized email client, and I type 2 consecutive "s" characters in, I'm not going to be terribly surprised when it forms a ligature. One which my friends who have never read german will think, at a glance, is a capital "B". It's a hazard of localization, imo. unicode is not intended to create an environment in which everyone can magically understand each other. merely one in which pairs of people speaking the same language can understand each other without having to use specialized versions of the software, and in which automatic tools like grep and sed have some hope of being able to work right without knowing which language they're scanning. -graydon -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]