Hi,
At Fri, 2 Feb 2001 18:09:10 +0100 ,
Karlsson Kent - keka <[EMAIL PROTECTED]> wrote:
>>'mojibake' - broken characters caused by encoding mismatching.
>
> I'm not sure what you mean. Most of the "broken characters"
> in 10646/Unicode come from East Asian legacy encodings, and
> there are even more in Unicode 3.1 because of recent additions
> of Han compatibility characters going into 10646-2.
Imagine you invoke xterm in utf8 mode and run a 8bit application.
This is mojibake. This is not related to character shortage.
This is, as I wrote, encoding mismatching.
> Originally Unicode was intended only for characters in active use
> these days. But the scope has been extended to cover also historic
> characters. That is why about 43 000 new Han characters are
> going into Unicode/10646, still with the same unification principles
> as for the BMP. Most are collected from various dictionaries,
> only a smaller number are compatibility Han (Kanji) characters
> (insisted on by Japan).
You misunderstand what I mean. Ok, it is natural because you don't
know CJK languages. I am not talking about historic ideographs.
I have to explain more.
In Unicode, CJK characters with same meaning and similar shape is
unified. For example, U+9AA8 (ideograph 'bone') unifies 0x3947 from
GB2312 (Mainland China), 0x586C from CNS11643-1 (Taiwan), 0x397C from
JISX0208 (Japan), and 0x4D69 from KSX1001 (Korea). However, though
these character share the common origin, today they have different
shape and CJK people cannot tolerate. Note that these all characters
are not historic but used for daily use. Also note that any future
extention cannot fix this problem because already determined codepoint
of Unicode will not be changed in future. (And more, if it were
changed, confusion will occur.)
Note that not all unifications are bad, because some CJK ideographs
have exactly the same shape.
This is why XFree86 have very similar
-misc-fixed-medium-r-normal-ja- font and
-misc-fixed-medium-r-normal-ko- font. We would also need
-misc-fixed-medium-r-normal-zhcn- font and
-misc-fixed-medium-r-normal-zhtw- font.
They have different glyphs for such badly unified ideographs and
exactly same glyphs for other characters.
---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://surfchem0.riken.go.jp/~kubota/
"Introduction to I18N"
http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/