Dear Jim,
On 2024/11/08 20:58, Jim Breen wrote:
On Fri, 8 Nov 2024 at 19:05, suzuki toshiya <[email protected]> wrote:
I understand your background is academic study of Japanese language, but
is there any special reason to mention to JIS X 0213, during the discussion
of general purpose encoding scheme of UTF-8?
It was an aside. (My academic background is in computer science;
Japanese NLP is a diversion which I have followed in my retirement.)
Oh, thank you for correcting my misunderstanding!
The original question was about the source code for UTF-8, and the OP
mentioned using Debian Linux I wanted to point out that there was
source code available for conversion of codes to UTF-8. I tossed in a
representation of the conversion of 16-bit Unicode points into 3-byte
UTF-8 sequences. (All the characters in JIS X 0208 and JIS X 0212 were
incorporated in the initial Unicode version.) Markus Scherer added
the representation of 21-bit Unicode in UTF-8, so I pointed out that
relatively few kanji in the JIS standards have 21-bit codepoints.
Correct. But, why we should restrict the focus to JIS character set?
I could not find any priority to JIS charset in the (painful) discussion...
If I focus iso-8859 character set, the usage of 16-bit codepoints
is rare, but if I say such, (I believe) many experts in the mailing
list may say, "sorry, our discussion is more generic".
In Japan, many running systems keep the restriction of JIS X 0208,
especially in public sectors.
Interesting comment. I guess you are aware that several of the changes
and additions made in the 2010 revision of the 常用漢字 involved the use
of kanji from outside JIS X 0208. Also, government bodies such as 文化庁
have been encouraging the use of Unicode-only kanji in lists such as
the 表外漢字字体表.
[...]
It's questionable whether 文化庁 was so ambitious to replace JIS
X 0208 + 0212 by ISO/IEC 10646. I guess they did not understand
the character encoding, and the industrial standard.
I guess, the earliest motivation of 表外漢字字体表 was not the
extension of the character set - their motivation would be an
elimination of the "non-authentic simplified form" of the characters,
as far as they had been exceptionally permitted by 常用漢字1981.
Maybe, the driving people of 表外漢字字体表 had a dream that their
result would urge Japanese IT companies to replace simplified glyphs
on JIS X 0208:1983-based system by more traditional glyph shape,
like 鷗, 𠮟, 噓, etc, without changing the character encoding scheme.
Unfortunately, it was too late to realize such a dream. As you know,
these glyph shapes were already coded as different characters in
ISO/IEC 10646, and Japanese IT companies could not afford to
recreate a system without ISO/IEC 10646-based frameworks anymore.
Even if the governmental customers ask Japanese vendors to build
a system supporting the characters which are not in JIS X 0208
but exist in ISO/IEC 10646, some Japanese vendors sell the system
which non-JIS characters are coded at the PUA codepoints of JIS
X 0208-based encoding (like Windows-31J), because they have no
experience to design other mechanism.
The "authentic traditional forms" coded in JIS X 0213:2004 are
still tagged as [環境依存文字] by Microsoft IME, so many people
think they are non-portable characters. In fact, if I make a
file whose filename including "叱" (U+20B9F) instead of "叱"
(U+53F1) on Microsoft Windows sold in Japan (running under Japanese
locale), I cannot put it in a ZIP file by builtin file manager
(so-called "Explorer"). The file manager warn "there are characters
which cannot be used in a compressed folder". Clearly, there is
a restriction of Windows-31J.
I think, the popularity of "21-bit Unicode codepoint" in Japanese text is
highly dependent with the category of the text.
Absolutely. Despite some misguided grumbling in Japan about Unicode
in its early days, it's what virtually everyone uses now, and no-one
is really aware whether the codepoints are 16 or 21 bits.
I remember, some lecturers in Japanese universities, at the role to
teach the information technology to young students, are still teaching as:
there are two kind of character encodings, one is single byte encoding
like ASCII, and another is double byte encoding like JIS-kanji...
Regards,
mpsuzuki