Re: Off-topic request

Mark Crispin Fri, 26 Mar 2004 09:17:33 -0800

On Fri, 26 Mar 2004, Mark Keasling wrote:

The basic problem is that Japanese, Chinese,
and Korean all use a large number of the same "characters" and when

Not same but similar.  The unification effort took characters with both a
similar appearance and a similar meaning and lumped them together.

Ah, Mark, have you actually read the Han unification rules in the Unicode specification?

The characters *are* the same, and the rules for unification are actually quite conservative. For example, U+5FB3 and U+5FB7 both mean "virtue": every Chinese ("de"), Japanese ("toku") and Korean agrees. But they are not unified, because CNS 11643 treats them differently and more importantly because Chinese/Korean virtue (U+5FB7) has an extra stroke over the "heart" radical that Japanese virtue (U+5FB3) lacks.

To be unified, a pair of characters must have the same:
 . number of components
 . relative position of components in each complete character
 . structure of a corresponding component
 . treatment in a source character set
 . radical contained in a component

When you examine the characters that have been unified, you will find that the differences are so exceedingly minor that most native Japanese or Chinese would not recognize a difference. [In fact, such surveys have been made.] The so-called "Chinese" version can be found in Japanese fonts, and vice-versa. When drawing the "house" radical, does the second stroke extend one pixel further...that kind of thing.

You have greater character variability by choosing a gothic font vs a mincho vs a saimincho font.

Chinese characters CAN be
displayed more or less intelligibly with a Japanese font (and vice

less intelligibly

That only occurs with JIS fonts, and not with Unicode fonts. JIS fonts lack many Chinese characters, and thus you have to substitute other seemingly "equivalent" characters.

This does not happen with Unicode fonts, because the Chinese characters that are not in JIS are present in Unicode.

There are such things as specifically "Chinese" or "Japanese" Unicode fonts, but most native Chinese or Japanese will not see a difference.

Although "only a font problem", this is a problem interfering with the
acceptance of Unicode (it's a cultural identity issue and, I think,
will not be easily resolved).

not just cultural identity but also cultural animosity.

Now we're getting to the real problem. Certain individuals in Japan are terribly offended that a Chinese source (kangxi zidian) is used as the primary source, and a Japanese source (daikanwa jiten) is secondary.

Among other things, these individuals have falsely claimed that Unicode will force Japanese to adopt Chinese forms for kanji, and similar nonsense.

These are a very small number of individuals, but they have successfully spread FUD for many years. Chief among these is an individual who demands that Unicode be abandoned in favor of a 24-bit character set that attempts to represent all languages in monotype glyphs (meaning that ligature type languages such as Arabic must have a separate character for every word). He talks a lot about plaintext (you probably know who I mean).

They've succeeded mostly because the issues *are* complex, and very few people are willing/able to investigate the claims and determine their veracity. Instead, most people just believe what they've been told.

Several years ago, I believed these claims too; and then I was persuaded to undertake the careful study and examination to verify them for myself. I found that these claims are nonsense.

Sure they could go to the trouble of having the Chinese and Korean fonts
on hand for those rare occasions; but, Japanese newspapers are printed mainly
for Japanese and not for Chinese or Koreans.  Is it worth the trouble
for them when in most cases the audience won't notice the difference?

The audience won't notice the difference even if they are not Japanese. It's a font difference. There is far more variability between fonts within a language than there are in fonts between languages.

The people who are complaining are those for whom the difference is crucial;
literary scholars, linguists and so on.

Have you asked any literary scholars, linguists, and so on?

I've been part of a group of literary scholars, linguists, etc. who have dealt with this issue for many years.

The main complaint is that the unicode charset loses the language/font association implied by national charsets and that the differences in the characters of the CJK fonts is enough to render the text unreadable.

This complaint is bogus. I know that it's a common rumor in Japan, but it is bogus.

I could fax you a long Japanese language text printed in a "Chinese" Unicode font, and you can give it to Yoshii-san and ask him to read it and see if there are any problems. He won't find any, because it will be perfectly good Japanese to him.

It's also rather remarkable in this age of HTML and language-tagged texts that we still are talking about implying a language from the character set. This has never been possible with most charsets in the world.

Incidentally, Unicode *does* have codepoints for language tags in plaintext. They're in plane 14, and when they appear they shift the language for subsequent text. Thus, the claim that Unicode "loses language/font" association is not only meaningless, it is also false.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.

Re: Off-topic request

Reply via email to