2017年12月28日 上午5:34 於 "Karl Williamson via Unicode" <[email protected]> 寫道: > > In UTS 39, it says, that optionally, > > "Mark Chinese strings as “mixed script” if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD]. > > "The criterion can only be applied if the language of the string is known to be Chinese." > > What does it mean for the language to "be known to be Chinese"? As in, the string is written in Chinese language, not Japanese language, not old Korean/Vietnamese text that use Chinese character, nor any other languages that use Chinese characters. According to my knowledge, some Chinese dialects/variants also use both Simplified and Traditional characters together with different etymology and that probably shouldn't be considered as mixed script too, although they aren't really common and is not mentioned in the UTS either.
> Is this something algorithmically determinable, or does it come from information about the input text that comes from outside the UCD? > > The example given shows some Hirigana in the text. That clearly indicates the language isn't Chinese. So in this example we can algorithmically rule out that its Chinese. Usually when there are Japanese kana in the mix then the text would be Japanese instead of Chinese. However the reverse is not necessarily true, especially for a single word or short pharse, older styled text and such, where a string with only Chinese characters can still be a Japanese text. > > And what does Chinese really mean here? > The written form of the (Mandarin) Chinese language?

