So we both agree that Unihan is not designed to tell people how to covert between traditional and simplified characters.
Yep.

Though some confusion as what other questions are being discussed here.
I think I misused the expression "folding" at some point. But the original query explicitly asked about "do[ing] traditional to simplified folding for indexing and query processing (/when the mapping is unambiguous/)" (emph added) so I wasn't really sure where parts of the discussion were going :-)

Japanese has well established traditions for simplifying CJK ideographs which are not identical to Chinese if one was to use a folding approach to deal with simplifications then there should be differences for Chinese and Japanese.
I think the kyūjitai-shinjitai mappings are not in Unihan. (Compare the entries of 廣 (U+5EE3) and the characteristically Japanese character 広 (U+5E83).) I know that certain contexts retain older forms (KenL talks about this somewhere too). Btw if you know about other mappings or good resources, I'll be curious to know.

"quite well documented" is a relative term
I highly respect the work in Cheung & Bauer, but it makes no attempt to tell us how easily understood the characters are. Many of them are ad-hoc coinages that are not understood by any of my informants; sometimes for say 6 ways of writing a syllable-morpheme, I can make my informants tell me that perhaps /one/ of them is passable. This problem isn't easily solved, but then the source isn't helpful in knowing which out of the approx 1000 characters are actually used nowadays. I won't give you a number, as I'd have to check more carefully to be quotable. The number of morphemes for which there truly seems to be no written representation is /very/ low, but often the characters in existence aren't exactly comprehensible to many native speakers either, and not all of them are unambiguous. This will give you an idea.

Zhuang Sawndip
Sounds exciting.

By best choice do you mean (a) the person producing the electronic form was unable to use the character they wished because either it is not yet in Unicode (b) even though in Unicode the person was did not know how to type it so type another character instead (c) a less than perfect, or ambiguous, 'spelling' . All of which are found both for Sinitic languages and non-Sinitic languages when written in CJK ideographs, be it printed publications, web-pages or text messages between native speakers.
Nearly all of Cantonese is in Unicode and therefore typeable in theory (though some people will not be used to such writing, but I'm sure you know this), so it's not (a). I would say it's largely (c) (people will often make up their own plausible thing), even though (b) is a reason too.

Not standardize does not mean totally beyond analysis or processing, or even necessarily that confusing to a native speaker, they are not random, though admittedly more complex than a standardized locale.
Yes. And we both agree that standardization is desirable.

Stephan

Reply via email to