I don't think the idea is that codepage equals language. Rather codepage
equals a writing system, which consists of one or more scripts (e.g., 6
scripts for ShiftJIS). As such the codepage is a useful cue in choosing
an appropriate font for rendering text. In the RichEdit edit engine, we
use a codepage generalization called a CharRep and break Unicode plain
text into runs of text each characterized by a particular CharRep. We
then bind these runs to appropriate fonts for rendering. There are many
additional considerations, so unfortunately this isn't an easy task. But
with enough refinements it works quite well. 

The bottom line is that if text was generated using a particular
codepage it's likely that the creator of that text intended the text to
be rendered with a font that supports that codepage. For text tagged
with no codepage, we do our best to translate the keyboard language to a
CharRep and proceed as above. When neither the keyboard nor codepage
info is available, we use a set of heuristics to break the text into
CharRep runs. Among the many heuristics used are 1) a string containing
Kana is likely to have a Japanese CharRep, and 2) a CJK string that
round trips through CHT, CHS, or ShiftJIS may well belong to those
CharReps. In particular if a CJK string doesn't round trip through CHT,
it's probably not Traditional Chinese.

Murray

Reply via email to