On Oct 21, 1:45 am, Paul Boddie <[EMAIL PROTECTED]> wrote: > From the Wikipedia page, it appears that you need to convert GB2312 > values to EUC-CN by a relatively straightforward process, and can then > output the resulting byte sequence in an ASCII compatible way, > provided that you filter out all the byte values greater than 127: > these filtered bytes would produce nonsense for anyone using a program > not expecting EUC-CN. UTF-8 has some similar properties, but as I > noted above, you wouldn't want to read most of the output if your > program wasn't expecting UTF-8.
What the Wikipedia page doesn't say is that the number of people who grok the concept of a GB2312 codepoint is vanishingly small, and the number of people who would actually have GB2312 codepoints in a file is smaller still. When people say their data is GB2312, they mean "GB<something> encoded as EUC-CN". So the relatively straightforward process is not required in practice. I don't understand the point or value of filtering out all byte values greater than 127: If the data is really GB2312, this would throw out all the Chinese characters. If the GB<something> is, as is likely, really GBK aka cp936 (a superset of GB2312), then the second byte of a Chinese character may be in the ASCII range, and the result of the filter would comprise the true ASCII characters plus some garbage ASCII characters. -- http://mail.python.org/mailman/listinfo/python-list