Martin v. Löwis wrote: > So "valid" yes; "meaningful" no. Therefore, for all practical > purposes, 8-bit single-byte characters sets *will not* produce > byte sequences that are valid in UTF-8 (although they could - > it just won't happen). > > > In fact I can't think of any multi-byte encoding that can't produce > > valid UTF-8 byte sequence. > > The same reasoning applies for them.
While you're reasoning may apply to European single-byte character sets, it doesn't apply as well to Far East multi-byte encodings. Take ISO 2202-JP (RFC 1468) for example where any string is valid UTF-8 as far as Python is concerned. About 1% of the EUC-JP encoded words and phrases listed in EDICT, a Japanese-English dictionary decode as valid UTF-8 strings. I get similar results with CEDICT, a Chinese-English dictionary, about 1% for the Big5 encoded version of the file and about 4.5% for the GB 2312 version. It would be nearly impossible to find all the strings in in Freedb that decode as UTF-8 but aren't really encoded in UTF-8, but they do exist. One example I managed to find are the GB 2312 encoded TTITLE5 and TTITLE13 records of disc id 020f5210. Ross Ridge -- http://mail.python.org/mailman/listinfo/python-list