Re: recycling internationalized garbage

Ross Ridge Thu, 16 Mar 2006 15:17:35 -0800

Martin v. Löwis wrote:
> So "valid" yes; "meaningful" no. Therefore, for all practical
> purposes, 8-bit single-byte characters sets *will not* produce
> byte sequences that are valid in UTF-8 (although they could -
> it just won't happen).
>
> > In fact I can't think of any multi-byte encoding that can't produce
> > valid UTF-8 byte sequence.
>
> The same reasoning applies for them.


While you're reasoning may apply to European single-byte character
sets, it doesn't apply as well to Far East multi-byte encodings.  Take
ISO 2202-JP (RFC 1468) for example where any string is valid UTF-8 as
far as Python is concerned.  About 1% of the EUC-JP encoded words and
phrases listed in EDICT, a Japanese-English dictionary decode as valid
UTF-8 strings.  I get similar results with CEDICT, a Chinese-English
dictionary, about 1% for the Big5 encoded version of the file and about
4.5% for the GB 2312 version.

It would be nearly impossible to find all the strings in in Freedb that
decode as UTF-8 but aren't really encoded in UTF-8, but they do exist.
One example I managed to find are the GB 2312 encoded TTITLE5 and
TTITLE13 records of disc id 020f5210.

                   Ross Ridge

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: recycling internationalized garbage

Reply via email to