> The question is rather: how often does web-XXX mojibake happen? Very often. Particularly web-1252 mixed up with UTF-8.
My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is when a right curly quote is encoded as UTF-8 and decoded as codepage 1252. In Python's official windows-1252, this would at best be "â€�", using the 'replace' error handler. In web-1252, this would be "â€\x9d". The web-1252 version is more common. Of course, since Python itself is widespread, there is some survivorship bias here. Another thing you could get instead of "â€�" is your code crashing. On Thu, 11 Jan 2018 at 12:20 Random832 <[email protected]> wrote: > On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote: > > There's a problem with these encodings: they are mostly meant > > for decoding (broken) data, but as soon as we have them in the stdlib, > > people will also start using them for encoding data, producing more > > corrupted data. > > Is it really corrupted? > > > Do you really things it's a good idea to support this natively > > in Python ? > > The problem is, that's ignoring the very real fact that this is, and has > always been* the behavior of the native encodings built in to Windows. My > opinion is that Microsoft, for whatever reason, misrepresented their > encodings when they submitted them to Unicode. The native APIs for text > conversion have mechanisms for error reporting, and these supposedly > undefined characters do not trigger them as they do for e.g. CP932 0xA0. > > Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private > use), a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the > mappings being discussed here) > If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns > an error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still > returns U+0081. > > As far as the actual encoding implemented in windows is concerned, > CP1252's 0x81->U+0081 mapping is a wholly valid one (though undocumented), > and not in any way a fallback or a "best fit" or an invalid character. > > *except for the addition of the Euro sign to each encoding at typically > 0x80 in circa 1998. > **It's worth mentioning that our cp932 returns U+F8F0, even with > errors='strict', despite this not being present in the unicode published > mapping. It has done this at least since the CJKCodecs change in 2004. I > can't determine where (or if) it was implemented at all before that. > _______________________________________________ > Python-ideas mailing list > [email protected] > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list [email protected] https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
