On 2018-01-11 19:42, Rob Speer wrote:
 > The question is rather: how often does web-XXX mojibake happen?

Very often. Particularly web-1252 mixed up with UTF-8.

My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is when a right curly quote is encoded as UTF-8 and decoded as codepage 1252. In Python's official windows-1252, this would at best be "â€�", using the 'replace' error handler. In web-1252, this would be "â€\x9d". The web-1252 version is more common.

Of course, since Python itself is widespread, there is some survivorship bias here. Another thing you could get instead of "�" is your code crashing.

FWIW, I've occasionally seen that kind of mojibake on the news ticker of the BBC News channel. :-(

[snip]
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to