On 2018-01-11 19:42, Rob Speer wrote:
> The question is rather: how often does web-XXX mojibake happen?
Very often. Particularly web-1252 mixed up with UTF-8.
My ftfy library is tested on data from Twitter and the Common Crawl,
both prime sources of mojibake. One common mojibake sequence is when a
right curly quote is encoded as UTF-8 and decoded as codepage 1252. In
Python's official windows-1252, this would at best be "�", using the
'replace' error handler. In web-1252, this would be "â€\x9d". The
web-1252 version is more common.
Of course, since Python itself is widespread, there is some survivorship
bias here. Another thing you could get instead of "�" is your code
crashing.
FWIW, I've occasionally seen that kind of mojibake on the news ticker of
the BBC News channel. :-(
[snip]
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/