Chris Angelico writes:

 > Can anyone give an example of a current in-use system encoding that
 > would have [ASCII bytes in non-ASCII text]?

Shift JIS, Big5.  (Both can have bytes < 128 inside multibyte
characters.)  I don't know if Big5 is still in use as the default
encoding anywhere, but Shift JIS is, although it's decreasing.

For both of those once you encounter a non-ASCII byte you can just
switch over, and none of the previous text was mis-decoded.  But
that's only if you *know* the language was Japanese (respectively
Chinese).  Remember, there is no encoding that can be distinguished
from ISO 8859-1 (and several other Latin encodings) simply based on
the bytes found, since it uses all 256 bytes.

 > How likely is it that you'd get even one line of text that purports
 > to be ASCII?

Program source code where the higher-level functions (likely to
contain literal strings) come late in the file are frequently
misdetected based on the earlier bytes.

Steve
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZB2LM3KYLQ34DHA276SPZA73BHJBRQMF/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to