On Sun, Jan 24, 2021, at 13:18, MRAB wrote:
> Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's 
> probably UTF16-BE and if you see patterns like 
> b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
> 
> You could also look for, say, sequences of Latin characters and 
> sequences of Han characters.

This is dangerous, as Microsoft discovered: a sequence of ASCII latin 
characters can look a lot like a sequence of UTF-16 Han characters.

On Windows, Notepad always writes UTF-16 with BOM, even though it now writes 
UTF-8 without it by default.

Probably the winning combination is "if there is a UTF-16 BOM, it's UTF-16, 
else if first few non-ASCII bytes encountered are valid UTF-8, it's UTF-8", 
otherwise it's the system default 'ANSI' locale.

The one problem with that is what to do if something like a pipe or a socket 
gets a sequence of bytes that are a valid *partial* UTF-8 character, then 
doesn't get any more data for a while. It's unacceptable to have to wait for 
more data before interpreting data that has been read.

Notepad has the luxury of only working on ordinary files, and being able to 
scan the whole file before making a decision about the character set [I believe 
it mmaps the file rather than using ordinary open/read calls].
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/DR4GEIPOWNQFWHETWM6L5Y2GGRZL2YRH/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to