[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

Random832 Sun, 24 Jan 2021 13:50:14 -0800

On Sun, Jan 24, 2021, at 13:18, MRAB wrote:
> Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's 
> probably UTF16-BE and if you see patterns like 
> b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
> 
> You could also look for, say, sequences of Latin characters and 
> sequences of Han characters.


This is dangerous, as Microsoft discovered: a sequence of ASCII latin 
characters can look a lot like a sequence of UTF-16 Han characters.

On Windows, Notepad always writes UTF-16 with BOM, even though it now writes 
UTF-8 without it by default.

Probably the winning combination is "if there is a UTF-16 BOM, it's UTF-16, 
else if first few non-ASCII bytes encountered are valid UTF-8, it's UTF-8", 
otherwise it's the system default 'ANSI' locale.

The one problem with that is what to do if something like a pipe or a socket 
gets a sequence of bytes that are a valid *partial* UTF-8 character, then 
doesn't get any more data for a while. It's unacceptable to have to wait for 
more data before interpreting data that has been read.

Notepad has the luxury of only working on ordinary files, and being able to 
scan the whole file before making a decision about the character set [I believe 
it mmaps the file rather than using ordinary open/read calls].
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/DR4GEIPOWNQFWHETWM6L5Y2GGRZL2YRH/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

Reply via email to