On Sun, Jan 24, 2021 at 01:32:28AM +0000, MRAB wrote:
> On 2021-01-24 01:14, Guido van Rossum wrote:
> >I have definitely seen BOMs written by Notepad on Windows 10.
> >
> >Why can’t the future be that open() in text mode guesses the encoding?
> >
> "In the face of ambiguity, refuse the temptation to guess."

"Although practicality beats purity."


The Zen is like scripture: there's a koan for any position you wish to 
take :-)

If you want to be pedantic, and I certainly do *wink*, providing any 
default for the encoding parameter is a guess. The encoding of all text 
files is ambiguous (the intended encoding is metadata which is not 
recorded in the file format). Most text files on Linux and Mac OS use 
UTF-8, and many on Windows too, but not *all* so setting the default to 
UTF-8 is just a guess.

I understand that there are good heuristics for auto-detection of 
encodings which are reliable and used in many other software. If 
auto-detection is a "guess", its an *educated* guess and not much 
different from the status quo, which usually guesses correctly on Linux 
and Mac but often guesses wrongly on Windows. This proposal is to 
improve the quality of the guess by inspecting the file's contents.

For example, a file opened in text mode where every second character is 
a NULL is *almost certainly* UTF-16. The chances that somebody actually 
intended to write:

    H\0e\0l\0l\0o\O \OW\0o\0r\0l\0d\0

rather than "Hello World" is negligible.

Before we consider changing the default encoding to "auto-detect", I 
would like to see some estimate of how many UTF-8 encoded files will be 
misclassified as something else. That is, if we make this change, how 
much software that currently guesses UTF-8 correctly (the default 
encoding is the actual intended encoding) will break because it guesses 
something else? That surely won't happen with mostly-ASCII files, but I 
suppose it could happen with some non-English languages?

-- 
Steve
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/U2T4JSKOUGSEXVVW3Y7LTXR7HQ5UJUKI/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to