Re: [Python-ideas] Fix default encodings on Windows

Steve Dower Wed, 10 Aug 2016 12:23:01 -0700

On 10Aug2016 1146, Random832 wrote:

On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:

To summarise the proposals (remembering that these would only affect
Python 3.6 on Windows):


* change sys.getfilesystemencoding() to return 'utf-8'
* automatically decode byte paths assuming they are utf-8
* remove the deprecation warning on byte paths


Why? What's the use case?

Allowing library developers who support POSIX and Windows to just usebytes everywhere to represent paths.

* make the default open() encoding check for a BOM or else use utf-8
* [ALTERNATIVE] make the default open() encoding check for a BOM or else
use sys.getpreferredencoding()


For reading, I assume. When opened for writing, it should probably be
utf-8-sig [if it's not mbcs] to match what Notepad does. What about
files opened for appending or updating? In theory it could ingest the
whole file to see if it's valid UTF-8, but that has a time cost.

Writing out the BOM automatically basically makes your filesincompatible with other platforms, which rarely expect a BOM. Byomitting it but writing and reading UTF-8 we ensure that Python canhandle its own files on any platform, while potentially upsetting someolder applications on Windows or platforms that don't assume UTF-8 as adefault.

Notepad, if there's no BOM, checks the first 256 bytes of the file for
whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK],
and can get it wrong for certain very short files [i.e. the infamous
"this app can break"]

Yeah, this is a pretty horrible idea :) I don't want to go there bydefault, but people can install chardet if they want the functionality.

What to do on opening a pipe or device? [Is os.fstat able to detect
these cases?]

We should be able to detect them, but why treat them any differentlyfrom a file? Right now they're just as broken as they will be after thechange if you aren't specifying 'b' or an encoding - probably morebroken, since at least you'll get less encoding errors when the encodingis UTF-8.

Maybe the BOM detection phase should be deferred until the first read.
What should encoding be at that point if this is done? Is there a
"utf-any" encoding that can handle all five BOMs? If not, should there
be? how are "utf-16" and "utf-32" files opened for appending or updating
handled today?

Yes, I think it would be. I suspect we'd have to leave the encodingunknown until the first read, and perhaps force it to utf-8-sig ifsomeone asks before we start. I don't *think* this is any lesspredictable than the current behaviour, given it only applies whenyou've left out any encoding specification, but maybe it is.

It probably also entails opening the file descriptor in bytes mode,which might break programs that pass the fd directly to CRT functions.Personally I wish they wouldn't, but it's too late to stop them now.

* force the console encoding to UTF-8 on initialize and revert on
finalize


Why not implement a true unicode console? What if sys.stdin/stdout are
pipes (or non-console devices such as a serial port)?

Mostly because it's much more work. As I mentioned in my other post, analternative would be to bring win_unicode_console into the stdlib andenable it by default (which considering the package was largelydeveloped on bugs.p.o is probably okay, but we'd probably need torewrite it in C, which is basically implementing a true Unicode console).

You're right that changing the console encoding after launching Pythonis probably going to mess with pipes. We can detect whether the streamsare interactive or not and adjust accordingly, but that's going to getmessy if you're only piping in/out and stdin/out end up with differentencodings. I'll put some more thought into this part.


Thanks,
Steve

_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

Reply via email to