Re: [Python-ideas] Fix default encodings on Windows

Steve Dower Wed, 10 Aug 2016 16:41:20 -0700

On 10Aug2016 1431, Chris Angelico wrote:

On Thu, Aug 11, 2016 at 6:09 AM, Random832 <random...@fastmail.com> wrote:

On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:

Why? What's the use case? [byte paths]


Allowing library developers who support POSIX and Windows to just use
bytes everywhere to represent paths.


Okay, how is that use case impacted by it being mbcs instead of utf-8?


AIUI, the data flow would be: Python bytes object -> decode to Unicode
text -> encode to UTF-16 -> Windows API.  If you do the first
transformation using mbcs, you're guaranteed *some* result (all
Windows codepages have definitions for all byte values, if I'm not
mistaken), but a hard-to-predict one - and worse, one that can change
based on system settings. Also, if someone naively types
"bytepath.decode()", Python will default to UTF-8, *not* to the system
codepage.

I'd rather a single consistent default encoding.

I'm proposing to make that single consistent default encoding utf-8. Itsounds like we're in agreement?

What about only doing the deprecation warning if non-ascii bytes are
present in the value?


-1. Data-dependent warnings just serve to strengthen the feeling that
"weird characters" keep breaking your programs, instead of writing
your program to cope with all characters equally. It's like being
racist against non-ASCII characters :)


Agreed. This won't happen.

On Thu, Aug 11, 2016 at 4:10 AM, Steve Dower <steve.do...@python.org> wrote:

To summarise the proposals (remembering that these would only affect Python
3.6 on Windows):

* change sys.getfilesystemencoding() to return 'utf-8'
* automatically decode byte paths assuming they are utf-8
* remove the deprecation warning on byte paths


+1 on these.

* make the default open() encoding check for a BOM or else use utf-8


-0.5. Is there any precedent for this kind of data-based detection
being the default? An explicit "utf-sig" could do a full detection,
but even then it's not perfect - how do you distinguish UTF-32LE from
UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll
assume UTF-16", or do you say "files starting U+0000 are rare, so
we'll assume UTF-32"?

The BOM exists solely for data-based detection, and the UTF-8 BOM isdifferent from the UTF-16 and UTF-32 ones. So we either find an exactBOM (which IIRC decodes as a no-op spacing character, though I have afeeling some version of Unicode redefined it exclusively for being themarker) or we use utf-8.

But the main reason for detecting the BOM is that currently openingfiles with 'utf-8' does not skip the BOM if it exists. I'd be quitehappy with changing the default encoding to:


* utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
* utf-8 when writing (so the BOM is *not* written)

This provides the best compatibility when reading/writing files withoutmaking any guesses. We could reasonably extend this to read utf-16 andutf-32 if they have a BOM, but that's an extension and not necessary forthe main change.

* force the console encoding to UTF-8 on initialize and revert on finalize


-0 for Python itself; +1 for Python's interactive interpreter.
Programs that mess with console settings get annoying when they crash
out and don't revert properly. Unless there is *no way* that you could
externally kill the process without also bringing the terminal down,
there's the distinct possibility of messing everything up.

The main problem here is that if the console is not forced to UTF-8 thenit won't render any of the characters correctly.

Would it be possible to have a "sys.setconsoleutf8()" that changes the
console encoding and slaps in an atexit() to revert? That would at
least leave it in the hands of the app.

Yes, but if the app is going to opt-in then I'd suggest thewin_unicode_console package, which won't require any particular changes.

It sounds like we'll have to look into effectively merging that packageinto the core. I'm afraid that'll come with a much longer tail of bugs(and will quite likely break code that expects to use file descriptorsto access stdin/out), but it's the least impactful way to do it.


Cheers,
Steve

_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

Reply via email to