I've trimmed fairly aggressively for the sake of not causing the rest of the list to mute our discussion (again :) ). Stephen - feel free to email me off list if I go too far or misrepresent you.

As a summary for people who don't want to read on (and Stephen will correct me if I misquote):

* we agree on removing use of the *A APIs within Python, which means Python will have to decode bytes before passing them to the operating system * we agree on allowing users to switch the encoding between utf-8 and mbcs:replace (the current default) * we agree on making utf-8 the default for 3.6.0b1 and closely monitoring the reaction * Stephen sees "no reason not to change locale.getpreferredencoding()" (default encoding for open()) at the same time with the same switches, while I'm not quite as confident. Do users generally specify an encoding these days? I know I always put utf-8 there.

Does anyone else have concerns or questions?



On 22Aug2016 2121, Stephen J. Turnbull wrote:
UTF-8 is absolutely not equivalent to UTF-16 from the point of view of
developers. Passing it to Windows APIs requires decoding to UTF-16 (or
from a Python developer's point of view, decoding to str and use of
str APIs).  That fact is what got you started on this whole proposal!

As encoded bytes, that's true, but as far as correctly encoding text, they are equivalent.

 > All MSVC users have been pushed towards Unicode for many years.

But that "push" is due to the use of UTF-16-based *W APIs and
deprecation of ACP-based *A APIs, right?  The input to *W APIs must be
decoded from all text/* content "out there", including UTF-8 content.
I don't see evidence that users have been pushed toward *UTF-8* in that
statement; they may be decoding from something else.  Unicode != UTF-8
for our purposes!

Yes, the operating system pushes people towards *W APIs, and the languages commonly used on that operating system follow.

Windows has (for as long as it matters) always been UTF-16 for paths and bytes for content. Nowhere does the operating system tell you how to read your text file except as raw bytes, and content types are meant to provide the encoding information you need. Languages each determine how to read files in "text" mode, but that's not bound to or enforced by the operating system in any way.

 > The .NET Framework has defaulted to UTF-8

Default != enforce, though.  Do you know that almost nobody changes
the default, and that behavior is fairly uniform across different
classes of organization (specifically by language)?  Or did you mean
"enforce"?

This will also not enforce anything that the operating system doesn't enforce. Windows uses Unicode to represent paths and requires them to be passed as UTF-16 encoded bytes. If you don't do that, it'll convert for you. My proposal is for Python to do the conversion instead.

(In .NET, users have to decode a byte array if they want to get a string. There aren't any APIs that take byte[] as if it were text, so it's basically the same separation between bytes/str that Python 3 introduced, except without any allowance for bytes to still be used in places where text is needed.)

To be clear: asking users who want backward-compatible behavior to set
an environment variable does not count as a "screw" -- some will
complain, but "the defaults always suck for somebody".  Reasonable
people know that, and we can't do anything about the hysterics.

Good. Glad we agree on this.


1.  Organizations which behave like ".NET users" already have pure
    UTF-8 environments.  They win from Python defaulting to UTF-8,
    since Windows won't let them do it for themselves.  Now they can
    plug in bytes-oriented code written for the POSIX environment
    straight from upstream.

    Is that correct?  Ie, without transcoding, they can't now use
    bytes because their environment hands them UTF-8 but when Python
    hands those bytes to Windows, it assumes anything else but UTF-8?

If you give Windows anything but UTF-16 as a path, it will convert to UTF-16. The change is to convert to UTF-16 ourselves, so Windows will never see the original bytes. To do that conversion, we need to know what encoding the incoming bytes are encoded with.

Python users will either transcode from bytes in encoding X to str, transcode from bytes in encoding X to bytes in UTF-8, or keep their bytes in UTF-8 if that's how they started.

(I feel like there's some other misunderstanding going on here, because I know you understand how encoding works, but I can't figure out what it is or what I need to say to trigger clarity. :( )

Windows does not support using UTF-8 encoded bytes as text. UTF-16 is the universal encoding. (Basically the only thing you can reliably do with UTF-8 bytes in the Windows API is convert them to UTF-16 - see the MultiByteToWideChar function. Everything else just treats it like a blob of meaningless data.)

BTW, I wonder how those organizations manage to get pure UTF-8
environments, given that Windows itself won't default to that.  Is it
just that they live in .NET and other applications that default to
producing UTF-8 text (in the rare(?) case that text is generated at
all, vs some application/* medium), and so never get near applications
that produce text in the active code page, and especially not near
applications that embed file system names encoded in a non-UTF-8
encoding in text/* media?

I doubt they're pure UTF-8, but they pay attention to what files are encoded with and explicitly decode into a common internal encoding.

Overall, I think Nick's hybrid strategy is the way to go.  First, give
users the choice of 'mbcs' or 'utf-8' for the Windows encoding.  I see
no reason not to do this for locale.getpreferredencoding() at the same
time, as long as it's an option.

The thing about this is that it's always been an option (the encoding argument to open() et al.), and specifically, an option that's required on all platforms. So I see one reason to not do it - users can (and do) override it in a cross-platform compatible way.

The biggest difference from the file system encoding is that the encoding for file contents is entirely the business of the application (and whichever other applications it talks to), while the OS is the main recipient of file system encoded text and so it gets a say in the chosen encoding.

I'm happy for this to be on the table though, but *I* need convincing that it's a good idea to do it now.

Cheers,
Steve

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to