Re: [Python-Dev] File system path encoding on Windows

Steve Dower Tue, 23 Aug 2016 09:11:00 -0700

I've trimmed fairly aggressively for the sake of not causing the rest ofthe list to mute our discussion (again :) ). Stephen - feel free toemail me off list if I go too far or misrepresent you.

As a summary for people who don't want to read on (and Stephen willcorrect me if I misquote):

* we agree on removing use of the *A APIs within Python, which meansPython will have to decode bytes before passing them to the operating system* we agree on allowing users to switch the encoding between utf-8 andmbcs:replace (the current default)* we agree on making utf-8 the default for 3.6.0b1 and closelymonitoring the reaction* Stephen sees "no reason not to change locale.getpreferredencoding()"(default encoding for open()) at the same time with the same switches,while I'm not quite as confident. Do users generally specify an encodingthese days? I know I always put utf-8 there.


Does anyone else have concerns or questions?



On 22Aug2016 2121, Stephen J. Turnbull wrote:

UTF-8 is absolutely not equivalent to UTF-16 from the point of view of
developers. Passing it to Windows APIs requires decoding to UTF-16 (or
from a Python developer's point of view, decoding to str and use of
str APIs).  That fact is what got you started on this whole proposal!

As encoded bytes, that's true, but as far as correctly encoding text,they are equivalent.

 > All MSVC users have been pushed towards Unicode for many years.

But that "push" is due to the use of UTF-16-based *W APIs and
deprecation of ACP-based *A APIs, right?  The input to *W APIs must be
decoded from all text/* content "out there", including UTF-8 content.
I don't see evidence that users have been pushed toward *UTF-8* in that
statement; they may be decoding from something else.  Unicode != UTF-8
for our purposes!

Yes, the operating system pushes people towards *W APIs, and thelanguages commonly used on that operating system follow.

Windows has (for as long as it matters) always been UTF-16 for paths andbytes for content. Nowhere does the operating system tell you how toread your text file except as raw bytes, and content types are meant toprovide the encoding information you need. Languages each determine howto read files in "text" mode, but that's not bound to or enforced by theoperating system in any way.

 > The .NET Framework has defaulted to UTF-8

Default != enforce, though.  Do you know that almost nobody changes
the default, and that behavior is fairly uniform across different
classes of organization (specifically by language)?  Or did you mean
"enforce"?

This will also not enforce anything that the operating system doesn'tenforce. Windows uses Unicode to represent paths and requires them to bepassed as UTF-16 encoded bytes. If you don't do that, it'll convert foryou. My proposal is for Python to do the conversion instead.

(In .NET, users have to decode a byte array if they want to get astring. There aren't any APIs that take byte[] as if it were text, soit's basically the same separation between bytes/str that Python 3introduced, except without any allowance for bytes to still be used inplaces where text is needed.)

To be clear: asking users who want backward-compatible behavior to set
an environment variable does not count as a "screw" -- some will
complain, but "the defaults always suck for somebody".  Reasonable
people know that, and we can't do anything about the hysterics.


Good. Glad we agree on this.

1.  Organizations which behave like ".NET users" already have pure
    UTF-8 environments.  They win from Python defaulting to UTF-8,
    since Windows won't let them do it for themselves.  Now they can
    plug in bytes-oriented code written for the POSIX environment
    straight from upstream.

    Is that correct?  Ie, without transcoding, they can't now use
    bytes because their environment hands them UTF-8 but when Python
    hands those bytes to Windows, it assumes anything else but UTF-8?

If you give Windows anything but UTF-16 as a path, it will convert toUTF-16. The change is to convert to UTF-16 ourselves, so Windows willnever see the original bytes. To do that conversion, we need to knowwhat encoding the incoming bytes are encoded with.

Python users will either transcode from bytes in encoding X to str,transcode from bytes in encoding X to bytes in UTF-8, or keep theirbytes in UTF-8 if that's how they started.

(I feel like there's some other misunderstanding going on here, becauseI know you understand how encoding works, but I can't figure out what itis or what I need to say to trigger clarity. :( )

Windows does not support using UTF-8 encoded bytes as text. UTF-16 isthe universal encoding. (Basically the only thing you can reliably dowith UTF-8 bytes in the Windows API is convert them to UTF-16 - see theMultiByteToWideChar function. Everything else just treats it like a blobof meaningless data.)

BTW, I wonder how those organizations manage to get pure UTF-8
environments, given that Windows itself won't default to that.  Is it
just that they live in .NET and other applications that default to
producing UTF-8 text (in the rare(?) case that text is generated at
all, vs some application/* medium), and so never get near applications
that produce text in the active code page, and especially not near
applications that embed file system names encoded in a non-UTF-8
encoding in text/* media?

I doubt they're pure UTF-8, but they pay attention to what files areencoded with and explicitly decode into a common internal encoding.

Overall, I think Nick's hybrid strategy is the way to go.  First, give
users the choice of 'mbcs' or 'utf-8' for the Windows encoding.  I see
no reason not to do this for locale.getpreferredencoding() at the same
time, as long as it's an option.

The thing about this is that it's always been an option (the encodingargument to open() et al.), and specifically, an option that's requiredon all platforms. So I see one reason to not do it - users can (and do)override it in a cross-platform compatible way.

The biggest difference from the file system encoding is that theencoding for file contents is entirely the business of the application(and whichever other applications it talks to), while the OS is the mainrecipient of file system encoded text and so it gets a say in the chosenencoding.

I'm happy for this to be on the table though, but *I* need convincingthat it's a good idea to do it now.


Cheers,
Steve

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] File system path encoding on Windows

Reply via email to