Re: [Python-ideas] Fix default encodings on Windows
On Fri, Aug 19, 2016 at 12:30 AM, Nick Coghlanwrote: > > So in porting to py3, they would have had to *add* that 'b' (and a bunch > of > > b'filename') to keep the good old bytes is text interface. > > > > Why would anyone do that? > > For a fair amount of *nix-centric code that primarily works with ASCII > data, adding the 'b' prefix is the easiest way to get into the common > subset of Python 2 & 3. > Sure -- but it's entirely unnecessary, yes? If you don't change your code, you'll get py2(bytes) strings as paths in py2, and py3 (Unicode) strings as paths on py3. So different, yes. But wouldn't it all work? So folks are making an active choice to change their code to get some perceived (real?) performance benefit??? However, as I understand it, py3 string paths did NOT "just work" in place of py2 paths before surrogate pairs were introduced (when was that?) -- so are we dealing with all of this because some (a lot, and important) libraries ported to py3 early in the game? What I'm getting at is whether there is anything other than inertia that keeps folks using bytes paths in py3 code? Maybe it wouldn't be THAT hard to get folks to make the switch: it's EASIER to port your code to py3 this way! -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Thu, Aug 18, 2016 at 3:25 PM, Steve Dowerwrote: > allow us to change locale.getpreferredencoding() to utf-8 on Windows _bootlocale.getpreferredencoding would need to be hard coded to return 'utf-8' on Windows. _locale._getdefaultlocale() itself shouldn't return 'utf-8' as the encoding because the CRT doesn't allow it as a locale encoding. site.aliasmbcs() uses getpreferredencoding, so it will need to be modified. The codecs module could add get_acp and get_oemcp functions based on GetACP and GetOEMCP, returning for example 'cp1252' and 'cp850'. Then aliasmbcs could call get_acp. Adding get_oemcp would also help with decoding output from subprocess.Popen. There's been discussion about adding encoding and errors options to Popen, and what the default should be. When writing to a pipe or file, some programs use OEM, some use ANSI, some use the console codepage if available, and far fewer use Unicode encodings. Obviously it's better to specify the encoding in each case if you know it. Regarding the locale module, how about modernizing _locale._getdefaultlocale to return the Windows locale name [1] from GetUserDefaultLocaleName? For example, it could return a tuple such as ('en-UK', None) and ('uz-Latn-UZ', None) -- always with the encoding set to None. The CRT accepts the new locale names, but it isn't quite up to speed. It still sets a legacy locale when the locale string is empty. In this case the high-level setlocale could call _getdefaultlocale. Also _parse_localename, which is called by getlocale, needs to return a tuple with the encoding as None. Currently it raises a ValueError for Windows locale names as defined by [1]. [1]: https://msdn.microsoft.com/en-us/library/dd373814 ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 8/18/2016 1:39 PM, Steve Dower wrote: On 18Aug2016 1036, Terry Reedy wrote: On 8/18/2016 11:25 AM, Steve Dower wrote: In this case, we would announce in 3.6 that using bytes as paths on Windows is no longer deprecated, My understanding is the the first 2 fixes refine the deprecation rather than reversing it. And #3 simply applies it. #3 certainly just applies the deprecation. As for the first two, I don't see any reason to deprecate the functionality once the issues are resolved. If using utf-8 encoded bytes is going to work fine in all the same cases as using str, why discourage it? As I understand it, you still proposing to remove the use of bytes encoded with anything other than utf-8 (and the corresponding *A internal functions) and in particular stop lossy path transformations. Am I wrong? -- Terry Jan Reedy ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Thu, Aug 18, 2016 at 6:23 AM, Steve Dowerwrote: > "You consistently ignore Makefiles, .ini, etc." > > Do people really do open('makefile', 'rb'), extract filenames and try to > use them without ever decoding the file contents? > I'm sure they do :-( But this has always confused me - back in the python2 "good old days" text and binary mode were exactly the same on *nix -- so folks sometimes fell into the trap of opening binary files as text on *nix, and then it failing on Windows but I can't image why anyone would have done the opposite. So in porting to py3, they would have had to *add* that 'b' (and a bunch of b'filename') to keep the good old bytes is text interface. Why would anyone do that? Honestly confused. I've honestly never seen that, and it certainly looks like the sort of > thing Python 3 was intended to discourage. > exactly -- we really don't need to support folks reading text files in binary mode and not considering encoding... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 18Aug2016 1036, Terry Reedy wrote: On 8/18/2016 11:25 AM, Steve Dower wrote: In this case, we would announce in 3.6 that using bytes as paths on Windows is no longer deprecated, My understanding is the the first 2 fixes refine the deprecation rather than reversing it. And #3 simply applies it. #3 certainly just applies the deprecation. As for the first two, I don't see any reason to deprecate the functionality once the issues are resolved. If using utf-8 encoded bytes is going to work fine in all the same cases as using str, why discourage it? ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 8/18/2016 11:25 AM, Steve Dower wrote: In this case, we would announce in 3.6 that using bytes as paths on Windows is no longer deprecated, My understanding is the the first 2 fixes refine the deprecation rather than reversing it. And #3 simply applies it. -- Terry Jan Reedy ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 18Aug2016 0900, Chris Angelico wrote: On Fri, Aug 19, 2016 at 1:54 AM, Steve Dowerwrote: On 18Aug2016 0829, Chris Angelico wrote: The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this? You're not the only one - I think this has been the most common misunderstanding. On Windows, the paths as stored in the filesystem are actually all text - more precisely, utf-16-le encoded bytes, represented as 16-bit characters strings. Converting to an 8-bit character representation only exists for compatibility with code written for other platforms (either Linux, or much older versions of Windows). The operating system has one way to do the conversion to bytes, which Python currently uses, but since we control that transformation I'm proposing an alternative conversion that is more reliable than compatible (with Windows 3.1... shouldn't affect compatibility with code that properly handles multibyte encodings, which should include anything developed for Linux in the last decade or two). Does that help? I tried to keep the explanation short and focused :) Ah, I think I see what you mean. There's a slight ambiguity in the word "missing" here. 1) The Unicode character in the result lacks some of the information it should have 2) The Unicode character in the file name is information that has now been lost. My reading was the first, but AIUI you actually meant the second. If so, I'd be inclined to reword it very slightly, eg: "The Unicode character in the second call to glob is now lost information." Is that a correct interpretation? I think so, though I find the wording a little awkward (and on rereading, my original wording was pretty bad). How about: "The second call to glob has replaced the Unicode character with '?', which means the actual filename cannot be recovered and the path is no longer valid." Cheers, STeve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Fri, Aug 19, 2016 at 1:54 AM, Steve Dowerwrote: > On 18Aug2016 0829, Chris Angelico wrote: >> >> The second call to glob doesn't have any Unicode characters at all, >> the way I see it - it's all bytes. Am I completely misunderstanding >> this? > > > You're not the only one - I think this has been the most common > misunderstanding. > > On Windows, the paths as stored in the filesystem are actually all text - > more precisely, utf-16-le encoded bytes, represented as 16-bit characters > strings. > > Converting to an 8-bit character representation only exists for > compatibility with code written for other platforms (either Linux, or much > older versions of Windows). The operating system has one way to do the > conversion to bytes, which Python currently uses, but since we control that > transformation I'm proposing an alternative conversion that is more reliable > than compatible (with Windows 3.1... shouldn't affect compatibility with > code that properly handles multibyte encodings, which should include > anything developed for Linux in the last decade or two). > > Does that help? I tried to keep the explanation short and focused :) Ah, I think I see what you mean. There's a slight ambiguity in the word "missing" here. 1) The Unicode character in the result lacks some of the information it should have 2) The Unicode character in the file name is information that has now been lost. My reading was the first, but AIUI you actually meant the second. If so, I'd be inclined to reword it very slightly, eg: "The Unicode character in the second call to glob is now lost information." Is that a correct interpretation? ChrisA ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 18Aug2016 0829, Chris Angelico wrote: The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this? You're not the only one - I think this has been the most common misunderstanding. On Windows, the paths as stored in the filesystem are actually all text - more precisely, utf-16-le encoded bytes, represented as 16-bit characters strings. Converting to an 8-bit character representation only exists for compatibility with code written for other platforms (either Linux, or much older versions of Windows). The operating system has one way to do the conversion to bytes, which Python currently uses, but since we control that transformation I'm proposing an alternative conversion that is more reliable than compatible (with Windows 3.1... shouldn't affect compatibility with code that properly handles multibyte encodings, which should include anything developed for Linux in the last decade or two). Does that help? I tried to keep the explanation short and focused :) Cheers, Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Fri, Aug 19, 2016 at 1:25 AM, Steve Dowerwrote: open('test\uAB00.txt', 'wb').close() import glob glob.glob('test*') > ['test\uab00.txt'] glob.glob(b'test*') > [b'test?.txt'] > > The Unicode character in the second call to glob is missing information. You > can observe the same results in os.listdir() or any function that matches > its result type to the parameter type. Apologies if this is just noise, but I'm a little confused by this. The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this? ChrisA ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
Summary for python-dev. This is the email I'm proposing to take over to the main mailing list to get some actual decisions made. As I don't agree with some of the possible recommendations, I want to make sure that they're represented fairly. I also want to summarise the background leading to why we should consider making a change here at all, rather than simply leaving it alone. There's a chance this will all make its way into a PEP, depending on how controversial the core team thinks this is. Please let me know if you think I've misrepresented (or unfairly represented) any of the positions, or if you think I can simplify/clarify anything in here. Please don't treat this like a PEP review - it's just going to be an email to python-dev - but the more we can avoid having the discussions there we've already had here the better. Cheers, Steve --- Background == File system paths are almost universally represented as text in some encoding determined by the file system. In Python, we expose these paths via a number of interfaces, such as the os and io modules. Paths may be passed either direction across these interfaces, that is, from the filesystem to the application (for example, os.listdir()), or from the application to the filesystem (for example, os.unlink()). When paths are passed between the filesystem and the application, they are either passed through as a bytes blob or converted to/from str using sys.getfilesystemencoding(). The result of encoding a string with sys.getfilesystemencoding() is a blob of bytes in the native format for the default file system. On Windows, the native format for the filesystem is utf-16-le. The recommended platform APIs for accessing the filesystem all accept and return text encoded in this format. However, prior to Windows NT (and possibly further back), the native format was a configurable machine option and a separate set of APIs existed to accept this format. The option (the "active code page") and these APIs (the "*A functions") still exist in recent versions of Windows for backwards compatibility, though new functionality often only has a utf-16-le API (the "*W functions"). In Python, we recommend using str as the default format on Windows because it can correctly round-trip all the characters representable in utf-16-le. Our support for bytes explicitly uses the *A functions and hence the encoding for the bytes is "whatever the active code page is". Since the active code page cannot represent all Unicode characters, the conversion of a path into bytes can lose information without warning. As a demonstration of this: >>> open('test\uAB00.txt', 'wb').close() >>> import glob >>> glob.glob('test*') ['test\uab00.txt'] >>> glob.glob(b'test*') [b'test?.txt'] The Unicode character in the second call to glob is missing information. You can observe the same results in os.listdir() or any function that matches its result type to the parameter type. Why is this a problem? == While the obvious and correct answer is to just use str everywhere, it remains well known that on Linux and MacOS it is perfectly okay to use bytes when taking values from the filesystem and passing them back. Doing so also avoids the cost of decoding and reencoding, such that (theoretically), code like below should be faster because of the `b'.'`: >>> for f in os.listdir(b'.'): ... os.stat(f) ... On Windows, if a filename exists that cannot be encoding with the active code page, you will receive an error from the above code. These errors are why in Python 3.3 the use of bytes paths on Windows was deprecated (listed in the What's New, but not clearly obvious in the documentation - more on this later). The above code produces multiple deprecation warnings in 3.3, 3.4 and 3.5 on Windows. However, we still keep seeing libraries use bytes paths, which can cause unexpected issues on Windows. Given the current approach of quietly recommending that library developers either write their code twice (once for bytes and once for str) or use str exclusively are not working, we should consider alternative mitigations. Proposals = There are two dimensions here - the fix and the timing. We can basically choose any fix and any timing. The main differences between the fixes are the balance between incorrect behaviour and backwards-incompatible behaviour. The main issue with respect to timing is whether or not we believe using bytes as paths on Windows was correctly deprecated in 3.3 and sufficiently advertised since to allow us to change the behaviour in 3.6. Fixes - Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows Currently the default filesystem encoding is 'mbcs', which is a meta-encoder that uses the active code page. In reality, our implementation uses the *A APIs and we don't explicitly decode bytes in order to pass them to the filesystem. This allows the OS to quietly
Re: [Python-ideas] Fix default encodings on Windows
"You consistently ignore Makefiles, .ini, etc." Do people really do open('makefile', 'rb'), extract filenames and try to use them without ever decoding the file contents? I've honestly never seen that, and it certainly looks like the sort of thing Python 3 was intended to discourage. (As soon as you open(..., 'r') you're only affected by this change if you explicitly encode again with mbcs.) Top-posted from my Windows Phone -Original Message- From: "Stephen J. Turnbull" <turnbull.stephen...@u.tsukuba.ac.jp> Sent: 8/17/2016 19:43 To: "Steve Dower" <steve.do...@python.org> Cc: "Paul Moore" <p.f.mo...@gmail.com>; "Python-Ideas" <python-ideas@python.org> Subject: Re: [Python-ideas] Fix default encodings on Windows Steve Dower writes: > On 17Aug2016 0235, Stephen J. Turnbull wrote: > > So a full statement is, "How do we best represent Windows file > > system paths in bytes for interoperability with systems that > > natively represent paths in bytes?" ("Other systems" refers to > > both other platforms and existing programs on Windows.) > > That's incorrect, or at least possible to interpret correctly as > the wrong thing. The goal is "code compatibility with systems ...", > not interoperability. You're right, I stated that incorrectly. I don't have anything to add to your corrected version. > > In a properly set up POSIX locale[1], it Just Works by design, > > especially if you use UTF-8 as the preferred encoding. It's > > Windows developers and users who suffer, not those who wrote the > > code, nor their primary audience which uses POSIX platforms. > > You mentioned "locale", "preferred" and "encoding" in the same sentence, > so I hope you're not thinking of locale.getpreferredencoding()? Changing > that function is orthogonal to this discussion, You consistently ignore Makefiles, .ini, etc. It is *not* orthogonal, it is *the* reason for all opposition to your proposal or request that it be delayed. Filesystem names *are* text in part because they are *used as filenames in text*. > When Windows developers and users suffer, I see it as my responsibility > to reduce that suffering. Changing Python on Windows should do that > without affecting developers on Linux, even though the Right Way is to > change all the developers on Linux to use str for paths. I resent that. If I were a partisan Linux fanboy, I'd be cheering you on because I think your proposal is going to hurt an identifiable and large class of *Windows* users. I know about and fear this possiblity because they use a language I love (Japanese) and an encoding I hate but have achieved a state of peaceful coexistence with (Shift JIS). And on the general principle, *I* don't disagree. I mentioned earlier that I use only the str interfaces in my own code on Linux and Mac OS X, and that I suspect that there are no real efficiency implications to using str rather than bytes for those interfaces. On the other hand, the programming convenience of reading the occasional "text" filename (or other text, such as XML tags) out of a binary stream and passing it directly to filesystem APIs cannot be denied. I think that the kind of usage you propose (a fixed, universal codec, universally accepted; ie, 'utf-8') is the best way to handle that in the long run. But as Grandmaster Lasker said, "Before the end game, the gods have placed the middle game." (Lord Keynes isn't relevant here, Python will outlive all of us. :-) > I don't think there's any reasonable way to noisily deprecate these > functions within Python, but certainly the docs can be made > clearer. People who explicitly encode with > sys.getfilesystemencoding() should not get the deprecation message, > but we can't tell whether they got their bytes from the right > encoding or a RNG, so there's no way to discriminate. I agree with you within Python; the custom is for DeprecationWarnings to be silent by default. As for "making noise", how about announcing the deprecation as like the top headline for 3.6, postponing the actual change to 3.7, and in the meantime you and Nick do a keynote duet at PyCon? (Your partner could be Guido, too, but Nick has been the most articulate proponent for this particular aspect of "inclusion". I think having a representative from the POSIX world explaining the importance of this for "all of us" would greatly multiply the impact.) Perhaps, given my proposed timing, a discussion at the language summit in '17 and the keynote in '18 would be the best timing. (OT, political: I've been strongly influenced in this proposal by recently reading http://blog.aurynn.com/contempt-culture. There's not as much of it in Pytho
Re: [Python-ideas] Fix default encodings on Windows
On Thu, Aug 18, 2016 at 2:32 AM, Stephen J. Turnbullwrote: > > So it's not just invalid surrogate *pairs*, it's invalid surrogates of > all kinds. This means that it's theoretically possible (though I > gather that it's unlikely in the extreme) for a real Windows filename > to indistinguishable from one generated by Python's surrogateescape > handler. Absolutely if the filesystem is one of Microsoft's such as NTFS, FAT32, exFAT, ReFS, NPFS (named pipes), MSFS (mailslots) -- and I'm pretty sure it's also possible with CDFS and UDFS. UDF allows any Unicode character except NUL. > What happens when Python's directory manipulation functions on Windows > encounter such a filename? Do they try to write it to the disk > directory? Do they succeed? Does that depend on surrogateescape? Python allows these 'Unicode' (but not strictly UTF compatible) strings, so it doesn't have a problem with such filenames, as long as it's calling the Windows wide-character APIs. > Is there a reason in practice to allow surrogateescape at all on names > in Windows filesystems, at least when using the *W API? You mention > non-Microsoft filesystems; are they common enough to matter? Previously I gave an example with a VirtualBox shared folder, which rejects names with invalid surrogates. I don't know how common that is in general. I typically switch between 2 guests on a Linux host and share folders between systems. In Windows I mount shared folders as directory symlinks in C:\Mount. I just tested another example that led to different results. Ext2Fsd is a free ext2/ext3 filesystem driver for Windows. I mounted an ext2 disk in Windows 10. Next, in Python I created a file named "\udc00b\udc00a\udc00d" in the root directory. Ext2Fsd defaults to using UTF-8 as the drive codepage, so I expected it to reject this filename, just like VBoxSF does. But it worked: >>> os.listdir('.')[-1] '\udc00b\udc00a\udc00d' As expected the ANSI API substitutes question marks for the surrogate codes: >>> os.listdir(b'.')[-1] b'?b?a?d' So what did Ext2Fsd write in this supposedly UTF-8 filesystem? I mounted the disk in Linux to check: >>> os.listdir(b'.')[-1] b'\xed\xb0\x80b\xed\xb0\x80a\xed\xb0\x80d' It blindly encoded the surrogate codes, creating invalid UTF-8. I think it's called WTF-8 (Wobbly Transformation Format). The file manager in Linux displays this file as "���b���a���d (invalid encoding)", and ls prints "???b???a???d". Python uses its surrogateescape error handler: >>> os.listdir('.')[-1] '\udced\udcb0\udc80b\udced\udcb0\udc80a\udced\udcb0\udc80d' The original name can be decoded using the surrogatepass error handler: >>> os.listdir(b'.')[-1].decode(errors='surrogatepass') '\udc00b\udc00a\udc00d' ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
Steve Dower writes: > On 17Aug2016 0235, Stephen J. Turnbull wrote: > > So a full statement is, "How do we best represent Windows file > > system paths in bytes for interoperability with systems that > > natively represent paths in bytes?" ("Other systems" refers to > > both other platforms and existing programs on Windows.) > > That's incorrect, or at least possible to interpret correctly as > the wrong thing. The goal is "code compatibility with systems ...", > not interoperability. You're right, I stated that incorrectly. I don't have anything to add to your corrected version. > > In a properly set up POSIX locale[1], it Just Works by design, > > especially if you use UTF-8 as the preferred encoding. It's > > Windows developers and users who suffer, not those who wrote the > > code, nor their primary audience which uses POSIX platforms. > > You mentioned "locale", "preferred" and "encoding" in the same sentence, > so I hope you're not thinking of locale.getpreferredencoding()? Changing > that function is orthogonal to this discussion, You consistently ignore Makefiles, .ini, etc. It is *not* orthogonal, it is *the* reason for all opposition to your proposal or request that it be delayed. Filesystem names *are* text in part because they are *used as filenames in text*. > When Windows developers and users suffer, I see it as my responsibility > to reduce that suffering. Changing Python on Windows should do that > without affecting developers on Linux, even though the Right Way is to > change all the developers on Linux to use str for paths. I resent that. If I were a partisan Linux fanboy, I'd be cheering you on because I think your proposal is going to hurt an identifiable and large class of *Windows* users. I know about and fear this possiblity because they use a language I love (Japanese) and an encoding I hate but have achieved a state of peaceful coexistence with (Shift JIS). And on the general principle, *I* don't disagree. I mentioned earlier that I use only the str interfaces in my own code on Linux and Mac OS X, and that I suspect that there are no real efficiency implications to using str rather than bytes for those interfaces. On the other hand, the programming convenience of reading the occasional "text" filename (or other text, such as XML tags) out of a binary stream and passing it directly to filesystem APIs cannot be denied. I think that the kind of usage you propose (a fixed, universal codec, universally accepted; ie, 'utf-8') is the best way to handle that in the long run. But as Grandmaster Lasker said, "Before the end game, the gods have placed the middle game." (Lord Keynes isn't relevant here, Python will outlive all of us. :-) > I don't think there's any reasonable way to noisily deprecate these > functions within Python, but certainly the docs can be made > clearer. People who explicitly encode with > sys.getfilesystemencoding() should not get the deprecation message, > but we can't tell whether they got their bytes from the right > encoding or a RNG, so there's no way to discriminate. I agree with you within Python; the custom is for DeprecationWarnings to be silent by default. As for "making noise", how about announcing the deprecation as like the top headline for 3.6, postponing the actual change to 3.7, and in the meantime you and Nick do a keynote duet at PyCon? (Your partner could be Guido, too, but Nick has been the most articulate proponent for this particular aspect of "inclusion". I think having a representative from the POSIX world explaining the importance of this for "all of us" would greatly multiply the impact.) Perhaps, given my proposed timing, a discussion at the language summit in '17 and the keynote in '18 would be the best timing. (OT, political: I've been strongly influenced in this proposal by recently reading http://blog.aurynn.com/contempt-culture. There's not as much of it in Python as in other communities I'm involved in, but I think this would be a good symbolic opportunity to express our oppostion to it. "Inclusion" isn't just about gender and race!) > I'm going to put together a summary post here (hopefully today) and get > those who have been contributing to basically sign off on it, then I'll > take it to python-dev. The possible outcomes I'll propose will basically > be "do we keep the status quo, undeprecate and change the functionality, > deprecate the deprecation and undeprecate/change in a couple releases, > or say that it wasn't a real deprecation so we can deprecate and then > change functionality in a couple releases". FWIW, of those four, I dislike 'status quo' the most, and like 'say it wasn't real, deprecate and change' the best. Although I lean toward phrasing that as "we deprecated it, but we realize that practitioners are by and large not aware of the deprecation, and nobody expects the Spanish Inquisition". @Nick, if you're watching: I wonder if it would be
Re: [Python-ideas] Fix default encodings on Windows
eryk sun writes: > On Wed, Aug 17, 2016 at 9:35 AM, Stephen J. Turnbull >wrote: > > BTW, why "surrogate pairs"? Does Windows validate surrogates to > > ensure they come in pairs, but not necessarily in the right order (or > > perhaps sometimes they resolve to non-characters such as U+1)? > > Microsoft's filesystems remain compatible with UCS2 So it's not just invalid surrogate *pairs*, it's invalid surrogates of all kinds. This means that it's theoretically possible (though I gather that it's unlikely in the extreme) for a real Windows filename to indistinguishable from one generated by Python's surrogateescape handler. What happens when Python's directory manipulation functions on Windows encounter such a filename? Do they try to write it to the disk directory? Do they succeed? Does that depend on surrogateescape? Is there a reason in practice to allow surrogateescape at all on names in Windows filesystems, at least when using the *W API? You mention non-Microsoft filesystems; are they common enough to matter? I admit that as we converge on sanity (UTF-8 for text/* content, some kind of Unicode for filesystem names) none of this is very likely to matter, but I'm a worrywart Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 17Aug2016 0901, Nick Coghlan wrote: On 17 August 2016 at 02:06, Chris Barkerwrote: So the Solution is to either: (A) get everyone to use Unicode "properly", which will work on all platforms (but only on py3.5 and above?) or (B) kludge some *nix-compatible support for byte paths into Windows, that will work at least much of the time. It's clear (to me at least) that (A) it the "Right Thing", but real world experience has shown that it's unlikely to happen any time soon. Practicality beats Purity and all that -- this is a judgment call. Have I got that right? Yep, pretty much. Based on Stephen Turnbull's concerns, I wonder if we could make a whitelist of universal encodings that Python-on-Windows will use in preference to UTF-8 if they're configured as the current code page. If we accepted GB18030, GB2312, Shift-JIS, and ISO-2022-* as overrides, then problems would be significantly less likely. Another alternative would be to apply a similar solution as we do on Linux with regards to the "surrogateescape" error handler: there are some interfaces (like the standard streams) where we only enable that error handler specifically if the preferred encoding is reported as ASCII. In 2016, we're *very* skeptical about any properly configured system actually being ASCII-only (rather than that value showing up because the POSIX standards mandate it as the default), so we don't really believe the OS when it tells us that. The equivalent for Windows would be to disbelieve the configured code page only when it was reported as "mbcs" - for folks that had configured their system to use something other than the default, Python would believe them, just as we do on Linux. The problem here is that "mbcs" is not configurable - it's a meta-encoder that uses whatever is configured as the "language (system locale) to use when displaying text in programs that do not support Unicode" (quote from the dialog where administrators can configure this). So there's nothing to disbelieve here. And even on machines where the current code page is "reliable", UTF-16 is still the actual encoding, which means UTF-8 is still a better choice for representing the path as a blob of bytes. Currently we have inconsistent encoding between different Windows machines and could either remove that inconsistency completely or simply reduce it for (approx.) English speakers. I would rather an extreme here - either make it consistent regardless of user configuration, or make it so broken that nobody can use it at all. (And note that the correct way to support *some* other FS encodings would be to change the return value from sys.getfilesystemencoding(), which breaks people who currently ignore that just as badly as changing it to utf-8 would.) Cheers, Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 17 August 2016 at 02:06, Chris Barkerwrote: > Just to make sure this is clear, the Pragmatic logic is thus: > > * There are more *nix-centric developers in the Python ecosystem than > Windows-centric (or even Windows-agnostic) developers. > > * The bytes path approach works fine on *nix systems. For the given value of "works fine" that is "works fine, except when it doesn't, and then you end up with mojibake". > * Whatever might be Right and Just -- the reality is that a number of > projects, including important and widely used libraries and frameworks, use > the bytes API for working with filenames and paths, etc. > > Therefore, there is a lot of code that does not work right on Windows. > > Currently, to get it to work right on Windows, you need to write Windows > specific code, which many folks don't want or know how to do (or just can't > support one way or the other). > > So the Solution is to either: > > (A) get everyone to use Unicode "properly", which will work on all > platforms (but only on py3.5 and above?) > > or > > (B) kludge some *nix-compatible support for byte paths into Windows, that > will work at least much of the time. > > It's clear (to me at least) that (A) it the "Right Thing", but real world > experience has shown that it's unlikely to happen any time soon. > > Practicality beats Purity and all that -- this is a judgment call. > > Have I got that right? Yep, pretty much. Based on Stephen Turnbull's concerns, I wonder if we could make a whitelist of universal encodings that Python-on-Windows will use in preference to UTF-8 if they're configured as the current code page. If we accepted GB18030, GB2312, Shift-JIS, and ISO-2022-* as overrides, then problems would be significantly less likely. Another alternative would be to apply a similar solution as we do on Linux with regards to the "surrogateescape" error handler: there are some interfaces (like the standard streams) where we only enable that error handler specifically if the preferred encoding is reported as ASCII. In 2016, we're *very* skeptical about any properly configured system actually being ASCII-only (rather than that value showing up because the POSIX standards mandate it as the default), so we don't really believe the OS when it tells us that. The equivalent for Windows would be to disbelieve the configured code page only when it was reported as "mbcs" - for folks that had configured their system to use something other than the default, Python would believe them, just as we do on Linux. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 17Aug2016 0235, Stephen J. Turnbull wrote: Paul Moore writes: > On 16 August 2016 at 16:56, Steve Dowerwrote: > > This discussion is for the developers who insist on using bytes > > for paths within Python, and the question is, "how do we best > > represent UTF-16 encoded paths in bytes?" That's incomplete, AFAICS. (Paul makes this point somewhat differently.) We don't want to represent paths in bytes on Windows if we can avoid it. Nor does UTF-16 really enter into it (except for the technical issue of invalid surrogate pairs). So a full statement is, "How do we best represent Windows file system paths in bytes for interoperability with systems that natively represent paths in bytes?" ("Other systems" refers to both other platforms and existing programs on Windows.) That's incorrect, or at least possible to interpret correctly as the wrong thing. The goal is "code compatibility with systems ...", not interoperability. Nothing about this will make it easier to take a path from Windows and use it on Linux or vice versa, but it will make it easier/more reliable to take code that uses paths on Linux and use it on Windows. BTW, why "surrogate pairs"? Does Windows validate surrogates to ensure they come in pairs, but not necessarily in the right order (or perhaps sometimes they resolve to non-characters such as U+1)? Eryk answered this better than I would have. Paul says: > People passing bytes to open() have in my view, already chosen not > to follow the standard advice of "decode incoming data at the > boundaries of your application". They may have good reasons for > that, but it's perfectly reasonable to expect them to take > responsibility for manually tracking the encoding of the resulting > bytes values flowing through their code. Abstractly true, but in practice there's no such need for those who made the choice! In a properly set up POSIX locale[1], it Just Works by design, especially if you use UTF-8 as the preferred encoding. It's Windows developers and users who suffer, not those who wrote the code, nor their primary audience which uses POSIX platforms. You mentioned "locale", "preferred" and "encoding" in the same sentence, so I hope you're not thinking of locale.getpreferredencoding()? Changing that function is orthogonal to this discussion, despite the fact that in most cases it returns the same code page as what is going to be used by the file system functions (which in most cases will also be used by the encoding returned from sys.getfilesystemencoding()). When Windows developers and users suffer, I see it as my responsibility to reduce that suffering. Changing Python on Windows should do that without affecting developers on Linux, even though the Right Way is to change all the developers on Linux to use str for paths. > > If you see an alternative choice to those listed above, feel free > > to contribute it. Otherwise, can we focus the discussion on these > > (or any new) choices? > > Accept that we should have deprecated builtin open and the io module, > but didn't do so. Extend the existing deprecation of bytes paths on > Windows, to cover *all* APIs, not just the os module, But modify the > deprecation to be "use of the Windows CP_ACP code page (via the ...A > Win32 APIs) is deprecated and will be replaced with use of UTF-8 as > the implied encoding for all bytes paths on Windows starting in Python > 3.7". Document and publicise it much more prominently, as it is a > breaking change. Then leave it one release for people to prepare for > the change. I like this one! If my paranoid fears are realized, in practice it might have to wait two releases, but at least this announcement should get people who are at risk to speak up. If they don't, then you can just call me "Chicken Little" and go ahead! I don't think there's any reasonable way to noisily deprecate these functions within Python, but certainly the docs can be made clearer. People who explicitly encode with sys.getfilesystemencoding() should not get the deprecation message, but we can't tell whether they got their bytes from the right encoding or a RNG, so there's no way to discriminate. I'm going to put together a summary post here (hopefully today) and get those who have been contributing to basically sign off on it, then I'll take it to python-dev. The possible outcomes I'll propose will basically be "do we keep the status quo, undeprecate and change the functionality, deprecate the deprecation and undeprecate/change in a couple releases, or say that it wasn't a real deprecation so we can deprecate and then change functionality in a couple releases". Cheers, Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Tue, Aug 16, 2016, at 12:12, Chris Barker wrote: > * convert and fail on invalid surrogate pairs > > where would an invalid surrogate pair come from? never from a file system > API call, yes? In principle it could, if the filesystem contains a file with an invalid surrogate pair. Nothing else, in general, prevents such a file from being created, though it's not easy to do so by accident. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
>> On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower>> wrote: > > and using the *W APIs exclusively is the right way to go. My proposal was to use the wide-character APIs, but transcoding CP_ACP without best-fit characters and raising a warning whenever the default character is used (e.g. substituting Katakana middle dot when creating a file using a bytes path that has an invalid sequence in CP932). This proposal was in response to the case made by Stephen Turnbull. If using UTF-8 is getting such heavy pushback, I thought half a solution was better than nothing, and it also sets up the infrastructure to easily switch to UTF-8 if that idea eventually gains acceptance. It could raise exceptions instead of warnings if that's preferred, since bytes paths on Windows are already deprecated. > *Any* encoding that may silently lose data is a problem, which basically > leaves utf-16 as the only option. However, as that causes other problems, > maybe we can accept the tradeoff of returning utf-8 and failing when a > path contains invalid surrogate pairs Are there any common sources of illegal UTF-16 surrogates in Windows filenames? I see that WTF-8 (Wobbly) was developed to handle this problem. A WTF-8 path would roundtrip back to the filesystem, but it should only be used internally in a program. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 16 August 2016 at 11:34, Chris Barker - NOAA Federalwrote: >> Given that, I'm proposing adding support for using byte strings encoded with >> UTF-8 in file system functions on Windows. This allows Python users to omit >> switching code like: >> >> if os.name == 'nt': >>f = os.stat(os.listdir('.')[-1]) >> else: >>f = os.stat(os.listdir(b'.')[-1]) > > REALLY? Do we really want to encourage using bytes as paths? IIUC, > anyone that wants to platform-independentify that code just needs to > use proper strings (or pat glib) for paths everywhere, yes? The problem is that bytes-as-paths actually *does* work for Mac OS X and systemd based Linux distros properly configured to use UTF-8 for OS interactions. This means that a lot of backend network service code makes that assumption, especially when it was originally written for Python 2, and rather than making it work properly on Windows, folks just drop Windows support as part of migrating to Python 3. At an ecosystem level, that means we're faced with a choice between implicitly encouraging folks to make their code *nix only, and finding a way to provide a more *nix like experience when running on Windows (where UTF-8 encoded binary data just works, and either other encodings lead to mojibake or else you use chardet to figure things out). Steve is suggesting that the latter option is preferable, a view I agree with since it lowers barriers to entry for Windows based developers to contribute to primarily *nix focused projects. > I understand that pre-surrogate-escape, there was a need for bytes > paths, but those days are gone, yes? No, UTF-8 encoded bytes are still the native language of network service development: http://utf8everywhere.org/ It also helps with cases where folks are switching back and forth between Python and other environments like JavaScript and Go where the UTF-8 assumption is more prevalent. > So why, at this late date, kludge what should be a deprecated pattern > into the Windows build??? Promoting cross-platform consistency often leads to enabling patterns that are considered a bad idea from a native platform perspective, and this strikes me as an example of that (just as the binary/text separation itself is a case where Python 3 diverged from the POSIX text model to improve consistency across *nix, Windows, JVM and CLR environments). Cheers, Nick. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 15Aug2016 0954, Random832 wrote: On Mon, Aug 15, 2016, at 12:35, Steve Dower wrote: I'm still not sure we're talking about the same thing right now. For `open(path_as_bytes).read()`, are we talking about the way path_as_bytes is passed to the file system? Or the codec used to decide the returned string? We are talking about the way path_as_bytes is passed to the filesystem, and in particular what encoding path_as_bytes is *actually* in, when it was obtained from a file or other stream opened in binary mode. Okay good, we are talking about the same thing. Passing path_as_bytes in that location has been deprecated since 3.3, so we are well within our rights (and probably overdue) to make it a TypeError in 3.6. While it's obviously an invalid assumption, for the purposes of changing the language we can assume that no existing code is passing bytes into any functions where it has been deprecated. As far as I'm concerned, there are currently no filesystem APIs on Windows that accept paths as bytes. Given that, I'm proposing adding support for using byte strings encoded with UTF-8 in file system functions on Windows. This allows Python users to omit switching code like: if os.name == 'nt': f = os.stat(os.listdir('.')[-1]) else: f = os.stat(os.listdir(b'.')[-1]) Or simply using the bytes variant unconditionally because they heard it was faster (sacrificing cross-platform correctness, since it may not correctly round-trip on Windows). My proposal is to remove all use of the *A APIs and only use the *W APIs. That completely removes the (already deprecated) use of bytes as paths. I then propose to change the (unused on Windows) sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into filesystem functions by transcoding into UTF-16 and calling the *W APIs. This completely removes the active codepage from the chain, allows paths returned from the filesystem to correctly roundtrip via bytes in Python, and allows those bytes paths to be manipulated at '\' characters. (Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate pairs. But that would prevent basic manipulation which seems to be a higher priority.) This does not allow you to take bytes from an arbitrary source and assume that they are correctly encoded for the file system. Python 3.3, 3.4 and 3.5 have been warning that doing that is deprecated and the path needs to be decoded to a known encoding first. At this stage, it's time for us to either make byte paths an error, or to specify a suitable encoding that can correctly round-trip paths. If this does not answer the question, I'm going to need the question to be explained more clearly for me. Cheers, Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Mon, Aug 15, 2016, at 09:23, Steve Dower wrote: > I guess I'm not sure what your question is then. > > Using text internally is of course the best way to deal with it. But for > those who insist on using bytes, this change at least makes Windows a > feasible target without requiring manual encoding/decoding at every > boundary. Why isn't it already? What's "not feasible" about requiring manual encoding/decoding? Basically your assumption is that people using Python on windows and having to deal with files that contain filename data encoded as bytes are more likely to be dealing with data that is either UTF-8 anyway (coming from Linux or some other platform) or came from the current version of Python (which will encode things in UTF-8 under the change) than they are to deal with data that came from other Windows programs that encoded things in the codepage used by them and by other Windows users in the same country / who speak the same language. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
> The last point is correct: if you get bytes from a file system API, you should be able to pass them back in without losing information. CP_ACP (a.k.a. the *A API) does not allow this, so I'm proposing using the *W API everywhere and encoding to utf-8 when the user wants/gives bytes. You get troubles when the filename comes a file, another application, a registry key, ... which is encoded to CP_ACP. Do you plan to transcode all these data? (decode from CP_ACP, encode back to UTF-8) ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
Random832 writes: > And what's going to happen if you shovel those bytes into the > filesystem without conversion on Linux, or worse, OSX? Off topic. See Subject: field. > This proposal embodies an assumption that bytes from unknown sources > used as filenames are more likely to be UTF-8 than in the locale ACP Then it's irrelevant: most bytes are not from "unknown sources", they're from correspondents (or from yourself!) -- and for most users most of the time, those correspondents share the locale encoding with them. At least where I live, they use that encoding frequently. > the only solution is to require the application to make a > considered decision That's not a solution. Code is not written with every decision considered, and it never will be. The (long-run) solution is a la Henry Ford: "you can encode text any way you want, as long as it's UTF-8". Then it won't matter if people ever make considered decisions about encoding! But trying to enforce that instead of letting it evolve naturally (as it is doing) will cause unnecessary pain for Python programmers, and I believe quite a lot of pain. I used to be in the "make them speak UTF-8" camp. But in the 15 years since PEP 263, experience has shown me that mostly it doesn't matter, and that when it does matter, you have to deal with the large variety of encodings anyway -- assuming UTF-8 is not a win. For use cases that can be encoding-agnostic because all cooperating participants share a locale encoding, making them explicitly specify the locale encoding is just a matter of "misery loves company". Please, let's not do things for that reason. > I think the use case that the proposal has in mind is a > file-names-are-just-bytes program (or set of programs) that reads > from the filesystem, converts to bytes for a file/network, and then > eventually does the reverse - either end may be on windows. You have misspoken somewhere. The programs under discussion do not "convert" input to bytes; they *receive* bytes, either from POSIX APIs or from Windows *A APIs, and use them as is. Unless I am greatly mistaken, Steve simply wants that to work as well on Windows as on POSIX platforms, so that POSIX programmers who do encoding-agnostic programming have one less barrier to supporting their software on Windows. But you'll have to ask Steve to rule on that. Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
Hello, I'm in holiday and I'm writing on a phone, so sorry in advance for the short answer. In short: we should drop support for the bytes API. Just use Unicode on all platforms, especially for filenames. Sorry but most of these changes look like very bad ideas. Or maybe I misunderstood something. Windows bytes API are broken in different ways, in short your proposal is to put another layer on top of it to try to workaround issues. Unicode is complex. Unicode issues are hard to debug. Adding a new layer makes debugging even harder. Is the bug in the input data? In the layer? In the final Windows function? In my experience on UNIX, the most important part is the interoperability with other applications. I understand that Python 2 will speak ANSI code page but Python 3 will speak UTF-8. I don't understand how it can work. Almsot all Windows applications speak the ANSI code page (I'm talking about stdin, stdout, pipes, ...). Do you propose to first try to decode from UTF-8 or fallback on decoding from the ANSI code page? What about encoding? Always encode to UTF-8? About BOM: I hate them. Many applications don't understand them. Again, think about Python 2. I recall vaguely that the Unicode strandard suggests to not use BOM (I have to check). I recall a bug in gettext. The tool doesn't understand BOM. When I opened the file in vim, the BOM was invisible (hidden). I had to use hexdump to understand the issue! BOM introduces issues very difficult to debug :-/ I also think that it goes in the wrong direction in term of interoperability. For the Windows console: I played with all Windows functions, tried all fonts and many code pages. I also read technical blog articles of Microsoft employees. I gave up on this issue. It doesn't seem possible to support fully Unicode the Windows console (at least the last time I checked). By the way, it seems like Windows functions have bugs, and the code page 65001 fixes a few issues but introduces new issues... Victor Le 10 août 2016 20:16, "Steve Dower"a écrit : > I suspect there's a lot of discussion to be had around this topic, so I > want to get it started. There are some fairly drastic ideas here and I need > help figuring out whether the impact outweighs the value. > > Some background: within the Windows API, the preferred encoding is UTF-16. > This is a 16-bit format that is typed as wchar_t in the APIs that use it. > These APIs are generally referred to as the *W APIs (because they have a W > suffix). > > There are also (broadly deprecated) APIs that use an 8-bit format (char), > where the encoding is assumed to be "the user's active code page". These > are *A APIs. AFAIK, there are no cases where a *A API should be preferred > over a *W API, and many newer APIs are *W only. > > In general, Python passes byte strings into the *A APIs and text strings > into the *W APIs. > > Right now, sys.getfilesystemencoding() on Windows returns "mbcs", which > translates to "the system's active code page". As this encoding generally > cannot represent all paths on Windows, it is deprecated and Unicode strings > are recommended instead. This, however, means you need to write > significantly different code between POSIX (use bytes) and Windows (use > text). > > ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and > updating path_converter() (Python/posixmodule.c; likely similar code in > other places) to decode incoming byte strings would allow us to undeprecate > byte strings and add the requirement that they *must* be encoded with > sys.getfilesystemencoding(). I assume that this would allow cross-platform > code to handle paths similarly by encoding to whatever the sys module says > they should and using bytes consistently (starting this thread is meant to > validate/refute my assumption). > > (Yes, I know that people on POSIX should just change to using Unicode and > surrogateescape. Unfortunately, rather than doing that they complain about > Windows and drop support for the platform. If you want to keep hitting them > with the stick, go ahead, but I'm inclined to think the carrot is more > valuable here.) > > Similarly, locale.getpreferredencoding() on Windows returns a legacy value > - the user's active code page - which should generally not be used for any > reason. The one exception is as a default encoding for opening files when > no other information is available (e.g. a Unicode BOM or explicit encoding > argument). BOMs are very common on Windows, since the default assumption is > nearly always a bad idea. > > Making open()'s default encoding detect a BOM before falling back to > locale.getpreferredencoding() would resolve many issues, but I'm also > inclined towards making the fallback utf-8, leaving > locale.getpreferredencoding() solely as a way to get the active system > codepage (with suitable warnings about it only being useful for > back-compat). This would match the behavior that the .NET Framework has >
Re: [Python-ideas] Fix default encodings on Windows
*On Fri Aug 12 11:33:35 EDT 2016, * *Random832 wrote:*> On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote: >>* That's the hope, though that module approaches the solution differently *>>* and may still uses. An alternative way for us to fix this whole thing *>>* would be to bring win_unicode_console into the standard library and use *>>* it by default (or probably whenever PYTHONIOENCODING is not specified). *> > I have concerns about win_unicode_console: > - For the "text_transcoded" streams, stdout.encoding is utf-8. For the > "text" streams, it is utf-16. UTF-16 it the "native" encoding since it corresponds to the wide chars used by Read/WriteConsoleW. The UTF-8 is used just as a signal for the consumers of PyOS_Readline. > - There is no object, as far as I can find, which can be used as an > unbuffered unicode I/O object. There is no buffer just on those wrapping streams because the bytes I have are not in UTF-8. Adding one would mean a fake buffer that just decodes and writes to the text stream. AFAIK there is no guarantee that sys.std* objects have buffer attribute and any code relying on that is incorrect. But I inderstand that there may be such code and we may want to be compatible. > - raw output streams silently drop the last byte if an odd number of > bytes are written. That's not true, it doesn't write an odd number of bytes, but returns the correct number of bytes written. If only one byte is given, it raises a ValueError. > - The sys.stdout obtained via streams.enable does not support .buffer / > .buffer.raw / .detach > - All of these objects provide a fileno() interface. Is this wrong? If I remember, I provide it because of some check -- maybe in input() -- to be viewed as a stdio stream. > - When using os.read/write for data that represents text, the data still > should be encoded in the console encoding and not in utf-8 or utf-16. I don't know what to do with this. Generally I wouldn't use bytes to communicate textual data. Regards, Adam Bartoš ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
I was thinking we would end up using the console API for input but stick with the standard handles for output, mostly to minimize the amount of magic switching we have to do. But since we can just switch the entire stream object in __std*__ once at startup if nothing is redirected it probably isn't that much of a simplification. I have some airport/aeroplane time today where I can experiment. Top-posted from my Windows Phone -Original Message- From: "eryk sun" <eryk...@gmail.com> Sent: 8/12/2016 5:40 To: "python-ideas" <python-ideas@python.org> Subject: Re: [Python-ideas] Fix default encodings on Windows On Thu, Aug 11, 2016 at 9:07 AM, Paul Moore <p.f.mo...@gmail.com> wrote: > set codepage to UTF-8 > ... > set codepage back > spawn subprocess X, but don't wait for it > set codepage to UTF-8 > ... > ... At this point what codepage does Python see? What codepage does > process X see? (Note that they are both sharing the same console). The input and output codepages are global data in conhost.exe. They aren't tracked for each attached process (unlike input history and aliases). That's how chcp.com works in the first place. Otherwise its calls to SetConsoleCP and SetConsoleOutputCP would be pointless. But IMHO all talk of using codepage 65001 is a waste of time. I think the trailing garbage output with this codepage in Windows 7 is unacceptable. And getting EOF for non-ASCII input is a show stopper. The problem occurs in conhost. All you get is the EOF result from ReadFile/ReadConsoleA, so it can't be worked around. This kills the REPL and raises EOFError for input(). ISTM the only people who think codepage 65001 actually works are those using Windows 8+ who occasionally need to print non-OEM text and never enter (or paste) anything but ASCII text. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/ ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Thu, Aug 11, 2016 at 9:07 AM, Paul Moorewrote: > set codepage to UTF-8 > ... > set codepage back > spawn subprocess X, but don't wait for it > set codepage to UTF-8 > ... > ... At this point what codepage does Python see? What codepage does > process X see? (Note that they are both sharing the same console). The input and output codepages are global data in conhost.exe. They aren't tracked for each attached process (unlike input history and aliases). That's how chcp.com works in the first place. Otherwise its calls to SetConsoleCP and SetConsoleOutputCP would be pointless. But IMHO all talk of using codepage 65001 is a waste of time. I think the trailing garbage output with this codepage in Windows 7 is unacceptable. And getting EOF for non-ASCII input is a show stopper. The problem occurs in conhost. All you get is the EOF result from ReadFile/ReadConsoleA, so it can't be worked around. This kills the REPL and raises EOFError for input(). ISTM the only people who think codepage 65001 actually works are those using Windows 8+ who occasionally need to print non-OEM text and never enter (or paste) anything but ASCII text. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
Eryk Sun wrote: > IMO, Python needs a C implementation of the win_unicode_console > module, using the wide-character APIs ReadConsoleW and WriteConsoleW. > Note that this sets sys.std*.encoding as UTF-8 and transcodes, so > Python code never has to work directly with UTF-16 encoded text. > > The transcoding wrappers with 'utf-8' encoding are used just as a work around the fact that Python tokenizer cannot use utf-16-le and that the readlinehook machinery is unfortunately bytes-based. The tanscoding wrapper just has encoding 'utf-8' and no buffer attribute, so there is no actual transcoding in sys.std* objects. It's just a signal for PyOS_Readline consumers, and the transcoding occurs in a custom readline hook. Nothing like this would be needed if PyOS_Readline was replaced by some Python API wrapper around sys.readlinehook that would be Unicode string based. Adam Bartoš ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Thu, Aug 11, 2016, at 10:25, Steven D'Aprano wrote: > > Interesting. Are you assuming that a text file cannot be empty? > > Hmmm... not consciously, but I guess I was. > > If the file is empty, how do you know it's text? Heh. That's the *other* thing that Notepad does wrong in the opinion of people coming from the Unix world - a Windows text file does not need to end with a [CR]LF, and normally will not. > But we're getting off topic here. In context of Steve's suggestion, we > should only autodetect UTF-8. In other words, if there's a UTF-8 BOM, > skip it, otherwise treat the file as UTF-8. I think there's still room for UTF-16. It's two of the four encodings supported by Notepad, after all. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Thu, Aug 11, 2016 at 02:09:00PM +1000, Chris Angelico wrote: > On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Apranowrote: > > The way I have done auto-detection based on BOMs is you start by reading > > four bytes from the file in binary mode. (If there are fewer than four > > bytes, it cannot be a text file with a BOM.) > > Interesting. Are you assuming that a text file cannot be empty? Hmmm... not consciously, but I guess I was. If the file is empty, how do you know it's text? > Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF > 0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with > less than one character in them? I'll have to think about it some more :-) > For a default file-open encoding detection, I would minimize the > number of options. The UTF-7 BOM could be the beginning of a file > containing Base 64 data encoded in ASCII, which is a very real > possibility. I'm coming from the assumption that you're reading unformated text in an unknown encoding, rather than some structured format. But we're getting off topic here. In context of Steve's suggestion, we should only autodetect UTF-8. In other words, if there's a UTF-8 BOM, skip it, otherwise treat the file as UTF-8. > When was the last time you saw a UTF-32LE-BOM file? Two minutes ago, when I looked at my test suite :-P -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 11 August 2016 at 01:41, Chris Angelicowrote: > I've almost never seen files stored in UTF-32 (even UTF-16 isn't all > that common compared to UTF-8), so I wouldn't stress too much about > that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth > doing, but it could easily be retrofitted (that byte sequence won't > decode as UTF-8). I see UTF-16 relatively often as a result of redirecting stdout in Powershell and forgetting that it defaults (stupidly, IMO) to UTF-16. >> The main problem here is that if the console is not forced to UTF-8 then it >> won't render any of the characters correctly. > > Ehh, that's annoying. Is there a way to guarantee, at the process > level, that the console will be returned to "normal state" when Python > exits? If not, there's the risk that people run a Python program and > then the *next* program gets into trouble. There's also the risk that Python programs using subprocess.Popen start the subprocess with the console in a non-standard state. Should we be temporarily restoring the console codepage in that case? How does the following work? set codepage to UTF-8 ... set codepage back spawn subprocess X, but don't wait for it set codepage to UTF-8 ... ... At this point what codepage does Python see? What codepage does process X see? (Note that they are both sharing the same console). ... restore codepage Paul ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On 11 August 2016 at 00:30, Random832wrote: >> Python could copy how >> configure_text_mode() handles the BOM, except it shouldn't write a BOM >> for new UTF-8 files. > > I disagree. I think that *on windows* it should, just like *on windows* > it should write CR-LF for line endings. Tools like git and hg, and cross platform text editors, handle transparently managing the differences between line endings for you. But nothing much handles BOM stripping/adding automatically. So while in theory the two cases are similar, in practice lack of tool support means that if we start adding BOMs on Windows (and requiring them so that we can detect UTF8) then we'll be setting up new interoperability problems for Python users, for little benefit. Paul ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Wed, Aug 10, 2016, at 17:31, Chris Angelico wrote: > AIUI, the data flow would be: Python bytes object Nothing _starts_ as a Python bytes object. It has to be read from somewhere or encoded in the source code as a literal. The scenario is very different for "defined internally within the program" (how are these not gonna be ASCII) vs "user input" (user input how? from the console? from tkinter? how'd that get converted to bytes?) vs "from a network or something like a tar file where it represents a path on some other system" (in which case it's in whatever encoding that system used, or *maybe* an encoding defined as part of the network protocol or file format). The use case has not been described adequately enough to answer my question. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Apranowrote: > On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote: > >> On 10Aug2016 1431, Chris Angelico wrote: >> >>* make the default open() encoding check for a BOM or else use utf-8 >> > >> >-0.5. Is there any precedent for this kind of data-based detection >> >being the default? > > There is precedent: the Python interpreter will accept a BOM instead of > an encoding cookie when importing .py files. Okay, that's good enough for me. > [Chris] >> >An explicit "utf-sig" could do a full detection, >> >but even then it's not perfect - how do you distinguish UTF-32LE from >> >UTF-16LE that starts with U+? > > BOMs are a heuristic, nothing more. If you're reading arbitrary files > could start with anything, then of course they can guess wrong. But then > if I dumped a bunch of arbitrary Unicode codepoints in your lap and > asked you to guess the language, you would likely get it wrong too :-) I have my own mental heuristics, but I can't recognize one Cyrillic language from another. And some Slavic languages can be written with either Latin or Cyrillic letters, just to further confuse matters. Of course, "arbitrary Unicode codepoints" might not all come from one language, and might not be any language at all. (Do you wanna build a U+2603?) > [Chris] >> >Do you say "UTF-32 is rare so we'll >> >assume UTF-16", or do you say "files starting U+ are rare, so >> >we'll assume UTF-32"? > > The way I have done auto-detection based on BOMs is you start by reading > four bytes from the file in binary mode. (If there are fewer than four > bytes, it cannot be a text file with a BOM.) Interesting. Are you assuming that a text file cannot be empty? Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF 0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with less than one character in them? > Compare those first four > bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second* > (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs > (big-endian and little-endian). Then check for UTF-8, and if you're > really keen, UTF-7 and UTF-1. For a default file-open encoding detection, I would minimize the number of options. The UTF-7 BOM could be the beginning of a file containing Base 64 data encoded in ASCII, which is a very real possibility. > elif bom.startswith(b'\x2B\x2F\x76'): > if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39': > return 'utf_7' So I wouldn't include UTF-7 in the detection. Nor UTF-1. Both are rare. Even UTF-32 doesn't necessarily have to be included. When was the last time you saw a UTF-32LE-BOM file? > [Steve] >> But the main reason for detecting the BOM is that currently opening >> files with 'utf-8' does not skip the BOM if it exists. I'd be quite >> happy with changing the default encoding to: >> >> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) >> * utf-8 when writing (so the BOM is *not* written) > > Sounds reasonable to me. > > Rather than hard-coding that behaviour, can we have a new encoding that > does that? "utf-8-readsig" perhaps. +1. Makes the documentation easier by having the default value for encoding not depend on the value for mode. ChrisA ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote: > On 10Aug2016 1431, Chris Angelico wrote: > >>* make the default open() encoding check for a BOM or else use utf-8 > > > >-0.5. Is there any precedent for this kind of data-based detection > >being the default? There is precedent: the Python interpreter will accept a BOM instead of an encoding cookie when importing .py files. [Chris] > >An explicit "utf-sig" could do a full detection, > >but even then it's not perfect - how do you distinguish UTF-32LE from > >UTF-16LE that starts with U+? BOMs are a heuristic, nothing more. If you're reading arbitrary files could start with anything, then of course they can guess wrong. But then if I dumped a bunch of arbitrary Unicode codepoints in your lap and asked you to guess the language, you would likely get it wrong too :-) [Chris] > >Do you say "UTF-32 is rare so we'll > >assume UTF-16", or do you say "files starting U+ are rare, so > >we'll assume UTF-32"? The way I have done auto-detection based on BOMs is you start by reading four bytes from the file in binary mode. (If there are fewer than four bytes, it cannot be a text file with a BOM.) Compare those first four bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second* (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs (big-endian and little-endian). Then check for UTF-8, and if you're really keen, UTF-7 and UTF-1. def bom2enc(bom, default=None): """Return encoding name from a four-byte BOM.""" if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): return 'utf_32' elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')): return 'utf_16' elif bom.startswith(b'\xEF\xBB\xBF'): return 'utf_8_sig' elif bom.startswith(b'\x2B\x2F\x76'): if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39': return 'utf_7' elif bom.startswith(b'\xF7\x64\x4C'): return 'utf_1' elif default is None: raise ValueError('no recognisable BOM signature') else: return default [Steve Dower] > The BOM exists solely for data-based detection, and the UTF-8 BOM is > different from the UTF-16 and UTF-32 ones. So we either find an exact > BOM (which IIRC decodes as a no-op spacing character, though I have a > feeling some version of Unicode redefined it exclusively for being the > marker) or we use utf-8. The Byte Order Mark is always U+FEFF encoded into whatever bytes your encoding uses. You should never use U+FEFF except as a BOM, but of course arbitrary Unicode strings might include it in the middle of the string Just Because. In that case, it may be interpreted as a legacy "ZERO WIDTH NON-BREAKING SPACE" character. But new content should never do that: you should use U+2060 "WORD JOINER" instead, and treat a U+FEFF inside the body of your file or string as an unsupported character. http://www.unicode.org/faq/utf_bom.html#BOM [Steve] > But the main reason for detecting the BOM is that currently opening > files with 'utf-8' does not skip the BOM if it exists. I'd be quite > happy with changing the default encoding to: > > * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) > * utf-8 when writing (so the BOM is *not* written) Sounds reasonable to me. Rather than hard-coding that behaviour, can we have a new encoding that does that? "utf-8-readsig" perhaps. [Steve] > This provides the best compatibility when reading/writing files without > making any guesses. We could reasonably extend this to read utf-16 and > utf-32 if they have a BOM, but that's an extension and not necessary for > the main change. The use of a BOM is always a guess :-) Maybe I just happen to have a Latin1 file that starts with "", or a Mac Roman file that starts with "Ôªø". Either case will be wrongly detected as UTF-8. That's the risk you take when using a heuristic. And if you don't want to use that heuristic, then you must specify the actual encoding in use. -- Steven D'Aprano ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Thu, Aug 11, 2016 at 9:40 AM, Steve Dowerwrote: > On 10Aug2016 1431, Chris Angelico wrote: >> I'd rather a single consistent default encoding. > > I'm proposing to make that single consistent default encoding utf-8. It > sounds like we're in agreement? Yes, we are. I was disagreeing with Random's suggestion that mbcs would also serve. Defaulting to UTF-8 everywhere is (a) consistent on all systems, regardless of settings; and (b) consistent with bytes.decode() and str.encode(), both of which default to UTF-8. >> -0.5. Is there any precedent for this kind of data-based detection >> being the default? An explicit "utf-sig" could do a full detection, >> but even then it's not perfect - how do you distinguish UTF-32LE from >> UTF-16LE that starts with U+? Do you say "UTF-32 is rare so we'll >> assume UTF-16", or do you say "files starting U+ are rare, so >> we'll assume UTF-32"? > > > The BOM exists solely for data-based detection, and the UTF-8 BOM is > different from the UTF-16 and UTF-32 ones. So we either find an exact BOM > (which IIRC decodes as a no-op spacing character, though I have a feeling > some version of Unicode redefined it exclusively for being the marker) or we > use utf-8. > > But the main reason for detecting the BOM is that currently opening files > with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with > changing the default encoding to: > > * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) > * utf-8 when writing (so the BOM is *not* written) > > This provides the best compatibility when reading/writing files without > making any guesses. We could reasonably extend this to read utf-16 and > utf-32 if they have a BOM, but that's an extension and not necessary for the > main change. AIUI the utf-8-sig encoding is happy to decode something that doesn't have a signature, right? If so, then yes, I would definitely support that mild mismatch in defaults. Chew up that UTF-8 aBOMination and just use UTF-8 as is. I've almost never seen files stored in UTF-32 (even UTF-16 isn't all that common compared to UTF-8), so I wouldn't stress too much about that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth doing, but it could easily be retrofitted (that byte sequence won't decode as UTF-8). >>> * force the console encoding to UTF-8 on initialize and revert on >>> finalize >> >> >> -0 for Python itself; +1 for Python's interactive interpreter. >> Programs that mess with console settings get annoying when they crash >> out and don't revert properly. Unless there is *no way* that you could >> externally kill the process without also bringing the terminal down, >> there's the distinct possibility of messing everything up. > > > The main problem here is that if the console is not forced to UTF-8 then it > won't render any of the characters correctly. Ehh, that's annoying. Is there a way to guarantee, at the process level, that the console will be returned to "normal state" when Python exits? If not, there's the risk that people run a Python program and then the *next* program gets into trouble. But if that happens only on abnormal termination ("I killed Python from Task Manager, and it left stuff messed up so I had to close the console"), it's probably an acceptable risk. And the benefit sounds well worthwhile. Revising my recommendation to +0.9. ChrisA ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Wed, Aug 10, 2016 at 11:30 PM, Random832wrote: > Er... utf-8 doesn't work reliably with arbitrary bytes paths either, > unless you intend to use surrogateescape (which you could also do with > mbcs). > > Is there any particular reason to expect all bytes paths in this > scenario to be valid UTF-8? The problem is more so that data is lost without an error when using the legacy ANSI API. If the path is invalid UTF-8, Python will at least raise an exception when decoding it. To work around this, the developers may decide they need to just bite the bullet and use Unicode, or maybe there could be legacy Latin-1 and ANSI modes enabled by an environment variable or sys flag. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Wed, Aug 10, 2016 at 8:09 PM, Random832wrote: > On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote: >> >> Allowing library developers who support POSIX and Windows to just use >> bytes everywhere to represent paths. > > Okay, how is that use case impacted by it being mbcs instead of utf-8? Using 'mbcs' doesn't work reliably with arbitrary bytes paths in locales that use a DBCS codepage such as 932. If a sequence is invalid, it gets passed to the filesystem as the default Unicode character, so it won't successfully roundtrip. In the following example b"\x81\xad", which isn't defined in CP932, gets mapped to the codepage's default Unicode character, Katakana middle dot, which encodes back as b"\x81E": >>> locale.getpreferredencoding() 'cp932' >>> open(b'\x81\xad', 'w').close() >>> os.listdir('.') ['・'] >>> unicodedata.name(os.listdir('.')[0]) 'KATAKANA MIDDLE DOT' >>> '・'.encode('932') b'\x81E' This isn't a problem for single-byte codepages, since every byte value uniquely maps to a Unicode code point, even if it's simply b'\x81' => u"\x81". Obviously there's still the general problem of dealing with arbitrary Unicode filenames created by other programs, since the ANSI API can only return a best-fit encoding of the filename, which is useless for actually accessing the file. >> It probably also entails opening the file descriptor in bytes mode, >> which might break programs that pass the fd directly to CRT functions. >> Personally I wish they wouldn't, but it's too late to stop them now. > > The only thing O_TEXT does rather than O_BINARY is convert CRLF line > endings (and maybe end on ^Z), and I don't think we even expose the > constants for the CRT's unicode modes. Python 3 uses O_BINARY when opening files, unless you explicitly call os.open. Specifically, FileIO.__init__ adds O_BINARY to the open flags if the platform defines it. The Windows CRT reads the BOM for the Unicode modes O_WTEXT, O_U16TEXT, and O_U8TEXT. For O_APPEND | O_WRONLY mode, this requires opening the file twice, the first time with read access. See configure_text_mode() in "Windows Kits\10\Source\10.0.10586.0\ucrt\lowio\open.cpp". Python doesn't expose or use these Unicode text-mode constants. That's for the best because in Unicode mode the CRT invokes the invalid parameter handler when a buffer doesn't have an even number of bytes, i.e. a multiple of sizeof(wchar_t). Python could copy how configure_text_mode() handles the BOM, except it shouldn't write a BOM for new UTF-8 files. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Fix default encodings on Windows
On Wed, 10 Aug 2016 at 11:16 Steve Dowerwrote: > [SNIP] > > Finally, the encoding of stdin, stdout and stderr are currently > (correctly) inferred from the encoding of the console window that Python > is attached to. However, this is typically a codepage that is different > from the system codepage (i.e. it's not mbcs) and is almost certainly > not Unicode. If users are starting Python from a console, they can use > "chcp 65001" first to switch to UTF-8, and then *most* functionality > works (input() has some issues, but those can be fixed with a slight > rewrite and possibly breaking readline hooks). > > It is also possible for Python to change the current console encoding to > be UTF-8 on initialize and change it back on finalize. (This would leave > the console in an unexpected state if Python segfaults, but console > encoding is probably the least of anyone's worries at that point.) So > I'm proposing actively changing the current console to be Unicode while > Python is running, and hence sys.std[in|out|err] will default to utf-8. > > So that's a broad range of changes, and I have little hope of figuring > out all the possible issues, back-compat risks, and flow-on effects on > my own. Please let me know (either on-list or off-list) how a change > like this would affect your projects, either positively or negatively, > and whether you have any specific experience with these changes/fixes > and think they should be approached differently. > > > To summarise the proposals (remembering that these would only affect > Python 3.6 on Windows): > > [SNIP] > * force the console encoding to UTF-8 on initialize and revert on finalize > Don't have enough Windows experience to comment on the other parts of this proposal, but for the console encoding I am a hearty +1 as I'm tired of Unicode characters failing to show up in the REPL. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/