subject:"\[Python\-ideas\] Fix default encodings on Windows"

Re: [Python-ideas] Fix default encodings on Windows

2016-08-19 Thread Chris Barker

On Fri, Aug 19, 2016 at 12:30 AM, Nick Coghlan  wrote:

> > So in porting to py3, they would have had to *add* that 'b' (and a bunch
> of
> > b'filename') to keep the good old bytes is text interface.
> >
> > Why would anyone do that?
>
> For a fair amount of *nix-centric code that primarily works with ASCII
> data, adding the 'b' prefix is the easiest way to get into the common
> subset of Python 2 & 3.
>

Sure -- but it's entirely unnecessary, yes? If you don't change your code,
you'll get py2(bytes) strings as paths in py2, and py3 (Unicode) strings as
paths on py3. So different, yes. But wouldn't it all work?

So folks are making an active choice to change their code to get some
perceived (real?) performance benefit???

However, as I understand it, py3 string paths did NOT "just work" in place
of py2 paths before surrogate pairs were introduced (when was that?) -- so
are we dealing with all of this because some (a lot, and important)
libraries ported to py3 early in the game?

What I'm getting at is whether there is anything other than inertia that
keeps folks using bytes paths in py3 code? Maybe it wouldn't be THAT hard
to get folks to make the switch: it's EASIER to port your code to py3 this
way!

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-19 Thread eryk sun

On Thu, Aug 18, 2016 at 3:25 PM, Steve Dower  wrote:
> allow us to change locale.getpreferredencoding() to utf-8 on Windows

_bootlocale.getpreferredencoding would need to be hard coded to return
'utf-8' on Windows. _locale._getdefaultlocale() itself shouldn't
return 'utf-8' as the encoding because the CRT doesn't allow it as a
locale encoding.

site.aliasmbcs() uses getpreferredencoding, so it will need to be
modified. The codecs module could add get_acp and get_oemcp functions
based on GetACP and GetOEMCP, returning for example 'cp1252' and
'cp850'. Then aliasmbcs could call get_acp.

Adding get_oemcp would also help with decoding output from
subprocess.Popen. There's been discussion about adding encoding and
errors options to Popen, and what the default should be. When writing
to a pipe or file, some programs use OEM, some use ANSI, some use the
console codepage if available, and far fewer use Unicode encodings.
Obviously it's better to specify the encoding in each case if you know
it.

Regarding the locale module, how about modernizing
_locale._getdefaultlocale to return the Windows locale name [1] from
GetUserDefaultLocaleName? For example, it could return a tuple such as
('en-UK', None) and ('uz-Latn-UZ', None) -- always with the encoding
set to None. The CRT accepts the new locale names, but it isn't quite
up to speed. It still sets a legacy locale when the locale string is
empty. In this case the high-level setlocale could call
_getdefaultlocale. Also _parse_localename, which is called by
getlocale, needs to return a tuple with the encoding as None.
Currently it raises a ValueError for Windows locale names as defined
by [1].

[1]: https://msdn.microsoft.com/en-us/library/dd373814
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Terry Reedy


On 8/18/2016 1:39 PM, Steve Dower wrote:

On 18Aug2016 1036, Terry Reedy wrote:

On 8/18/2016 11:25 AM, Steve Dower wrote:


In this case, we would announce in 3.6 that using bytes as paths on
Windows is no longer deprecated,


My understanding is the the first 2 fixes refine the deprecation rather
than reversing it.  And #3 simply applies it.


#3 certainly just applies the deprecation.

As for the first two, I don't see any reason to deprecate the
functionality once the issues are resolved. If using utf-8 encoded bytes
is going to work fine in all the same cases as using str, why discourage
it?


As I understand it, you still proposing to remove the use of bytes 
encoded with anything other than utf-8 (and the corresponding *A 
internal functions) and in particular stop lossy path transformations. 
Am I wrong?


--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Chris Barker

On Thu, Aug 18, 2016 at 6:23 AM, Steve Dower  wrote:

> "You consistently ignore Makefiles, .ini, etc."
>
> Do people really do open('makefile', 'rb'), extract filenames and try to
> use them without ever decoding the file contents?
>

I'm sure they do :-(

But this has always confused me - back in the python2 "good old days" text
and binary mode were exactly the same on *nix -- so folks sometimes fell
into the trap of opening binary files as text on *nix, and then it failing
on Windows but I can't image why anyone would have done the opposite.

So in porting to py3, they would have had to *add* that 'b' (and a bunch of
b'filename') to keep the good old bytes is text interface.

Why would anyone do that?

Honestly confused.

I've honestly never seen that, and it certainly looks like the sort of
> thing Python 3 was intended to discourage.
>

exactly -- we really don't need to support folks reading text files in
binary mode and not considering encoding...

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower


On 18Aug2016 1036, Terry Reedy wrote:

On 8/18/2016 11:25 AM, Steve Dower wrote:


In this case, we would announce in 3.6 that using bytes as paths on
Windows is no longer deprecated,


My understanding is the the first 2 fixes refine the deprecation rather
than reversing it.  And #3 simply applies it.


#3 certainly just applies the deprecation.

As for the first two, I don't see any reason to deprecate the 
functionality once the issues are resolved. If using utf-8 encoded bytes 
is going to work fine in all the same cases as using str, why discourage it?


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Terry Reedy


On 8/18/2016 11:25 AM, Steve Dower wrote:


In this case, we would announce in 3.6 that using bytes as paths on
Windows is no longer deprecated,


My understanding is the the first 2 fixes refine the deprecation rather 
than reversing it.  And #3 simply applies it.



--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower


On 18Aug2016 0900, Chris Angelico wrote:

On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower  wrote:

On 18Aug2016 0829, Chris Angelico wrote:


The second call to glob doesn't have any Unicode characters at all,
the way I see it - it's all bytes. Am I completely misunderstanding
this?



You're not the only one - I think this has been the most common
misunderstanding.

On Windows, the paths as stored in the filesystem are actually all text -
more precisely, utf-16-le encoded bytes, represented as 16-bit characters
strings.

Converting to an 8-bit character representation only exists for
compatibility with code written for other platforms (either Linux, or much
older versions of Windows). The operating system has one way to do the
conversion to bytes, which Python currently uses, but since we control that
transformation I'm proposing an alternative conversion that is more reliable
than compatible (with Windows 3.1... shouldn't affect compatibility with
code that properly handles multibyte encodings, which should include
anything developed for Linux in the last decade or two).

Does that help? I tried to keep the explanation short and focused :)


Ah, I think I see what you mean. There's a slight ambiguity in the
word "missing" here.

1) The Unicode character in the result lacks some of the information
it should have

2) The Unicode character in the file name is information that has now been lost.

My reading was the first, but AIUI you actually meant the second. If
so, I'd be inclined to reword it very slightly, eg:

"The Unicode character in the second call to glob is now lost information."

Is that a correct interpretation?


I think so, though I find the wording a little awkward (and on 
rereading, my original wording was pretty bad). How about:


"The second call to glob has replaced the Unicode character with '?', 
which means the actual filename cannot be recovered and the path is no 
longer valid."


Cheers,
STeve

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Chris Angelico

On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower  wrote:
> On 18Aug2016 0829, Chris Angelico wrote:
>>
>> The second call to glob doesn't have any Unicode characters at all,
>> the way I see it - it's all bytes. Am I completely misunderstanding
>> this?
>
>
> You're not the only one - I think this has been the most common
> misunderstanding.
>
> On Windows, the paths as stored in the filesystem are actually all text -
> more precisely, utf-16-le encoded bytes, represented as 16-bit characters
> strings.
>
> Converting to an 8-bit character representation only exists for
> compatibility with code written for other platforms (either Linux, or much
> older versions of Windows). The operating system has one way to do the
> conversion to bytes, which Python currently uses, but since we control that
> transformation I'm proposing an alternative conversion that is more reliable
> than compatible (with Windows 3.1... shouldn't affect compatibility with
> code that properly handles multibyte encodings, which should include
> anything developed for Linux in the last decade or two).
>
> Does that help? I tried to keep the explanation short and focused :)

Ah, I think I see what you mean. There's a slight ambiguity in the
word "missing" here.

1) The Unicode character in the result lacks some of the information
it should have

2) The Unicode character in the file name is information that has now been lost.

My reading was the first, but AIUI you actually meant the second. If
so, I'd be inclined to reword it very slightly, eg:

"The Unicode character in the second call to glob is now lost information."

Is that a correct interpretation?

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower


On 18Aug2016 0829, Chris Angelico wrote:

The second call to glob doesn't have any Unicode characters at all,
the way I see it - it's all bytes. Am I completely misunderstanding
this?


You're not the only one - I think this has been the most common 
misunderstanding.


On Windows, the paths as stored in the filesystem are actually all text 
- more precisely, utf-16-le encoded bytes, represented as 16-bit 
characters strings.


Converting to an 8-bit character representation only exists for 
compatibility with code written for other platforms (either Linux, or 
much older versions of Windows). The operating system has one way to do 
the conversion to bytes, which Python currently uses, but since we 
control that transformation I'm proposing an alternative conversion that 
is more reliable than compatible (with Windows 3.1... shouldn't affect 
compatibility with code that properly handles multibyte encodings, which 
should include anything developed for Linux in the last decade or two).


Does that help? I tried to keep the explanation short and focused :)

Cheers,
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Chris Angelico

On Fri, Aug 19, 2016 at 1:25 AM, Steve Dower  wrote:
 open('test\uAB00.txt', 'wb').close()
 import glob
 glob.glob('test*')
> ['test\uab00.txt']
 glob.glob(b'test*')
> [b'test?.txt']
>
> The Unicode character in the second call to glob is missing information. You
> can observe the same results in os.listdir() or any function that matches
> its result type to the parameter type.

Apologies if this is just noise, but I'm a little confused by this.
The second call to glob doesn't have any Unicode characters at all,
the way I see it - it's all bytes. Am I completely misunderstanding
this?

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower


Summary for python-dev.

This is the email I'm proposing to take over to the main mailing list to 
get some actual decisions made. As I don't agree with some of the 
possible recommendations, I want to make sure that they're represented 
fairly.


I also want to summarise the background leading to why we should 
consider making a change here at all, rather than simply leaving it 
alone. There's a chance this will all make its way into a PEP, depending 
on how controversial the core team thinks this is.


Please let me know if you think I've misrepresented (or unfairly 
represented) any of the positions, or if you think I can 
simplify/clarify anything in here. Please don't treat this like a PEP 
review - it's just going to be an email to python-dev - but the more we 
can avoid having the discussions there we've already had here the better.


Cheers,
Steve

---

Background
==

File system paths are almost universally represented as text in some 
encoding determined by the file system. In Python, we expose these paths 
via a number of interfaces, such as the os and io modules. Paths may be 
passed either direction across these interfaces, that is, from the 
filesystem to the application (for example, os.listdir()), or from the 
application to the filesystem (for example, os.unlink()).


When paths are passed between the filesystem and the application, they 
are either passed through as a bytes blob or converted to/from str using 
sys.getfilesystemencoding(). The result of encoding a string with 
sys.getfilesystemencoding() is a blob of bytes in the native format for 
the default file system.


On Windows, the native format for the filesystem is utf-16-le. The 
recommended platform APIs for accessing the filesystem all accept and 
return text encoded in this format. However, prior to Windows NT (and 
possibly further back), the native format was a configurable machine 
option and a separate set of APIs existed to accept this format. The 
option (the "active code page") and these APIs (the "*A functions") 
still exist in recent versions of Windows for backwards compatibility, 
though new functionality often only has a utf-16-le API (the "*W 
functions").


In Python, we recommend using str as the default format on Windows 
because it can correctly round-trip all the characters representable in 
utf-16-le. Our support for bytes explicitly uses the *A functions and 
hence the encoding for the bytes is "whatever the active code page is". 
Since the active code page cannot represent all Unicode characters, the 
conversion of a path into bytes can lose information without warning.


As a demonstration of this:

>>> open('test\uAB00.txt', 'wb').close()
>>> import glob
>>> glob.glob('test*')
['test\uab00.txt']
>>> glob.glob(b'test*')
[b'test?.txt']

The Unicode character in the second call to glob is missing information. 
You can observe the same results in os.listdir() or any function that 
matches its result type to the parameter type.


Why is this a problem?
==

While the obvious and correct answer is to just use str everywhere, it 
remains well known that on Linux and MacOS it is perfectly okay to use 
bytes when taking values from the filesystem and passing them back. 
Doing so also avoids the cost of decoding and reencoding, such that 
(theoretically), code like below should be faster because of the `b'.'`:


>>> for f in os.listdir(b'.'):
... os.stat(f)
...

On Windows, if a filename exists that cannot be encoding with the active 
code page, you will receive an error from the above code. These errors 
are why in Python 3.3 the use of bytes paths on Windows was deprecated 
(listed in the What's New, but not clearly obvious in the documentation 
- more on this later). The above code produces multiple deprecation 
warnings in 3.3, 3.4 and 3.5 on Windows.


However, we still keep seeing libraries use bytes paths, which can cause 
unexpected issues on Windows. Given the current approach of quietly 
recommending that library developers either write their code twice (once 
for bytes and once for str) or use str exclusively are not working, we 
should consider alternative mitigations.


Proposals
=

There are two dimensions here - the fix and the timing. We can basically 
choose any fix and any timing.


The main differences between the fixes are the balance between incorrect 
behaviour and backwards-incompatible behaviour. The main issue with 
respect to timing is whether or not we believe using bytes as paths on 
Windows was correctly deprecated in 3.3 and sufficiently advertised 
since to allow us to change the behaviour in 3.6.


Fixes
-

Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows

Currently the default filesystem encoding is 'mbcs', which is a 
meta-encoder that uses the active code page. In reality, our 
implementation uses the *A APIs and we don't explicitly decode bytes in 
order to pass them to the filesystem. This allows the OS to quietly

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower

"You consistently ignore Makefiles, .ini, etc."

Do people really do open('makefile', 'rb'), extract filenames and try to use 
them without ever decoding the file contents?

I've honestly never seen that, and it certainly looks like the sort of thing 
Python 3 was intended to discourage. (As soon as you open(..., 'r') you're only 
affected by this change if you explicitly encode again with mbcs.)

Top-posted from my Windows Phone

-Original Message-
From: "Stephen J. Turnbull" <turnbull.stephen...@u.tsukuba.ac.jp>
Sent: ‎8/‎17/‎2016 19:43
To: "Steve Dower" <steve.do...@python.org>
Cc: "Paul Moore" <p.f.mo...@gmail.com>; "Python-Ideas" <python-ideas@python.org>
Subject: Re: [Python-ideas] Fix default encodings on Windows

Steve Dower writes:
 > On 17Aug2016 0235, Stephen J. Turnbull wrote:

 > > So a full statement is, "How do we best represent Windows file
 > > system paths in bytes for interoperability with systems that
 > > natively represent paths in bytes?"  ("Other systems" refers to
 > > both other platforms and existing programs on Windows.)
 > 
 > That's incorrect, or at least possible to interpret correctly as
 > the wrong thing. The goal is "code compatibility with systems ...",
 > not interoperability.

You're right, I stated that incorrectly.  I don't have anything to add
to your corrected version.

 > > In a properly set up POSIX locale[1], it Just Works by design,
 > > especially if you use UTF-8 as the preferred encoding.  It's
 > > Windows developers and users who suffer, not those who wrote the
 > > code, nor their primary audience which uses POSIX platforms.
 > 
 > You mentioned "locale", "preferred" and "encoding" in the same sentence, 
 > so I hope you're not thinking of locale.getpreferredencoding()? Changing 
 > that function is orthogonal to this discussion,

You consistently ignore Makefiles, .ini, etc.  It is *not* orthogonal,
it is *the* reason for all opposition to your proposal or request that
it be delayed.  Filesystem names *are* text in part because they are
*used as filenames in text*.

 > When Windows developers and users suffer, I see it as my responsibility 
 > to reduce that suffering. Changing Python on Windows should do that 
 > without affecting developers on Linux, even though the Right Way is to 
 > change all the developers on Linux to use str for paths.

I resent that.  If I were a partisan Linux fanboy, I'd be cheering you
on because I think your proposal is going to hurt an identifiable and
large class of *Windows* users.  I know about and fear this possiblity
because they use a language I love (Japanese) and an encoding I hate
but have achieved a state of peaceful coexistence with (Shift JIS).

And on the general principle, *I* don't disagree.  I mentioned earlier
that I use only the str interfaces in my own code on Linux and Mac OS
X, and that I suspect that there are no real efficiency implications
to using str rather than bytes for those interfaces.

On the other hand, the programming convenience of reading the
occasional "text" filename (or other text, such as XML tags) out of a
binary stream and passing it directly to filesystem APIs cannot be
denied.  I think that the kind of usage you propose (a fixed,
universal codec, universally accepted; ie, 'utf-8') is the best way to
handle that in the long run.  But as Grandmaster Lasker said, "Before
the end game, the gods have placed the middle game."  (Lord Keynes
isn't relevant here, Python will outlive all of us. :-)

 > I don't think there's any reasonable way to noisily deprecate these
 > functions within Python, but certainly the docs can be made
 > clearer. People who explicitly encode with
 > sys.getfilesystemencoding() should not get the deprecation message,
 > but we can't tell whether they got their bytes from the right
 > encoding or a RNG, so there's no way to discriminate.

I agree with you within Python; the custom is for DeprecationWarnings
to be silent by default.

As for "making noise", how about announcing the deprecation as like
the top headline for 3.6, postponing the actual change to 3.7, and in
the meantime you and Nick do a keynote duet at PyCon?  (Your partner
could be Guido, too, but Nick has been the most articulate proponent
for this particular aspect of "inclusion".  I think having a
representative from the POSIX world explaining the importance of this
for "all of us" would greatly multiply the impact.)  Perhaps, given my
proposed timing, a discussion at the language summit in '17 and the
keynote in '18 would be the best timing.

(OT, political: I've been strongly influenced in this proposal by
recently reading http://blog.aurynn.com/contempt-culture.  There's not
as much of it in Pytho

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread eryk sun

On Thu, Aug 18, 2016 at 2:32 AM, Stephen J. Turnbull
 wrote:
>
> So it's not just invalid surrogate *pairs*, it's invalid surrogates of
> all kinds.  This means that it's theoretically possible (though I
> gather that it's unlikely in the extreme) for a real Windows filename
> to indistinguishable from one generated by Python's surrogateescape
> handler.

Absolutely if the filesystem is one of Microsoft's such as NTFS,
FAT32, exFAT, ReFS, NPFS (named pipes), MSFS (mailslots) -- and I'm
pretty sure it's also possible with CDFS and UDFS. UDF allows any
Unicode character except NUL.

> What happens when Python's directory manipulation functions on Windows
> encounter such a filename?  Do they try to write it to the disk
> directory?  Do they succeed?  Does that depend on surrogateescape?

Python allows these 'Unicode' (but not strictly UTF compatible)
strings, so it doesn't have a problem with such filenames, as long as
it's calling the Windows wide-character APIs.

> Is there a reason in practice to allow surrogateescape at all on names
> in Windows filesystems, at least when using the *W API?  You mention
> non-Microsoft filesystems; are they common enough to matter?

Previously I gave an example with a VirtualBox shared folder, which
rejects names with invalid surrogates. I don't know how common that is
in general. I typically switch between 2 guests on a Linux host and
share folders between systems. In Windows I mount shared folders as
directory symlinks in C:\Mount.

I just tested another example that led to different results. Ext2Fsd
is a free ext2/ext3 filesystem driver for Windows. I mounted an ext2
disk in Windows 10. Next, in Python I created a file named
"\udc00b\udc00a\udc00d" in the root directory. Ext2Fsd defaults to
using UTF-8 as the drive codepage, so I expected it to reject this
filename, just like VBoxSF does. But it worked:

>>> os.listdir('.')[-1]
'\udc00b\udc00a\udc00d'

As expected the ANSI API substitutes question marks for the surrogate codes:

>>> os.listdir(b'.')[-1]
b'?b?a?d'

So what did Ext2Fsd write in this supposedly UTF-8 filesystem? I
mounted the disk in Linux to check:

>>> os.listdir(b'.')[-1]
b'\xed\xb0\x80b\xed\xb0\x80a\xed\xb0\x80d'

It blindly encoded the surrogate codes, creating invalid UTF-8. I
think it's called WTF-8 (Wobbly Transformation Format). The file
manager in Linux displays this file as "���b���a���d (invalid
encoding)", and ls prints "???b???a???d". Python uses its
surrogateescape error handler:

>>> os.listdir('.')[-1]
'\udced\udcb0\udc80b\udced\udcb0\udc80a\udced\udcb0\udc80d'

The original name can be decoded using the surrogatepass error handler:

>>> os.listdir(b'.')[-1].decode(errors='surrogatepass')
'\udc00b\udc00a\udc00d'
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-17 Thread Stephen J. Turnbull

Steve Dower writes:
 > On 17Aug2016 0235, Stephen J. Turnbull wrote:

 > > So a full statement is, "How do we best represent Windows file
 > > system paths in bytes for interoperability with systems that
 > > natively represent paths in bytes?"  ("Other systems" refers to
 > > both other platforms and existing programs on Windows.)
 > 
 > That's incorrect, or at least possible to interpret correctly as
 > the wrong thing. The goal is "code compatibility with systems ...",
 > not interoperability.

You're right, I stated that incorrectly.  I don't have anything to add
to your corrected version.

 > > In a properly set up POSIX locale[1], it Just Works by design,
 > > especially if you use UTF-8 as the preferred encoding.  It's
 > > Windows developers and users who suffer, not those who wrote the
 > > code, nor their primary audience which uses POSIX platforms.
 > 
 > You mentioned "locale", "preferred" and "encoding" in the same sentence, 
 > so I hope you're not thinking of locale.getpreferredencoding()? Changing 
 > that function is orthogonal to this discussion,

You consistently ignore Makefiles, .ini, etc.  It is *not* orthogonal,
it is *the* reason for all opposition to your proposal or request that
it be delayed.  Filesystem names *are* text in part because they are
*used as filenames in text*.

 > When Windows developers and users suffer, I see it as my responsibility 
 > to reduce that suffering. Changing Python on Windows should do that 
 > without affecting developers on Linux, even though the Right Way is to 
 > change all the developers on Linux to use str for paths.

I resent that.  If I were a partisan Linux fanboy, I'd be cheering you
on because I think your proposal is going to hurt an identifiable and
large class of *Windows* users.  I know about and fear this possiblity
because they use a language I love (Japanese) and an encoding I hate
but have achieved a state of peaceful coexistence with (Shift JIS).

And on the general principle, *I* don't disagree.  I mentioned earlier
that I use only the str interfaces in my own code on Linux and Mac OS
X, and that I suspect that there are no real efficiency implications
to using str rather than bytes for those interfaces.

On the other hand, the programming convenience of reading the
occasional "text" filename (or other text, such as XML tags) out of a
binary stream and passing it directly to filesystem APIs cannot be
denied.  I think that the kind of usage you propose (a fixed,
universal codec, universally accepted; ie, 'utf-8') is the best way to
handle that in the long run.  But as Grandmaster Lasker said, "Before
the end game, the gods have placed the middle game."  (Lord Keynes
isn't relevant here, Python will outlive all of us. :-)

 > I don't think there's any reasonable way to noisily deprecate these
 > functions within Python, but certainly the docs can be made
 > clearer. People who explicitly encode with
 > sys.getfilesystemencoding() should not get the deprecation message,
 > but we can't tell whether they got their bytes from the right
 > encoding or a RNG, so there's no way to discriminate.

I agree with you within Python; the custom is for DeprecationWarnings
to be silent by default.

As for "making noise", how about announcing the deprecation as like
the top headline for 3.6, postponing the actual change to 3.7, and in
the meantime you and Nick do a keynote duet at PyCon?  (Your partner
could be Guido, too, but Nick has been the most articulate proponent
for this particular aspect of "inclusion".  I think having a
representative from the POSIX world explaining the importance of this
for "all of us" would greatly multiply the impact.)  Perhaps, given my
proposed timing, a discussion at the language summit in '17 and the
keynote in '18 would be the best timing.

(OT, political: I've been strongly influenced in this proposal by
recently reading http://blog.aurynn.com/contempt-culture.  There's not
as much of it in Python as in other communities I'm involved in, but I
think this would be a good symbolic opportunity to express our
oppostion to it.  "Inclusion" isn't just about gender and race!)

 > I'm going to put together a summary post here (hopefully today) and get 
 > those who have been contributing to basically sign off on it, then I'll 
 > take it to python-dev. The possible outcomes I'll propose will basically 
 > be "do we keep the status quo, undeprecate and change the functionality, 
 > deprecate the deprecation and undeprecate/change in a couple releases, 
 > or say that it wasn't a real deprecation so we can deprecate and then 
 > change functionality in a couple releases".

FWIW, of those four, I dislike 'status quo' the most, and like 'say it
wasn't real, deprecate and change' the best.  Although I lean toward
phrasing that as "we deprecated it, but we realize that practitioners
are by and large not aware of the deprecation, and nobody expects the
Spanish Inquisition".

@Nick, if you're watching: I wonder if it would be

Re: [Python-ideas] Fix default encodings on Windows

2016-08-17 Thread Stephen J. Turnbull

eryk sun writes:
 > On Wed, Aug 17, 2016 at 9:35 AM, Stephen J. Turnbull
 >  wrote:
 > > BTW, why "surrogate pairs"?  Does Windows validate surrogates to
 > > ensure they come in pairs, but not necessarily in the right order (or
 > > perhaps sometimes they resolve to non-characters such as U+1)?
 > 
 > Microsoft's filesystems remain compatible with UCS2

So it's not just invalid surrogate *pairs*, it's invalid surrogates of
all kinds.  This means that it's theoretically possible (though I
gather that it's unlikely in the extreme) for a real Windows filename
to indistinguishable from one generated by Python's surrogateescape
handler.

What happens when Python's directory manipulation functions on Windows
encounter such a filename?  Do they try to write it to the disk
directory?  Do they succeed?  Does that depend on surrogateescape?

Is there a reason in practice to allow surrogateescape at all on names
in Windows filesystems, at least when using the *W API?  You mention
non-Microsoft filesystems; are they common enough to matter?

I admit that as we converge on sanity (UTF-8 for text/* content, some
kind of Unicode for filesystem names) none of this is very likely to
matter, but I'm a worrywart

Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-17 Thread Steve Dower


On 17Aug2016 0901, Nick Coghlan wrote:

On 17 August 2016 at 02:06, Chris Barker  wrote:

So the Solution is to either:

 (A) get everyone to use Unicode  "properly", which will work on all
platforms (but only on py3.5 and above?)

or

(B) kludge some *nix-compatible support for byte paths into Windows, that
will work at least much of the time.

It's clear (to me at least) that (A) it the "Right Thing", but real world
experience has shown that it's unlikely to happen any time soon.

Practicality beats Purity and all that -- this is a judgment call.

Have I got that right?


Yep, pretty much. Based on Stephen Turnbull's concerns, I wonder if we
could make a whitelist of universal encodings that Python-on-Windows
will use in preference to UTF-8 if they're configured as the current
code page. If we accepted GB18030, GB2312, Shift-JIS, and ISO-2022-*
as overrides, then problems would be significantly less likely.

Another alternative would be to apply a similar solution as we do on
Linux with regards to the "surrogateescape" error handler: there are
some interfaces (like the standard streams) where we only enable that
error handler specifically if the preferred encoding is reported as
ASCII. In 2016, we're *very* skeptical about any properly configured
system actually being ASCII-only (rather than that value showing up
because the POSIX standards mandate it as the default), so we don't
really believe the OS when it tells us that.

The equivalent for Windows would be to disbelieve the configured code
page only when it was reported as "mbcs" - for folks that had
configured their system to use something other than the default,
Python would believe them, just as we do on Linux.


The problem here is that "mbcs" is not configurable - it's a 
meta-encoder that uses whatever is configured as the "language (system 
locale) to use when displaying text in programs that do not support 
Unicode" (quote from the dialog where administrators can configure 
this). So there's nothing to disbelieve here.


And even on machines where the current code page is "reliable", UTF-16 
is still the actual encoding, which means UTF-8 is still a better choice 
for representing the path as a blob of bytes. Currently we have 
inconsistent encoding between different Windows machines and could 
either remove that inconsistency completely or simply reduce it for 
(approx.) English speakers. I would rather an extreme here - either make 
it consistent regardless of user configuration, or make it so broken 
that nobody can use it at all. (And note that the correct way to support 
*some* other FS encodings would be to change the return value from 
sys.getfilesystemencoding(), which breaks people who currently ignore 
that just as badly as changing it to utf-8 would.)


Cheers,
Steve

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-17 Thread Nick Coghlan

On 17 August 2016 at 02:06, Chris Barker  wrote:
> Just to make sure this is clear, the Pragmatic logic is thus:
>
> * There are more *nix-centric developers in the Python ecosystem than
> Windows-centric (or even Windows-agnostic) developers.
>
> * The bytes path approach works fine on *nix systems.

For the given value of "works fine" that is "works fine, except when
it doesn't, and then you end up with mojibake".

> * Whatever might be Right and Just -- the reality is that a number of
> projects, including important and widely used libraries and frameworks, use
> the bytes API for working with filenames and paths, etc.
>
> Therefore, there is a lot of code that does not work right on Windows.
>
> Currently, to get it to work right on Windows, you need to write Windows
> specific code, which many folks don't want or know how to do (or just can't
> support one way or the other).
>
> So the Solution is to either:
>
>  (A) get everyone to use Unicode  "properly", which will work on all
> platforms (but only on py3.5 and above?)
>
> or
>
> (B) kludge some *nix-compatible support for byte paths into Windows, that
> will work at least much of the time.
>
> It's clear (to me at least) that (A) it the "Right Thing", but real world
> experience has shown that it's unlikely to happen any time soon.
>
> Practicality beats Purity and all that -- this is a judgment call.
>
> Have I got that right?

Yep, pretty much. Based on Stephen Turnbull's concerns, I wonder if we
could make a whitelist of universal encodings that Python-on-Windows
will use in preference to UTF-8 if they're configured as the current
code page. If we accepted GB18030, GB2312, Shift-JIS, and ISO-2022-*
as overrides, then problems would be significantly less likely.

Another alternative would be to apply a similar solution as we do on
Linux with regards to the "surrogateescape" error handler: there are
some interfaces (like the standard streams) where we only enable that
error handler specifically if the preferred encoding is reported as
ASCII. In 2016, we're *very* skeptical about any properly configured
system actually being ASCII-only (rather than that value showing up
because the POSIX standards mandate it as the default), so we don't
really believe the OS when it tells us that.

The equivalent for Windows would be to disbelieve the configured code
page only when it was reported as "mbcs" - for folks that had
configured their system to use something other than the default,
Python would believe them, just as we do on Linux.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-17 Thread Steve Dower

On 17Aug2016 0235, Stephen J. Turnbull wrote:

Paul Moore writes:
 > On 16 August 2016 at 16:56, Steve Dower  wrote:

 > > This discussion is for the developers who insist on using bytes
 > > for paths within Python, and the question is, "how do we best
 > > represent UTF-16 encoded paths in bytes?"

That's incomplete, AFAICS.  (Paul makes this point somewhat
differently.)  We don't want to represent paths in bytes on Windows if
we can avoid it.  Nor does UTF-16 really enter into it (except for the
technical issue of invalid surrogate pairs).  So a full statement is,
"How do we best represent Windows file system paths in bytes for
interoperability with systems that natively represent paths in bytes?"
("Other systems" refers to both other platforms and existing programs
on Windows.)

That's incorrect, or at least possible to interpret correctly as the 
wrong thing. The goal is "code compatibility with systems ...", not 
interoperability.

Nothing about this will make it easier to take a path from Windows and 
use it on Linux or vice versa, but it will make it easier/more reliable 
to take code that uses paths on Linux and use it on Windows.

BTW, why "surrogate pairs"?  Does Windows validate surrogates to
ensure they come in pairs, but not necessarily in the right order (or
perhaps sometimes they resolve to non-characters such as U+1)?

Eryk answered this better than I would have.

Paul says:

 > People passing bytes to open() have in my view, already chosen not
 > to follow the standard advice of "decode incoming data at the
 > boundaries of your application". They may have good reasons for
 > that, but it's perfectly reasonable to expect them to take
  > responsibility for manually tracking the encoding of the resulting
 > bytes values flowing through their code.

Abstractly true, but in practice there's no such need for those who
made the choice!  In a properly set up POSIX locale[1], it Just Works by
design, especially if you use UTF-8 as the preferred encoding.  It's
Windows developers and users who suffer, not those who wrote the code,
nor their primary audience which uses POSIX platforms.

You mentioned "locale", "preferred" and "encoding" in the same sentence, 
so I hope you're not thinking of locale.getpreferredencoding()? Changing 
that function is orthogonal to this discussion, despite the fact that in 
most cases it returns the same code page as what is going to be used by 
the file system functions (which in most cases will also be used by the 
encoding returned from sys.getfilesystemencoding()).

When Windows developers and users suffer, I see it as my responsibility 
to reduce that suffering. Changing Python on Windows should do that 
without affecting developers on Linux, even though the Right Way is to 
change all the developers on Linux to use str for paths.

 > > If you see an alternative choice to those listed above, feel free
 > > to contribute it. Otherwise, can we focus the discussion on these
 > > (or any new) choices?
 >
 > Accept that we should have deprecated builtin open and the io module,
 > but didn't do so. Extend the existing deprecation of bytes paths on
 > Windows, to cover *all* APIs, not just the os module, But modify the
 > deprecation to be "use of the Windows CP_ACP code page (via the ...A
 > Win32 APIs) is deprecated and will be replaced with use of UTF-8 as
 > the implied encoding for all bytes paths on Windows starting in Python
 > 3.7". Document and publicise it much more prominently, as it is a
 > breaking change. Then leave it one release for people to prepare for
 > the change.

I like this one!  If my paranoid fears are realized, in practice it
might have to wait two releases, but at least this announcement should
get people who are at risk to speak up.  If they don't, then you can
just call me "Chicken Little" and go ahead!

I don't think there's any reasonable way to noisily deprecate these 
functions within Python, but certainly the docs can be made clearer. 
People who explicitly encode with sys.getfilesystemencoding() should not 
get the deprecation message, but we can't tell whether they got their 
bytes from the right encoding or a RNG, so there's no way to discriminate.

I'm going to put together a summary post here (hopefully today) and get 
those who have been contributing to basically sign off on it, then I'll 
take it to python-dev. The possible outcomes I'll propose will basically 
be "do we keep the status quo, undeprecate and change the functionality, 
deprecate the deprecation and undeprecate/change in a couple releases, 
or say that it wasn't a real deprecation so we can deprecate and then 
change functionality in a couple releases".

Cheers,
Steve

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-16 Thread Random832

On Tue, Aug 16, 2016, at 12:12, Chris Barker wrote:
> * convert and fail on invalid surrogate pairs
> 
> where would an invalid surrogate pair come from? never from a file system
> API call, yes?

In principle it could, if the filesystem contains a file with an invalid
surrogate pair. Nothing else, in general, prevents such a file from
being created, though it's not easy to do so by accident.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-16 Thread eryk sun

>> On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower 
>> wrote:
>
> and using the *W APIs exclusively is the right way to go.

My proposal was to use the wide-character APIs, but transcoding CP_ACP
without best-fit characters and raising a warning whenever the default
character is used (e.g. substituting Katakana middle dot when creating
a file using a bytes path that has an invalid sequence in CP932). This
proposal was in response to the case made by Stephen Turnbull. If
using UTF-8 is getting such heavy pushback, I thought half a solution
was better than nothing, and it also sets up the infrastructure to
easily switch to UTF-8 if that idea eventually gains acceptance. It
could raise exceptions instead of warnings if that's preferred, since
bytes paths on Windows are already deprecated.

> *Any* encoding that may silently lose data is a problem, which basically
> leaves utf-16 as the only option. However, as that causes other problems,
> maybe we can accept the tradeoff of returning utf-8 and failing when a
> path contains invalid surrogate pairs

Are there any common sources of illegal UTF-16 surrogates in Windows
filenames? I see that WTF-8 (Wobbly) was developed to handle this
problem. A WTF-8 path would roundtrip back to the filesystem, but it
should only be used internally in a program.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-15 Thread Nick Coghlan

On 16 August 2016 at 11:34, Chris Barker - NOAA Federal
 wrote:
>> Given that, I'm proposing adding support for using byte strings encoded with 
>> UTF-8 in file system functions on Windows. This allows Python users to omit 
>> switching code like:
>>
>> if os.name == 'nt':
>>f = os.stat(os.listdir('.')[-1])
>> else:
>>f = os.stat(os.listdir(b'.')[-1])
>
> REALLY? Do we really want to encourage using bytes as paths? IIUC,
> anyone that wants to platform-independentify that code just needs to
> use proper strings (or pat glib) for paths everywhere, yes?

The problem is that bytes-as-paths actually *does* work for Mac OS X
and systemd based Linux distros properly configured to use UTF-8 for
OS interactions. This means that a lot of backend network service code
makes that assumption, especially when it was originally written for
Python 2, and rather than making it work properly on Windows, folks
just drop Windows support as part of migrating to Python 3.

At an ecosystem level, that means we're faced with a choice between
implicitly encouraging folks to make their code *nix only, and finding
a way to provide a more *nix like experience when running on Windows
(where UTF-8 encoded binary data just works, and either other
encodings lead to mojibake or else you use chardet to figure things
out).

Steve is suggesting that the latter option is preferable, a view I
agree with since it lowers barriers to entry for Windows based
developers to contribute to primarily *nix focused projects.

> I understand that pre-surrogate-escape, there was a need for bytes
> paths, but those days are gone, yes?

No, UTF-8 encoded bytes are still the native language of network
service development: http://utf8everywhere.org/

It also helps with cases where folks are switching back and forth
between Python and other environments like JavaScript and Go where the
UTF-8 assumption is more prevalent.

> So why, at this late date, kludge what should be a deprecated pattern
> into the Windows build???

Promoting cross-platform consistency often leads to enabling patterns
that are considered a bad idea from a native platform perspective, and
this strikes me as an example of that (just as the binary/text
separation itself is a case where Python 3 diverged from the POSIX
text model to improve consistency across *nix, Windows, JVM and CLR
environments).

Cheers,
Nick.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-15 Thread Steve Dower


On 15Aug2016 0954, Random832 wrote:

On Mon, Aug 15, 2016, at 12:35, Steve Dower wrote:

I'm still not sure we're talking about the same thing right now.

For `open(path_as_bytes).read()`, are we talking about the way
path_as_bytes is passed to the file system? Or the codec used to decide
the returned string?


We are talking about the way path_as_bytes is passed to the filesystem,
and in particular what encoding path_as_bytes is *actually* in, when it
was obtained from a file or other stream opened in binary mode.


Okay good, we are talking about the same thing.

Passing path_as_bytes in that location has been deprecated since 3.3, so 
we are well within our rights (and probably overdue) to make it a 
TypeError in 3.6. While it's obviously an invalid assumption, for the 
purposes of changing the language we can assume that no existing code is 
passing bytes into any functions where it has been deprecated.


As far as I'm concerned, there are currently no filesystem APIs on 
Windows that accept paths as bytes.



Given that, I'm proposing adding support for using byte strings encoded 
with UTF-8 in file system functions on Windows. This allows Python users 
to omit switching code like:


if os.name == 'nt':
f = os.stat(os.listdir('.')[-1])
else:
f = os.stat(os.listdir(b'.')[-1])

Or simply using the bytes variant unconditionally because they heard it 
was faster (sacrificing cross-platform correctness, since it may not 
correctly round-trip on Windows).


My proposal is to remove all use of the *A APIs and only use the *W 
APIs. That completely removes the (already deprecated) use of bytes as 
paths. I then propose to change the (unused on Windows) 
sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into 
filesystem functions by transcoding into UTF-16 and calling the *W APIs.


This completely removes the active codepage from the chain, allows paths 
returned from the filesystem to correctly roundtrip via bytes in Python, 
and allows those bytes paths to be manipulated at '\' characters. 
(Frankly I don't mind what encoding we use, and I'd be quite happy to 
force bytes paths to be UTF-16-LE encoded, which would also round-trip 
invalid surrogate pairs. But that would prevent basic manipulation which 
seems to be a higher priority.)


This does not allow you to take bytes from an arbitrary source and 
assume that they are correctly encoded for the file system. Python 3.3, 
3.4 and 3.5 have been warning that doing that is deprecated and the path 
needs to be decoded to a known encoding first. At this stage, it's time 
for us to either make byte paths an error, or to specify a suitable 
encoding that can correctly round-trip paths.



If this does not answer the question, I'm going to need the question to 
be explained more clearly for me.


Cheers,
Steve

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-15 Thread Random832

On Mon, Aug 15, 2016, at 09:23, Steve Dower wrote:
> I guess I'm not sure what your question is then.
> 
> Using text internally is of course the best way to deal with it. But for
> those who insist on using bytes, this change at least makes Windows a
> feasible target without requiring manual encoding/decoding at every
> boundary.

Why isn't it already? What's "not feasible" about requiring manual
encoding/decoding?

Basically your assumption is that people using Python on windows and
having to deal with files that contain filename data encoded as bytes
are more likely to be dealing with data that is either UTF-8 anyway
(coming from Linux or some other platform) or came from the current
version of Python (which will encode things in UTF-8 under the change)
than they are to deal with data that came from other Windows programs
that encoded things in the codepage used by them and by other Windows
users in the same country / who speak the same language.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-14 Thread Victor Stinner

> The last point is correct: if you get bytes from a file system API, you
should be able to pass them back in without losing information. CP_ACP
(a.k.a. the *A API) does not allow this, so I'm proposing using the *W API
everywhere and encoding to utf-8 when the user wants/gives bytes.

You get troubles when the filename comes a file, another application, a
registry key, ... which is encoded to CP_ACP.

Do you plan to transcode all these data? (decode from CP_ACP, encode back
to UTF-8)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-13 Thread Stephen J. Turnbull

Random832 writes:

 > And what's going to happen if you shovel those bytes into the
 > filesystem without conversion on Linux, or worse, OSX?

Off topic.  See Subject: field.

 > This proposal embodies an assumption that bytes from unknown sources
 > used as filenames are more likely to be UTF-8 than in the locale ACP

Then it's irrelevant: most bytes are not from "unknown sources",
they're from correspondents (or from yourself!) -- and for most users
most of the time, those correspondents share the locale encoding with
them.  At least where I live, they use that encoding frequently.

 > the only solution is to require the application to make a
 > considered decision

That's not a solution.  Code is not written with every decision
considered, and it never will be.  The (long-run) solution is a la
Henry Ford: "you can encode text any way you want, as long as it's
UTF-8".  Then it won't matter if people ever make considered decisions
about encoding!  But trying to enforce that instead of letting it
evolve naturally (as it is doing) will cause unnecessary pain for
Python programmers, and I believe quite a lot of pain.

I used to be in the "make them speak UTF-8" camp.  But in the 15 years
since PEP 263, experience has shown me that mostly it doesn't matter,
and that when it does matter, you have to deal with the large variety
of encodings anyway -- assuming UTF-8 is not a win.  For use cases
that can be encoding-agnostic because all cooperating participants
share a locale encoding, making them explicitly specify the locale
encoding is just a matter of "misery loves company".  Please, let's
not do things for that reason.

 > I think the use case that the proposal has in mind is a
 > file-names-are-just-bytes program (or set of programs) that reads
 > from the filesystem, converts to bytes for a file/network, and then
 > eventually does the reverse - either end may be on windows.

You have misspoken somewhere.  The programs under discussion do not
"convert" input to bytes; they *receive* bytes, either from POSIX APIs
or from Windows *A APIs, and use them as is.  Unless I am greatly
mistaken, Steve simply wants that to work as well on Windows as on
POSIX platforms, so that POSIX programmers who do encoding-agnostic
programming have one less barrier to supporting their software on
Windows.  But you'll have to ask Steve to rule on that.

Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-12 Thread Victor Stinner

Hello,

I'm in holiday and I'm writing on a phone, so sorry in advance for the
short answer.

In short: we should drop support for the bytes API. Just use Unicode on all
platforms, especially for filenames.

Sorry but most of these changes look like very bad ideas. Or maybe I
misunderstood something. Windows bytes API are broken in different ways, in
short your proposal is to put another layer on top of it to try to
workaround issues.

Unicode is complex. Unicode issues are hard to debug. Adding a new layer
makes debugging even harder. Is the bug in the input data? In the layer? In
the final Windows function?

In my experience on UNIX, the most important part is the interoperability
with other applications. I understand that Python 2 will speak ANSI code
page but Python 3 will speak UTF-8. I don't understand how it can work.
Almsot all Windows applications speak the ANSI code page (I'm talking about
stdin, stdout, pipes, ...).

Do you propose to first try to decode from UTF-8 or fallback on decoding
from the ANSI code page? What about encoding? Always encode to UTF-8?

About BOM: I hate them. Many applications don't understand them. Again,
think about Python 2. I recall vaguely that the Unicode strandard suggests
to not use BOM (I have to check).

I recall a bug in gettext. The tool doesn't understand BOM. When I opened
the file in vim, the BOM was invisible (hidden). I had to use hexdump to
understand the issue!

BOM introduces issues very difficult to debug :-/ I also think that it goes
in the wrong direction in term of interoperability.

For the Windows console: I played with all Windows functions, tried all
fonts and many code pages. I also read technical blog articles of Microsoft
employees. I gave up on this issue. It doesn't seem possible to support
fully Unicode the Windows console (at least the last time I checked). By
the way, it seems like Windows functions have bugs, and the code page 65001
fixes a few issues but introduces new issues...

Victor

Le 10 août 2016 20:16, "Steve Dower"  a écrit :

> I suspect there's a lot of discussion to be had around this topic, so I
> want to get it started. There are some fairly drastic ideas here and I need
> help figuring out whether the impact outweighs the value.
>
> Some background: within the Windows API, the preferred encoding is UTF-16.
> This is a 16-bit format that is typed as wchar_t in the APIs that use it.
> These APIs are generally referred to as the *W APIs (because they have a W
> suffix).
>
> There are also (broadly deprecated) APIs that use an 8-bit format (char),
> where the encoding is assumed to be "the user's active code page". These
> are *A APIs. AFAIK, there are no cases where a *A API should be preferred
> over a *W API, and many newer APIs are *W only.
>
> In general, Python passes byte strings into the *A APIs and text strings
> into the *W APIs.
>
> Right now, sys.getfilesystemencoding() on Windows returns "mbcs", which
> translates to "the system's active code page". As this encoding generally
> cannot represent all paths on Windows, it is deprecated and Unicode strings
> are recommended instead. This, however, means you need to write
> significantly different code between POSIX (use bytes) and Windows (use
> text).
>
> ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and
> updating path_converter() (Python/posixmodule.c; likely similar code in
> other places) to decode incoming byte strings would allow us to undeprecate
> byte strings and add the requirement that they *must* be encoded with
> sys.getfilesystemencoding(). I assume that this would allow cross-platform
> code to handle paths similarly by encoding to whatever the sys module says
> they should and using bytes consistently (starting this thread is meant to
> validate/refute my assumption).
>
> (Yes, I know that people on POSIX should just change to using Unicode and
> surrogateescape. Unfortunately, rather than doing that they complain about
> Windows and drop support for the platform. If you want to keep hitting them
> with the stick, go ahead, but I'm inclined to think the carrot is more
> valuable here.)
>
> Similarly, locale.getpreferredencoding() on Windows returns a legacy value
> - the user's active code page - which should generally not be used for any
> reason. The one exception is as a default encoding for opening files when
> no other information is available (e.g. a Unicode BOM or explicit encoding
> argument). BOMs are very common on Windows, since the default assumption is
> nearly always a bad idea.
>
> Making open()'s default encoding detect a BOM before falling back to
> locale.getpreferredencoding() would resolve many issues, but I'm also
> inclined towards making the fallback utf-8, leaving
> locale.getpreferredencoding() solely as a way to get the active system
> codepage (with suitable warnings about it only being useful for
> back-compat). This would match the behavior that the .NET Framework has
>

Re: [Python-ideas] Fix default encodings on Windows

2016-08-12 Thread Adam Bartoš

*On Fri Aug 12 11:33:35 EDT 2016, *

*Random832 wrote:*> On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote:
>>* That's the hope, though that module approaches the solution differently
*>>* and may still uses. An alternative way for us to fix this whole thing
*>>* would be to bring win_unicode_console into the standard library and use
*>>* it by default (or probably whenever PYTHONIOENCODING is not specified).
*>
> I have concerns about win_unicode_console:
> - For the "text_transcoded" streams, stdout.encoding is utf-8. For the
> "text" streams, it is utf-16.

UTF-16 it the "native" encoding since it corresponds to the wide chars used
by Read/WriteConsoleW. The UTF-8 is used just as a signal for the consumers
of PyOS_Readline.

> - There is no object, as far as I can find, which can be used as an
> unbuffered unicode I/O object.

There is no buffer just on those wrapping streams because the bytes I have
are not in UTF-8. Adding one would mean a fake buffer that just decodes and
writes to the text stream. AFAIK there is no guarantee that sys.std*
objects have buffer attribute and any code relying on that is incorrect.
But I inderstand that there may be such code and we may want to be
compatible.

> - raw output streams silently drop the last byte if an odd number of
> bytes are written.

That's not true, it doesn't write an odd number of bytes, but returns the
correct number of bytes written. If only one byte is given, it raises a
ValueError.

> - The sys.stdout obtained via streams.enable does not support .buffer /
> .buffer.raw / .detach
> - All of these objects provide a fileno() interface.

Is this wrong? If I remember, I provide it because of some check -- maybe
in input() -- to be viewed as a stdio stream.

> - When using os.read/write for data that represents text, the data still
> should be encoded in the console encoding and not in utf-8 or utf-16.

I don't know what to do with this. Generally I wouldn't use bytes to
communicate textual data.

Regards,
Adam Bartoš
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-12 Thread Steve Dower

I was thinking we would end up using the console API for input but stick with 
the standard handles for output, mostly to minimize the amount of magic 
switching we have to do. But since we can just switch the entire stream object 
in __std*__ once at startup if nothing is redirected it probably isn't that 
much of a simplification.

I have some airport/aeroplane time today where I can experiment.

Top-posted from my Windows Phone

-Original Message-
From: "eryk sun" <eryk...@gmail.com>
Sent: ‎8/‎12/‎2016 5:40
To: "python-ideas" <python-ideas@python.org>
Subject: Re: [Python-ideas] Fix default encodings on Windows

On Thu, Aug 11, 2016 at 9:07 AM, Paul Moore <p.f.mo...@gmail.com> wrote:
> set codepage to UTF-8
> ...
> set codepage back
> spawn subprocess X, but don't wait for it
> set codepage to UTF-8
> ...
> ... At this point what codepage does Python see? What codepage does
> process X see? (Note that they are both sharing the same console).

The input and output codepages are global data in conhost.exe. They
aren't tracked for each attached process (unlike input history and
aliases). That's how chcp.com works in the first place. Otherwise its
calls to SetConsoleCP and SetConsoleOutputCP would be pointless.

But IMHO all talk of using codepage 65001 is a waste of time. I think
the trailing garbage output with this codepage in Windows 7 is
unacceptable. And getting EOF for non-ASCII input is a show stopper.
The problem occurs in conhost. All you get is the EOF result from
ReadFile/ReadConsoleA, so it can't be worked around. This kills the
REPL and raises EOFError for input(). ISTM the only people who think
codepage 65001 actually works are those using Windows 8+ who
occasionally need to print non-OEM text and never enter (or paste)
anything but ASCII text.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-12 Thread eryk sun

On Thu, Aug 11, 2016 at 9:07 AM, Paul Moore  wrote:
> set codepage to UTF-8
> ...
> set codepage back
> spawn subprocess X, but don't wait for it
> set codepage to UTF-8
> ...
> ... At this point what codepage does Python see? What codepage does
> process X see? (Note that they are both sharing the same console).

The input and output codepages are global data in conhost.exe. They
aren't tracked for each attached process (unlike input history and
aliases). That's how chcp.com works in the first place. Otherwise its
calls to SetConsoleCP and SetConsoleOutputCP would be pointless.

But IMHO all talk of using codepage 65001 is a waste of time. I think
the trailing garbage output with this codepage in Windows 7 is
unacceptable. And getting EOF for non-ASCII input is a show stopper.
The problem occurs in conhost. All you get is the EOF result from
ReadFile/ReadConsoleA, so it can't be worked around. This kills the
REPL and raises EOFError for input(). ISTM the only people who think
codepage 65001 actually works are those using Windows 8+ who
occasionally need to print non-OEM text and never enter (or paste)
anything but ASCII text.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-11 Thread Adam Bartoš

Eryk Sun wrote:

> IMO, Python needs a C implementation of the win_unicode_console
> module, using the wide-character APIs ReadConsoleW and WriteConsoleW.
> Note that this sets sys.std*.encoding as UTF-8 and transcodes, so
> Python code never has to work directly with UTF-16 encoded text.
>
>
The transcoding wrappers with 'utf-8' encoding are used just as a work
around the fact that Python tokenizer cannot use utf-16-le and that the
readlinehook machinery is unfortunately bytes-based. The tanscoding wrapper
just has encoding 'utf-8' and no buffer attribute, so there is no actual
transcoding in sys.std* objects. It's just a signal for PyOS_Readline
consumers, and the transcoding occurs in a custom readline hook. Nothing
like this would be needed if PyOS_Readline was replaced by some Python API
wrapper around sys.readlinehook that would be Unicode string based.

Adam Bartoš
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-11 Thread Random832

On Thu, Aug 11, 2016, at 10:25, Steven D'Aprano wrote:
> > Interesting. Are you assuming that a text file cannot be empty?
> 
> Hmmm... not consciously, but I guess I was.
> 
> If the file is empty, how do you know it's text?

Heh. That's the *other* thing that Notepad does wrong in the opinion of
people coming from the Unix world - a Windows text file does not need to
end with a [CR]LF, and normally will not.

> But we're getting off topic here. In context of Steve's suggestion, we 
> should only autodetect UTF-8. In other words, if there's a UTF-8 BOM, 
> skip it, otherwise treat the file as UTF-8.

I think there's still room for UTF-16. It's two of the four encodings
supported by Notepad, after all.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-11 Thread Steven D'Aprano

On Thu, Aug 11, 2016 at 02:09:00PM +1000, Chris Angelico wrote:
> On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Aprano  wrote:

> > The way I have done auto-detection based on BOMs is you start by reading
> > four bytes from the file in binary mode. (If there are fewer than four
> > bytes, it cannot be a text file with a BOM.)
> 
> Interesting. Are you assuming that a text file cannot be empty?

Hmmm... not consciously, but I guess I was.

If the file is empty, how do you know it's text?

> Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF
> 0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with
> less than one character in them?

I'll have to think about it some more :-)


> For a default file-open encoding detection, I would minimize the
> number of options. The UTF-7 BOM could be the beginning of a file
> containing Base 64 data encoded in ASCII, which is a very real
> possibility.

I'm coming from the assumption that you're reading unformated text in an 
unknown encoding, rather than some structured format.

But we're getting off topic here. In context of Steve's suggestion, we 
should only autodetect UTF-8. In other words, if there's a UTF-8 BOM, 
skip it, otherwise treat the file as UTF-8.


> When was the last time you saw a UTF-32LE-BOM file?

Two minutes ago, when I looked at my test suite :-P


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-11 Thread Paul Moore

On 11 August 2016 at 01:41, Chris Angelico  wrote:
> I've almost never seen files stored in UTF-32 (even UTF-16 isn't all
> that common compared to UTF-8), so I wouldn't stress too much about
> that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth
> doing, but it could easily be retrofitted (that byte sequence won't
> decode as UTF-8).

I see UTF-16 relatively often as a result of redirecting stdout in
Powershell and forgetting that it defaults (stupidly, IMO) to UTF-16.

>> The main problem here is that if the console is not forced to UTF-8 then it
>> won't render any of the characters correctly.
>
> Ehh, that's annoying. Is there a way to guarantee, at the process
> level, that the console will be returned to "normal state" when Python
> exits? If not, there's the risk that people run a Python program and
> then the *next* program gets into trouble.

There's also the risk that Python programs using subprocess.Popen
start the subprocess with the console in a non-standard state. Should
we be temporarily restoring the console codepage in that case? How
does the following work?

set codepage to UTF-8
...
set codepage back
spawn subprocess X, but don't wait for it
set codepage to UTF-8
...
... At this point what codepage does Python see? What codepage does
process X see? (Note that they are both sharing the same console).
...

restore codepage

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-11 Thread Paul Moore

On 11 August 2016 at 00:30, Random832  wrote:
>> Python could copy how
>> configure_text_mode() handles the BOM, except it shouldn't write a BOM
>> for new UTF-8 files.
>
> I disagree. I think that *on windows* it should, just like *on windows*
> it should write CR-LF for line endings.

Tools like git and hg, and cross platform text editors, handle
transparently managing the differences between line endings for you.
But nothing much handles BOM stripping/adding automatically. So while
in theory the two cases are similar, in practice lack of tool support
means that if we start adding BOMs on Windows (and requiring them so
that we can detect UTF8) then we'll be setting up new interoperability
problems for Python users, for little benefit.

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-10 Thread Random832

On Wed, Aug 10, 2016, at 17:31, Chris Angelico wrote:
> AIUI, the data flow would be: Python bytes object 

Nothing _starts_ as a Python bytes object. It has to be read from
somewhere or encoded in the source code as a literal. The scenario is
very different for "defined internally within the program" (how are
these not gonna be ASCII) vs "user input" (user input how? from the
console? from tkinter? how'd that get converted to bytes?) vs "from a
network or something like a tar file where it represents a path on some
other system" (in which case it's in whatever encoding that system used,
or *maybe* an encoding defined as part of the network protocol or file
format).

The use case has not been described adequately enough to answer my
question.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-10 Thread Chris Angelico

On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Aprano  wrote:
> On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:
>
>> On 10Aug2016 1431, Chris Angelico wrote:
>> >>* make the default open() encoding check for a BOM or else use utf-8
>> >
>> >-0.5. Is there any precedent for this kind of data-based detection
>> >being the default?
>
> There is precedent: the Python interpreter will accept a BOM instead of
> an encoding cookie when importing .py files.

Okay, that's good enough for me.

> [Chris]
>> >An explicit "utf-sig" could do a full detection,
>> >but even then it's not perfect - how do you distinguish UTF-32LE from
>> >UTF-16LE that starts with U+?
>
> BOMs are a heuristic, nothing more. If you're reading arbitrary files
> could start with anything, then of course they can guess wrong. But then
> if I dumped a bunch of arbitrary Unicode codepoints in your lap and
> asked you to guess the language, you would likely get it wrong too :-)

I have my own mental heuristics, but I can't recognize one Cyrillic
language from another. And some Slavic languages can be written with
either Latin or Cyrillic letters, just to further confuse matters. Of
course, "arbitrary Unicode codepoints" might not all come from one
language, and might not be any language at all.

(Do you wanna build a U+2603?)

> [Chris]
>> >Do you say "UTF-32 is rare so we'll
>> >assume UTF-16", or do you say "files starting U+ are rare, so
>> >we'll assume UTF-32"?
>
> The way I have done auto-detection based on BOMs is you start by reading
> four bytes from the file in binary mode. (If there are fewer than four
> bytes, it cannot be a text file with a BOM.)

Interesting. Are you assuming that a text file cannot be empty?
Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF
0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with
less than one character in them?

> Compare those first four
> bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second*
> (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs
> (big-endian and little-endian). Then check for UTF-8, and if you're
> really keen, UTF-7 and UTF-1.

For a default file-open encoding detection, I would minimize the
number of options. The UTF-7 BOM could be the beginning of a file
containing Base 64 data encoded in ASCII, which is a very real
possibility.

> elif bom.startswith(b'\x2B\x2F\x76'):
> if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39':
> return 'utf_7'

So I wouldn't include UTF-7 in the detection. Nor UTF-1. Both are
rare. Even UTF-32 doesn't necessarily have to be included. When was
the last time you saw a UTF-32LE-BOM file?

> [Steve]
>> But the main reason for detecting the BOM is that currently opening
>> files with 'utf-8' does not skip the BOM if it exists. I'd be quite
>> happy with changing the default encoding to:
>>
>> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
>> * utf-8 when writing (so the BOM is *not* written)
>
> Sounds reasonable to me.
>
> Rather than hard-coding that behaviour, can we have a new encoding that
> does that? "utf-8-readsig" perhaps.

+1. Makes the documentation easier by having the default value for
encoding not depend on the value for mode.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-10 Thread Steven D'Aprano

On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:

> On 10Aug2016 1431, Chris Angelico wrote:
> >>* make the default open() encoding check for a BOM or else use utf-8
> >
> >-0.5. Is there any precedent for this kind of data-based detection
> >being the default?

There is precedent: the Python interpreter will accept a BOM instead of 
an encoding cookie when importing .py files.

[Chris]
> >An explicit "utf-sig" could do a full detection,
> >but even then it's not perfect - how do you distinguish UTF-32LE from
> >UTF-16LE that starts with U+? 

BOMs are a heuristic, nothing more. If you're reading arbitrary files 
could start with anything, then of course they can guess wrong. But then 
if I dumped a bunch of arbitrary Unicode codepoints in your lap and 
asked you to guess the language, you would likely get it wrong too :-)

[Chris]
> >Do you say "UTF-32 is rare so we'll
> >assume UTF-16", or do you say "files starting U+ are rare, so
> >we'll assume UTF-32"?

The way I have done auto-detection based on BOMs is you start by reading 
four bytes from the file in binary mode. (If there are fewer than four 
bytes, it cannot be a text file with a BOM.) Compare those first four 
bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second* 
(otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs 
(big-endian and little-endian). Then check for UTF-8, and if you're 
really keen, UTF-7 and UTF-1.

def bom2enc(bom, default=None):
"""Return encoding name from a four-byte BOM."""
if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif bom.startswith(b'\xEF\xBB\xBF'):
return 'utf_8_sig'
elif bom.startswith(b'\x2B\x2F\x76'):
if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39':
return 'utf_7'
elif bom.startswith(b'\xF7\x64\x4C'):
return 'utf_1'
elif default is None:
raise ValueError('no recognisable BOM signature')
else:
return default

[Steve Dower]
> The BOM exists solely for data-based detection, and the UTF-8 BOM is 
> different from the UTF-16 and UTF-32 ones. So we either find an exact 
> BOM (which IIRC decodes as a no-op spacing character, though I have a 
> feeling some version of Unicode redefined it exclusively for being the 
> marker) or we use utf-8.

The Byte Order Mark is always U+FEFF encoded into whatever bytes your 
encoding uses. You should never use U+FEFF except as a BOM, but of 
course arbitrary Unicode strings might include it in the middle of the 
string Just Because. In that case, it may be interpreted as a legacy 
"ZERO WIDTH NON-BREAKING SPACE" character. But new content should never 
do that: you should use U+2060 "WORD JOINER" instead, and treat a U+FEFF 
inside the body of your file or string as an unsupported character.

http://www.unicode.org/faq/utf_bom.html#BOM

[Steve]
> But the main reason for detecting the BOM is that currently opening 
> files with 'utf-8' does not skip the BOM if it exists. I'd be quite 
> happy with changing the default encoding to:
> 
> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
> * utf-8 when writing (so the BOM is *not* written)

Sounds reasonable to me.

Rather than hard-coding that behaviour, can we have a new encoding that 
does that? "utf-8-readsig" perhaps.

[Steve]
> This provides the best compatibility when reading/writing files without 
> making any guesses. We could reasonably extend this to read utf-16 and 
> utf-32 if they have a BOM, but that's an extension and not necessary for 
> the main change.

The use of a BOM is always a guess :-) Maybe I just happen to have a 
Latin1 file that starts with "ï»¿", or a Mac Roman file that starts with 
"Ôªø". Either case will be wrongly detected as UTF-8. That's the risk 
you take when using a heuristic.

And if you don't want to use that heuristic, then you must specify the 
actual encoding in use.

-- 
Steven D'Aprano
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-10 Thread Chris Angelico

On Thu, Aug 11, 2016 at 9:40 AM, Steve Dower  wrote:
> On 10Aug2016 1431, Chris Angelico wrote:
>> I'd rather a single consistent default encoding.
>
> I'm proposing to make that single consistent default encoding utf-8. It
> sounds like we're in agreement?

Yes, we are. I was disagreeing with Random's suggestion that mbcs
would also serve. Defaulting to UTF-8 everywhere is (a) consistent on
all systems, regardless of settings; and (b) consistent with
bytes.decode() and str.encode(), both of which default to UTF-8.

>> -0.5. Is there any precedent for this kind of data-based detection
>> being the default? An explicit "utf-sig" could do a full detection,
>> but even then it's not perfect - how do you distinguish UTF-32LE from
>> UTF-16LE that starts with U+? Do you say "UTF-32 is rare so we'll
>> assume UTF-16", or do you say "files starting U+ are rare, so
>> we'll assume UTF-32"?
>
>
> The BOM exists solely for data-based detection, and the UTF-8 BOM is
> different from the UTF-16 and UTF-32 ones. So we either find an exact BOM
> (which IIRC decodes as a no-op spacing character, though I have a feeling
> some version of Unicode redefined it exclusively for being the marker) or we
> use utf-8.
>
> But the main reason for detecting the BOM is that currently opening files
> with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with
> changing the default encoding to:
>
> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
> * utf-8 when writing (so the BOM is *not* written)
>
> This provides the best compatibility when reading/writing files without
> making any guesses. We could reasonably extend this to read utf-16 and
> utf-32 if they have a BOM, but that's an extension and not necessary for the
> main change.

AIUI the utf-8-sig encoding is happy to decode something that doesn't
have a signature, right? If so, then yes, I would definitely support
that mild mismatch in defaults. Chew up that UTF-8 aBOMination and
just use UTF-8 as is.

I've almost never seen files stored in UTF-32 (even UTF-16 isn't all
that common compared to UTF-8), so I wouldn't stress too much about
that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth
doing, but it could easily be retrofitted (that byte sequence won't
decode as UTF-8).

>>> * force the console encoding to UTF-8 on initialize and revert on
>>> finalize
>>
>>
>> -0 for Python itself; +1 for Python's interactive interpreter.
>> Programs that mess with console settings get annoying when they crash
>> out and don't revert properly. Unless there is *no way* that you could
>> externally kill the process without also bringing the terminal down,
>> there's the distinct possibility of messing everything up.
>
>
> The main problem here is that if the console is not forced to UTF-8 then it
> won't render any of the characters correctly.

Ehh, that's annoying. Is there a way to guarantee, at the process
level, that the console will be returned to "normal state" when Python
exits? If not, there's the risk that people run a Python program and
then the *next* program gets into trouble.

But if that happens only on abnormal termination ("I killed Python
from Task Manager, and it left stuff messed up so I had to close the
console"), it's probably an acceptable risk. And the benefit sounds
well worthwhile. Revising my recommendation to +0.9.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-10 Thread eryk sun

On Wed, Aug 10, 2016 at 11:30 PM, Random832  wrote:
> Er... utf-8 doesn't work reliably with arbitrary bytes paths either,
> unless you intend to use surrogateescape (which you could also do with
> mbcs).
>
> Is there any particular reason to expect all bytes paths in this
> scenario to be valid UTF-8?

The problem is more so that data is lost without an error when using
the legacy ANSI API. If the path is invalid UTF-8, Python will at
least raise an exception when decoding it. To work around this, the
developers may decide they need to just bite the bullet and use
Unicode, or maybe there could be legacy Latin-1 and ANSI modes enabled
by an environment variable or sys flag.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-10 Thread eryk sun

On Wed, Aug 10, 2016 at 8:09 PM, Random832  wrote:
> On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
>>
>> Allowing library developers who support POSIX and Windows to just use
>> bytes everywhere to represent paths.
>
> Okay, how is that use case impacted by it being mbcs instead of utf-8?

Using 'mbcs' doesn't work reliably with arbitrary bytes paths in
locales that use a DBCS codepage such as 932. If a sequence is
invalid, it gets passed to the filesystem as the default Unicode
character, so it won't successfully roundtrip. In the following
example b"\x81\xad", which isn't defined in CP932, gets mapped to the
codepage's default Unicode character, Katakana middle dot, which
encodes back as b"\x81E":

>>> locale.getpreferredencoding()
'cp932'
>>> open(b'\x81\xad', 'w').close()
>>> os.listdir('.')
['・']
>>> unicodedata.name(os.listdir('.')[0])
'KATAKANA MIDDLE DOT'
>>> '・'.encode('932')
b'\x81E'

This isn't a problem for single-byte codepages, since every byte value
uniquely maps to a Unicode code point, even if it's simply b'\x81' =>
u"\x81". Obviously there's still the general problem of dealing with
arbitrary Unicode filenames created by other programs, since the ANSI
API can only return a best-fit encoding of the filename, which is
useless for actually accessing the file.

>> It probably also entails opening the file descriptor in bytes mode,
>> which might break programs that pass the fd directly to CRT functions.
>> Personally I wish they wouldn't, but it's too late to stop them now.
>
> The only thing O_TEXT does rather than O_BINARY is convert CRLF line
> endings (and maybe end on ^Z), and I don't think we even expose the
> constants for the CRT's unicode modes.

Python 3 uses O_BINARY when opening files, unless you explicitly call
os.open. Specifically, FileIO.__init__ adds O_BINARY to the open flags
if the platform defines it.

The Windows CRT reads the BOM for the Unicode modes O_WTEXT,
O_U16TEXT, and O_U8TEXT. For O_APPEND | O_WRONLY mode, this requires
opening the file twice, the first time with read access. See
configure_text_mode() in "Windows
Kits\10\Source\10.0.10586.0\ucrt\lowio\open.cpp".

Python doesn't expose or use these Unicode text-mode constants. That's
for the best because in Unicode mode the CRT invokes the invalid
parameter handler when a buffer doesn't have an even number of bytes,
i.e. a multiple of sizeof(wchar_t). Python could copy how
configure_text_mode() handles the BOM, except it shouldn't write a BOM
for new UTF-8 files.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-10 Thread Brett Cannon

On Wed, 10 Aug 2016 at 11:16 Steve Dower  wrote:

> [SNIP]
>
> Finally, the encoding of stdin, stdout and stderr are currently
> (correctly) inferred from the encoding of the console window that Python
> is attached to. However, this is typically a codepage that is different
> from the system codepage (i.e. it's not mbcs) and is almost certainly
> not Unicode. If users are starting Python from a console, they can use
> "chcp 65001" first to switch to UTF-8, and then *most* functionality
> works (input() has some issues, but those can be fixed with a slight
> rewrite and possibly breaking readline hooks).
>
> It is also possible for Python to change the current console encoding to
> be UTF-8 on initialize and change it back on finalize. (This would leave
> the console in an unexpected state if Python segfaults, but console
> encoding is probably the least of anyone's worries at that point.) So
> I'm proposing actively changing the current console to be Unicode while
> Python is running, and hence sys.std[in|out|err] will default to utf-8.
>
> So that's a broad range of changes, and I have little hope of figuring
> out all the possible issues, back-compat risks, and flow-on effects on
> my own. Please let me know (either on-list or off-list) how a change
> like this would affect your projects, either positively or negatively,
> and whether you have any specific experience with these changes/fixes
> and think they should be approached differently.
>
>
> To summarise the proposals (remembering that these would only affect
> Python 3.6 on Windows):
>
> [SNIP]
> * force the console encoding to UTF-8 on initialize and revert on finalize
>

Don't have enough Windows experience to comment on the other parts of this
proposal, but for the console encoding I am a hearty +1 as I'm tired of
Unicode characters failing to show up in the REPL.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

41 matches

Mail list logo