Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Steve Dower
Passing universal_newlines will use whatever locale.getdefaultencoding() 
returns (which at least on Windows is useless enough that I added the encoding 
and errors parameters in 3.6). So it sounds like it'll only actually do Unicode 
on Linux if enough of the planets have aligned, which is what Victor is trying 
to do, but you can't force the other process to use a particular encoding. 
universal_newlines may become a bad choice if the default encoding no longer 
matches what the environment says, and personally, I wouldn't lose much sleep 
over that.

(As an aside, when I was doing all the Unicode changes for Windows in 3.6, I 
eventually decided that changing locale.getdefaultencoding() was too big a 
breaking change to ever be a good idea. Perhaps that will be the same result 
here too, but I'm nowhere near familiar enough with the conventions at play to 
state that with any certainty.)

Cheers,
Steve

Top-posted from my Windows Phone

-Original Message-
From: "Barry Warsaw" 
Sent: ‎1/‎6/‎2017 14:04
To: "python-ideas@python.org" 
Subject: Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

On Jan 05, 2017, at 05:50 PM, Victor Stinner wrote:

>I guess that all users and most developers are more in the "UNIX mode"
>camp. *If* we want to change the default, I suggest to use the "UNIX
>mode" by default.

FWIW, it seems to be a general and widespread recommendation to always pass
universal_newlines=True to Popen and friends when you only want to deal with
unicode from subprocesses:

If encoding or errors are specified, or universal_newlines is true, the
file objects stdin, stdout and stderr will be opened in text mode using
the encoding and errors specified in the call or the defaults for
io.TextIOWrapper.

Cheers,
-Barry
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Victor Stinner
2017-01-07 1:06 GMT+01:00 Barry Warsaw :
> For some reason it's not configured: (...)

Ok, thanks for the information.

> I'm not sure why that's the default inside a chroot.

I found at least one good reason to use the POSIX locale to build a
package: it helps to get reproductible builds, see:
https://reproducible-builds.org/docs/locales/

I used it as an example in my new rationale:
https://www.python.org/dev/peps/pep-0540/#it-s-not-a-bug-you-must-fix-your-locale-is-not-an-acceptable-answer

I tried to explain how using LANG=C can be a smart choice in some
cases, and so that Python 3 should do its best to not annoy the user
with Unicode errors.

I also started to list cases where you get the POSIX locale "by
mistake". As I wrote previously, I'm not sure that it's correct to add
"by mistake".
https://www.python.org/dev/peps/pep-0540/#posix-locale-used-by-mistake

By the way, I tried to force the POSIX locale in my benchmarking
"perf" module. The idea is to get more reproductible results between
heterogeneous computers. But I got a bug report. So I decided to copy
the locale by default and add an opt-in --no-locale option to ignore
the locale (force the POSIX locale).
https://github.com/haypo/perf/issues/15

Victor
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Barry Warsaw
On Jan 06, 2017, at 11:33 PM, Victor Stinner wrote:

>Barry: About chroot, why do you get a C locale? Is it because no
>locale is explicitly configured? Or because no locale is installed in
>the chroot?

For some reason it's not configured:

% schroot -u root -c sid-amd64
(sid-amd64)# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
(sid-amd64)# export LC_ALL=C.UTF-8
(sid-amd64)# locale
LANG=
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=C.UTF-8

I'm not sure why that's the default inside a chroot.  I thought there was a
bug or discussion about this, but I can't find it right now.

Generally when this happens, exporting this environment variable in your
debian/rules file is the way to work around the default.

Cheers,
-Barry


pgpIdNGQajAaA.pgp
Description: OpenPGP digital signature
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Victor Stinner
2017-01-06 22:20 GMT+01:00 Barry Warsaw :
>>Because I have the impression that nowadays all Linux distributions are UTF-8
>>by default and you have to show some bloody-mindedness to end up with a POSIX
>>locale.
>
> It can still happen in some corner cases, even on Debian and Ubuntu where
> C.UTF-8 is available and e.g. my desktop defaults to en_US.UTF-8.  For
> example, in an sbuild/schroot environment[*], the default locale is C and I've
> seen package build failures because of this.  There may be other such "corner
> case" environments where this happens too.

Right, that's the whole point of the Nick's PEP 538 and my PEP 540:
it's still common to get the POSIX locale.

I began to list examples of practical use cases where you get the POSIX locale.
https://www.python.org/dev/peps/pep-0540/#posix-locale-used-by-mistake

I'm not sure about the title of the section: "POSIX locale used by mistake".

Barry: About chroot, why do you get a C locale? Is it because no
locale is explicitly configured? Or because no locale is installed in
the chroot?

Would it work if we had a tool to copy the locale from the host when
creating the chroot: env vars and the data files required by the
locale (if any)?

The chroot issue seems close to the reported chroot issue:
http://bugs.python.org/issue28180

I understand that it's more a configuration issue, than a deliberate
choice to use the POSIX locale. Again, the user requirement is that
Python 3 should just work without any kind of specific configuration,
as other classic UNIX tools.

Victor
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Chris Angelico
On Sat, Jan 7, 2017 at 8:20 AM, Barry Warsaw  wrote:
> On Jan 06, 2017, at 07:22 AM, Stephan Houben wrote:
>
>>Because I have the impression that nowadays all Linux distributions are UTF-8
>>by default and you have to show some bloody-mindedness to end up with a POSIX
>>locale.
>
> It can still happen in some corner cases, even on Debian and Ubuntu where
> C.UTF-8 is available and e.g. my desktop defaults to en_US.UTF-8.  For
> example, in an sbuild/schroot environment[*], the default locale is C and I've
> seen package build failures because of this.  There may be other such "corner
> case" environments where this happens too.

A lot of background jobs get run in a purged environment, too. I don't
remember exactly which ones land in the C locale and which don't, but
check cron jobs, systemd background processes, inetd, etc, etc, etc.
Having Python DTRT in those situations would be a Good Thing.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Barry Warsaw
On Jan 06, 2017, at 07:22 AM, Stephan Houben wrote:

>Because I have the impression that nowadays all Linux distributions are UTF-8
>by default and you have to show some bloody-mindedness to end up with a POSIX
>locale.

It can still happen in some corner cases, even on Debian and Ubuntu where
C.UTF-8 is available and e.g. my desktop defaults to en_US.UTF-8.  For
example, in an sbuild/schroot environment[*], the default locale is C and I've
seen package build failures because of this.  There may be other such "corner
case" environments where this happens too.

Cheers,
-Barry

[*] Where sbuild/schroot is a very common suite of package building tools.


pgp8kv4ZZDw4D.pgp
Description: OpenPGP digital signature
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Barry Warsaw
On Jan 05, 2017, at 05:50 PM, Victor Stinner wrote:

>I guess that all users and most developers are more in the "UNIX mode"
>camp. *If* we want to change the default, I suggest to use the "UNIX
>mode" by default.

FWIW, it seems to be a general and widespread recommendation to always pass
universal_newlines=True to Popen and friends when you only want to deal with
unicode from subprocesses:

If encoding or errors are specified, or universal_newlines is true, the
file objects stdin, stdout and stderr will be opened in text mode using
the encoding and errors specified in the call or the defaults for
io.TextIOWrapper.

Cheers,
-Barry


pgplvA9Q9CyK2.pgp
Description: OpenPGP digital signature
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Oleg Broytman
On Fri, Jan 06, 2017 at 10:15:52AM +0900, INADA Naoki  
wrote:
> >> Always use UTF-8
> >> 
> >>
> >> Python already always use the UTF-8 encoding on Mac OS X, Android and 
> >> Windows.
> >> Since UTF-8 became the defacto encoding, it makes sense to always use it 
> >> on all
> >> platforms with any locale.
> >
> >Please don't! I use different locales and encodings, sometimes it's
> > utf-8, sometimes not - but I have properly configured LC_* settings and
> > I prefer Python to follow my command. It'd be disgusting if Python
> > starts to bend me to its preferences.
> 
> For stdio (including console), PYTHONIOENCODING can be used for
> supporting legacy system.
> e.g. `export PYTHONIOENCODING=$(locale charmap)`

   This means one more thing to reconfigure when I switch locales
instead of Python to catches up automatically.

> For commandline argument and filepath, UTF-8/surrogateescape can round trip.
> But mojibake may happens when pass the path to GUI.
> 
> If we chose "Always use UTF-8 for fs encoding", I think
> PYTHONFSENCODING envvar should be
> added again.  (It should be used from startup: decoding command line 
> argument).
> 
> >
> >> The risk is to introduce mojibake if the locale uses a different encoding,
> >> especially for locales other than the POSIX locale.
> >
> >There is no such risk for me as I already have mojibake in my
> > systems. Two most notable sources of mojibake are:
> >
> > 1) FTP servers - people create files (both names and content) in
> >different encodings; w32 FTP clients usually send file names and
> >content in cp1251 (Russian Windows encoding), sometimes in cp866
> >(Russian Windows OEM encoding).
> >
> > 2) MP3 tags and play lists - almost always cp1251.
> >
> >So whatever my personal encoding is - koi8-r or utf-8 - I have to
> > deal with file names and content in different encodings.
> 
> 3) unzip zip file sent by Windows.   Windows user use no-ASCII filenames, and
> create legacy (no UTF-8) zip file very often.

   Good example, thank you! I forgot about it because I have wrote my
own zip.py and unzip.py that encode/decode filenames.

> I think people using non UTF-8 should solve encoding issue by themselves.
> People should use ASCII or UTF-8 always if they don't want to see mojibake.

   Impossible. Even if I'd always use UTF-8 I still will receive a lot
of cp1251/cp866.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Stephan Houben
Hi Victor,

2017-01-06 13:01 GMT+01:00 Victor Stinner :
>
> What do you mean by "eating mojibake"?

OK, I erroneously understood that the failure mode was that mojibake was
produced.

> Users complain because their
> application is stopped by a Python exception.

Got it.

> Currently, most Python 3
> applications doesn't produce or display mojibake, since Python is
> strict on outputs. (One exception: stdout with the POSIX locale since
> Python 3.5).

OK, I now tried it myself and indeed it produces the following error:

UnicodeEncodeError: 'ascii' codec can't encode character '\xfe' in position
0: ordinal not in range(128)

My suggestion would be to make this error message more specific.
In particular, if we have LC_TYPE/LANG=C or unset,
we could print something like the following information
(on Linux only):

"""
You are attempting to use non-ASCII Unicode characters while your system
has been configured (possibly erroneously) to operate in the legacy "C"
locale,
which is pure ASCII.
It is strongly recommended that you configure your system to allow
arbitrary non-ASCII
Unicode characters This can be done by configuring a UTF-8 locale, for
example:

export LANG=en_US.UTF-8

Use:
locale -a | grep UTF-8

to get a list of all valid UTF-8 locales on your system.
"""

Stephan
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Victor Stinner
2017-01-06 10:50 GMT+01:00 M.-A. Lemburg :
> Victor: I think you are taking the UTF-8 idea a bit too far.

Hum, sorry, the PEP is still a draft, the rationale is far from
perfect yet. Let me try to simplify the issue: users are unable to
configure a locale for various reasons and expect that Python 3 must
"just works", so never fail on encoding or decoding.

Do you mean that you must try to fix this issue? Or that my approach
is not the good one?


> Nick was trying to address the situation where the locale is
> set to "C", or rather not set at all (in which case the lib C
> defaults to the "C" locale). The latter is a fairly standard
> situation when piping data on Unix or when spawning processes
> which don't inherit the current OS environment.

In the second version of my PEP, Python 3.7 will basically "just work"
with the POSIX locale (or C locale if you prefer). This locale enables
the UTF-8 mode which forces UTF-8/surrogatescape, and this error
handler prevents the most common encode/decode error (but not all of
them!).

When I read the different issues on the bug tracker, I understood that
people have different opinions because they have different use cases
and so different expectations.

I tried to describe a few use cases to help to understand why we don't
have the expectations:
https://www.python.org/dev/peps/pep-0540/#replace-a-word-in-a-text

I guess that "piping data on Unix" is represented by my "Replace a
word in a text" example, right? It implements the "sed -e
s/apple/orange/g" command using Python 3. Classical usage:

   cat input_file | sed -e s/apple/orange/g > output

"UNIX users" don't want Unicode errors here.


> The problem with the "C" locale is that the encoding defaults to
> "ASCII" and thus does not allow Python to show its built-in
> Unicode support.

I don't think that it's the main annoying issues for users.

User complain because basic functions like (1) "List a directory into
stdout" or (2) "List a directory into a text file" fail badly:

(1) https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout
(2) https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-a-text-file

They don't really care of powerful Unicode features, but are bitten
early just on writing data back to the disk, into a pipe, or something
else.

Python 3.6 tries to be nice with users when *getting* data, and it is
very pedantic when you try to put the data somewhere. The only
exception is that stdout now uses the surrogateescape error handler,
but only with the POSIX locale.


> Nick's PEP and the discussion on the ticket
> http://bugs.python.org/issue28180 are trying to address this
> particular situation, not enforce any particular encoding
> overriding the user's configured environment.
>
> So I think it would be better if you'd focus your PEP on the
> same situation: locale set to "C" or not set at all.

I'm not sure that I understood: do you suggest to only modify the
behaviour when the POSIX locale is used, but don't add any option to
ignore the locale and force UTF-8?

At least, I would like to get a UTF-8/strict mode which would require
an option to enable it.

About -X utf8, the idea is to write explicitly that you are sure that
all inputs are encoded to UTF-8 and that you request to encode outputs
to UTF-8.

I guess that you are concerned by locales using encodings other than
ASCII or UTF-8 like Latin1, ShiftJIS or something else?


> BTW: You mention a locale "POSIX" in a few places. I have
> never seen this used in practice and wonder why we should
> even consider this in Python as possible work-around for
> a particular set of features. The locale setting in your
> environment does have a lot of influence on your user
> experience, so forcing people to set a "POSIX" locale doesn't
> sound like a good idea - if they have to go through the
> trouble of correctly setting up their environment for Python
> to correctly run, they would much more likely use the correct
> setting rather than a generic one like "POSIX", which is
> defined as alias for the "C" locale and not as a separate
> locale: (...)

Hum, the POSIX locale is the "C" locale in my PEP.

I don't request users to force the POSIX locale. I propose to make
Python nicer than users already *get* the POSIX locale for various
reasons:

* OS not correctly configured
* SSH connection failing to set the locale
* user using LANG=C to get messages in english
* LANG=C used for a bad reason
* program run in an empty environment
* user locale set to a non-existent locale => the libc falls back on POSIX
* etc.

"LANG=C": "LC_ALL=C" is more correct, but it seems like LANG=C is more
common than LC_ALL=C or LC_CTYPE=C in the wild.


>> It's actually very similar to your PEP, except that instead of adding
>> the ability to make CPython ignore the C level locale settings (which
>> I think is a bad idea based on your own previous work in that area and
>> on the way that CPython interacts with other C/C++ components in the
>> same pro

Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Victor Stinner
2017-01-06 7:22 GMT+01:00 Stephan Houben :
> How common is this problem?

Last 2 or 3 years, I don't recall having be bitten by such issue.

On the bug tracker, new issues are opened infrequently.

* http://bugs.python.org/issue19977 opened at 2013-12-13, closed at 2014-04-27
* http://bugs.python.org/issue19846 opened at 2013-11-30, closed as
NOTABUG at 2015-05-17 22, but got new comments after it was closed
* http://bugs.python.org/issue19847 opened at 2013-11-30, closed as
NOTABUG at 2013-12-13
* http://bugs.python.org/issue28180 opened at 2016-09-16, still open

Again, I don't think that this list is complete, I recall other similar issues.


> I realise there is some attractiveness in solving the issue "for Python",
> since that will reduce the amount of bug reports
> and get people off the chests of the maintainers, but to get this fixed in
> the wider Linux ecosystem it might be preferable to
> "Let them eat mojibake", to paraphrase what Marie-Antoinette never said.

What do you mean by "eating mojibake"? Users complain because their
application is stopped by a Python exception. Currently, most Python 3
applications doesn't produce or display mojibake, since Python is
strict on outputs. (One exception: stdout with the POSIX locale since
Python 3.5).

Victor
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Victor Stinner
2017-01-06 8:21 GMT+01:00 INADA Naoki :
> I want UTF-8 mode is enabled by default (opt-out option) even if
> locale is not POSIX,
> like `PYTHONLEGACYWINDOWSFSENCODING`.

You do, I don't :-)

It shouldn't be hard to find very concrete issues from the mojibake
issues described at:
https://www.python.org/dev/peps/pep-0540/#expected-mojibake-issues

IMHO there are 3 steps before being able to reach your dream:

1) add opt-in support for UTF-8
2) use UTF-8 if the locale is POSIX
3) UTF-8 is enabled by default

I would prefer to begin with a first Python release at stage (1) or
(2), wait for user complains, and later decide if we can move to (3).

Right now, I didn't implement the PEP 540, so I wasn't able to
experiment anything in practice yet.

Well, at least it means that I have to elaborate the "Always use
UTF-8" alternative of my PEP to explain why I consider that we are not
ready to switch directly to his "obvious" option.


> Users depends on locale know what locale is and how to configure it.

It's not a matter of users, but a matter of code in the wild which
uses directly C functions like mbstowcs() or wsctombs(). These
functions use the current locale encoding, they are not aware of the
new Python UTF-8 mode.


> But many people lives in "UTF-8 everywhere" world, and don't know about 
> locale.

The PEP 540 was written to help users for very concrete cases. I'm
repeating since Python 3.0 that users must learn how to configure
their locale. Well, 8 years later, I keep getting exactly the same
user complains: "Python doesn't work, it must just work!".

It's really hard to decode bytes and later encode the text and
prevenet any kind of encoding error. That's why no solution was
proposed before.


> `-X utf8` option should be parsed before converting commandline (...)

Yeah, that's a though technical issue. I'm not sure right know how to
implement this with a clean design. Maybe I will just try with a hack?
:-)

Victor
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread M.-A. Lemburg
On 06.01.2017 04:32, Nick Coghlan wrote:
> On 6 January 2017 at 12:37, Victor Stinner  wrote:
>> 2017-01-06 3:10 GMT+01:00 Stephen J. Turnbull
>> :
>>> I've quoted Victor out of context, and his other posts make me very
>>> doubtful that he considers this a serious alternative.  That said, I'm
>>> +1 on "don't!"
>>
>> The "always ignore locale and force UTF-8" option has supporters. For
>> example, Nick Coghlan wrote a whole PEP, PEP 538, to support this.
> 
> Err, no, that's not what PEP 538 does. PEP 538 doesn't do *anything*
> if a locale is already properly configured - it only changes the
> locale if the current locale is "C".

Victor: I think you are taking the UTF-8 idea a bit too far.
Nick was trying to address the situation where the locale is
set to "C", or rather not set at all (in which case the lib C
defaults to the "C" locale). The latter is a fairly standard
situation when piping data on Unix or when spawning processes
which don't inherit the current OS environment.

The problem with the "C" locale is that the encoding defaults to
"ASCII" and thus does not allow Python to show its built-in
Unicode support.

Nick's PEP and the discussion on the ticket
http://bugs.python.org/issue28180 are trying to address this
particular situation, not enforce any particular encoding
overriding the user's configured environment.

So I think it would be better if you'd focus your PEP on the
same situation: locale set to "C" or not set at all.

BTW: You mention a locale "POSIX" in a few places. I have
never seen this used in practice and wonder why we should
even consider this in Python as possible work-around for
a particular set of features. The locale setting in your
environment does have a lot of influence on your user
experience, so forcing people to set a "POSIX" locale doesn't
sound like a good idea - if they have to go through the
trouble of correctly setting up their environment for Python
to correctly run, they would much more likely use the correct
setting rather than a generic one like "POSIX", which is
defined as alias for the "C" locale and not as a separate
locale:

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

> It's actually very similar to your PEP, except that instead of adding
> the ability to make CPython ignore the C level locale settings (which
> I think is a bad idea based on your own previous work in that area and
> on the way that CPython interacts with other C/C++ components in the
> same process and in subprocesses), it just *changes* those settings
> when we're pretty sure they're wrong.

... and this is taking the original intent of the ticket
a little too far as well :-)

The original request was to have the FS encoding default to
UTF-8, in case the locale is not set or set to "C", with the
reasoning being that this makes it easier to use Python in
situations where you have exactly this situations (see above).

Your PEP takes this approach further by fixing the locale
setting to "C.UTF-8" in those two cases - intentionally, with all
the implications this has on other parts of the C lib.

The latter only has an effect on the C lib, if the "C.UTF-8" locale
is available on the system, which it isn't on many systems,
since C locales have to be explicitly generated.

Without the "C.UTF-8" locale available, your PEP only affects
the FS encoding, AFAICT, unless other parts of the application
try to interpret the locale env settings as well and use their
own logic for the interpretation.

For the purpose of experimentation, I would find it better
to start with just fixing the FS encoding in 3.7 and
leaving the option to adjust the locale setting turned off
per default.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 06 2017)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2017-01-06 Thread Neil Girdhar
On Fri, Jan 6, 2017 at 3:59 AM Paul Moore  wrote:

> On 6 January 2017 at 07:26, Neil Girdhar  wrote:
> > On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull
> >  wrote:
> >>
> >> Neil Girdhar writes:
> >>
> >>  > I don't understand this?  How is providing a default method in an
> >>  > abstract base class a pessimization?  If it happens to be slower
> >>  > than the code in the current methods, it can still be overridden.
> >>
> >> How often will people override until it's bitten them?  How many
> >> people will not even notice until they've lost business due to slow
> >> response?  If you don't have a default, that's much more obvious.
> >> Note that if there is a default, the collections are "large", and
> >> equality comparisons are "rare", this could be a substantial overhead.
> >
> >
> > I still don't understand what you're talking about here.  You're saying
> that
> > we shouldn't provide a __hash__ in case the default hash happens to be
> > slower than what the user wants and so by not providing it, we force the
> > user to write a fast one?  Doesn't that argument apply to all methods
> > provided by abcs?
>
> The point here is that ABCs should provide defaults for methods where
> there is an *obvious* default. It's not at all clear that there's an
> obvious default for __hash__.



>
> Unless I missed a revision of your proposal, what you suggested was:
>
>
Yeah, looks like you missed a revision.  There were two emails.  I
suggested adding ImmutableIterable and ImmutableSet, and so there is an
obvious implementation of __hash__ for both.


> > A better option is to add collections.abc.ImmutableIterable that derives
> from Iterable and provides __hash__.
>
> So what classes would derive from ImmutableIterable? Frozenset? A
> user-defined frozendict? There's no "obvious" default that would work
> for both those cases. And that's before we even get to the question of
> whether the default has the right performance characteristics, which
> is highly application-dependent.
>
> It's not clear to me if you expect ImmutableIterable to provide
> anything other than a default implementation of hash. If not, then how
> is making it an ABC any better than simply providing a helper function
> that people can use in their own __hash__ implementation? That would
> make it explicit what people are doing, and avoid any tendency towards
> people thinking they "should" inherit from ImmutableIterable and yet
> needing to override the only method it provides.
>
> Paul
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in __hash__

2017-01-06 Thread Paul Moore
On 6 January 2017 at 07:26, Neil Girdhar  wrote:
> On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull
>  wrote:
>>
>> Neil Girdhar writes:
>>
>>  > I don't understand this?  How is providing a default method in an
>>  > abstract base class a pessimization?  If it happens to be slower
>>  > than the code in the current methods, it can still be overridden.
>>
>> How often will people override until it's bitten them?  How many
>> people will not even notice until they've lost business due to slow
>> response?  If you don't have a default, that's much more obvious.
>> Note that if there is a default, the collections are "large", and
>> equality comparisons are "rare", this could be a substantial overhead.
>
>
> I still don't understand what you're talking about here.  You're saying that
> we shouldn't provide a __hash__ in case the default hash happens to be
> slower than what the user wants and so by not providing it, we force the
> user to write a fast one?  Doesn't that argument apply to all methods
> provided by abcs?

The point here is that ABCs should provide defaults for methods where
there is an *obvious* default. It's not at all clear that there's an
obvious default for __hash__.

Unless I missed a revision of your proposal, what you suggested was:

> A better option is to add collections.abc.ImmutableIterable that derives from 
> Iterable and provides __hash__.

So what classes would derive from ImmutableIterable? Frozenset? A
user-defined frozendict? There's no "obvious" default that would work
for both those cases. And that's before we even get to the question of
whether the default has the right performance characteristics, which
is highly application-dependent.

It's not clear to me if you expect ImmutableIterable to provide
anything other than a default implementation of hash. If not, then how
is making it an ABC any better than simply providing a helper function
that people can use in their own __hash__ implementation? That would
make it explicit what people are doing, and avoid any tendency towards
people thinking they "should" inherit from ImmutableIterable and yet
needing to override the only method it provides.

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2017-01-06 Thread Stephen J. Turnbull
Neil Girdhar writes:

 > I don't understand this?  How is providing a default method in an
 > abstract base class a pessimization?  If it happens to be slower
 > than the code in the current methods, it can still be overridden.

How often will people override until it's bitten them?  How many
people will not even notice until they've lost business due to slow
response?  If you don't have a default, that's much more obvious.
Note that if there is a default, the collections are "large", and
equality comparisons are "rare", this could be a substantial overhead.

 > > BTW, it occurs to me that now that dictionaries are versioned, in some
 > > cases it *may* make sense to hash dictionaries even though they are
 > > mutable, although the "hash" would need to somehow account for the
 > > version changing.  Seems messy but maybe someone has an idea?

 > I think it's important to keep in mind that dictionaries are not versioned
 > in Python. They happen to be versioned in CPython as an unexposed
 > implementation detail.  I don't think that such details should have any
 > bearing on potential changes to Python.

AFAIK the use of the hash member for equality checking is an
implementation detail too, although the language reference does
mention that set, frozenset and dict are "hashed collections".  The
basic requirements on hashes are that (1) objects that compare equal
must hash to the same value, and (2) the hash bucket must not change
over an object's lifetime (this is the "messy" aspect that probably
kills the idea -- you'd need to fix up all hashed collections that
contain the object as a key).

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread Stephan Houben
Hi all,

One meta-question I have which may already have been discussed much earlier
in this whole proposal series, is:
How common is this problem?

Because I have the impression that nowadays all Linux distributions are
UTF-8 by default and you have to show some
bloody-mindedness to end up with a POSIX locale.

Docker was mentioned, is this not really an issue which should be solved at
the Docker level?
Since it would affect *all* applications which are doing something
non-trivial with encodings?

I realise there is some attractiveness in solving the issue "for Python",
since that will reduce the amount of bug reports
and get people off the chests of the maintainers, but to get this fixed in
the wider Linux ecosystem it might be preferable to
"Let them eat mojibake", to paraphrase what Marie-Antoinette never said.

Stephan

2017-01-06 5:49 GMT+01:00 Steven D'Aprano :

> On Fri, Jan 06, 2017 at 02:54:49AM +0100, Victor Stinner wrote:
>
> > Let's say that you have the filename b'nonascii\xff': it's decoded as
> > 'nonascii\xdcff' by the UTF-8 mode. How do GUIs handle such filename?
> > (I don't know the answer, it's a real question ;-))
>
> I ran this in Python 2.7 to create the file:
>
> open(b'/tmp/nonascii\xff-', 'w')
>
> and then confirmed the filename:
>
> [steve@ando tmp]$ ls -b nonascii*
> nonascii\377-
>
> Konquorer in KDE 3 displays it with *two* "missing character" glyphs
> (small hollow boxes) before the hyphen. The KDE "Open File" dialog box
> shows the file with two blank spaces before the hyphen.
>
> My interpretation of this is that the difference is due to using
> different fonts: the file name is shown the same way, but in one font
> the missing character is a small box and in the other it is a blank
> space.
>
> I cannot tell what KDE is using for the invalid character, if I copy it
> as text and paste it into a file I just get the original \xFF.
>
> The Geany text editor, which I think uses the same GUI toolkit as Gnome,
> shows the file with a single "missing glyph" character, this time a
> black diamond with a question mark in it.
>
> It looks like Geany (Gnome?) is displaying the invalid byte as U+FFFD,
> the Unicode "REPLACEMENT CHARACTER".
>
> So at least two Linux GUI environments are capable of dealing with
> filenames that are invalid UTF-8, in two different ways.
>
> Does this answer your question about GUIs?
>
>
> --
> Steve
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] incremental hashing in __hash__

2017-01-06 Thread Paul Moore
On 6 January 2017 at 09:02, Neil Girdhar  wrote:
>
> Yeah, looks like you missed a revision.  There were two emails.  I suggested
> adding ImmutableIterable and ImmutableSet, and so there is an obvious
> implementation of __hash__ for both.

OK, sorry.

The proposal is still getting more complicated, though, and I really
don't see how it's better than having some low-level helper functions
for people who need to build custom __hash__ implementations. The "one
obvious way" to customise hashing is to implement __hash__, not to
derive from a base class that says my class is an
Immutable{Iterable,Set}.

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-06 Thread INADA Naoki
LGTM.

Some comments:

I want UTF-8 mode is enabled by default (opt-out option) even if
locale is not POSIX,
like `PYTHONLEGACYWINDOWSFSENCODING`.

Users depends on locale know what locale is and how to configure it.
They can understand difference between locale mode and UTF-8 mode
and they can opt-out UTF-8 mode.
But many people lives in "UTF-8 everywhere" world, and don't know about locale.


`-X utf8` option should be parsed before converting commandline
arguments to wchar_t*.
How about adding Py_UnixMain(int argc, char** argv) which is available
only on Unix?

I dislike wchar_t type and mbstowcs functions on Unix. (I love wchar_t
on Windows, off course).
I hope we can remove `wchar_t *wstr` from PyASCIIObject and deprecate
all wchar_t APIs
on Unix in the future.


On Fri, Jan 6, 2017 at 10:43 AM, Victor Stinner
 wrote:
> Ok, I modified my PEP: the POSIX locale now enables the UTF-8 mode.
>
> 2017-01-05 18:10 GMT+01:00 Victor Stinner :
>> A common request is that "Python just works" without having to pass a
>> command line option or set an environment variable. Maybe the default
>> behaviour should be left unchanged, but the behaviour with the POSIX
>> locale should change.
>
> http://bugs.python.org/issue28180 asks to "change the default" to get
> a Python which "just works" without any kind of configuration, in the
> context of a Docker image (I don't any detail about the image yet).
>
>
>> Maybe we can enable the UTF-8 mode (or "UNIX mode") of the PEP 540
>> when the POSIX locale is used?
>
> I read again other issues and I confirm that users are looking for a
> Python 3 which behaves like Python 2: simply don't bother them with
> encodings. I see the UTF-8 mode as an opportunity to answer to this
> request.
>
> Moreover, the most common cause of encoding issues is a program run
> with no locale variable set and so using the POSIX locale.
>
> So I modified my PEP 540: the POSIX locale now enables the UTF-8 mode.
> I had to update the "Backward Compatibility" section since the PEP now
> introduces a backward incompatible change (POSIX locale), but my bet
> is that the new behaviour is the one expected by users and that it
> cannot break applications.
>
> I moved my initial proposition as an alternative.
>
> I added a "Use Cases" section to explain in depth the "always work"
> behaviour, which I called the "UNIX mode" in my previous email.
>
> Latest version of the PEP:
> https://github.com/python/peps/blob/master/pep-0540.txt
>
> https://www.python.org/dev/peps/pep-0540/ will be updated shortly.
>
> Victor
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2017-01-06 Thread Neil Girdhar
On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull <
turnbull.stephen...@u.tsukuba.ac.jp> wrote:

> Neil Girdhar writes:
>
>  > I don't understand this?  How is providing a default method in an
>  > abstract base class a pessimization?  If it happens to be slower
>  > than the code in the current methods, it can still be overridden.
>
> How often will people override until it's bitten them?  How many
> people will not even notice until they've lost business due to slow
> response?  If you don't have a default, that's much more obvious.
> Note that if there is a default, the collections are "large", and
> equality comparisons are "rare", this could be a substantial overhead.
>

I still don't understand what you're talking about here.  You're saying
that we shouldn't provide a __hash__ in case the default hash happens to be
slower than what the user wants and so by not providing it, we force the
user to write a fast one?  Doesn't that argument apply to all methods
provided by abcs?


>  > > BTW, it occurs to me that now that dictionaries are versioned, in some
>  > > cases it *may* make sense to hash dictionaries even though they are
>  > > mutable, although the "hash" would need to somehow account for the
>  > > version changing.  Seems messy but maybe someone has an idea?
>
>  > I think it's important to keep in mind that dictionaries are not
> versioned
>  > in Python. They happen to be versioned in CPython as an unexposed
>  > implementation detail.  I don't think that such details should have any
>  > bearing on potential changes to Python.
>
> AFAIK the use of the hash member for equality checking is an
> implementation detail too, although the language reference does
> mention that set, frozenset and dict are "hashed collections".  The
> basic requirements on hashes are that (1) objects that compare equal
> must hash to the same value, and (2) the hash bucket must not change
> over an object's lifetime (this is the "messy" aspect that probably
> kills the idea -- you'd need to fix up all hashed collections that

contain the object as a key).
>

Are you saying that __hash__ is called by __eq__?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/