subject:"\[Python\-Dev\] PEP 540\: Add a new UTF\-8 mode"

[Python-Dev] PEP 540: Add a new UTF-8 mode

2017-12-05 Thread Victor Stinner

Hi,

Since it's the PEP Acceptance Week, I try my luck! Here is my very
long PEP to propose a tiny change. The PEP is very long to explain the
rationale and limitations.

Inaccurate tl; dr with the UTF-8 mode, Unicode "just works" as expected.

Reminder: INADA Naoki was nominated as the BDFL-Delegate.

https://www.python.org/dev/peps/pep-0540/

Full-text below.

Victor


PEP: 540
Title: Add a new UTF-8 mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner ,
Nick Coghlan 
BDFL-Delegate: INADA Naoki
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-January-2016
Python-Version: 3.7


Abstract


Add a new UTF-8 mode, enabled by default in the POSIX locale, to ignore
the locale and force the usage of the UTF-8 encoding for external
operating system interfaces, including the standard IO streams.

Essentially, the UTF-8 mode behaves as Python 2 and other C based
applications on \*nix systems: it aims to process text as best it can,
but it errs on the side of producing or propagating mojibake to
subsequent components in a processing pipeline rather than requiring
strictly valid encodings at every step in the process.

The UTF-8 mode can be configured as strict to reduce the risk of
producing or propagating mojibake.

A new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to explicitly control the UTF-8 mode (including
turning it off entirely, even in the POSIX locale).


Rationale
=

"It's not a bug, you must fix your locale" is not an acceptable answer
--

Since Python 3.0 was released in 2008, the usual answer to users getting
Unicode errors is to ask developers to fix their code to handle Unicode
properly. Most applications and Python modules were fixed, but users
kept reporting Unicode errors regularly: see the long list of issues in
the `Links`_ section below.

In fact, a second class of bugs comes from a locale which is not properly
configured. The usual answer to such a bug report is: "it is not a bug,
you must fix your locale".

Technically, the answer is correct, but from a practical point of view,
the answer is not acceptable. In many cases, "fixing the issue" is a
hard task. Moreover, sometimes, the usage of the POSIX locale is
deliberate.

A good example of a concrete issue are build systems which create a
fresh environment for each build using a chroot, a container, a virtual
machine or something else to get reproducible builds. Such a setup
usually uses the POSIX locale.  To get 100% reproducible builds, the
POSIX locale is a good choice: see the `Locales section of
reproducible-builds.org
`_.

PEP 538 lists additional problems related to the use of Linux containers to
run network services and command line applications.

UNIX users don't expect Unicode errors, since the common command lines
tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors -
they produce mostly-readable text instead.

These users similarly expect that tools written in Python 3 (including
those updated from Python 2), continue to tolerate locale
misconfigurations and avoid bothering them with text encoding details.
From their point of the view, the bug is not their locale but is
obviously Python 3 ("Everything else works, including Python 2, so
what's wrong with Python 3?").

Since Python 2 handles data as bytes, similar to system utilities
written in C and C++, it's rarer in Python 2 compared to Python 3 to get
explicit Unicode errors. It also contributes significantly to why many
affected users perceive Python 3 as the root cause of their Unicode
errors.

At the same time, the stricter text handling model was deliberately
introduced into Python 3 to reduce the frequency of data corruption bugs
arising in production services due to mismatched assumptions regarding
text encodings.  It's one thing to emit mojibake to a user's terminal
while listing a directory, but something else entirely to store that in
a system manifest in a database, or to send it to a remote client
attempting to retrieve files from the system.

Since different group of users have different expectations, there is no
silver bullet which solves all issues at once. Last but not least,
backward compatibility should be preserved whenever possible.

Locale and operating system data


.. _operating system data:

Python uses an encoding called the "filesystem encoding" to decide how
to encode and decode data from/to the operating system:

* file content
* command line arguments: ``sys.argv``
* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
* environment variables: ``os.environ``
* filenames: ``os.listdir(str)`` for example
* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
* error messages: ``os.strerror(code)`` for example
* user and terminal names: ``os``, ``grp`` and ``pwd`` mod

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode

2017-12-05 Thread Guido van Rossum

I've been discussing this PEP offline with Victor, but he suggested we
should discuss it in public instead.

I am very worried about this long and rambling PEP, and I propose that it
not be accepted without a major rewrite to focus on clarity of the
specification. The "Unicode just works" summary is more a wish than a
proper summary of the PEP.

For others interested in reviewing this, the implementation link is hidden
in the long list of links; it is http://bugs.python.org/issue29240.


FWIW the relationship with PEP 538 is also pretty unclear. (Or maybe that's
another case of the forest and the trees.) And that PEP (while already
accepted) also comes across as rambling and vague, and I have no idea what
it actually does. And it seems to mention PEP 540 quite a few times.

So I guess PEP acceptance week is over. :-(

On Tue, Dec 5, 2017 at 7:52 AM, Victor Stinner 
wrote:

> Hi,
>
> Since it's the PEP Acceptance Week, I try my luck! Here is my very
> long PEP to propose a tiny change. The PEP is very long to explain the
> rationale and limitations.
>
> Inaccurate tl; dr with the UTF-8 mode, Unicode "just works" as expected.
>
> Reminder: INADA Naoki was nominated as the BDFL-Delegate.
>
> https://www.python.org/dev/peps/pep-0540/
>
> Full-text below.
>
> Victor
>
>
> PEP: 540
> Title: Add a new UTF-8 mode
> Version: $Revision$
> Last-Modified: $Date$
> Author: Victor Stinner ,
> Nick Coghlan 
> BDFL-Delegate: INADA Naoki
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 5-January-2016
> Python-Version: 3.7
>
>
> Abstract
> 
>
> Add a new UTF-8 mode, enabled by default in the POSIX locale, to ignore
> the locale and force the usage of the UTF-8 encoding for external
> operating system interfaces, including the standard IO streams.
>
> Essentially, the UTF-8 mode behaves as Python 2 and other C based
> applications on \*nix systems: it aims to process text as best it can,
> but it errs on the side of producing or propagating mojibake to
> subsequent components in a processing pipeline rather than requiring
> strictly valid encodings at every step in the process.
>
> The UTF-8 mode can be configured as strict to reduce the risk of
> producing or propagating mojibake.
>
> A new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
> variable are added to explicitly control the UTF-8 mode (including
> turning it off entirely, even in the POSIX locale).
>
>
> Rationale
> =
>
> "It's not a bug, you must fix your locale" is not an acceptable answer
> --
>
> Since Python 3.0 was released in 2008, the usual answer to users getting
> Unicode errors is to ask developers to fix their code to handle Unicode
> properly. Most applications and Python modules were fixed, but users
> kept reporting Unicode errors regularly: see the long list of issues in
> the `Links`_ section below.
>
> In fact, a second class of bugs comes from a locale which is not properly
> configured. The usual answer to such a bug report is: "it is not a bug,
> you must fix your locale".
>
> Technically, the answer is correct, but from a practical point of view,
> the answer is not acceptable. In many cases, "fixing the issue" is a
> hard task. Moreover, sometimes, the usage of the POSIX locale is
> deliberate.
>
> A good example of a concrete issue are build systems which create a
> fresh environment for each build using a chroot, a container, a virtual
> machine or something else to get reproducible builds. Such a setup
> usually uses the POSIX locale.  To get 100% reproducible builds, the
> POSIX locale is a good choice: see the `Locales section of
> reproducible-builds.org
> `_.
>
> PEP 538 lists additional problems related to the use of Linux containers to
> run network services and command line applications.
>
> UNIX users don't expect Unicode errors, since the common command lines
> tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors -
> they produce mostly-readable text instead.
>
> These users similarly expect that tools written in Python 3 (including
> those updated from Python 2), continue to tolerate locale
> misconfigurations and avoid bothering them with text encoding details.
> From their point of the view, the bug is not their locale but is
> obviously Python 3 ("Everything else works, including Python 2, so
> what's wrong with Python 3?").
>
> Since Python 2 handles data as bytes, similar to system utilities
> written in C and C++, it's rarer in Python 2 compared to Python 3 to get
> explicit Unicode errors. It also contributes significantly to why many
> affected users perceive Python 3 as the root cause of their Unicode
> errors.
>
> At the same time, the stricter text handling model was deliberately
> introduced into Python 3 to reduce the frequency of data corruption bugs
> arising in production services

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode

2017-12-05 Thread Victor Stinner

2017-12-05 22:18 GMT+01:00 Guido van Rossum :
> So I guess PEP acceptance week is over. :-(

My bad, Barry wrote "PEP Acceptance Day", not week, on twitter ;-)
https://twitter.com/pumpichank/status/937770805076905990

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode

2017-12-05 Thread Chris Barker

On Tue, Dec 5, 2017 at 1:18 PM, Guido van Rossum  wrote:
>
>
> I am very worried about this long and rambling PEP,
>

FWIW, I read the PEP on the bus this morning on a phone, and while lng, I
didn't find it too rambling. And this topic has been very often discussed
in very long and rambling mailing list threads, etc. So I think a long (If
not rambling) PEP is in order.

This is a very important topic for Python -- the py2-3 transition got a LOT
of flack, to the point of people claiming that it was easier to learn a
whole new language than convert to py3 -- and THIS particular issue was a
big part of it:

The truth is that any system that does not use a clearly defined encoding
for filenames (and everything else) is broken, plain and simple. But the
other truth is (as talked about in the PEP) they some *nix systems are that
broken because C code that simply passed around char* still works fine. And
no matter how you slice it telling people that they need to fix their
broken system in order for your software to run is not a popular option.

When Python added surrogateescape to its Unicode implementation, the tools
were there to work with broken (OK, I'll be charitable: misconfigured)
systems. Now we just need some easier defaults.

OK, now I'm getting long and rambling

TL;DR -- The proposal in the PEP is an important step forward, and the
issue is fraught with enough history and controversy that a long PEP is
probably a good idea.

So the addition of a better summary of the specification up at the top, and
editing of the rest, and we could have a good PEP.

Too late for this release, but what can you do?

> The "Unicode just works" summary is more a wish than a proper summary of
> the PEP.
>

well, yeah.

> FWIW the relationship with PEP 538 is also pretty unclear. (Or maybe
> that's another case of the forest and the trees.) And that PEP (while
> already accepted) also comes across as rambling and vague, and I have no
> idea what it actually does. And it seems to mention PEP 540 quite a few
> times.
>

I just took another look at 538 -- and yes, the relationship between the
two is really unclear. In particular, with 538, why do we need 540? I
honestly don't know.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode

2017-12-05 Thread Victor Stinner

Chris:
> I just took another look at 538 -- and yes, the relationship between the two
> is really unclear. In particular, with 538, why do we need 540? I honestly
> don't know.

The PEP 538 only impacts platforms which provide the C.UTF-8 locale or
a variant: only a few recent Linux distribution. I know Fedora, maybe
a few other have it? FreeBSD and macOS are completely ignored by the
PEP 538. The PEP 540 uses the UTF-8 encoding for the POSIX locale on
*all* platforms.

Moreover, the PEP 538 only concerns the POSIX locale (locale "C"),
whereas the PEP 540 is usable with any locale. For example, using the
"fr_FR.iso88591" locale, the encoding is Latin1. But if you enable the
UTF-8 mode with this locale, Python will use UTF-8.

The other difference is that the PEP 538 is implemented with
setlocale(LC_CTYPE, "C.UTF-8"), whereas the PEP 540 is implemented in
Python internals and ignores the locale. The PEP 540 scope is limited
to Python, non-Python running in the same process is not aware of the
"Python UTF-8 mode".

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Victor Stinner

new UTF-8 mode of this PEP is only enabled by the POSIX locale, it can
be enabled manually for any other locale.

The PEP 538 is implemented with ``setlocale(LC_CTYPE, "C.UTF-8")``: any
non-Python code running in the process is impacted by this change.  This
PEP is implemented in Python internals and ignores the locale:
non-Python running in the same process is not aware of the "Python UTF-8
mode".


Links
=

* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode
  <http://bugs.python.org/issue29240>`_
* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
  "Coercing the legacy C locale to C.UTF-8"
* `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
  "Change Windows filesystem encoding to UTF-8"
* `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
  "Change Windows console encoding to UTF-8"
* `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
  "Non-decodable Bytes in System Character Interfaces"


Post History


* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode
  <https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
* 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
  540 (assuming UTF-8 for *nix system boundaries)
  <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
  <https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
* 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
  C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
* 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows
  to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_
  -- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows
  filesystem encoding to UTF-8)


Copyright
=

This document has been placed in the public domain.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread Victor Stinner

-Python running in the same process is not aware of the
"Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps
ensure that encoding handling in binary extension modules and subprocesses
is consistent with CPython's encoding handling. The upside of the PEP 540
approach is that it allows an embedding application to change the
interpreter's behaviour without having to change the process global
locale settings.


Links
=

* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode
  <http://bugs.python.org/issue29240>`_
* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
  "Coercing the legacy C locale to C.UTF-8"
* `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
  "Change Windows filesystem encoding to UTF-8"
* `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
  "Change Windows console encoding to UTF-8"
* `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
  "Non-decodable Bytes in System Character Interfaces"


Post History


* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode
  <https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
* 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
  540 (assuming UTF-8 for *nix system boundaries)
  <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
  <https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
* 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
  C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
* 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows
  to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_
  -- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows
  filesystem encoding to UTF-8)


Copyright
=

This document has been placed in the public domain.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Victor Stinner

> Annex: Differences between the PEP 538 and the PEP 540
> ==
>
> The PEP 538 uses the "C.UTF-8" locale which is quite new and only
> supported by a few Linux distributions; this locale is not currently
> supported by FreeBSD or macOS for example. This PEP 540 supports all
> operating systems.
>
> The PEP 538 only changes the behaviour for the POSIX locale. While the
> new UTF-8 mode of this PEP is only enabled by the POSIX locale, it can
> be enabled manually for any other locale.
>
> The PEP 538 is implemented with ``setlocale(LC_CTYPE, "C.UTF-8")``: any
> non-Python code running in the process is impacted by this change.  This
> PEP is implemented in Python internals and ignores the locale:
> non-Python running in the same process is not aware of the "Python UTF-8
> mode".

The main advantage of the PEP 538 ùover* the PEP 540 is that, for the
POSIX locale, non-Python code running in the same process gets the
UTF-8 encoding.

To be honest, I'm not sure that there is a lot of code in the wild
which uses "text" types like the C type wchar_t* and rely on the
locale encoding. Almost all C library handle data as bytes using the
char* type, like filenames and environment variables.

First I understood that the PEP 538 changed the locale encoding using
an environment variable. But no, it's implemented with
setlocale(LC_CTYPE, "C.UTF-8") which only impacts the current process
and is not inherited by child processes. So I'm not sure anymore that
PEP 538 and PEP 540 are really complementary.

I'm not sure how PyGTK interacts with the PEP 538 for example. Does it
use UTF-8 with the POSIX locale?

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Nick Coghlan

On 6 December 2017 at 11:01, Victor Stinner  wrote:
>> Annex: Differences between the PEP 538 and the PEP 540
>> ==
>>
>> The PEP 538 uses the "C.UTF-8" locale which is quite new and only
>> supported by a few Linux distributions; this locale is not currently
>> supported by FreeBSD or macOS for example. This PEP 540 supports all
>> operating systems.
>>
>> The PEP 538 only changes the behaviour for the POSIX locale. While the
>> new UTF-8 mode of this PEP is only enabled by the POSIX locale, it can
>> be enabled manually for any other locale.
>>
>> The PEP 538 is implemented with ``setlocale(LC_CTYPE, "C.UTF-8")``: any
>> non-Python code running in the process is impacted by this change.  This
>> PEP is implemented in Python internals and ignores the locale:
>> non-Python running in the same process is not aware of the "Python UTF-8
>> mode".

I submitted a PR to reword this part: https://github.com/python/peps/pull/493

> The main advantage of the PEP 538 ùover* the PEP 540 is that, for the
> POSIX locale, non-Python code running in the same process gets the
> UTF-8 encoding.
>
> To be honest, I'm not sure that there is a lot of code in the wild
> which uses "text" types like the C type wchar_t* and rely on the
> locale encoding. Almost all C library handle data as bytes using the
> char* type, like filenames and environment variables.

At the very least, GNU readline breaks if you don't change the locale
setting: 
https://www.python.org/dev/peps/pep-0538/#considering-locale-coercion-independently-of-utf-8-mode

Given that we found an example of this directly in the standard
library, I assume that there are plenty more in third party extension
modules (especially once we take C++ extensions into account, not just
C ones).

> First I understood that the PEP 538 changed the locale encoding using
> an environment variable. But no, it's implemented with
> setlocale(LC_CTYPE, "C.UTF-8") which only impacts the current process
> and is not inherited by child processes. So I'm not sure anymore that
> PEP 538 and PEP 540 are really complementary.

It sets the LC_CTYPE environment variable as well:
https://www.python.org/dev/peps/pep-0538/#explicitly-setting-lc-ctype-for-utf-8-locale-coercion

The relevant code is in _coerce_default_locale_settings (currently at
https://github.com/python/cpython/blob/master/Python/pylifecycle.c#L448)

> I'm not sure how PyGTK interacts with the PEP 538 for example. Does it
> use UTF-8 with the POSIX locale?

Desktop environments aim not to get into this situation in the first
place by ensuring they're using a more appropriate locale :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread INADA Naoki

I'm sorry about my laziness.
I've very busy these months, but I'm back to OSS world from today.

While I should review carefully again, I think I'm close to accept PEP 540.

* PEP 540 really helps containers and old Linux machines PEP 538 doesn't work.
  And containers is really important for these days.  Many new
Pythonistas who is
  not Linux experts start using containers.

* In recent years, UTF-8 fixed many mojibakes.  Now UnicodeError is
more usability
  problem for many Python users.  So I agree opt-out UTF-8 mode is
better than opt-in
  on POSIX locale.

I don't have enough time to read all mails in ML archive.
So if someone have opposite opinion, please remind me by this weekend.

Regards,
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread INADA Naoki

Oh, revised version is really short!

And I have one worrying point.
With UTF-8 mode, open()'s default encoding/error handler is
UTF-8/surrogateescape.

Containers are really growing.  PyCharm supports Docker and many new Python
developers use Docker instead of installing Python directly on their system,
especially on Windows.

And opening binary file without "b" option is very common mistake of new
developers.  If default error handler is surrogateescape, they lose a chance
to notice their bug.

On the other hand, it helps some use cases when user want byte-transparent
behavior, without modifying code to use "surrogateescape" explicitly.

Which is more important scenario?  Anyone has opinion about it?
Are there any rationals and use cases I missing?

Regards,

INADA Naoki  

On Wed, Dec 6, 2017 at 12:17 PM, INADA Naoki  wrote:
> I'm sorry about my laziness.
> I've very busy these months, but I'm back to OSS world from today.
>
> While I should review carefully again, I think I'm close to accept PEP 540.
>
> * PEP 540 really helps containers and old Linux machines PEP 538 doesn't work.
>   And containers is really important for these days.  Many new
> Pythonistas who is
>   not Linux experts start using containers.
>
> * In recent years, UTF-8 fixed many mojibakes.  Now UnicodeError is
> more usability
>   problem for many Python users.  So I agree opt-out UTF-8 mode is
> better than opt-in
>   on POSIX locale.
>
> I don't have enough time to read all mails in ML archive.
> So if someone have opposite opinion, please remind me by this weekend.
>
> Regards,
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Nick Coghlan

Something I've just noticed that needs to be clarified: on Linux, "C"
locale and "POSIX" locale are aliases, but this isn't true in general
(e.g. it's not the case on *BSD systems, including Mac OS X).

To handle that in PEP 538, I made it clear that everything is keyed
specifically off the "C" locale, since that's what you actually get by
default.

So if PEP 540 is going to implicitly trigger switching encodings, it
needs to specify whether it's going to look for the C locale or the
POSIX locale (I'd suggest C locale, since that's the actual default
that causes problems).

The precedence relationship with locale coercion also needs to be
spelled out: successful locale coercion should skip implicitly
enabling UTF-8 mode (for opt-in UTF-8 mode, we'd still try to coerce
the locale setting as appropriate, so extensions modules are more
likely to behave themselves).

On 6 December 2017 at 14:07, INADA Naoki  wrote:
> Oh, revised version is really short!
>
> And I have one worrying point.
> With UTF-8 mode, open()'s default encoding/error handler is
> UTF-8/surrogateescape.
>
> Containers are really growing.  PyCharm supports Docker and many new Python
> developers use Docker instead of installing Python directly on their system,
> especially on Windows.
>
> And opening binary file without "b" option is very common mistake of new
> developers.  If default error handler is surrogateescape, they lose a chance
> to notice their bug.
>
> On the other hand, it helps some use cases when user want byte-transparent
> behavior, without modifying code to use "surrogateescape" explicitly.
>
> Which is more important scenario?  Anyone has opinion about it?
> Are there any rationals and use cases I missing?

For platforms that offer a C.UTF-8 locale, I'd like "LC_CTYPE=C.UTF-8
python" and "PYTHONCOERCECLOCALE=0 LC_CTYPE=C PYTHONUTF8=1" to be
equivalent (aside from the known limitation that extension modules may
not do the right thing in the latter case).

For the locale coercion case, the default error handler for `open`
remains as "strict", which means I'd be in favour of keeping it as
"strict" by default in UTF-8 mode as well. That would flip the toggle
in the PEP: "strict UTF-8" would be the default selection for
"PYTHONUTF8=1, and you'd choose the more relaxed option via
"PYTHONUTF8=permissive".

That way, the combination of PEPs 538 and 540 would give us the
following situation in the C locale:

1. Our preferred approach is to coerce LC_CTYPE in the C locale to a
UTF-8 based equivalent
2. Only if that fails (e.g. as it will on CentOS 7) do we resort to
implicitly enabling CPython's internal UTF-8 mode (which should behave
like C.UTF-8, *except* for the fact extension modules won't respect
it)

That way, the ideal outcome is that a UTF-8 based locale exists, and
we use it automatically when needed. UTF-8 mode than lets us cope with
older platforms where neither C.UTF-8 nor an equivalent exists.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Chris Angelico

On Wed, Dec 6, 2017 at 4:46 PM, Nick Coghlan  wrote:
> Something I've just noticed that needs to be clarified: on Linux, "C"
> locale and "POSIX" locale are aliases, but this isn't true in general
> (e.g. it's not the case on *BSD systems, including Mac OS X).

For those of us with little to no BSD/MacOS experience, can you give a
quick run-down of the differences between "C" and "POSIX"?

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Nick Coghlan

On 6 December 2017 at 15:59, Chris Angelico  wrote:
> On Wed, Dec 6, 2017 at 4:46 PM, Nick Coghlan  wrote:
>> Something I've just noticed that needs to be clarified: on Linux, "C"
>> locale and "POSIX" locale are aliases, but this isn't true in general
>> (e.g. it's not the case on *BSD systems, including Mac OS X).
>
> For those of us with little to no BSD/MacOS experience, can you give a
> quick run-down of the differences between "C" and "POSIX"?

The one that's relevant to default locale detection is just the string
that "setlocale(LC_CTYPE, NULL)" returns.

On Linux (or, more accurately, with glibc), after setting
"LC_CTYPE=POSIX", that call still returns "C" (since the "POSIX"
locale is defined as an alias for the "C" locale).

By contrast, on *BSD, it will return "POSIX" (since "POSIX" is
actually a distinct locale there).

Beyond that, I don't know what the actual functional differences are.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Glenn Linderman


On 12/5/2017 8:07 PM, INADA Naoki wrote:

Oh, revised version is really short!

And I have one worrying point.
With UTF-8 mode, open()'s default encoding/error handler is
UTF-8/surrogateescape.

Containers are really growing.  PyCharm supports Docker and many new Python
developers use Docker instead of installing Python directly on their system,
especially on Windows.

And opening binary file without "b" option is very common mistake of new
developers.  If default error handler is surrogateescape, they lose a chance
to notice their bug.


"b" mostly matters on Windows, correct? And Windows doesn't use C or 
POSIX locale, correct? And if these are correct, then is this an issue? 
And if so, why?



On the other hand, it helps some use cases when user want byte-transparent
behavior, without modifying code to use "surrogateescape" explicitly.

Which is more important scenario?  Anyone has opinion about it?
Are there any rationals and use cases I missing?

Regards,

INADA Naoki  


On Wed, Dec 6, 2017 at 12:17 PM, INADA Naoki  wrote:

I'm sorry about my laziness.
I've very busy these months, but I'm back to OSS world from today.

While I should review carefully again, I think I'm close to accept PEP 540.

* PEP 540 really helps containers and old Linux machines PEP 538 doesn't work.
   And containers is really important for these days.  Many new
Pythonistas who is
   not Linux experts start using containers.

* In recent years, UTF-8 fixed many mojibakes.  Now UnicodeError is
more usability
   problem for many Python users.  So I agree opt-out UTF-8 mode is
better than opt-in
   on POSIX locale.

I don't have enough time to read all mails in ML archive.
So if someone have opposite opinion, please remind me by this weekend.

Regards,

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/v%2Bpython%40g.nevcal.com



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Nick Coghlan

On 6 December 2017 at 16:18, Glenn Linderman  wrote:
> "b" mostly matters on Windows, correct? And Windows doesn't use C or POSIX
> locale, correct? And if these are correct, then is this an issue? And if so,
> why?

In Python 3, "b" matters everywhere, since it controls whether the
stream gets wrapped in TextIOWrapper or not.

It's only in Python 2 that the distinction is Windows-specific (where
it controls how "\r\n" sequences get handled).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Victor Stinner

Hi Naoki,

2017-12-06 5:07 GMT+01:00 INADA Naoki :
> Oh, revised version is really short!
>
> And I have one worrying point.
> With UTF-8 mode, open()'s default encoding/error handler is
> UTF-8/surrogateescape.

The Strict UTF-8 Mode is for you if you prioritize correctness over usability.

In the very first version of my PEP/idea, I wanted to use
UTF-8/strict. But then I started to play with the implementation and I
got many "practical" issues. Using UTF-8/strict, you quickly get
encoding errors. For example, you become unable to read undecodable
bytes from stdin. stdin.read() only gives you an error, without
letting you decide how to handle these "invalid" data. Same issue with
stdout.

Compare encodings of the UTF-8 mode and the Strict UTF-8 Mode:
https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler

I tried to summarize all these kinds of issues in the second short
subsection of the rationale:
https://www.python.org/dev/peps/pep-0540/#passthough-undecodable-bytes-surrogateescape

In the old long version of the PEP, I tried to explain UTF-8/strict
issues with very concrete examples, the removed "Use Cases" section:
https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490

Tell me if I should rephrase the rationale of the PEP 540 to better
justify the usage of surrogateescape.

Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
surrogateescape, or backslashreplace for stderr, or surrogatepass for
fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
the PEP title would be too long, no? :-)


> And opening binary file without "b" option is very common mistake of new
> developers.  If default error handler is surrogateescape, they lose a chance
> to notice their bug.

When open() in used in text mode to read "binary data", usually the
developer would only notify when getting the POSIX locale (ASCII
encoding). But the PEP 538 already changed that by using the C.UTF-8
locale (and so the UTF-8 encoding, instead of the ASCII encoding).

I'm not sure that locales are the best way to detect such class of
bytes. I suggest to use -b or -bb option to detect such bugs without
having to care of the locale.


> On the other hand, it helps some use cases when user want byte-transparent
> behavior, without modifying code to use "surrogateescape" explicitly.
>
> Which is more important scenario?  Anyone has opinion about it?
> Are there any rationals and use cases I missing?

Usually users expect that Python 3 "just works" and don't bother them
with the locale (thay nobody understands).

The old version of the PEP contains a long list of issues:
https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986

I already replaced the strict error handler with surrogateescape for
sys.stdin and sys.stdout on the POSIX locale in Python 3.5:
https://bugs.python.org/issue19977

For the rationale, read for example these comments:

* https://bugs.python.org/issue19846#msg205727 "As I would state it,
the problem is that python's boundary with the OS is not yet uniform.
(...) Note that currently, input() and sys.stdin.read() won't read
undecodable data so this is somewhat symmetrical but it seems to me
that saying "everything that interfaces with the OS except the
standard streams will use surrogateescape on undecodable bytes" is
drawing a line in an unintuitive location."

* https://bugs.python.org/issue19977#msg206141 "My impression was that
python3 was supposed to help get rid of UnicodeError tracebacks, not
mojibake.  If mojibake was the problem then we should never have gone
down the surrogateescape path for input."

* https://bugs.python.org/issue19846#msg205646 "For example I'm using
[LANG=C] for testcases to set the language uncomplicated to english."

In bug reports, to get the user expectations, just ignore all core
developers comments :-)

Users set the locale to C to get messages in english and still expects
"Unicode" to work properly.

Only Python 3 is so strict about encodings. Most other programming
languages, like Python 2, "just works", since they process data as
bytes.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Victor Stinner

Nick:
> So if PEP 540 is going to implicitly trigger switching encodings, it
> needs to specify whether it's going to look for the C locale or the
> POSIX locale (I'd suggest C locale, since that's the actual default
> that causes problems).

I'm thinking at the test already used by check_force_ascii() (function
checking if the LC_CTYPE uses the ASCII encoding or something else):

loc = setlocale(LC_CTYPE, NULL);
if (loc == NULL)
goto error;
if (strcmp(loc, "C") != 0) {
/* the LC_CTYPE locale is different than C */
return 0;
}

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Nick Coghlan

On 6 December 2017 at 20:38, Victor Stinner  wrote:
> Nick:
>> So if PEP 540 is going to implicitly trigger switching encodings, it
>> needs to specify whether it's going to look for the C locale or the
>> POSIX locale (I'd suggest C locale, since that's the actual default
>> that causes problems).
>
> I'm thinking at the test already used by check_force_ascii() (function
> checking if the LC_CTYPE uses the ASCII encoding or something else):
>
> loc = setlocale(LC_CTYPE, NULL);
> if (loc == NULL)
> goto error;
> if (strcmp(loc, "C") != 0) {
> /* the LC_CTYPE locale is different than C */
> return 0;
> }

Yeah, the locale coercion code changes the locale multiple times to
make sure we have a coercion target that will actually work (and then
checks nl_langinfo as well, since that sometimes breaks on BSD
systems, even if the original setlocale() call claimed to work). Once
we've found a locale that appears to work though, then we configure
the LC_CTYPE environment variable, and reload the locale from the
environment.

It's all annoyingly convoluted and arcane, but it works well enough
for 
https://github.com/python/cpython/blob/master/Lib/test/test_c_locale_coercion.py
to pass across the full BuildBot fleet :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread INADA Naoki

>> And I have one worrying point.
>> With UTF-8 mode, open()'s default encoding/error handler is
>> UTF-8/surrogateescape.
>
> The Strict UTF-8 Mode is for you if you prioritize correctness over usability.

Yes, but as I said, I cares about not experienced developer
who doesn't know what UTF-8 mode is.

>
> In the very first version of my PEP/idea, I wanted to use
> UTF-8/strict. But then I started to play with the implementation and I
> got many "practical" issues. Using UTF-8/strict, you quickly get
> encoding errors. For example, you become unable to read undecodable
> bytes from stdin. stdin.read() only gives you an error, without
> letting you decide how to handle these "invalid" data. Same issue with
> stdout.
>

I don't care about stdio, because PEP 538 uses surrogateescape for stdio/error
https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams

I care only about builtin open()'s behavior.
PEP 538 doesn't change default error handler of open().

I think PEP 538 and PEP 540 should behave almost identical except
changing locale
or not.  So I need very strong reason if PEP 540 changes default error
handler of open().


> In the old long version of the PEP, I tried to explain UTF-8/strict
> issues with very concrete examples, the removed "Use Cases" section:
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490
>
> Tell me if I should rephrase the rationale of the PEP 540 to better
> justify the usage of surrogateescape.

OK, "List a directory into a text file" example demonstrates why surrogateescape
is used for open().  If os.listdir() returns surrogateescpaed data,
file.write() will be
fail.
All other examples are about stdio.

But we should achieve good balance between correctness and usability of
default behavior.

>
> Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
> surrogateescape, or backslashreplace for stderr, or surrogatepass for
> fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
> the PEP title would be too long, no? :-)
>

I feel short name is enough.

>
>> And opening binary file without "b" option is very common mistake of new
>> developers.  If default error handler is surrogateescape, they lose a chance
>> to notice their bug.
>
> When open() in used in text mode to read "binary data", usually the
> developer would only notify when getting the POSIX locale (ASCII
> encoding). But the PEP 538 already changed that by using the C.UTF-8
> locale (and so the UTF-8 encoding, instead of the ASCII encoding).
>

With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not
UTF-8/surrogateescape.

For example, this code raise UnicodeDecodeError with PEP 538 if the
file is JPEG file.

with open(fn) as f:
f.read()


> I'm not sure that locales are the best way to detect such class of
> bytes. I suggest to use -b or -bb option to detect such bugs without
> having to care of the locale.
>

But many new developers doesn't use/know -b or -bb option.

>
>> On the other hand, it helps some use cases when user want byte-transparent
>> behavior, without modifying code to use "surrogateescape" explicitly.
>>
>> Which is more important scenario?  Anyone has opinion about it?
>> Are there any rationals and use cases I missing?
>
> Usually users expect that Python 3 "just works" and don't bother them
> with the locale (thay nobody understands).
>
> The old version of the PEP contains a long list of issues:
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986
>
> I already replaced the strict error handler with surrogateescape for
> sys.stdin and sys.stdout on the POSIX locale in Python 3.5:
> https://bugs.python.org/issue19977
>
> For the rationale, read for example these comments:
>
[snip]

OK, I'll read them and think again about open()'s default behavior.
But I still hope open()'s behavior is consistent with PEP 538 and PEP 540.

Regards,
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Jakub Wilk


* Nick Coghlan , 2017-12-06, 16:15:
Something I've just noticed that needs to be clarified: on Linux, "C" 
locale and "POSIX" locale are aliases, but this isn't true in general 
(e.g. it's not the case on *BSD systems, including Mac OS X).
For those of us with little to no BSD/MacOS experience, can you give a 
quick run-down of the differences between "C" and "POSIX"?


POSIX says that "C" and "POSIX" are equivalent[0].

The one that's relevant to default locale detection is just the string 
that "setlocale(LC_CTYPE, NULL)" returns.


POSIX doesn't require any particular return value for setlocale() calls. 
It's only guaranteed that the returned string can be used in subsequent 
setlocale() calls to restore the original locale.


So in the POSIX locale, a compliant setlocale() implementation could 
return "C", or "POSIX", or even something entirely different.



Beyond that, I don't know what the actual functional differences are.


I don't believe there are any.


[0] http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html

--
Jakub Wilk
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Brett Cannon

On Wed, 6 Dec 2017 at 06:10 INADA Naoki  wrote:

> >> And I have one worrying point.
> >> With UTF-8 mode, open()'s default encoding/error handler is
> >> UTF-8/surrogateescape.
> >
> > The Strict UTF-8 Mode is for you if you prioritize correctness over
> usability.
>
> Yes, but as I said, I cares about not experienced developer
> who doesn't know what UTF-8 mode is.
>
> >
> > In the very first version of my PEP/idea, I wanted to use
> > UTF-8/strict. But then I started to play with the implementation and I
> > got many "practical" issues. Using UTF-8/strict, you quickly get
> > encoding errors. For example, you become unable to read undecodable
> > bytes from stdin. stdin.read() only gives you an error, without
> > letting you decide how to handle these "invalid" data. Same issue with
> > stdout.
> >
>
> I don't care about stdio, because PEP 538 uses surrogateescape for
> stdio/error
>
> https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams
>
> I care only about builtin open()'s behavior.
> PEP 538 doesn't change default error handler of open().
>
> I think PEP 538 and PEP 540 should behave almost identical except
> changing locale
> or not.  So I need very strong reason if PEP 540 changes default error
> handler of open().
>

I don't have enough locale experience to weigh in as an expert, but I
already was leaning towards INADA-san's logic of not wanting to change
open() and this makes me really not want to change it.

-Brett


>
>
> > In the old long version of the PEP, I tried to explain UTF-8/strict
> > issues with very concrete examples, the removed "Use Cases" section:
> >
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490
> >
> > Tell me if I should rephrase the rationale of the PEP 540 to better
> > justify the usage of surrogateescape.
>
> OK, "List a directory into a text file" example demonstrates why
> surrogateescape
> is used for open().  If os.listdir() returns surrogateescpaed data,
> file.write() will be
> fail.
> All other examples are about stdio.
>
> But we should achieve good balance between correctness and usability of
> default behavior.
>
> >
> > Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
> > surrogateescape, or backslashreplace for stderr, or surrogatepass for
> > fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
> > the PEP title would be too long, no? :-)
> >
>
> I feel short name is enough.
>
> >
> >> And opening binary file without "b" option is very common mistake of new
> >> developers.  If default error handler is surrogateescape, they lose a
> chance
> >> to notice their bug.
> >
> > When open() in used in text mode to read "binary data", usually the
> > developer would only notify when getting the POSIX locale (ASCII
> > encoding). But the PEP 538 already changed that by using the C.UTF-8
> > locale (and so the UTF-8 encoding, instead of the ASCII encoding).
> >
>
> With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not
> UTF-8/surrogateescape.
>
> For example, this code raise UnicodeDecodeError with PEP 538 if the
> file is JPEG file.
>
> with open(fn) as f:
> f.read()
>
>
> > I'm not sure that locales are the best way to detect such class of
> > bytes. I suggest to use -b or -bb option to detect such bugs without
> > having to care of the locale.
> >
>
> But many new developers doesn't use/know -b or -bb option.
>
> >
> >> On the other hand, it helps some use cases when user want
> byte-transparent
> >> behavior, without modifying code to use "surrogateescape" explicitly.
> >>
> >> Which is more important scenario?  Anyone has opinion about it?
> >> Are there any rationals and use cases I missing?
> >
> > Usually users expect that Python 3 "just works" and don't bother them
> > with the locale (thay nobody understands).
> >
> > The old version of the PEP contains a long list of issues:
> >
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986
> >
> > I already replaced the strict error handler with surrogateescape for
> > sys.stdin and sys.stdout on the POSIX locale in Python 3.5:
> > https://bugs.python.org/issue19977
> >
> > For the rationale, read for example these comments:
> >
> [snip]
>
> OK, I'll read them and think again about open()'s default behavior.
> But I still hope open()'s behavior is consistent with PEP 538 and PEP 540.
>
> Regards,
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Greg Ewing


Victor Stinner wrote:

Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
surrogateescape, or backslashreplace for stderr, or surrogatepass for
fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
the PEP title would be too long, no? :-)


Relaxed UTF-8 Mode?

UTF8-Yeah-I'm-Fine-With-That mode?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Antoine Pitrou

On Wed, 6 Dec 2017 01:49:41 +0100
Victor Stinner  wrote:
> Hi,
> 
> I knew that I had to rewrite my PEP 540, but I was too lazy. Since
> Guido explicitly requested a shorter PEP, here you have!
> 
> https://www.python.org/dev/peps/pep-0540/
> 
> Trust me, it's the same PEP, but focused on the most important
> information and with a shorter rationale ;-)

Congrats on the rewriting!  The shortening is appreciated :-)

One question: how do you plan to test for the POSIX locale?  Apparently
you need to check at least for the "C" and "POSIX" strings, but perhaps
other aliases as well?

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Victor Stinner

2017-12-06 23:07 GMT+01:00 Antoine Pitrou :
> One question: how do you plan to test for the POSIX locale?

I'm not sure. I will probably rely on Nick for that ;-) Nick already
implemented this exact check for his PEP 538 which is already
implemented in Python 3.7.

I already implemented the PEP 540:

   https://bugs.python.org/issue29240
   https://github.com/python/cpython/pull/855

Right now, my implementation uses:

   char *ctype = _PyMem_RawStrdup(setlocale(LC_CTYPE, ""));
   ...
   if (strcmp(ctype, "C") == 0) ...

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Antoine Pitrou

On Wed, 6 Dec 2017 23:20:41 +0100
Victor Stinner  wrote:
> 2017-12-06 23:07 GMT+01:00 Antoine Pitrou :
> > One question: how do you plan to test for the POSIX locale?  
> 
> I'm not sure. I will probably rely on Nick for that ;-) Nick already
> implemented this exact check for his PEP 538 which is already
> implemented in Python 3.7.

Other than that, +1 on the PEP.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Victor Stinner

2017-12-06 23:36 GMT+01:00 Antoine Pitrou :
> Other than that, +1 on the PEP.

Naoki doesn't seem to be confortable with the usage of the
surrogateescape error handler by default for open(). Are you ok with
that? If yes, would you mind to explain why? :-)

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Antoine Pitrou

On Thu, 7 Dec 2017 00:22:52 +0100
Victor Stinner  wrote:
> 2017-12-06 23:36 GMT+01:00 Antoine Pitrou :
> > Other than that, +1 on the PEP.  
> 
> Naoki doesn't seem to be confortable with the usage of the
> surrogateescape error handler by default for open(). Are you ok with
> that? If yes, would you mind to explain why? :-)

Sorry, I had missed that objection.  I agree with Inada Naoki: it's
better to keep it strict.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Nick Coghlan

On 7 December 2017 at 01:59, Jakub Wilk  wrote:
> * Nick Coghlan , 2017-12-06, 16:15:
>> The one that's relevant to default locale detection is just the string
>> that "setlocale(LC_CTYPE, NULL)" returns.
>
> POSIX doesn't require any particular return value for setlocale() calls.
> It's only guaranteed that the returned string can be used in subsequent
> setlocale() calls to restore the original locale.
>
> So in the POSIX locale, a compliant setlocale() implementation could return
> "C", or "POSIX", or even something entirely different.

Thanks. I'd been wondering if we should also handle the "POSIX" case
in the legacy locale detection logic, and you've convinced me that we
should. Issue filed for that here: https://bugs.python.org/issue32238

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Nick Coghlan

On 7 December 2017 at 08:20, Victor Stinner  wrote:
> 2017-12-06 23:07 GMT+01:00 Antoine Pitrou :
>> One question: how do you plan to test for the POSIX locale?
>
> I'm not sure. I will probably rely on Nick for that ;-) Nick already
> implemented this exact check for his PEP 538 which is already
> implemented in Python 3.7.
>
> I already implemented the PEP 540:
>
>https://bugs.python.org/issue29240
>https://github.com/python/cpython/pull/855
>
> Right now, my implementation uses:
>
>char *ctype = _PyMem_RawStrdup(setlocale(LC_CTYPE, ""));
>...
>if (strcmp(ctype, "C") == 0) ...

We have a private helper for this as a result of the PEP 538
implementation: _Py_LegacyLocaleDetected()

Details are in the source code at
https://github.com/python/cpython/blob/master/Python/pylifecycle.c#L345

As per my comment there, and Jakub Wilk's post to this thread, we're
missing a case to also check for the string "POSIX" (which will fix
several of the current locale coercion discrepancies between Linux and
*BSD systems).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread INADA Naoki

> I care only about builtin open()'s behavior.
> PEP 538 doesn't change default error handler of open().
>
> I think PEP 538 and PEP 540 should behave almost identical except
> changing locale
> or not.  So I need very strong reason if PEP 540 changes default error
> handler of open().
>

I just came up with crazy idea; changing default error handler of open()
to "surrogateescape" only when open mode is "w" or "a".

When reading, "surrogateescape" error handler is dangerous because
it can produce arbitrary broken unicode string by mistake.

On the other hand, "surrogateescape" error handler for writing
is not so dangerous if encoding is UTF-8.
When writing normal unicode string, it doesn't create broken data.
When writing string containing surrogateescaped data, data is
(partially) broken before writing.

This idea allows following code:

with open("files.txt", "w") as f:
for fn in os.listdir():  # may returns surrogateescaped string
f.write(fn+'\n')

And it doesn't allow following code:

with open("image.jpg", "r") as f:  # Binary data, not UTF-8
return f.read()


I'm not sure about this is good idea.  And I don't know when is good for
changing write error handler; only when PEP 538 or PEP 540 is used?
Or always when os.fsencoding() is UTF-8?

Any thoughts?

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Victor Stinner

While I'm not strongly convinced that open() error handler must be
changed for surrogateescape, first I would like to make sure that it's
really a very bad idea because changing it :-)


2017-12-07 7:49 GMT+01:00 INADA Naoki :
> I just came up with crazy idea; changing default error handler of open()
> to "surrogateescape" only when open mode is "w" or "a".

The idea is tempting but I'm not sure that it's a good idea. Moreover,
what about "r+" and "w+" modes?

I dislike getting a different behaviour for inputs and outputs. The
motivation for surrogateescape is to "pass through" undecodable bytes:
you need to handle them on the input side and on the output side.

That's why I decided to not only change sys.stdin error handler to
surrogateescape for the POSIX locale, but also sys.stdout:
https://bugs.python.org/issue19977


> When reading, "surrogateescape" error handler is dangerous because
> it can produce arbitrary broken unicode string by mistake.

I'm fine with that. I wouldn't say that it's the purpose of the PEP,
but sadly it's an expected, known and documented side effect.

You get the same behaviour with Unix command line tools and most
Python 2 applications (processing data as bytes). Nothing new under
the sun.

The PEP 540 allows users to write applications behaving like Unix
tools/Python 2 with the power of the Python 3 language and stdlib.

Again, use the Strict UTF8 mode if you prioritize *correctness* over
*usability*.

Honestly, I'm not even sure that the Strict UTF-8 mode is *usable* in
practice, since we are all surrounded by old documents encoded to
various "legacy" encodings (where legay means: "not UTF-8", like
Latin1 or ShiftJIS). The first non-ASCII character which is not
encoded to UTF-8 is going to "crash" the application (big traceback
with an unicode error).


Maybe the problem is the feature name: "UTF-8 mode". Users may think
to "strict" when they read "UTF-8", since UTF-8 is known to be a
strict encoding. For example, UTF-8 is much stricter than latin1 which
is unable to tell if a document was encoded latin1 or whatever else.
UTF-8 is able to tell if a document was actually encoded to UTF-8 or
not, thanks to the design of the encoding itself.



> And it doesn't allow following code:
>
> with open("image.jpg", "r") as f:  # Binary data, not UTF-8
> return f.read()

Using a JPEG image, the example is obviously wrong.

But using surrogateescape on open() is written to read *text files*
which are mostly correctly encoded to UTF-8, except a few bytes.

I'm not sure how to explain the issue. The Mercurial wiki page has a
good example of this issue that they call the "Makefile problem":
https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22

While it's not exactly the discussed issue, it gives you an issue of
the kind of issues that you have when you use open(filename,
encoding="utf-8", errors="strict") versus open(filename,
encoding="utf-8", errors="surrogateescape")


> I'm not sure about this is good idea.  And I don't know when is good for
> changing write error handler; only when PEP 538 or PEP 540 is used?
> Or always when os.fsencoding() is UTF-8?
>
> Any thoughts?

The PEP 538 doesn't affect the error handler. The PEP 540 only changes
the error handler for the POSIX locale, it's a deliberate choice. The
PEP 538 is only enabled for the POSIX locale, and the PEP 540 will
also be enabled by default by this locale.

I dislike the idea of chaning the error handler if the filesystem
encoding is UTF-8. The UTF-8 mode must be enabled explicitly on
purpose. The reduce any risk of regression, and prepare users who
enable it for any potential issue.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Victor Stinner

2017-12-06 5:07 GMT+01:00 INADA Naoki :
> And opening binary file without "b" option is very common mistake of new
> developers.  If default error handler is surrogateescape, they lose a chance
> to notice their bug.

To come back to your original point, I didn't know that it was a
common mistake to open binary files in text mode.

Honestly, I didn't try recently. How does Python behave when you do that?

Is it possible to write a full binary parser using the text mode? You
should quickly get issues pointing you to your mistake, no?

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Guido van Rossum

On Thu, Dec 7, 2017 at 3:02 PM, Victor Stinner 
wrote:

> 2017-12-06 5:07 GMT+01:00 INADA Naoki :
> > And opening binary file without "b" option is very common mistake of new
> > developers.  If default error handler is surrogateescape, they lose a
> chance
> > to notice their bug.
>
> To come back to your original point, I didn't know that it was a
> common mistake to open binary files in text mode.
>

It probably is because in Python 2 it makes no difference on UNIX, and on
Windows the only difference is that binary mode preserves \r.

> Honestly, I didn't try recently. How does Python behave when you do that?
>
> Is it possible to write a full binary parser using the text mode? You
> should quickly get issues pointing you to your mistake, no?
>

You will quickly get decoding errors, and that is INADA's point. (Unless
you use encoding='Latin-1'.) His worry is that the surrogateescape error
handler makes it so that you won't get decoding errors, and then the
failure mode is much harder to debug.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Victor Stinner

2017-12-08 0:26 GMT+01:00 Guido van Rossum :
> You will quickly get decoding errors, and that is INADA's point. (Unless you
> use encoding='Latin-1'.) His worry is that the surrogateescape error handler
> makes it so that you won't get decoding errors, and then the failure mode is
> much harder to debug.

Hum, my question was more to know if Python fails because of an
operation failing with strings whereas bytes were expected, or if
Python fails with a decoding error... But now I'm not sure aynmore
that this level of detail really matters.


Let me think out loud. To explain unicode issues, I like to use
filenames, since it's something that users view commonly, handle
directly and can modify (and so enter many non-ASCII characters like
diacritics and emojis ;-)).

Filenames can be found on the command line, in environment variables
(PYTHONSTARTUP), stdin (read a list of files from stdin), stdout
(write the list of files into stdout), but also in text files (the
Mercurial "makefile problem).

I consider that the command line and environment variables should
"just work" and so use surrogateescape. It would be too annoying to
not even be able to *start* Python because of an Unicode error. For
example, it wouldn't be easy to identify which environment variable
causes the issue. Hopefully, the UTF-8 doesn't change anything here:
surrogateescape is already used since Python 3.3 for the command line
and environment variables.

For stdin/stdout, I think that the main motivation here is to write
Unix command line tools using Python 3: pass-through undecodable bytes
without bugging the user with Unicode. Users don't use stdin and
stdout as regular files, they are more used as pipes to pass data
between programs with the Unix pipe in a shell like "producer |
consumer". Sometimes stdout is redirected to a file, but I consider
that it is expected to behave as a pipe and the regular TTY stdout.
IMHO we are still in the safe surrogateescape area (for the specific
case of the UTF-8 mode).


Ok, now comes the real question, open().

For open(), I used the example of a code snippet *writing* the content
of a directory (os.listdir) into a text file. Another example is to
read filenames from a text files but pass-through undecodable bytes
thanks to surrogateescape.

But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.

If I should make a choice between the two categories of usage of
open(), "read undecodable bytes in UTF-8 from a text file" versus
"misuse open() on binary file", I expect that the later is more common
that that open() shouldn't use surrogateescape by default.

While stdin and stdout are usually associated to Unix pipes and Unix
tools working on bytes, files are more commonly associated to
important data that must not be lost nor corrupted. Python is expected
to "help" the developer to use the proper options to read content from
a file and to write content into a file. So I understand that open()
should use the "strict" error handler in the UTF-8 mode, rather than
"surrogateescape".

I can survive to this "tiny" change to my PEP. I just posted a 3rd
version of my PEP where open() error handler remains strict (is no
more changed by the PEP).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Glenn Linderman


On 12/7/2017 4:48 PM, Victor Stinner wrote:


Ok, now comes the real question, open().

For open(), I used the example of a code snippet *writing* the content
of a directory (os.listdir) into a text file. Another example is to
read filenames from a text files but pass-through undecodable bytes
thanks to surrogateescape.

But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.


So the real problem here is that open has a default mode of text. 
Instead of forcing the user to specify either "text" or "binary" when 
opening, text is used as a default, binary as an option to be specified.


I understand that default has a long history in Unix-land, dating at 
last as far back as 1977 when I first learned how to use the Unix open() 
function.


And now it would be an incompatible change to change it.

The real question is whether or not it is a good idea to change it... at 
this point in time, with Unicode and UTF-8 so prevalent, text and binary 
modes are far different than back in 1977, when they mostly just 
documented that this was a binary file that was being opened, and that 
one could more likely expect to see read() than fgets() in the following 
code.


If it were to be changed, one could add a text-mode option in 3.7, say 
"t" in the mode string, and a PendingDeprecationWarning for open calls 
without the specification of either t or b in the mode string.


In 3.8, the warning would be changed to DeprecationWarning.

In 3.9, all open calls would need to have either t or b, or would fail.

Meanwhile, back on the PEP 540 ranch, text mode open calls could 
immediately use surrogateescape, binary mode open calls would not, and 
unspecified open calls would not.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Jonathan Goble

On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman 
wrote:

> If it were to be changed, one could add a text-mode option in 3.7, say "t"
> in the mode string, and a PendingDeprecationWarning for open calls without
> the specification of either t or b in the mode string.
>

"t" is already supported in open()'s mode argument [1] as a way to
explicitly request text mode, though it's essentially ignored right now
since text is the default anyway. So since the option is already present,
the only thing needed at this stage for your plan would be to begin
deprecating not using it.

*goes back to lurking*

[1] https://docs.python.org/3/library/functions.html#open
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Glenn Linderman


On 12/7/2017 5:45 PM, Jonathan Goble wrote:
On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman > wrote:


If it were to be changed, one could add a text-mode option in 3.7,
say "t" in the mode string, and a PendingDeprecationWarning for
open calls without the specification of either t or b in the mode
string.


"t" is already supported in open()'s mode argument [1] as a way to 
explicitly request text mode, though it's essentially ignored right 
now since text is the default anyway. So since the option is already 
present, the only thing needed at this stage for your plan would be to 
begin deprecating not using it.


*goes back to lurking*

[1] https://docs.python.org/3/library/functions.html#open


Thanks for briefly de-lurking.

So then for PEP 540... use surrogateescape immediately for t mode.

Then, when the user encounters an encoding error, there would be three 
solutions: switch to t mode, explicitly switch to surrogateescape, or 
fix the locale.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Chris Barker - NOAA Federal

I’m a bit confused:

File names and the like are one thing, and the CONTENTS of files is quite
another.

I get that there is theoretically a “default” encoding for the contents of
text files, but that is SO likely to be wrong as to be ignorable.

open() already defaults to utf-8. Which is a fine default if you are going
to have one, but it seems a bad idea to have it default to surrogateescape
EVER, regardless of the locale or anything else.

If the file is binary, or a different encoding, or simply broken, it’s much
better to get an encoding error as soon as possible.

Why does this have anything to do with the PEP?

Perhaps the issue of reading a filename from the system, writing it to a
file, then reading it back in again.

I actually do that a lot — but mostly so I can pass that file to another
system, so I really don’t want broken encoding in it anyway.

-CHB

Sent from my iPhone

On Dec 7, 2017, at 5:53 PM, Glenn Linderman  wrote:

On 12/7/2017 5:45 PM, Jonathan Goble wrote:

On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman 
wrote:

> If it were to be changed, one could add a text-mode option in 3.7, say "t"
> in the mode string, and a PendingDeprecationWarning for open calls without
> the specification of either t or b in the mode string.
>

"t" is already supported in open()'s mode argument [1] as a way to
explicitly request text mode, though it's essentially ignored right now
since text is the default anyway. So since the option is already present,
the only thing needed at this stage for your plan would be to begin
deprecating not using it.

*goes back to lurking*

[1] https://docs.python.org/3/library/functions.html#open

Thanks for briefly de-lurking.

So then for PEP 540... use surrogateescape immediately for t mode.

Then, when the user encounters an encoding error, there would be three
solutions: switch to t mode, explicitly switch to surrogateescape, or fix
the locale.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread Chris Barker - NOAA Federal

ls and ignores the
locale: non-Python running in the same process is not aware of the
"Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps
ensure that encoding handling in binary extension modules and subprocesses
is consistent with CPython's encoding handling. The upside of the PEP 540
approach is that it allows an embedding application to change the
interpreter's behaviour without having to change the process global
locale settings.


Links
=

* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode
 <http://bugs.python.org/issue29240>`_
* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
 "Coercing the legacy C locale to C.UTF-8"
* `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
 "Change Windows filesystem encoding to UTF-8"
* `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
 "Change Windows console encoding to UTF-8"
* `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
 "Non-decodable Bytes in System Character Interfaces"


Post History


* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode
 <https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
* 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
 540 (assuming UTF-8 for *nix system boundaries)
 <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
 <https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
* 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
 C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
* 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows
 to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_
 -- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows
 filesystem encoding to UTF-8)


Copyright
=

This document has been placed in the public domain.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread INADA Naoki

shreplace  UTF-8/backslashreplace
>   ===
> ==  ==
>
> By comparison, Python 3.6 uses:
>
>   ===
> ==
> Function  Default  Legacy Windows
> FS encoding
>   ===
> ==
> open()mbcs/strict  mbcs/strict
> os.fsdecode(), os.fsencode()  UTF-8/surrogatepass  **mbcs/replace**
> sys.stdin, sys.stdout UTF-8/surrogateescapeUTF-8/surrogateescape
> sys.stderrUTF-8/backslashreplace   UTF-8/backslashreplace
>   ===
> ==
>
> The "Legacy Windows FS encoding" is enabled by the
> ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable.
>
> If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
> ``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
> in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
> encoding.
>
> .. note:
>There is no POSIX locale on Windows. The ANSI code page is used to the
>locale encoding, and this code page never uses the ASCII encoding.
>
>
> Annex: Differences between PEP 538 and PEP 540
> ==
>
> PEP 538's locale coercion is only effective if a suitable UTF-8
> based locale is available as a coercion target. PEP 540's
> UTF-8 mode can be enabled even for operating systems that don't
> provide a suitable platform locale (such as CentOS 7).
>
> PEP 538 only changes the interpreter's behaviour for the C locale. While the
> new UTF-8 mode of this PEP is only enabled by default in the C locale, it can
> also be enabled manually for any other locale.
>
> PEP 538 is implemented with ``setlocale(LC_CTYPE, "")`` and
> ``setenv("LC_CTYPE", "")``, so any non-Python code running
> in the process and any subprocesses that inherit the environment is impacted
> by the change. PEP 540 is implemented in Python internals and ignores the
> locale: non-Python running in the same process is not aware of the
> "Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps
> ensure that encoding handling in binary extension modules and subprocesses
> is consistent with CPython's encoding handling. The upside of the PEP 540
> approach is that it allows an embedding application to change the
> interpreter's behaviour without having to change the process global
> locale settings.
>
>
> Links
> =
>
> * `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode
>   <http://bugs.python.org/issue29240>`_
> * `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
>   "Coercing the legacy C locale to C.UTF-8"
> * `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
>   "Change Windows filesystem encoding to UTF-8"
> * `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
>   "Change Windows console encoding to UTF-8"
> * `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
>   "Non-decodable Bytes in System Character Interfaces"
>
>
> Post History
> 
>
> * 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode
>   <https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
> * 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
>   540 (assuming UTF-8 for *nix system boundaries)
>   <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
> * 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
>   <https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
> * 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
>   C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
> * 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows
>   to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_
>   -- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows
>   filesystem encoding to UTF-8)
>
>
> Copyright
> =
>
> This document has been placed in the public domain.
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread INADA Naoki

> Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too?

Or should we change loale.getpreferredencoding() to return UTF-8
instead of ASCII always, regardless of PEP 538 and 540?

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Greg Ewing


Victor Stinner wrote:

Users don't use stdin and
stdout as regular files, they are more used as pipes to pass data
between programs with the Unix pipe in a shell like "producer |
consumer". Sometimes stdout is redirected to a file, but I consider
that it is expected to behave as a pipe and the regular TTY stdout.


It seems weird to me to make a distinction between stdin/stdout
connected to a file and accessing the file some other way.

It would be surprising, for example, if the following two
commands behaved differently with respect to encoding:

   cat foo | sort

   cat < foo | sort


But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.


Maybe if you *explicitly* open the file in text mode it
should default to surrogateescape, but use strict if text
mode is being used by default?

I.e.

   open("foo", "rt") --> surrogateescape
   open("foo")   --> strict

That way you can easily open a file in a way that's
compatible with the way stdin/stdout behave, but you
will get bitten if you mistakenly open a binary file
as text.

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

Hi,

Oh, locale.getpreferredencoding(), that's a good question :-)

2017-12-08 6:02 GMT+01:00 INADA Naoki :
> But I want to clarify more about difference/relationship between PEP
> 538 and 540.
>
> If I understand correctly:
>
> Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares
> same logic to detect POSIX locale.
>
> When POSIX locale is detected, locale coercion is tried first. And if
> locale coercion
> succeeds,  UTF-8 mode is not used because locale is not POSIX anymore.

No, I would like to enable the UTF-8 mode as well in this case.

In short, locale coercion and UTF-8 mode will be both enabled by the
POSIX locale.


> If locale coercion is disabled or failed, UTF-8 mode is used automatically,
> unless it is disabled explicitly.

PEP 540 is always enabled if the POSIX locale is detected. Only
PYTHONUTF8=0 or -X utf8=0 disable it in this case.

Disabling locale coercion doesn't disable the PEP 540.


> UTF-8 mode is similar to C.UTF-8 or other locale coercion target locales.
> But UTF-8 mode is different from C.UTF-8 locale in these ways because
> actual locale is not changed:
>
> * Libraries using locale (e.g. readline) works as in POSIX locale.  So UTF-8
>   cannot be used in such libraries.

My assumption is that very few C library rely on the locale encoding.
The wchar_t* type is rarely used. You may only get issues if Python
pass UTF-8 encoded string to a C library which tries to decode it from
the locale encoding which is not UTF-8. For example, with the POSIX
locale, if the locale encoding is ASCII, you can get a decoding error
if a C library tries to decode a UTF-8 encoded string coming from
Python.

But the encoding problem is not restricted to the current process. For
the "producer | consumer" model, if the producer is a Python 3.7
application using UTF-8 mode and so encoding text to UTF-8 to stdout,
an application may be unable to decode the UTF-8 data. Here we enter
the grey area of encodings. Which applications rely use the locale
encoding? Which applications always use UTF-8? Do some applications
try UTF-8 first, or falls back on the locale encoding? (OpenSSL does
that on filenames for example, as the glib if I recall correctly.)

Until we know exactly how UTF-8 is used in the "wild", I chose to make
the UTF-8 an opt-in option for locales other than POSIX. I expect a
few bugs reports later which will help us to adjust our encodings.

> * locale.getpreferredencoding() returns 'ASCII' instead of 'UTF-8'.  So
>   libraries depending on locale.getpreferredencoding() may raise
>   UnicodeErrors.

Right.


> Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too?

Here is where the PEP 538 plays very nicely with the PEP 540. On
platforms where the locale coercion is supported (Fedora, macOS,
FreeBSD, maybe other Linux distributons), on the POSIX locale,
locale.getpreferredencoding() will return UTF-8 and functions like
mbstowcs() will use the UTF-8 encoding internally.

Currently, in the implementation of my PEP 540, I chose to modify
open() to use UTF-8 if the UTF-8 mode is used, rather using
locale.getpreferredencoding().

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

2017-12-08 6:11 GMT+01:00 INADA Naoki :
> Or should we change loale.getpreferredencoding() to return UTF-8
> instead of ASCII always, regardless of PEP 538 and 540?

On the POSIX locale, if the locale coercion works (PEP 538),
locale.getpreferredencoding() returns UTF-8. We are good.

The question is for platforms like Centos 7 where the locale coercion
(PEP 538) doesn't work and so Python uses UTF-8 (PEP 540), whereas the
locale probably uses ASCII (or maybe Latin1).

My current implementation of the PEP 540 is cheating for open(): if
sys.flags.utf8_mode is non-zero, use the UTF-8 encoding rather than
calling locale.getpreferredencoding().

I checked the stdlib, and I found many places where
locale.getpreferredencoding() is used to get the user preferred
encoding:

* builtin open(): default encoding
* cgi.FieldStorage: encode the query string
* encoding._alias_mbcs(): check if the requested encoding is the ANSI code page
* gettext.GNUTranslations: lgettext() and lngettext() methods
* xml.etree.ElementTree: ElementTree.write(encoding='unicode')

In the UTF-8 mode, I would expect that cgi, gettext and xml.etree all
use the UTF-8 encoding by default. So locale.getpreferredencoding()
should return UTF-8 if the UTF-8 mode is enabled.

The private _alias_mbcs() method can be modified to call directly
_locale._getdefaultlocale()[1] to get the ANSI code page.

Question: do we need to add an option to getpreferredencoding() to
return the locale encoding even if the UTF-8 mode is enabled. If yes,
what should be the API? locale.getpreferredencoding(utf8_mode=False)?

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread INADA Naoki

On Fri, Dec 8, 2017 at 7:22 PM, Victor Stinner  wrote:
>>
>> Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares
>> same logic to detect POSIX locale.
>>
>> When POSIX locale is detected, locale coercion is tried first. And if
>> locale coercion
>> succeeds,  UTF-8 mode is not used because locale is not POSIX anymore.
>
> No, I would like to enable the UTF-8 mode as well in this case.
>
> In short, locale coercion and UTF-8 mode will be both enabled by the
> POSIX locale.
>

Hm, it is bit surprising because I thought UTF-8 mode is fallback
of locale coercion when coercion is failed or disabled.

As PEP 538 [1], all coercion target locales uses surrogateescape
for stdin and stdout.
So, do you mean "UTF-8 mode enabled as flag level, but it has no
real effects"?

[1]: 
https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams

Since coercion target locales and UTF-8 mode do same thing,
I think this is not a big issue.
But I want it is clarified in the PEP.

Regards,
---
INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

2017-12-08 15:01 GMT+01:00 INADA Naoki :
>> In short, locale coercion and UTF-8 mode will be both enabled by the
>> POSIX locale.
>
> Hm, it is bit surprising because I thought UTF-8 mode is fallback
> of locale coercion when coercion is failed or disabled.

I rewrote the "differences between the PEP 538 and the PEP 540" as a
new section "Relationship with the locale coercion (PEP 538)".

https://www.python.org/dev/peps/pep-0540/#relationship-with-the-locale-coercion-pep-538

"""
Relationship with the locale coercion (PEP 538)
===

The POSIX locale enables the locale coercion (PEP 538) and the UTF-8
mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
mode has no (additional) effect.

Locale coercion only impacts non-Python code like C libraries, whereas
the Python UTF-8 Mode only impacts Python code: the two PEPs are
complementary.

On platforms where locale coercion is not supported like Centos 7, the
POSIX locale only enables the UTF-8 Mode. In this case, Python code uses
the UTF-8 encoding and ignores the locale encoding, whereas non-Python
code uses the locale encoding which is usually ASCII for the POSIX
locale.

While the UTF-8 Mode is supported on all platforms and can be enabled
with any locale, the locale coercion is not supported by all platforms
and is restricted to the POSIX locale.

The UTF-8 Mode has only an impact on Python child processes when the
``PYTHONUTF8`` environment variable is set to ``1``, whereas the locale
coercion sets the ``LC_CTYPE`` environment variables which impacts all
child processes.

The benefit of the locale coercion approach is that it helps ensure that
encoding handling in binary extension modules and child processes is
consistent with Python's encoding handling. The upside of the UTF-8 Mode
approach is that it allows an embedding application to change the
interpreter's behaviour without having to change the process global
locale settings.
"""

I hope that it's now better explained.

In short, the two PEPs are really complementary.

> As PEP 538 [1], all coercion target locales uses surrogateescape
> for stdin and stdout.
> So, do you mean "UTF-8 mode enabled as flag level, but it has no
> real effects"?

Right and it was a deliberate choice of Nick Coghlan when he designed
the PEP 538, to make sure that the two PEPs are complementary and
"compatible".

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

I updated my PEP: in the 4th version, locale.getpreferredencoding()
now returns 'UTF-8' in the UTF-8 Mode.

https://www.python.org/dev/peps/pep-0540/

I also clarified the direct effects of the UTF-8 Mode, but also listed
the most user visible changes as "Side effects".

"""
Effects of the UTF-8 Mode:

* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``.
* ``locale.getpreferredencoding()`` returns ``UTF-8``, its
  *do_setlocale* argument and the locale encoding are ignored.
* ``sys.stdin`` and ``sys.stdout`` error handler is set to
  ``surrogateescape``

Side effects:

* ``open()`` uses the UTF-8 encoding by default.
* ``os.fsdecode()`` and ``os.fsencode()`` use the UTF-8 encoding.
* Command line arguments, environment variables and filenames use the
  UTF-8 encoding.
"""

Thank you Naokia INADA for your quick feedback, it was very helpful
and I really like how the PEP evolves!

IMHO the PEP 540 version 4 is just perfect and ready for
pronouncement! (... until someone finds another flaw, obviously!)

Victor


2017-12-08 13:58 GMT+01:00 Victor Stinner :
> 2017-12-08 6:11 GMT+01:00 INADA Naoki :
>> Or should we change loale.getpreferredencoding() to return UTF-8
>> instead of ASCII always, regardless of PEP 538 and 540?
>
> On the POSIX locale, if the locale coercion works (PEP 538),
> locale.getpreferredencoding() returns UTF-8. We are good.
>
> The question is for platforms like Centos 7 where the locale coercion
> (PEP 538) doesn't work and so Python uses UTF-8 (PEP 540), whereas the
> locale probably uses ASCII (or maybe Latin1).
>
> My current implementation of the PEP 540 is cheating for open(): if
> sys.flags.utf8_mode is non-zero, use the UTF-8 encoding rather than
> calling locale.getpreferredencoding().
>
> I checked the stdlib, and I found many places where
> locale.getpreferredencoding() is used to get the user preferred
> encoding:
>
> * builtin open(): default encoding
> * cgi.FieldStorage: encode the query string
> * encoding._alias_mbcs(): check if the requested encoding is the ANSI code 
> page
> * gettext.GNUTranslations: lgettext() and lngettext() methods
> * xml.etree.ElementTree: ElementTree.write(encoding='unicode')
>
> In the UTF-8 mode, I would expect that cgi, gettext and xml.etree all
> use the UTF-8 encoding by default. So locale.getpreferredencoding()
> should return UTF-8 if the UTF-8 mode is enabled.
>
> The private _alias_mbcs() method can be modified to call directly
> _locale._getdefaultlocale()[1] to get the ANSI code page.
>
> Question: do we need to add an option to getpreferredencoding() to
> return the locale encoding even if the UTF-8 mode is enabled. If yes,
> what should be the API? locale.getpreferredencoding(utf8_mode=False)?
>
> Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

2017-12-08 16:22 GMT+01:00 Victor Stinner :
> I updated my PEP: in the 4th version, locale.getpreferredencoding()
> now returns 'UTF-8' in the UTF-8 Mode.

Sorry, I forgot to mention that I already updated the implementation
to the latest version of the PEP:
https://github.com/python/cpython/pull/855

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Ethan Furman


There were some concerns about open() earlier:

On Wed, 6 Dec 2017 at 06:10 INADA Naoki wrote:
> I think PEP 538 and PEP 540 should behave almost identical except
> changing locale or not.  So I need very strong reason if PEP 540
> changes default error handler of open().

Brett replied:
> I don't have enough locale experience to weigh in as an expert,
> but I already was leaning towards INADA-san's logic of not wanting
> to change open() and this makes me really not want to change it.

On 12/08/2017 07:22 AM, Victor Stinner wrote:

"""
Effects of the UTF-8 Mode:

[...]

Side effects:

* ``open()`` uses the UTF-8 encoding by default.


For those of us trying to follow along, is this change to open() one that Inada-san was worried about?  Has something 
else changed?


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

2017-12-08 17:29 GMT+01:00 Ethan Furman :
> For those of us trying to follow along, is this change to open() one that
> Inada-san was worried about?  Has something else changed?

I agree that my PEP is evolving quickly, that's why I added a "Version
History" at the end:
https://www.python.org/dev/peps/pep-0540/#version-history

"""
Version History
===

* Version 4: ``locale.getpreferredencoding()`` now returns ``'UTF-8'``
  in the UTF-8 Mode.
* Version 3: The UTF-8 Mode does not change the ``open()`` default error
  handler (``strict``) anymore, and the Strict UTF-8 Mode has been
  removed.
* Version 2: Rewrite the PEP from scratch to make it much shorter and
  easier to understand.
* Version 1: First version posted to python-dev.
"""

Naoki disliked the usage of the surrogateescape error handler for
open(). I "fixed" this in the PEP version 3: open() error handler is
not modified by the PEP.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Nick Coghlan

On 9 December 2017 at 01:22, Victor Stinner  wrote:
> I updated my PEP: in the 4th version, locale.getpreferredencoding()
> now returns 'UTF-8' in the UTF-8 Mode.

+1, that's a good change, since it brings the "locale coercion failed"
case even closer to the "locale coercion succeeded" behaviour.

To continue with the CentOS 7 example: that actually does use a UTF-8
based locale by default, it's just en_US.UTF.8 rather than C.UTF-8.

Earlier versions of PEP 538 thus included "en_US.UTF-8" on the
candidate target locale list, but that turned out to cause assorted
problems due to the "C -> en_US" part of the coercion.

Cheers,
Nick.

P.S. Thinking back on the history of the changes though, it may be
worth revisiting the idea of "en_US.UTF-8" as a potential coercion
locale: it was dropped as a potential coercion target back when the
PEP still set both LANG & LC_ALL, whereas it now changes only
LC_CTYPE. That means setting it won't mess with LC_COLLATE, or any of
the other locale categories. That said, I'm not sure if there are
behavioural differences between "LC_CTYPE=C.UTF-8" and
"LC_CTYPE=en_US.UTF-8", so I'm inclined to leave that alone for now.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-09 Thread INADA Naoki

Now I'm OK to accept the PEP, except one nitpick.

>
> Locale coercion only impacts non-Python code like C libraries, whereas
> the Python UTF-8 Mode only impacts Python code: the two PEPs are
> complementary.
>

This sentence seems bit misleading.
If UTF-8 mode is disabled explicitly, locale coercion affects Python code too.
locale.getpreferredencoding() is UTF-8, open()' s default encoding is UTF-8,
and stdio is UTF-8/surrogateescape.

So shouldn't this sentence is: "Locale coercion impacts both of Python code
and non-Python code like C libraries, whereas ..."?

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-09 Thread INADA Naoki

> Earlier versions of PEP 538 thus included "en_US.UTF-8" on the
> candidate target locale list, but that turned out to cause assorted
> problems due to the "C -> en_US" part of the coercion.

Hm, but PEP 538 says:

> this PEP instead proposes to extend the "surrogateescape" default for stdin 
> and stderr error handling to also apply to the three potential coercion 
> target locales.

https://www.python.org/dev/peps/pep-0538/#defaulting-to-surrogateescape-error-handling-on-the-standard-io-streams

I don't think en_US.UTF-8 should use surrogateescape error handler.

Regards,

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread Victor Stinner

Hi,

Le 10 déc. 2017 05:48, "INADA Naoki"  a écrit :

Now I'm OK to accept the PEP, except one nitpick.


I got a private email about the same issue. I don't think that it's
nitpicking since many people were confused about the relationship between
the PEP 538 and PEP 540. So it seems like I was confused as well :-) I was
also confused because my PEP evolved quickly. With the additionnal
local.getpreferredenconding() change in my PEP, the two PEP became even
more similar.

> Locale coercion only impacts non-Python code like C libraries, whereas
> the Python UTF-8 Mode only impacts Python code: the two PEPs are
> complementary.
>

This sentence seems bit misleading.
If UTF-8 mode is disabled explicitly, locale coercion affects Python code
too.
locale.getpreferredencoding() is UTF-8, open()' s default encoding is UTF-8,
and stdio is UTF-8/surrogateescape.

So shouldn't this sentence is: "Locale coercion impacts both of Python code
and non-Python code like C libraries, whereas ..."?


Right. I will rephrase it.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread Victor Stinner

Ok, I fixed the effects of the locale coercion (PEP 538). Does it now
look good to you, Naoki?

https://www.python.org/dev/peps/pep-0540/#relationship-with-the-locale-coercion-pep-538

The commit:

https://github.com/python/peps/commit/71cda51fbb622ece63f7a9d3c8fa6cd33ce06b58

diff --git a/pep-0540.txt b/pep-0540.txt
index 0a9cbc1e..c163916d 100644
--- a/pep-0540.txt
+++ b/pep-0540.txt
@@ -144,9 +144,15 @@ The POSIX locale enables the locale coercion (PEP
538) and the UTF-8
 mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
 mode has no (additional) effect.

-Locale coercion only impacts non-Python code like C libraries, whereas
-the Python UTF-8 Mode only impacts Python code: the two PEPs are
-complementary.
+The UTF-8 has the same effect than locale coercion:
+``sys.getfilesystemencoding()`` returns ``'UTF-8'``,
+``locale.getpreferredencoding()`` returns ``UTF-8``, ``sys.stdin`` and
+``sys.stdout`` error handler set to ``surrogateescape``. These changes
+only affect Python code. But the locale coercion has addiditonal
+effects: the ``LC_CTYPE`` environment variable and the ``LC_CTYPE``
+locale are set to a UTF-8 locale like ``C.UTF-8``. The side effect is
+that non-Python code is also impacted by the locale coercion. The two
+PEPs are complementary.

 On platforms where locale coercion is not supported like Centos 7, the
 POSIX locale only enables the UTF-8 Mode. In this case, Python code uses

Victor


2017-12-10 5:47 GMT+01:00 INADA Naoki :
> Now I'm OK to accept the PEP, except one nitpick.
>
>>
>> Locale coercion only impacts non-Python code like C libraries, whereas
>> the Python UTF-8 Mode only impacts Python code: the two PEPs are
>> complementary.
>>
>
> This sentence seems bit misleading.
> If UTF-8 mode is disabled explicitly, locale coercion affects Python code too.
> locale.getpreferredencoding() is UTF-8, open()' s default encoding is UTF-8,
> and stdio is UTF-8/surrogateescape.
>
> So shouldn't this sentence is: "Locale coercion impacts both of Python code
> and non-Python code like C libraries, whereas ..."?
>
> INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread INADA Naoki

Except one typo I commented on Github,
I accept PEP 540.

Well done, Victor and Nick for PEP 540 and 538.
Python 3.7 will be most UTF-8 friendly Python 3 than ever.

INADA Naoki  


On Mon, Dec 11, 2017 at 2:21 AM, Victor Stinner
 wrote:
> Ok, I fixed the effects of the locale coercion (PEP 538). Does it now
> look good to you, Naoki?
>
> https://www.python.org/dev/peps/pep-0540/#relationship-with-the-locale-coercion-pep-538
>
> The commit:
>
> https://github.com/python/peps/commit/71cda51fbb622ece63f7a9d3c8fa6cd33ce06b58
>
> diff --git a/pep-0540.txt b/pep-0540.txt
> index 0a9cbc1e..c163916d 100644
> --- a/pep-0540.txt
> +++ b/pep-0540.txt
> @@ -144,9 +144,15 @@ The POSIX locale enables the locale coercion (PEP
> 538) and the UTF-8
>  mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
>  mode has no (additional) effect.
>
> -Locale coercion only impacts non-Python code like C libraries, whereas
> -the Python UTF-8 Mode only impacts Python code: the two PEPs are
> -complementary.
> +The UTF-8 has the same effect than locale coercion:
> +``sys.getfilesystemencoding()`` returns ``'UTF-8'``,
> +``locale.getpreferredencoding()`` returns ``UTF-8``, ``sys.stdin`` and
> +``sys.stdout`` error handler set to ``surrogateescape``. These changes
> +only affect Python code. But the locale coercion has addiditonal
> +effects: the ``LC_CTYPE`` environment variable and the ``LC_CTYPE``
> +locale are set to a UTF-8 locale like ``C.UTF-8``. The side effect is
> +that non-Python code is also impacted by the locale coercion. The two
> +PEPs are complementary.
>
>  On platforms where locale coercion is not supported like Centos 7, the
>  POSIX locale only enables the UTF-8 Mode. In this case, Python code uses
>
> Victor
>
>
> 2017-12-10 5:47 GMT+01:00 INADA Naoki :
>> Now I'm OK to accept the PEP, except one nitpick.
>>
>>>
>>> Locale coercion only impacts non-Python code like C libraries, whereas
>>> the Python UTF-8 Mode only impacts Python code: the two PEPs are
>>> complementary.
>>>
>>
>> This sentence seems bit misleading.
>> If UTF-8 mode is disabled explicitly, locale coercion affects Python code 
>> too.
>> locale.getpreferredencoding() is UTF-8, open()' s default encoding is UTF-8,
>> and stdio is UTF-8/surrogateescape.
>>
>> So shouldn't this sentence is: "Locale coercion impacts both of Python code
>> and non-Python code like C libraries, whereas ..."?
>>
>> INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread Victor Stinner

2017-12-10 18:46 GMT+01:00 INADA Naoki :
> Except one typo I commented on Github,

Fixed: 
https://github.com/python/peps/commit/08224bf6bdf16b539fb6f8136061877e5924476d

> I accept PEP 540.

Wow, thank you :-) Again, thank you for your very useful feedback
which helped to make the PEP 540 much better than its initial version.

> Well done, Victor and Nick for PEP 540 and 538.
> Python 3.7 will be most UTF-8 friendly Python 3 than ever.

Yep. Once the PEP 540 will be implemented, we will need need to test
them as much as possible before 3.7 final!

https://bugs.python.org/issue29240
https://github.com/python/cpython/pull/855

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread Toshio Kuratomi

On Dec 9, 2017 8:53 PM, "INADA Naoki"  wrote:

> Earlier versions of PEP 538 thus included "en_US.UTF-8" on the
> candidate target locale list, but that turned out to cause assorted
> problems due to the "C -> en_US" part of the coercion.

Hm, but PEP 538 says:

> this PEP instead proposes to extend the "surrogateescape" default for
stdin and stderr error handling to also apply to the three potential
coercion target locales.

https://www.python.org/dev/peps/pep-0538/#defaulting-to-
surrogateescape-error-handling-on-the-standard-io-streams

I don't think en_US.UTF-8 should use surrogateescape error handler.


Could you explain why not? utf-8 seems like the common thread for using
surrogateescape so I'm not sure what would make en_US.UTF-8 different than
C.UTF-8.

-Toshio
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread INADA Naoki

>
> Could you explain why not? utf-8 seems like the common thread for using
> surrogateescape so I'm not sure what would make en_US.UTF-8 different than
> C.UTF-8.
>

Because there are many lang_COUNTRY.UTF-8 locales:
ja_JP.UTF-8, zh_TW.UTF-8, fr_FR.UTF-8, etc...

If only en_US.UTF-8 should use surrogateescape, it may make confusing situation
like: "This script works in English Linux desktop, but doesn't work in
Japanese Linux
desktop!"

I accepted PEP 540.  So even if failed to coerce locale, it is better
than Python 3.6.

Regards,

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-11 Thread Guido van Rossum

Congrats Victor! Thanks mr. Inada for reviewing this PEP (and 538). Thanks
everyone else who participated in the lively discussion!

On Sun, Dec 10, 2017 at 4:00 PM, INADA Naoki  wrote:

> >
> > Could you explain why not? utf-8 seems like the common thread for using
> > surrogateescape so I'm not sure what would make en_US.UTF-8 different
> than
> > C.UTF-8.
> >
>
> Because there are many lang_COUNTRY.UTF-8 locales:
> ja_JP.UTF-8, zh_TW.UTF-8, fr_FR.UTF-8, etc...
>
> If only en_US.UTF-8 should use surrogateescape, it may make confusing
> situation
> like: "This script works in English Linux desktop, but doesn't work in
> Japanese Linux
> desktop!"
>
> I accepted PEP 540.  So even if failed to coerce locale, it is better
> than Python 3.6.
>
> Regards,
>
> INADA Naoki  
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
> guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-13 Thread Nick Coghlan

On 11 Dec. 2017 6:50 am, "INADA Naoki"  wrote:

Except one typo I commented on Github,
I accept PEP 540.

Well done, Victor and Nick for PEP 540 and 538.
Python 3.7 will be most UTF-8 friendly Python 3 than ever.


And thank you for all of your work on reviewing them! The appropriate
trade-offs between ease of use in common scenarios and an increased chance
of emitting mojibake are hard to figure out, but I like where we've ended
up :)

Cheers,
Nick.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

62 matches

Mail list logo