Re: [Python-ideas] PEP 538: Coercing the legacy C locale to C.UTF-8

2017-01-18 Thread Xavier de Gaye

> On Android, the locale settings are of limited relevance (due to most
> applications running in the UTF-16-LE based Dalvik environment) and there's
> limited value in preserving backwards compatibility with other locale aware
> C/C++ components in the same process (since it's a relatively new target
> platform for CPython), so CPython bypasses the operating system provided APIs
> and hardcodes the use of UTF-8 (similar to its behaviour on Apple platforms).

FWIW the default locale seems to be UTF-8 for java applications, the public
abstract class Charset Android documentation [1] says for the
defaultCharset() method:

"Android note: The Android platform default is always UTF-8."

and wide character functions in the NDK use the UTF-8 encoding whatever the
locale set by setlocale(), see the test run by Chi Hsuan Yen in [2].

Xavier

[1] https://developer.android.com/reference/java/nio/charset/Charset.html
[2] http://bugs.python.org/issue26928#msg281110
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] PEP 538: Coercing the legacy C locale to C.UTF-8

2017-01-07 Thread Nick Coghlan
Hi folks,

Many of you would have seen Victor's recent PEP proposing the
introduction a new "UTF-8" mode that told Python to use UTF-8 by
default in the legacy C locale (similar to the way CPython behaves on
Mac OS X, Android and iOS), as well as allowing explicit selection of
that mode regardless of the current locale settings.

That was prompted by my proposal in PEP 538 to start coercing the
legacy C locale to C.UTF-8 (when we have the ability and opportunity
to do so), and otherwise at least warn that we don't expect the legacy
C locale to work properly. That PEP has now been through its initial
round of review on the Python Linux SIG, and updated to address both
the feedback received there, as well as some of the points Victor
raised in PEP 540.

The rendered version is available at
https://www.python.org/dev/peps/pep-0538/ and the plain text version
is included inline below.

Folks that have already read PEP 540 may want to start with the new
section that looks at the way the two PEPs are potentially
complementary to each other rather than competitive:
https://www.python.org/dev/peps/pep-0538/#relationship-with-other-peps

In particular, the approach in PEP 540 may be a better last resort
alternative than setting "LC_CTYPE=en_US.UTF-8" on platforms that
don't provide either C.UTF-8 or C.utf8 (which is what the current
draft of PEP 538 proposes)

Cheers,
Nick.

===
PEP: 538
Title: Coercing the legacy C locale to C.UTF-8
Version: $Revision$
Last-Modified: $Date$
Author: Nick Coghlan 
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 28-Dec-2016
Python-Version: 3.7


Abstract


An ongoing challenge with Python 3 on \*nix systems is the conflict between
needing to use the configured locale encoding by default for consistency with
other C/C++ components in the same process and those invoked in subprocesses,
and the fact that the standard C locale (as defined in POSIX:2001) specifies
a default text encoding of ASCII, which is entirely inadequate for the
development of networked services and client applications in a multilingual
world.

This PEP proposes that the way the CPython implementation handles the default
C locale be changed such that:

* the standalone CPython binary will automatically attempt to coerce the ``C``
  locale to ``C.UTF-8`` (preferred), ``C.utf8`` or ``en_US.UTF-8`` unless the
  new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``
* if the subsequent runtime initialization process detects that the legacy
  ``C`` locale remains active (e.g. locale coercion is disabled, or the runtime
  is embedded in an application other than the main CPython binary), it  will
  emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
  text encoding may cause various Unicode compatibility issues

Explicitly configuring the ``C.UTF-8`` or ``en_US.UTF-8`` locales has already
been used successfully for a number of years (including by the PEP author) to
get Python 3 running reliably in environments where no locale is otherwise
configured (such as Docker containers).

With this change, any \*nix platform that does *not* offer at least one of the
``C.UTF-8``, ``C.utf8`` or ``en_US.UTF-8`` locales as part of its standard
configuration would only be considered a fully supported platform for CPython
3.7+ deployments when a locale other than the default ``C`` locale is
configured explicitly.

Redistributors (such as Linux distributions) with a narrower target audience
than the upstream CPython development team may also choose to opt in to this
behaviour for the Python 3.6.x series by applying the necessary changes as a
downstream patch when first introducing Python 3.6.0.


Background
==

While the CPython interpreter is starting up, it may need to convert from
the ``char *`` format to the ``wchar_t *`` format, or from one of those formats
to ``PyUnicodeObject *``, before its own text encoding handling machinery is
fully configured. It handles these cases by relying on the operating system to
do the conversion and then ensuring that the text encoding name reported by
``sys.getfilesystemencoding()`` matches the encoding used during this early
bootstrapping process.

On Apple platforms (including both Mac OS X and iOS), this is straightforward,
as Apple guarantees that these operations will always use UTF-8 to do the
conversion.

On Windows, the limitations of the ``mbcs`` format used by default in these
conversions proved sufficiently problematic that PEP 528 and PEP 529 were
implemented to bypass the operating system supplied interfaces for binary data
handling and force the use of UTF-8 instead.

On Android, the locale settings are of limited relevance (due to most
applications running in the UTF-16-LE based Dalvik environment) and there's
limited value in preserving backwards compatibility with other locale aware
C/C++ components in the same process (since it's a relatively new target
platform