Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

2017-01-07 Thread Nick Coghlan
On 8 January 2017 at 02:47, Stephen J. Turnbull
 wrote:
> I agree that people around me mostly know only two encodings: "works
> for me" and "mojibake", but they also use locales configured for them
> by technical staff.  On top of that, international students (the most
> likely victims of "UTF-8 by default" because students are the biggest
> Python users) typically have non-Japanese locales set on their
> imported computers.
>
> I'm not going to say my experience is typical enough to block "UTF-8
> by default", but let's do this very carefully with thought.

Unsurprisingly (given where I work [1]), one of my key concerns is to
enable large Python using institutions to be able to keep moving
forward, regardless of whether they've fully standardised their
internal environments on UTF-8 or not. As such, while I'm entirely in
favour of pushing people towards UTF-8 as the default choice
everywhere, I also want to make sure that system and application
integrators, including the folks responsible for defining the Standard
Operating Environments in large organisations, get warnings of
potential problems when they arise, and continue to get encoding
errors when we have definitive evidence of a compatibiliy problem.

For me, that boils down to:

- if a locale is properly configured, we'll continue to respect it
- if we're ignoring or changing the locale setting without an explicit
config option, we'll emit a warning on stderr that we're doing so
(*without* using the warnings system, so there's no way to turn it
into an exception)
- if a UTF-8 based Linux container is run on a
GB-18030/ISO-2022/Shift-JIS/etc host and tries to exchange locally
encoded data with that host (rather than exchanging UTF-8 encoded data
over a network connection), getting an exception is preferable to
silently corrupting the data stream

(I think I'll add something along those lines to PEP 538 as a new
"Core Design Principles" section)

Cheers,
Nick.

[1] https://docs.python.org/devguide/motivations.html#published-entries

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] PEP 538: Coercing the legacy C locale to C.UTF-8

2017-01-07 Thread Nick Coghlan
Hi folks,

Many of you would have seen Victor's recent PEP proposing the
introduction a new "UTF-8" mode that told Python to use UTF-8 by
default in the legacy C locale (similar to the way CPython behaves on
Mac OS X, Android and iOS), as well as allowing explicit selection of
that mode regardless of the current locale settings.

That was prompted by my proposal in PEP 538 to start coercing the
legacy C locale to C.UTF-8 (when we have the ability and opportunity
to do so), and otherwise at least warn that we don't expect the legacy
C locale to work properly. That PEP has now been through its initial
round of review on the Python Linux SIG, and updated to address both
the feedback received there, as well as some of the points Victor
raised in PEP 540.

The rendered version is available at
https://www.python.org/dev/peps/pep-0538/ and the plain text version
is included inline below.

Folks that have already read PEP 540 may want to start with the new
section that looks at the way the two PEPs are potentially
complementary to each other rather than competitive:
https://www.python.org/dev/peps/pep-0538/#relationship-with-other-peps

In particular, the approach in PEP 540 may be a better last resort
alternative than setting "LC_CTYPE=en_US.UTF-8" on platforms that
don't provide either C.UTF-8 or C.utf8 (which is what the current
draft of PEP 538 proposes)

Cheers,
Nick.

===
PEP: 538
Title: Coercing the legacy C locale to C.UTF-8
Version: $Revision$
Last-Modified: $Date$
Author: Nick Coghlan 
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 28-Dec-2016
Python-Version: 3.7


Abstract


An ongoing challenge with Python 3 on \*nix systems is the conflict between
needing to use the configured locale encoding by default for consistency with
other C/C++ components in the same process and those invoked in subprocesses,
and the fact that the standard C locale (as defined in POSIX:2001) specifies
a default text encoding of ASCII, which is entirely inadequate for the
development of networked services and client applications in a multilingual
world.

This PEP proposes that the way the CPython implementation handles the default
C locale be changed such that:

* the standalone CPython binary will automatically attempt to coerce the ``C``
  locale to ``C.UTF-8`` (preferred), ``C.utf8`` or ``en_US.UTF-8`` unless the
  new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``
* if the subsequent runtime initialization process detects that the legacy
  ``C`` locale remains active (e.g. locale coercion is disabled, or the runtime
  is embedded in an application other than the main CPython binary), it  will
  emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
  text encoding may cause various Unicode compatibility issues

Explicitly configuring the ``C.UTF-8`` or ``en_US.UTF-8`` locales has already
been used successfully for a number of years (including by the PEP author) to
get Python 3 running reliably in environments where no locale is otherwise
configured (such as Docker containers).

With this change, any \*nix platform that does *not* offer at least one of the
``C.UTF-8``, ``C.utf8`` or ``en_US.UTF-8`` locales as part of its standard
configuration would only be considered a fully supported platform for CPython
3.7+ deployments when a locale other than the default ``C`` locale is
configured explicitly.

Redistributors (such as Linux distributions) with a narrower target audience
than the upstream CPython development team may also choose to opt in to this
behaviour for the Python 3.6.x series by applying the necessary changes as a
downstream patch when first introducing Python 3.6.0.


Background
==

While the CPython interpreter is starting up, it may need to convert from
the ``char *`` format to the ``wchar_t *`` format, or from one of those formats
to ``PyUnicodeObject *``, before its own text encoding handling machinery is
fully configured. It handles these cases by relying on the operating system to
do the conversion and then ensuring that the text encoding name reported by
``sys.getfilesystemencoding()`` matches the encoding used during this early
bootstrapping process.

On Apple platforms (including both Mac OS X and iOS), this is straightforward,
as Apple guarantees that these operations will always use UTF-8 to do the
conversion.

On Windows, the limitations of the ``mbcs`` format used by default in these
conversions proved sufficiently problematic that PEP 528 and PEP 529 were
implemented to bypass the operating system supplied interfaces for binary data
handling and force the use of UTF-8 instead.

On Android, the locale settings are of limited relevance (due to most
applications running in the UTF-16-LE based Dalvik environment) and there's
limited value in preserving backwards compatibility with other locale aware
C/C++ components in the same process (since it's a relatively new target
platform 

Re: [Python-ideas] New PyThread_tss_ C-API for CPython

2017-01-07 Thread Masayuki YAMAMOTO
2016-12-31 16:42 GMT+09:00 Nick Coghlan :

> On 31 December 2016 at 08:24, Masayuki YAMAMOTO  > wrote:
>
>> I have read the discussion and I'm sure that use structure as Py_tss_t
>> instead of platform-specific data type. Just as Steve said that Py_tss_t
>> should be genuinely treated as an opaque type, the key state checking
>> should provide macros or inline functions with name like
>> PyThread_tss_is_created. Well, I'd resolve the specification a bit more :)
>>
>> If PyThread_tss_create is called with the created key, it is no-op but
>> which the function should succeed or fail? In my opinion, It is better to
>> return a failure because it is a high possibility that the code is
>> incorrect for multiple callings of PyThread_tss_create for One key.
>>
>
> That's not what we currently do for the EnsureGIL autoTLS key and the
> tracemalloc key though - the reentrant key creation is part of
> "create-if-needed" flows where the key creation is silently skipped if the
> key already exists.
>
> Changing that would require some further research into how we ended up
> with the current approach, while carrying it over into the new API design
> would be the default option.
>

Yes, as you pointed out, my suggestion changes API semantics and not
inherit "create-if-needed". I confirmed again codes...current approach has
enough to work and I've not found strong benefit to change the semantics.
So I agree with you and withdraw my suggestion. Well, I'm going to update
patch based on the result.

Best regards,
Masayuki
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/