On 7 May 2017 at 15:22, INADA Naoki <songofaca...@gmail.com> wrote: > Hi, Nick. > > After thinking about relationship between PEP 538 and 540 in two days, > I came up with idea which removes locale coercion by default from PEP 538, > it does just enables UTF-8 mode and show warning about C locale. > > Of course, this idea is based on PEP 540. There are no "If PEP 540 is > rejected". > > How do you think?
The main problems I see with this approach are: 1. There's no way to configure earlier Python versions to emulate PEP 540. It's a completely new mode of operation. 2. PEP 540 isn't actually defined yet (Victor is still working on it) 3. Due to 1&2, PEP 540 isn't something 3.6 redistributors can experiment with backporting to a narrower target audience By contrast, you can emulate PEP 538 all the way back to Python 3.1 by setting the following environment variables: LC_ALL=C.UTF-8 LANG=C.UTF-8 PYTHONIOENCODING=utf-8:surrogateescape (assuming your platform provides a C.UTF-8 locale and you don't need to run any Python 2.x components in that same environment) I think the specific concerns you raise below are valid though, and I'd be happy to amend PEP 538 to address them all. > If it make sense, I want to postpone PEP 538 until PEP 540 is > accepted or rejected, or merge PEP 538 into PEP 540. > > > ## Background > > Locale coercion in current PEP 538 has some downsides: > > * If user set `LANG=C LC_DATE=ja_JP.UTF-8`, locale coercion may > overrides LC_DATE. The fact it sets "LC_ALL" has previously been raised as a concern with PEP 538, so it probably makes sense to drop that aspect and just override "LANG". The scenarios where it makes a difference are incredibly obscure (involving non-default SSH locale forwarding settings for folks using SSH on Mac OS X to connect to remote Linux systems), while just setting "LANG" will be sufficient to address the "LANG=C" case that is the main driver for the PEP. That means in the case above, the specific LC_DATE setting would still take precedence. > * It makes behavior divergence between standalone and embedded > Python. Such divergence already exists, only in the other direction: embedding applications may override the runtime's default settings, either by setting a particular locale, or by using Py_SetStandardStreamEncoding (which was added specifically to make it easy for Blender to force the use of UTF-8 on the embedded Python's standard streams, regardless of the currently locale) That said, this is also the rationale for my suggestion that we expose locale coercion as a public API: if (Py_LegacyLocaleDetected()) { Py_CoerceLegacyLocale(); } That would make it straightforward for any embedding application that wanted to do so to replicate the behaviour of the standard CLI. The level of divergence is also mitigated by the point in the next section. > * Parent Python process may use utf-8:surrogateescape, but child process > Python may use utf-8:strict. (Python 3.6 uses ascii:surrogateescape in > both of parent and children). This discrepancy is gone now thanks to your suggestion of making "surrogateescape" the default standard stream handler when one of the coercion target locales is explicitly configured - both parent processes and child processes end up with "utf-8:surrogateescape" configured on the standard streams. > On the other hand, benefits from locale coercion is restricted: > > * When locale coercion succeeds, warning is always shown. > To hide the warning, user must disable coercion in some way. > (e.g. use UTF-8 locale explicitly, or set PYTHONCOERCECLOCALE=0). The current warning is based on what we think is appropriate for Fedora downstream, but that doesn't necessarily mean its the right approach for Python upstream, especially if the LC_ALL override is dropped. We could also opt for a model where Python 3.7 emits the coercion warning, but Python 3.8 just does the coercion silently (that rationale would then also apply to PEP 540 - we'd warn on stderr about the change in default behaviour in 3.7, but take the new behaviour for granted in 3.8). The change to make the standard stream error handler setting depend solely on the currently configured locale also helps here, since it means it doesn't matter how a process reached the state of having the locale set to "C.UTF-8". CPython will behave the same way regardless, so it makes it less import to provide an explicit notice that coercion took place. > So I feel benefit / complexity ratio of locale coercion is less than > UTF-8 mode. It isn't an either/or though - we're entirely free to do both, one based solely on the existing configuration options that have been around since 3.1, and the other going beyond those to also adjust the default behaviour of other interfaces (like "open()"). > But locale coercion works nice on Android. And there are some Android-like > Unix systems (container or small device) that C.UTF-8 is always proper locale. > > ## Rough spec > > * Make Android-style locale coercion (forced, no warning) is now > build option. Some users who build Python for container or small device > may like it. But do we *want* to support the legacy C locale in 3.7+? I don't think we do, because it will never work properly for our purposes as long as it assumes ASCII as the default text encoding. Part of the motivation for making locale coercion the default is so we can update PEP 11 to make it clear that running in the legacy C locale is no longer an officially supported configuration. > * Normal Python build doesn't change locale. When python executable is > run in C locale, show locale warning. locale warning can be disabled > as current PEP 538. That still pushes the problem back on end users to fix, though, rather than just automatically making things like GNU readline integration work. > * User can disable automatic UTF-8 mode by setting PYTHONUTF8=0 > environment variables. User can hide warning by setting > PYTHONUTF8=1 too. I think I need to better explain in the PEP why PEP 540's UTF-8 mode on its own won't be enough, as it doesn't necessarily handle locale-aware extension modules like GNU readline (this came up in the draft PR review, but I never added anything specifically to the PEP about it), and also doesn't help at all with invocation of older 3.x releases in a subprocess. Here's an interactive session from a PEP 538 enabled CPython, where each line after the first is executed by doing "up-arrow, 4xleft-arrow, delete, enter" $ LANG=C ./python Python detected LC_CTYPE=C: LC_ALL & LANG coerced to C.UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behavior). Python 3.7.0a0 (heads/pep538-coerce-c-locale:188e780, May 7 2017, 00:21:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌἤ") ℙƴ☂ℌἤ >>> print("ℙƴ☂ἤ") ℙƴ☂ἤ >>> print("ℙƴἤ") ℙƴἤ >>> print("ℙἤ") ℙἤ >>> print("ἤ") ἤ >>> Not exactly exciting, but this is what currently happens on an older release if you only change the Python level stream encoding settings without updating the locale settings: $ LANG=C PYTHONIOENCODING=utf-8:surrogateescape python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌ�") File "<stdin>", line 0 ^ SyntaxError: 'utf-8' codec can't decode bytes in position 20-21: invalid continuation byte That particular misbehaviour is coming from GNU readline, *not* CPython - because the editing wasn't UTF-8 aware, it corrupted the history buffer and fed such nonsense to stdin that even the surrogateescape error handler was bypassed. While PEP 540's UTF-8 mode could technically be updated to also reconfigure readline, that's *one* extension module, and only when it's running directly as part of Python 3.7. By contrast, using a more appropriate locale setting already gets readline to play nice, even when its running inside Python 3.5: $ LANG=C.UTF-8 python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌἤ") ℙƴ☂ℌἤ >>> print("ℙƴ☂ἤ") ℙƴ☂ἤ >>> print("ℙƴἤ") ℙƴἤ >>> print("ℙἤ") ℙἤ >>> print("ἤ") ἤ >>> Don't get me wrong, I'm definitely a fan of PEP 540, as it extends much of what PEP 538 covers beyond the standard streams and also applies it to other operating system interfaces without relying on the underlying operating system to provide a UTF-8 based locale. However, I also expect it to be plagued by extension module compatibility issues if folks attempt to use it standalone, without locale coercion to reconfigure the behaviour of extension modules appropriately. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com