Eryk Sun <[email protected]> added the comment:
> In my experience, most applications use the ANSI code page because
> they use the ANSI flavor of the Windows API.
The default encoding at startup and in the "C" locale wouldn't change. It would
only differ from the default if setlocale(LC_CTYPE, locale_name) sets it
otherwise. The suggestion is to match the behavior of nl_langinfo(CODESET) in
Linux and many other POSIX systems.
When I say the default encoding won't change, I mean that the Universal C
Runtime (ucrt) system component uses the process ANSI code page as the default
locale encoding for setlocale(LC_CTYPE, ""). This agrees with what Python has
always done, but it disagrees with previous versions of the CRT in Windows.
Personally, I think it's a misstep because the user locale isn't necessarily
compatible with the process code page, but I'm not looking to change this
decision. For example, if the user locale is "el_GR" (Greek, Greece) but the
process code page is 1252 (Latin) instead of 1253 (Greek), I get the following
result in Python 3.4 (VC++ 10) vs Python 3.5 (ucrt):
>py -3.4 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
Greek_Greece.1253
>py -3.5 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
Greek_Greece.1252
The result from VC++ 10 is consistent with the user locale. It's also
consistent with multilingual user interface (MUI) text, such as error messages,
or at least it should be, because the user locale and user preferred language
(i.e. Windows display language) should be consistent. (The control panel dialog
to set the user locale in Windows 10 has an option to match the display
language, which is the recommended and default setting.) For example, Python
uses system error messages that are localized to the user's preferred language:
>py -c "import os; os.stat('spam')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
FileNotFoundError: [WinError 2] Δεν είναι δυνατή η εύρεση του καθορισμένου
αρχείου από το σύστημα: 'spam'
This example is on a system where the process (system) ANSI code page is 1252
(Latin), which cannot encode the user's preferred Greek text. Thankfully Python
3.6+ uses the console's Unicode API, so neither the console session's output
code page nor the process code page gets in the way. On the other hand, if this
Greek text is written to a file or piped to a child process using
subprocess.Popen(), Python's choice of locale encoding based on the process
code page (Latin) is incompatible with Greek text, and thus it's incompatible
with the current user's preferred locale and language settings.
The process ANSI code page from GetACP() has its uses, which are important.
It's a system setting that's independent of the current user locale and thus
useful when interacting with the legacy system API and as a common encoding for
inter-process data exchange when applications do not use Unicode and may be
operating in different locales. So if you're writing to a legacy-encoded text
file that's shared by multiple users or piping text to an arbitrary program,
then using the ANSI code page is probably okay. Though, especially for IPC,
there's a good chance that's it's wrong since Windows has never set, let alone
enforced, a standard in that case.
Using the process ANSI code page in the "C" locale makes sense to me.
> What is the use case for using ___lc_codepage()? Is it a different
> encoding?
I always forget the "_func" suffix in the name; it's ___lc_codepage_func() [1].
The lc_codepage value is the current LC_CTYPE codeset as an integer code page.
It's the equivalent of nl_langinfo(CODESET) in POSIX. For UTF-8, the code page
is CP_UTF8 (65001), but this get displayed in locale strings as "UTF-8" (or
variants such as "utf8"). It could be the LC_CTYPE encoding of just the current
thread, but Python does not enable per-thread locales.
The CRT has exported ___lc_codepage_func() since VC++ 7.0 (2002), and before
that the current lc_codepage value itself was directly exported as
__lc_codepage. However, this triple-dundered function is documented as internal
and not recommended for use. That's why the code snippet I showed uses
_get_current_locale() with locinfo cast to __crt_locale_data_public *. This
takes "public" in the struct name at face value. Anything that's declared
public should be safe to use, but the locale_t type is frustratingly
undocumented even for this public data [2].
If neither approach is supported, locale.get_current_locale_encoding() could
instead parse the current locale encoding from setlocale(LC_CTYPE, NULL). The
resulting locale string usually includes the codeset (e.g.
"Greek_Greece.1253"). The exceptions are the "C" locale and BCP-47 (RFC 5646)
locales that do not explicitly use UTF-8 (e.g. "el_GR" or "el" instead of
"el_GR.UTF-8"), but these cases can be handled reliably.
---
[1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/lc-codepage-func
[2] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue43552>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com