[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

Eryk Sun Sat, 20 Mar 2021 16:05:41 -0700

Eryk Sun <[email protected]> added the comment:

> In my experience, most applications use the ANSI code page because 
> they use the ANSI flavor of the Windows API.


The default encoding at startup and in the "C" locale wouldn't change. It would 
only differ from the default if setlocale(LC_CTYPE, locale_name) sets it 
otherwise. The suggestion is to match the behavior of nl_langinfo(CODESET) in 
Linux and many other POSIX systems.

When I say the default encoding won't change, I mean that the Universal C 
Runtime (ucrt) system component uses the process ANSI code page as the default 
locale encoding for setlocale(LC_CTYPE, ""). This agrees with what Python has 
always done, but it disagrees with previous versions of the CRT in Windows. 
Personally, I think it's a misstep because the user locale isn't necessarily 
compatible with the process code page, but I'm not looking to change this 
decision. For example, if the user locale is "el_GR" (Greek, Greece) but the 
process code page is 1252 (Latin) instead of 1253 (Greek), I get the following 
result in Python 3.4 (VC++ 10) vs Python 3.5 (ucrt):

    >py -3.4 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
    Greek_Greece.1253

    >py -3.5 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
    Greek_Greece.1252

The result from VC++ 10 is consistent with the user locale. It's also 
consistent with multilingual user interface (MUI) text, such as error messages, 
or at least it should be, because the user locale and user preferred language 
(i.e. Windows display language) should be consistent. (The control panel dialog 
to set the user locale in Windows 10 has an option to match the display 
language, which is the recommended and default setting.)  For example, Python 
uses system error messages that are localized to the user's preferred language:

    >py -c "import os; os.stat('spam')"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    FileNotFoundError: [WinError 2] Δεν είναι δυνατή η εύρεση του καθορισμένου 
αρχείου από το σύστημα: 'spam'

This example is on a system where the process (system) ANSI code page is 1252 
(Latin), which cannot encode the user's preferred Greek text. Thankfully Python 
3.6+ uses the console's Unicode API, so neither the console session's output 
code page nor the process code page gets in the way. On the other hand, if this 
Greek text is written to a file or piped to a child process using 
subprocess.Popen(), Python's choice of locale encoding based on the process 
code page (Latin) is incompatible with Greek text, and thus it's incompatible 
with the current user's preferred locale and language settings.

The process ANSI code page from GetACP() has its uses, which are important. 
It's a system setting that's independent of the current user locale and thus 
useful when interacting with the legacy system API and as a common encoding for 
inter-process data exchange when applications do not use Unicode and may be 
operating in different locales. So if you're writing to a legacy-encoded text 
file that's shared by multiple users or piping text to an arbitrary program, 
then using the ANSI code page is probably okay. Though, especially for IPC, 
there's a good chance that's it's wrong since Windows has never set, let alone 
enforced, a standard in that case. 

Using the process ANSI code page in the "C" locale makes sense to me. 

> What is the use case for using ___lc_codepage()? Is it a different 
> encoding?

I always forget the "_func" suffix in the name; it's ___lc_codepage_func() [1]. 
The lc_codepage value is the current LC_CTYPE codeset as an integer code page. 
It's the equivalent of nl_langinfo(CODESET) in POSIX. For UTF-8, the code page 
is CP_UTF8 (65001), but this get displayed in locale strings as "UTF-8" (or 
variants such as "utf8"). It could be the LC_CTYPE encoding of just the current 
thread, but Python does not enable per-thread locales.

The CRT has exported ___lc_codepage_func() since VC++ 7.0 (2002), and before 
that the current lc_codepage value itself was directly exported as 
__lc_codepage. However, this triple-dundered function is documented as internal 
and not recommended for use. That's why the code snippet I showed uses 
_get_current_locale() with locinfo cast to __crt_locale_data_public *. This 
takes "public" in the struct name at face value. Anything that's declared 
public should be safe to use, but the locale_t type is frustratingly 
undocumented even for this public data [2].

If neither approach is supported, locale.get_current_locale_encoding() could 
instead parse the current locale encoding from setlocale(LC_CTYPE, NULL). The 
resulting locale string usually includes the codeset (e.g. 
"Greek_Greece.1253"). The exceptions are the "C" locale and BCP-47 (RFC 5646) 
locales that do not explicitly use UTF-8 (e.g. "el_GR" or "el" instead of 
"el_GR.UTF-8"), but these cases can be handled reliably.

---

[1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/lc-codepage-func
[2] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue43552>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

Reply via email to