Eryk Sun <[email protected]> added the comment:
Here's some additional background information for work on this issue.
A Unix locale identifier has the following form:
"language[_territory][.codeset][@modifier]"
| "POSIX"
| "C"
| ""
| NULL
(X/Open Portability Guide, Issue 4, 1992 -- aka XPG4)
Some systems also implement "C.UTF-8".
The language and territory should use ISO 639 and ISO 3166 alpha-2 codes. The
"@" modifier may indicate an alternate script such as "sr_RS@latin" or an
alternate currency such as "de_DE@euro". For the optional codeset, IANA
publishes the following table of character sets:
http://www.iana.org/assignments/character-sets/character-sets.xhtml
In Debian Linux, the available encodings are defined by mapping files in
"/usr/share/i18n/charmaps". But encodings can't be arbitrarily used in locales
at run time. A locale has to be generated (see "/etc/locale.gen") before it's
available.
A Windows (not ucrt) locale name has the following form:
"ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
| "" | LOCALE_NAME_INVARIANT
| "!x-sys-default-locale" | LOCALE_NAME_SYSTEM_DEFAULT
| NULL | LOCALE_NAME_USER_DEFAULT
The invariant locale provides stable data. The system and user default locales
vary according to the Control Panel "Region" settings.
A locale name is based on BCP 47 language tags, with the form
"<language>-<script>-<region>"(e.g. "en-Latn-GB"), for which the script and
region codes are optional. The language is an ISO 639 alpha-2 or alpha-3 code,
with alpha-2 preferred. The script is an initial-uppercase ISO 15924 code. The
region is an ISO 3166-1 alpha-2 or numeric-3 code, with alpha-2 preferred.
As specified, the sort-order code should be delimited by an underscore, but
Windows 10 (maybe older versions also?) accepts a hyphen instead. Here's a list
of the sort-order codes that I've seen:
* mathan - Math Alphanumerics ( x-IV_mathan)
* phoneb - Phone Book (de-DE_phoneb)
* modern - Modern (ka-GE_modern)
* tradnl - Traditional (es-ES_tradnl)
* technl - Technical (hu-HU_technl)
* radstr - Radical/Stroke (ja-JP_radstr)
* stroke - Stroke Count (zh-CN_stroke)
* pronun - Pronunciation (Bopomofo) (zh-TW_pronun)
One final note of interest about Windows locales is that the user-interface
language has been functionally isolated from the locale. The display language
is handled by the Multilinugual User Interface (MUI) API, which depends on .mui
files in locale-named subdirectories of a binary, such as "kernel32.dll" ->
"en-US\kernel32.dll.mui". Windows 10 has an option to configure the user locale
to match the preferred display language. This helps to keep the two in sync,
but they're still functionally independent.
The Universal CRT (ucrt) in Windows supports the following syntax for a locale
identifier:
"ISO639Language[-ISO15924Script][-ISO3166Region][.utf8|.utf-8]"
| "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
| "language[_region][.codepage|.utf8|.utf-8]"
| ".codepage" | ".utf8" | ".utf-8"
| "C"
| ""
| NULL
NULL is used with setlocale to query the current value of a category. The empty
string "" is the current-user locale. "C" is a minimal locale. For LC_CTYPE,
"C" uses Latin-1, but for LC_TIME it uses the system ANSI codepage (possibly
multi-byte), which can lead to mojibake. The "POSIX" locale is not supported,
nor is "C.UTF-8".
Note that UTF-8 support is relatively new, as is the ability to set the
encoding without also specifying a region (e.g. "english.utf8").
Recent versions of ucrt extend BCP-47 support in a couple of ways. Underscore
is allowed in addition to hyphen as the tag delimiter (e.g "en_GB" instead of
"en-GB"), and specifying UTF-8 as the encoding (and only UTF-8) is supported.
If UTF-8 isn't specified, internally the locale defaults to the language's ANSI
codepage. ucrt has to parse BCP 47 locales manually if they include an
encoding, and also in some cases when underscore is used. Currently this fails
to handle a sort-order tag, so we can't use, for example, "de_DE_phoneb.utf8".
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37945>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com