Hello, this is informative message which summarizes behavior of
setlocale/_wsetlocale functions among all CRTs.
As I'm working on a library which implements POSIX functions (in context of
this topic, relevant ones are setlocale, newlocale and uselocale) for native
Windows, I will sometimes mention how I've implemented some things described
here. However, I will omit details on how the library handles parsing of locale
string and construction of Windows locale objects (`LCID` or locale names,
depending on configuration).
1. Relevant information about Windows locales
Windows function GetLocaleInfo[Ex] allow us obtain important information about
specific locale:
- LOCALE_SENGLANGUAGE: language name in English (such as "English")
- LOCALE_SENGCOUNTRY: country/region name in English (such as "United States")
- LOCALE_SISO639LANGNAME: ISO 639 language code (such as "en")
- LOCALE_SISO3166CTRYNAME: ISO 3166 country code (such as "US")
- LOCALE_IDEFAULTCODEPAGE: locale's default OEM code page
- LOCALE_IDEFAULTANSICODEPAGE: locale's default ANSI code page
When it comes to locale's default code pages, we need to distinguish one
special case - Unicode locales. For such locales default ANSI code page will be
returned as CP_ACP (0) and OEM code page as CP_OEMCP (1).
Locales also have default MAC (returned as CP_MACCP for Unicode locales) and
EBCDIC code pages (always returned as 500 for Unicode locales), but they are
mostly irrelevant in context of this topic.
2. Locale string formats used by CRT
The following describes format string accepted by CRT's [_w]setlocale and
_[w}create_locale functions.
Important notes: _wsetlocale is available since msvcrt20.dll. _wcreate_locale
is available since msvcr110.dll.
Microsoft documents supported formats here[1] and in UTF-8 support section of
setlocale's documentation[2].
2.1. Language_Country.CodePage
This is probably the most well-known and commonly used format. Such string can
be easily constructed from information returned by GetLocaleInfo[Ex]. I would
like to point out some important points when you use this format.
First, one unfortunate thing about this format is that GetLocaleInfo[Ex] may
return string which contain non-ascii characters, for example, for locale
"quc-Latn-GT", returned language name is "Kʼicheʼ (Latin)" which contains
U+02BC.
One could think that using _wsetlocale would suffice. It is not true for some
CRTs (to be more precise, the ones which do not support locale name, which
includes msvcr100.dll and all older CRTs). One way to set such locale with
older CRTs is to convert them using their default ANSI code page (in this case
1252, where best-fit conversion will replace U+02BC with U+0027) and pass
converted string to setlocale. Funny part is that if you convert this best-fit
converted string back to wchar_t, _wsetlocale will accept it *sigh*.
Note that in case of Unicode locales, you want to use active ANSI code page
(GetACP), not code page 65001 (CP_UTF8) explicitly.
For anything older than UCRT (I may end up doing the same for
msvcr{110,120}.dll), I use _wsetlocale with fallback to setlocale. For UCRT I
only use _wsetlocale.
Second, this format has limited support in ancient CRTs (crtdll.dll,
msvcrt{10,20,40}.dll, historical msvcrt.dll versions 4.*, 5.*). This format is
fully supported starting with 6.* versions of msvcrt.dll.
Pali, if you're reading: does it mean we can expect this format to be fully
supported since msvcrt.dll on Windows 2000? msvcrt.dll version 6.0 seems to be
released in 1998, while Windows 2000 was released in 1999.
Third, not all locales can be set using this format :).
Out of all locales which can be represented by an LCID locale, the following
locales cannot be set:
- Edo_Nigeria (bin-NG)
Out of locale which can be represented only by locale names, the following
locales cannot be set:
- Asu_Tanzania (asa-TZ)
- Ewe_Ghana (ee-GH)
- Ewe_Togo (ee-TG)
- English_St. Kitts & Nevis (en-KN)
- English_St. Lucia (en-LC)
- English_St Helena, Ascension, Tristan da Cunha (en-SH)
- English_U.S. Outlying Islands (en-UM)
- English_St. Vincent & Grenadines (en-VC)
- English_U.S. Virgin Islands (en-VI)
- French_St. Barthélemy (fr-BL)
- French_St. Martin (fr-MF)
- French_St. Pierre & Miquelon (fr-PM)
- Luo_Kenya (luo-KE)
- Dutch_Bonaire, Sint Eustatius and Saba (nl-BQ)
- Rwa_Tanzania (rwk-TZ)
2.2. Windows Locale Names
Windows introduced locale names as replacements for LCID locales in Windows
Vista. CRT support for locale names was added in msvcr110.dll (same CRT that
added _wcreate_locale, now it's clear to me why it was added so late).
This format has one significant limitation - there is no way to set code page
explicitly. When you pass locale name to setlocale/_create_locale, it will
always use locale's default ANSI code page.
Time to remember Unicode locales. When you pass name of such locale to
setlocale/_create_locale, it will use active ANSI code page (as returned by
GetACP). You may get lucky and it will be 65001 (CP_UTF8) or you may get
unlucky and set/create broken locale. Also keep in mind that only UCRT supports
UTF-8.
This logic also applies to using Language_Country format without explicitly
specifying code page to use, with a few extra details mentioned later in 2.4.
However, using locale names is one of two ways to set locales which cannot be
set using Language_Country.CodePage format. Another important detail about
using locale names is that they allow you to set locales which specify sorting
order[3]. There is no way to supply this information to
setlocale/_create_locale using any other format.
In my library, when it is configured to use locale names, I'm using this format
when a non-Unicode locale is created with its default ANSI code page.
2.3. ll_CC.UTF-8
UCRT supports UTF-8. Microsoft documentation mentions that it is possible to
set UTF-8 locale using string like "en_US.UTF-8". It is important to mention
that this format is only accepted for UTF-8 locales. (why? why not support
specifying any code page? I have so many questions and no answers.)
This format can be constructed from language and country codes returned by
GetLocaleInfo[Ex]. This is second way to set locales which cannot be set using
Language_Country.CodePage format. And look, we can use UTF-8! (yay?)
I would like to point out that this format should be used with care. Take
Serbian language as an example, which has two variants:
- Serbian (Cyrillic)_Serbia (sr-Cyrl-RS)
- Serbian (Latin)_Serbia (sr-Cyrl-RS)
When you use "sr_RS.UTF-8", you cannot be sure which variant will be actually
used.
Another locale deserving mention is ca-ES-valencia (Valencian_Spain) (with
glibc, you would use "ca_ES@valencia"). Language code for this locale is "ca"
and country code is "ES", using only this information you would end up with
"ca_ES.UTF-8" which is effectively Catalan_Spain.65001. Handling this one
locale is particularly painful.
In my library I use this format for UTF-8 locale unless either:
- "script" (such as "Latn" in sr-Latn-RS) was applied during construction of
Windows locale object
- "modifier" (such as "@valencia" in ca_ES@valencia) is required to properly
distinguish such locale (currently, this is only ca_ES@valencia)
2.4. "language strings"
As mentioned in 2.1, ancient CRTs provide only partial support for
Language_Country.CodePage format. What this means is that some locales can be
set with such a string, while others cannot - some of those may be set using
"language string"[4].
When you use such "language string", similarly to locale names, you cannot
explicitly specify code page to use; that's where the interesting part begins.
crtdll.dll and msvcrt10.dll, unlike the rest of CRTs, will set code page to
locale's default OEM code page instead of locale's default ANSI code page. If
you're gonna use "language strings", this detail must be taken into account.
There are few important details I would like to mention based on my tests with
those ancient CRTs:
First, do not use "swedish-finland". None of CRTs of interest recognize this
"language string", but they seem to recognize "Swedish_Finalnd" constructed
from information returned by GetLocaleInfoW.
Second, with CRTs which provide limited support for Language_Country.CodePage
format, string "portuguese" will set locale to "Portuguese_Portugal". With CRTs
which provide full support for this format, it will set locale to
"Portuguese_Brazil". This issue is relevant only to msvcrt.dll. All CRTs seem
to recognize both "Portuguese_Portugal" and "Portuguese_Brazil". When building
for msvcrt.dll, it is possible to avoid usage of "portuguese" and always use
Language_Country format.
In my library, current condition for using "language strings" is either:
- building for crtdll.dll or msvcrt{10,20,40}.dll
- building for msvcrt.dll and targeting anything older than Windows 2000, in
which case "portuguese" is not used
Currently, second condition is never true, since library cannot be configured
for anything older than Windows XP. I'm sure if I'll add support for older
Windows versions.
- Kirill Makurin
[1]
https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings
[2]
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale
[3] https://learn.microsoft.com/en-us/windows/win32/intl/sort-order-identifiers
[4] https://learn.microsoft.com/en-us/cpp/c-runtime-library/language-strings
_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public