Hello, this is informative message which summarizes behavior of 
setlocale/_wsetlocale functions among all CRTs.

As I'm working on a library which implements POSIX functions (in context of 
this topic, relevant ones are setlocale, newlocale and uselocale) for native 
Windows, I will sometimes mention how I've implemented some things described 
here. However, I will omit details on how the library handles parsing of locale 
string and construction of Windows locale objects (`LCID` or locale names, 
depending on configuration).

1. Relevant information about Windows locales

Windows function GetLocaleInfo[Ex] allow us obtain important information about 
specific locale:

- LOCALE_SENGLANGUAGE: language name in English (such as "English")
- LOCALE_SENGCOUNTRY: country/region name in English (such as "United States")
- LOCALE_SISO639LANGNAME: ISO 639 language code (such as "en")
- LOCALE_SISO3166CTRYNAME: ISO 3166 country code (such as "US")
- LOCALE_IDEFAULTCODEPAGE: locale's default OEM code page
- LOCALE_IDEFAULTANSICODEPAGE: locale's default ANSI code page

When it comes to locale's default code pages, we need to distinguish one 
special case - Unicode locales. For such locales default ANSI code page will be 
returned as CP_ACP (0) and OEM code page as CP_OEMCP (1).

Locales also have default MAC (returned as CP_MACCP for Unicode locales) and 
EBCDIC code pages (always returned as 500 for Unicode locales), but they are 
mostly irrelevant in context of this topic.

2. Locale string formats used by CRT

The following describes format string accepted by CRT's [_w]setlocale and 
_[w}create_locale functions.

Important notes: _wsetlocale is available since msvcrt20.dll. _wcreate_locale 
is available since msvcr110.dll.

Microsoft documents supported formats here[1] and in UTF-8 support section of 
setlocale's documentation[2].

2.1. Language_Country.CodePage

This is probably the most well-known and commonly used format. Such string can 
be easily constructed from information returned by GetLocaleInfo[Ex]. I would 
like to point out some important points when you use this format.

First, one unfortunate thing about this format is that GetLocaleInfo[Ex] may 
return string which contain non-ascii characters, for example, for locale 
"quc-Latn-GT", returned language name is "Kʼicheʼ (Latin)" which contains 
U+02BC.

One could think that using _wsetlocale would suffice. It is not true for some 
CRTs (to be more precise, the ones which do not support locale name, which 
includes msvcr100.dll and all older CRTs). One way to set such locale with 
older CRTs is to convert them using their default ANSI code page (in this case 
1252, where best-fit conversion will replace U+02BC with U+0027) and pass 
converted string to setlocale. Funny part is that if you convert this best-fit 
converted string back to wchar_t, _wsetlocale will accept it *sigh*.

Note that in case of Unicode locales, you want to use active ANSI code page 
(GetACP), not code page 65001 (CP_UTF8) explicitly.

For anything older than UCRT (I may end up doing the same for 
msvcr{110,120}.dll), I use _wsetlocale with fallback to setlocale. For UCRT I 
only use _wsetlocale.

Second, this format has limited support in ancient CRTs (crtdll.dll, 
msvcrt{10,20,40}.dll, historical msvcrt.dll versions 4.*, 5.*). This format is 
fully supported starting with 6.* versions of msvcrt.dll.

Pali, if you're reading: does it mean we can expect this format to be fully 
supported since msvcrt.dll on Windows 2000? msvcrt.dll version 6.0 seems to be 
released in 1998, while Windows 2000 was released in 1999.

Third, not all locales can be set using this format :).

Out of all locales which can be represented by an LCID locale, the following 
locales cannot be set:

- Edo_Nigeria (bin-NG)

Out of locale which can be represented only by locale names, the following 
locales cannot be set:

- Asu_Tanzania (asa-TZ)
- Ewe_Ghana (ee-GH)
- Ewe_Togo (ee-TG)
- English_St. Kitts & Nevis (en-KN)
- English_St. Lucia (en-LC)
- English_St Helena, Ascension, Tristan da Cunha (en-SH)
- English_U.S. Outlying Islands (en-UM)
- English_St. Vincent & Grenadines (en-VC)
- English_U.S. Virgin Islands (en-VI)
- French_St. Barthélemy (fr-BL)
- French_St. Martin (fr-MF)
- French_St. Pierre & Miquelon (fr-PM)
- Luo_Kenya (luo-KE)
- Dutch_Bonaire, Sint Eustatius and Saba (nl-BQ)
- Rwa_Tanzania (rwk-TZ)

2.2. Windows Locale Names

Windows introduced locale names as replacements for LCID locales in Windows 
Vista. CRT support for locale names was added in msvcr110.dll (same CRT that 
added _wcreate_locale, now it's clear to me why it was added so late).

This format has one significant limitation - there is no way to set code page 
explicitly. When you pass locale name to setlocale/_create_locale, it will 
always use locale's default ANSI code page.

Time to remember Unicode locales. When you pass name of such locale to 
setlocale/_create_locale, it will use active ANSI code page (as returned by 
GetACP). You may get lucky and it will be 65001 (CP_UTF8) or you may get 
unlucky and set/create broken locale. Also keep in mind that only UCRT supports 
UTF-8.

This logic also applies to using Language_Country format without explicitly 
specifying code page to use, with a few extra details mentioned later in 2.4.

However, using locale names is one of two ways to set locales which cannot be 
set using Language_Country.CodePage format. Another important detail about 
using locale names is that they allow you to set locales which specify sorting 
order[3]. There is no way to supply this information to 
setlocale/_create_locale using any other format.

In my library, when it is configured to use locale names, I'm using this format 
when a non-Unicode locale is created with its default ANSI code page.

2.3. ll_CC.UTF-8

UCRT supports UTF-8. Microsoft documentation mentions that it is possible to 
set UTF-8 locale using string like "en_US.UTF-8". It is important to mention 
that this format is only accepted for UTF-8 locales. (why? why not support 
specifying any code page? I have so many questions and no answers.)

This format can be constructed from language and country codes returned by 
GetLocaleInfo[Ex]. This is second way to set locales which cannot be set using 
Language_Country.CodePage format. And look, we can use UTF-8! (yay?)

I would like to point out that this format should be used with care. Take 
Serbian language as an example, which has two variants:

- Serbian (Cyrillic)_Serbia (sr-Cyrl-RS)
- Serbian (Latin)_Serbia (sr-Cyrl-RS)

When you use "sr_RS.UTF-8", you cannot be sure which variant will be actually 
used.

Another locale deserving mention is ca-ES-valencia (Valencian_Spain) (with 
glibc, you would use "ca_ES@valencia"). Language code for this locale is "ca" 
and country code is "ES", using only this information you would end up with 
"ca_ES.UTF-8" which is effectively Catalan_Spain.65001. Handling this one 
locale is particularly painful.

In my library I use this format for UTF-8 locale unless either:

- "script" (such as "Latn" in sr-Latn-RS) was applied during construction of 
Windows locale object
- "modifier" (such as "@valencia" in ca_ES@valencia) is required to properly 
distinguish such locale (currently, this is only ca_ES@valencia)

2.4. "language strings"

As mentioned in 2.1, ancient CRTs provide only partial support for 
Language_Country.CodePage format. What this means is that some locales can be 
set with such a string, while others cannot - some of those may be set using 
"language string"[4].

When you use such "language string", similarly to locale names, you cannot 
explicitly specify code page to use; that's where the interesting part begins. 
crtdll.dll and msvcrt10.dll, unlike the rest of CRTs, will set code page to 
locale's default OEM code page instead of locale's default ANSI code page. If 
you're gonna use "language strings", this detail must be taken into account.

There are few important details I would like to mention based on my tests with 
those ancient CRTs:

First, do not use "swedish-finland". None of CRTs of interest recognize this 
"language string", but they seem to recognize "Swedish_Finalnd" constructed 
from information returned by GetLocaleInfoW.

Second, with CRTs which provide limited support for Language_Country.CodePage 
format, string "portuguese" will set locale to "Portuguese_Portugal". With CRTs 
which provide full support for this format, it will set locale to 
"Portuguese_Brazil". This issue is relevant only to msvcrt.dll. All CRTs seem 
to recognize both "Portuguese_Portugal" and "Portuguese_Brazil". When building 
for msvcrt.dll, it is possible to avoid usage of "portuguese" and always use 
Language_Country format.

In my library, current condition for using "language strings" is either:

- building for crtdll.dll or msvcrt{10,20,40}.dll
- building for msvcrt.dll and targeting anything older than Windows 2000, in 
which case "portuguese" is not used

Currently, second condition is never true, since library cannot be configured 
for anything older than Windows XP. I'm sure if I'll add support for older 
Windows versions.

- Kirill Makurin

[1] 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings
[2] 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale
[3] https://learn.microsoft.com/en-us/windows/win32/intl/sort-order-identifiers
[4] https://learn.microsoft.com/en-us/cpp/c-runtime-library/language-strings

_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to