Eryk Sun <eryk...@gmail.com> added the comment:

> The APIs were written at a time where locale modifiers 
> simply did mot exist. 

Technically, locale modifiers did exist circa 2000, but I suppose you mean that 
they were uncommon to the point of being unheard of at the time.

The modifier field was specified in the X/Open Portability Guide Issue 3 (XPG3) 
in 1989, and again in XPG4 in 1992 as 
"language[_territory][.codeset][@modifier]". I can't provide links to the 
specifications (they're not freely available), but here's a link to X/Open 
"Internationalisation Guide Version 2" (1993), which defines the modifier field 
in section 5.1.2 (pages 88-89):

https://pubs.opengroup.org/onlinepubs/009269599/toc.pdf

> Support could be added via a special locale tuple return
> object, which looks like 2-tuple, but comes with extra attributes
> to store the modifier

That's a good idea and worth implementing. But the _strptime and calendar 
modules have no need to call getlocale(LC_TIME). IMO, it adds fragility for no 
benefit. All they need to save is the result of setlocale(LC_TIME). 

Also, the default locale for calendar.LocaleTextCalendar has no need to use 
getdefaultlocale() instead of using an empty string, i.e. setlocale(LC_TIME, 
""). The latter is simpler and more reliable.

---

> support for Windows is only partial, due to the
> completely different approach Windows' CRT took to locales.

Using the same implementation for POSIX and Windows is needlessly complicated, 
and difficult to reason about how it behaves in all cases.

I suggest implementing separate versions of normalize() and _parse_localename() 
for Windows, making use of direct queries via _winapi.GetLocaleInfoEx() (to be 
added). 

The mapping in encodings.aliases also needs comprehensive coverage for Windows 
code pages (e.g. cp20127 -> ascii, cp28591 -> latin_1, etc). A poor match 
should not be aliased, such as code page 20932 and euc_JP. (For all 3-byte 
sequences in standard euc-JP, code page 20932 encodes 2-byte sequences by 
dropping the lead byte and masking the third byte as ASCII.)

If the locale string doesn't include a codeset, then normalize() shouldn't do 
anything to obtain one. It's not necessary in Windows. If there's a codeset, 
normalize() should ensure it's "UTF-8", "OCP", "ACP", or a Windows code page in 
the right form, e.g. "ascii" -> "20127". ucrt supports "ACP" and "OCP" codesets 
for the locale's ANSI and OEM code pages. These must be in uppercase, e.g. 
"hindi.acp" -> "hindi.ACP". ucrt will set the latter as "Hindi_India.utf8" 
(it's a Unicode-only locale), which should parse as ("Hindi_India", "UTF-8").

If the locale without the codeset isn't a valid Windows BCP-47 locale name, as 
determined by the NLS API, then normalize() should only care about 
case-insensitive normalization of "C" and "POSIX" as "C", e.g. "c" -> "C". No 
other normalization is necessary. ucrt supports case-insensitive 
"language[_country[.codepage]]" and ".codepage" forms, where language and 
country are either the full English names, LOCALE_SENGLISHLANGUAGENAME and 
LOCALE_SENGLISHCOUNTRYNAME, or 3-letter abbreviations, LOCALE_SABBREVLANGNAME 
and LOCALE_SABBREVCTRYNAME, such as "enu_USA". It also supports locale aliases 
such as "american[.codeset]". If the result isn't "C" or a BCP-47 locale name, 
ucrt setlocale() always returns the "language_country.codepage" form with full 
English names.

A BCP-47 locale name such as "en" or "en_US" cannot be used with a codeset 
other than UTF-8. If no codeset is specified, ucrt implicitly uses the locale's 
ANSI code page. 

If a BCP-47 locale name is paired with a codeset that's neither the given 
locale's ANSI codepage nor UTF-8, then normalize it to the 
"language_country.codepage" form. For example, "fr_FR.latin-1" -> 
"French_France.28591". Parse the latter as ("French_France", "ISO-8859-1"). 

If a BCP-47 locale name is paired with the locale's ANSI code page, then 
normalize it without the code page, e.g. "sr_Latn_RS.cp1250" -> "sr_Latn_RS". 
Look up the locale's ANSI code page when parsing the latter, e.g. "sr_Latn_RS" 
-> ("sr_Latn_RS", "cp1250"). 

If a BCP-47 locale name is paired with UTF-8, then there isn't much to do other 
than normalize the locale name and encoding name, e.g. "en_us.utf8" -> 
"en_US.UTF-8".

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43115>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to