Eryk Sun <eryk...@gmail.com> added the comment:
> The APIs were written at a time where locale modifiers > simply did mot exist. Technically, locale modifiers did exist circa 2000, but I suppose you mean that they were uncommon to the point of being unheard of at the time. The modifier field was specified in the X/Open Portability Guide Issue 3 (XPG3) in 1989, and again in XPG4 in 1992 as "language[_territory][.codeset][@modifier]". I can't provide links to the specifications (they're not freely available), but here's a link to X/Open "Internationalisation Guide Version 2" (1993), which defines the modifier field in section 5.1.2 (pages 88-89): https://pubs.opengroup.org/onlinepubs/009269599/toc.pdf > Support could be added via a special locale tuple return > object, which looks like 2-tuple, but comes with extra attributes > to store the modifier That's a good idea and worth implementing. But the _strptime and calendar modules have no need to call getlocale(LC_TIME). IMO, it adds fragility for no benefit. All they need to save is the result of setlocale(LC_TIME). Also, the default locale for calendar.LocaleTextCalendar has no need to use getdefaultlocale() instead of using an empty string, i.e. setlocale(LC_TIME, ""). The latter is simpler and more reliable. --- > support for Windows is only partial, due to the > completely different approach Windows' CRT took to locales. Using the same implementation for POSIX and Windows is needlessly complicated, and difficult to reason about how it behaves in all cases. I suggest implementing separate versions of normalize() and _parse_localename() for Windows, making use of direct queries via _winapi.GetLocaleInfoEx() (to be added). The mapping in encodings.aliases also needs comprehensive coverage for Windows code pages (e.g. cp20127 -> ascii, cp28591 -> latin_1, etc). A poor match should not be aliased, such as code page 20932 and euc_JP. (For all 3-byte sequences in standard euc-JP, code page 20932 encodes 2-byte sequences by dropping the lead byte and masking the third byte as ASCII.) If the locale string doesn't include a codeset, then normalize() shouldn't do anything to obtain one. It's not necessary in Windows. If there's a codeset, normalize() should ensure it's "UTF-8", "OCP", "ACP", or a Windows code page in the right form, e.g. "ascii" -> "20127". ucrt supports "ACP" and "OCP" codesets for the locale's ANSI and OEM code pages. These must be in uppercase, e.g. "hindi.acp" -> "hindi.ACP". ucrt will set the latter as "Hindi_India.utf8" (it's a Unicode-only locale), which should parse as ("Hindi_India", "UTF-8"). If the locale without the codeset isn't a valid Windows BCP-47 locale name, as determined by the NLS API, then normalize() should only care about case-insensitive normalization of "C" and "POSIX" as "C", e.g. "c" -> "C". No other normalization is necessary. ucrt supports case-insensitive "language[_country[.codepage]]" and ".codepage" forms, where language and country are either the full English names, LOCALE_SENGLISHLANGUAGENAME and LOCALE_SENGLISHCOUNTRYNAME, or 3-letter abbreviations, LOCALE_SABBREVLANGNAME and LOCALE_SABBREVCTRYNAME, such as "enu_USA". It also supports locale aliases such as "american[.codeset]". If the result isn't "C" or a BCP-47 locale name, ucrt setlocale() always returns the "language_country.codepage" form with full English names. A BCP-47 locale name such as "en" or "en_US" cannot be used with a codeset other than UTF-8. If no codeset is specified, ucrt implicitly uses the locale's ANSI code page. If a BCP-47 locale name is paired with a codeset that's neither the given locale's ANSI codepage nor UTF-8, then normalize it to the "language_country.codepage" form. For example, "fr_FR.latin-1" -> "French_France.28591". Parse the latter as ("French_France", "ISO-8859-1"). If a BCP-47 locale name is paired with the locale's ANSI code page, then normalize it without the code page, e.g. "sr_Latn_RS.cp1250" -> "sr_Latn_RS". Look up the locale's ANSI code page when parsing the latter, e.g. "sr_Latn_RS" -> ("sr_Latn_RS", "cp1250"). If a BCP-47 locale name is paired with UTF-8, then there isn't much to do other than normalize the locale name and encoding name, e.g. "en_us.utf8" -> "en_US.UTF-8". ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue43115> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com