Eryk Sun <[email protected]> added the comment:
If normalize() is implemented for Windows, then the tests should be split out
into POSIX and Windows versions. Currently, most of the tests in NormalizeTest
are not checking a result that's properly normalized for ucrt.
A useful implementation of locale.normalize should allow a script to use
("en_US", "iso8859_1") in Windows without having to know that Latin-1 is
Windows codepage 28591, or that ucrt requires a classic locale name if the
encoding isn't UTF-8. The required result for setlocale() is "English_United
States.28591".
As far as aliases are concerned, at a minimum, we need to map "posix" and "c"
to "C". We can also support "C.UTF-8" as "en_US.UTF-8". Do we need to support
the Unix locale_alias mappings from X.org? If so, I suppose we could use a
double mapping. First try the Unix locale_alias mapping. Then try that result
in a windows_locale_alias mapping that includes additional mappings from Unix
to Windows. For example:
sr_CS.UTF-8 -> sr_Cyrl_CS.UTF-8
sr_CS.UTF-8@latin -> sr_Latn_CS.UTF-8
ca_ES.UTF-8@valencia -> ca_ES_valencia.UTF-8
Note that the last one doesn't currently work. "ca-ES-valencia" is a valid
Windows locale name for the Valencian variant of Catalan (ca), which lacks an
ISO 639 code of its own since it's officially (and somewhat controversially)
designated as a dialect of Catalan. This is an unusual case that has a subtag
after the region, which ucrt's manual BCP-47 parsing cannot handle. (It tries
to parse "ES" as the script and "valencia" as an ISO 3166-1 country code.)
After mapping aliases, if the result still has "@" in it, normalize() should
fail. We don't know what the "@" modifier means.
Otherwise, split the locale name and encoding parts. If the encoding isn't
UTF-8, try to map it to a codepage. For this we need a windows_codepage_alias
dict that maps IANA official and Python-specific encoding names to Windows
codepages. Next, check the locale name via WINAPI IsValidLocaleName. If it's
not valid, try replacing underscore with hyphen and check again. Otherwise
assume it's a classic ucrt locale name. (It may not be valid, but implementing
all of the work ucrt does to parse a classic locale name is too much I think.)
If it's a valid Windows locale name, and we have a codepage encoding, then try
to translate it as a classic ucrt locale name. This requires two WINAPI
GetLocaleInfoEx calls to look up the English versions of the language and
country name.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37945>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com