Eryk Sun <eryk...@gmail.com> added the comment:

> All getlocale is used for in _strptime.py is comparing the value 
> returned to the previous value returned.

Which is why _strptime should be calling setlocale(LC_TIME), the same as the 
calendar module. That's not to say that I don't think getlocale() and 
normalize() need to be fixed. But returning None for the encoding when there's 
no codeset, while it works for a few cases, doesn't address many other cases.

For example, normalize() and getlocale() will often be wrong in many cases when 
an encoding is guessed, such as normalize('en_US') -> 'en_US.ISO8859-1' and 
normalize('ja_JP') -> 'ja_JP.eucJP'. The encoding is wrong, plus no encoding 
except UTF-8 is allowed in a BCP-47 locale, so setlocale() will fail. 

In all but four cases, classic ucrt "language_country.codepage" locales such as 
"Japanese_Japan.932" are parsed in a 'benignly' incorrect way (i.e. not as RFC 
1766 language tags), which at least roundtrips with setlocale(). It simply 
splits out the codeset, e.g. ('Japanese_Japan', '932'). The four misbehaving 
cases are actually the ones for which getlocale() works as documented because 
the locale_alias mapping has an entry for them.

    * "French_France.1252" -> ('fr_FR', 'cp1252')
    * "German_Germany.1252" -> ('de_DE', 'cp1252')
    * "Portuguese_Brazil.1252" -> ('pt_BR', 'cp1252')
    * "Spanish_Spain.1252" -> ('es_ES', 'cp1252')

The problem is that the parsed tuples don't roundtrip because normalize() keeps 
the encoding, complete with the 'cp' prefix, and only UTF-8 is allowed in a 
BCP-47 locale. For example:

    >>> locale.setlocale(locale.LC_CTYPE, 'French_France.1252')
    'French_France.1252'
    >>> locale.getlocale()
    ('fr_FR', 'cp1252')
    >>> locale.normalize(locale._build_localename(locale.getlocale()))
    'fr_FR.cp1252'
    >>> try: locale.setlocale(locale.LC_CTYPE, locale.getlocale())
    ... except locale.Error as e: print(e)
    ...
    unsupported locale setting

I suppose normalize() could be special cased in Windows to look for a BCP-47 
locale and omit the encoding if it's not UTF-8. I suppose _parse_localename() 
could be special cased to use None for the encoding if there's no codeset. But 
this just leaves me feeling unsettled and disappointed that we could be doing a 
better job of providing the documented behavior of getlocale() and normalize() 
by implementing them separately for Windows using the tools that the OS 
provides.

FYI, I've commented on this problem across a few issues, including bpo-20088 
and bpo-23425 in early 2015, and then extensively in bpo-37945 in mid 2019. 
Plus my latest comments in msg387256 in this issue. The latter suggestions 
could be combined with something like the mapping that's generated by the code 
in msg235937 in bpo-23425, in order to parse a classic ucrt locale string such 
as "Japanese_Japan.932" properly as ("ja_JP", "cp932"), and then build and 
normalize it back as "Japanese_Japan.932".

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43115>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to