Eryk Sun <eryk...@gmail.com> added the comment:

That the CRT caches the tzname strings as ANSI multibyte strings is frustrating 
-- whether or not it's buggy. I would expect there to be a _wtzname cache of 
the native OS strings that wcsftime uses directly, with no potential for failed 
encodings (e.g. empty strings or mojibake).

It's also strange that it encodes the time-zone name using the system ANSI 
codepage in the C locale. Normally LC_CTYPE in the C locale uses Latin-1, due 
to simple casting between WCHAR and CHAR. This leads to mojibake when the ANSI 
time-zone name gets decoded as Latin-1 by an internal mbstowcs call in 
wcsftime. I'm not saying one or the other is necessarily right, but more care 
should haven gone into this. At the very least, if we're stuck with system ANSI 
tzname strings in the C locale, then a flag should be set that tells wcsftime 
to decode them as system ANSI strings instead of via mbstowcs.

Also, the timezone name is determined by the preferred UI language of the 
current user, which is not necessarily compatible with the system ANSI 
codepage. It's not even necessarily compatible with the user-locale ANSI 
codepage, as used by setlocale(LC_CTYPE, ""). Windows 10 at least provides an 
option to sync the user locale with the user preferred UI language. IMO, this 
is a strong argument in favor of using _wtzname wide-character strings. UI 
Language (MUI) and locale are not tightly coupled in Windows NLS.

Here's an example where the user's preferred language is Hindi, and the time 
zone name is "समन्वित वैश्विक समय" (i.e. Coordinated Universal Time), but the 
system locale is English with codepage 1252 (for Western European languages). 
This is a normal configuration if the system locale doesn't have beta UTF-8 
support enabled, or if the process ANSI codepage isn't overridden to UTF-8 via 
the "activeCodePage" manifest setting.

The tzname strings normally get set by a one-time _tzset call, and they're only 
reset if tzset is called manually. tzset uses the system ANSI encoding if 
LC_CTYPE is the "C" locale (again, normally ucrt uses Latin-1 in the "C" 
locale). Since the encoding of the Hindi timezone name to codepage 1252 
contains the default character ("?"), which is not allowed, tzset sets the 
tzname strings to empty strings.

    import ctypes, locale, time
    ucrt = ctypes.CDLL('ucrtbase', use_errno=True)
    ucrt.__tzname.restype = ctypes.POINTER(ctypes.c_char_p)
    tzname = ucrt.__tzname()

    >>> locale.setlocale(locale.LC_CTYPE, 'C')
    'C'
    >>> ucrt._tzset()
    0
    >>> tzname[0], tzname[1]
    (b'', b'')
    >>> time.strftime('%Z')
    ''

If we update the LC_CTYPE category to use UTF-8, the cached tzname value 
doesn't get automatically updated, and strftime still returns an empty string:

    >>> locale.setlocale(locale.LC_CTYPE, '.utf8')
    'Hindi_India.utf8'

    >>> tzname[0], tzname[1]
    (b'', b'')
    >>> time.strftime('%Z')
    ''

The tzname values get updated if we manually call tzset:

    >>> ucrt._tzset()
    0
    >>> tzname[0].decode('utf-8'), tzname[1].decode('utf-8')
    ('समन्वित वैश्विक समय', 'समन्वित वैश्विक समय')

However, LC_TIME is still in the "C" locale. strftime uses system ANSI (1252) 
in this case, so the encoded result from the CRT strftime call ends up using 
the default character (?):

    >>> time.strftime('%Z')
    '??????? ??????? ???'

If we set LC_TIME to UTF-8, we finally get a valid result:

    >>> locale.setlocale(locale.LC_TIME, '.utf8')
    'Hindi_India.utf8'
    >>> time.strftime('%Z')
    'समन्वित वैश्विक समय'

We wouldn't have to worry about LC_TIME here if Python called C wcsftime 
instead of C strftime. The problem that bpo-10653 was trying to work around is 
a design flaw in the C runtime library, and calling strftime is not a solution. 

Here's a variation on my example in msg243660, continuing with the current 
Hindi example. The setup in this example uses UTF-8 as the system ANSI codepage 
(via python_utf8.exe) and sets LC_CTYPE to the "C" locale. This yields the 
following monstrosity:

    >>> time.strftime('%Z')
    'Ã\xa0¤¸Ã\xa0¤®Ã\xa0¤¨Ã\xa0Â¥Â\x8dÃ\xa0¤µÃ\xa0¤¿Ã\xa0¤¤ 
Ã\xa0¤µÃ\xa0Â¥Â\x88Ã\xa0¤¶Ã\xa0Â¥Â\x8dÃ\xa0¤µÃ\xa0¤¿Ã\xa0¤Â\x95 
Ã\xa0¤¸Ã\xa0¤®Ã\xa0¤¯'

It's due to the following sequence of encoding and decoding operations:

    >>> mbs_lcctype_utf8 = 'समन्वित वैश्विक समय'.encode('utf-8')
    >>> wcs_lcctype_latin1 = mbs_lcctype_utf8.decode('latin-1')
    >>> mbs_lctime_utf8 = wcs_lcctype_latin1.encode('utf-8')

This last one is from PyUnicode_DecodeLocaleAndSize and mbstowcs:

    >>> py_str_lcctype_latin1 = mbs_lctime_utf8.decode('latin-1')
    >>> py_str_lcctype_latin1 == time.strftime('%Z')
    True

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue36792>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to