[Python-ideas] Re: Make UTF-8 mode more accessible for Windows users.

M.-A. Lemburg Thu, 11 Feb 2021 05:23:54 -0800

On 11.02.2021 13:49, Eryk Sun wrote:
> On 2/11/21, M.-A. Lemburg <m...@egenix.com> wrote:
> 
>> I think the main problem here is that open() doesn't use
>> locale.getlocale()[1] as default for the encoding parameter,
>> but instead locale.getpreferredencoding(False).
> 
> Currently, locale.getpreferredencoding(False) is implemented as
> locale._get_locale_encoding(). This ultimately calls
> _Py_GetLocaleEncoding(), defined in "Python/fileutils.c".
> TextIOWrapper() calls this C function to get the encoding to use when
> encoding=None is passed.


All that seems to be new in Python 3.10. This is not what's
happening in Python 3.9. The _get_locale_encoding() function
doesn't even exist.

> In POSIX, _Py_GetLocaleEncoding() calls nl_langinfo(CODESET), which
> returns the current LC_CTYPE encoding, not the default LC_CTYPE
> encoding. For example, in Linux:
> 
>     >>> setlocale(LC_CTYPE, 'en_US.UTF-8')
>     'en_US.UTF-8'
>     >>> _get_locale_encoding()
>     'UTF-8'
>     >>> open('test.txt').encoding
>     'UTF-8'
> 
>     >>> setlocale(LC_CTYPE, 'en_US.ISO-8859-1')
>     'en_US.ISO-8859-1'
>     >>> _get_locale_encoding()
>     'ISO-8859-1'
>     >>> open('test.txt').encoding
>     'ISO-8859-1'
> 
> In Windows, _Py_GetLocaleEncoding() just uses GetACP(), which returns
> the process ANSI code page. This is based on the CRT's default locale
> set by setlocale(LC_CTYPE, ""), which combines the user's default
> locale with the process ANSI code page. I'm not overjoyed about this
> combination in the default locale, since it's potentially inconsistent
> (e.g. Korean user locale with Latin 1252 process code page), but that
> ship sailed a long time ago. I'm not arguing to change
> locale.getdefaultlocale().
> 
> The problem is that locale._get_locale_encoding() in Windows is not
> returning the current LC_CTYPE locale encoding, in contrast to how it
> behaves in POSIX. I'd like an environment variable and/or -X option to
> fix this flaw. If enabled, and if the C runtime supports UTF-8 locales
> (as it has for the past 3 years in Windows 10), and the application
> warrants it (e.g. many open calls across many modules), then
> convenient use of UTF-8 would be one setlocale() call away.

Why an env variable ? You could simply open up a ticket to get this
fixed, since 3.10 is not released yet.

> It's not for packages. Frankly, I don't see why it's a problem for a
> package developer to use encoding='utf-8' for files that need to use
> UTF-8. Developing libraries that are designed to work in arbitrary
> applications on multiple platforms is tedious work. Having to
> explicitly pass encoding='utf-8' goes with the territory, and it's a
> minor annoyance in the grand scheme of things.
> 
>> That's what getlocale(LC_CTYPE) is intended for, unless I'm
>> missing something.
> 
> getlocale() can't be relied on to parse the correct codeset from the
> locale name, and it can even raise ValueError (more likely in Windows,
> e.g. with the native locale name "en-US"). The codeset should be
> queried directly using an API call, such as nl_langinfo(CODESET) in
> POSIX.
> 
> In Windows, the C runtime's POSIX locale implementation doesn't
> include nl_langinfo(). There's ___lc_codepage_func(), but it's
> documented as an internal function. A ucrt locale record, however,
> does expose the code page as a public field, as documented in the
> public header "corecrt.h". Here's a prototype using ctypes:
> 
>     import os
>     import ctypes
> 
>     ucrt = ctypes.CDLL('ucrtbase', use_errno=True)
> 
>     class _crt_locale_data_public(ctypes.Structure):
>         _fields_ = (('_locale_pctype', ctypes.POINTER(ctypes.c_ushort)),
>                     ('_locale_mb_cur_max', ctypes.c_int),
>                     ('_locale_lc_codepage', ctypes.c_uint))
> 
>     class _crt_locale_pointers(ctypes.Structure):
>         _fields_ = (('locinfo', ctypes.POINTER(_crt_locale_data_public)),
>                     ('mbcinfo', ctypes.c_void_p))
> 
>     ucrt._get_current_locale.restype = ctypes.POINTER(_crt_locale_pointers)
> 
>     CP_UTF8 = 65001
> 
>     def _get_locale_encoding():
>         locale = ucrt._get_current_locale()
>         if not locale:
>             errno = ctypes.get_errno()
>             raise OSError(errno, os.strerror(errno))
>         try:
>             codepage = locale[0].locinfo[0]._locale_lc_codepage
>         finally:
>             ucrt._free_locale(locale)
>         if codepage == 0:
>             return 'latin-1' # "C" locale
>         if codepage == CP_UTF8:
>             return 'utf-8'
>         return f'cp{cp}'
> 
> Examples with Python 3.9 in Windows 10:
> 
>     >>> setlocale(LC_CTYPE, 'C')
>     'C'
>     >>> _get_locale_encoding()
>     'latin-1'
>     >>> setlocale(LC_CTYPE, 'en_US')
>     'en_US'
>     >>> _get_locale_encoding()
>     'cp1252'
>     >>> setlocale(LC_CTYPE, 'el_GR')
>     'el_GR'
>     >>> _get_locale_encoding()
>     'cp1253'
>     >>> setlocale(LC_CTYPE, 'en_US.utf-8')
>     'en_US.utf-8'
>     >>> _get_locale_encoding()
>     'utf-8'

Here's what I get with Python 3.9 on Windows 10:

>>> from locale import *
>>> setlocale(LC_CTYPE, 'C')
'C'
>>> _get_locale_encoding()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name '_get_locale_encoding' is not defined
>>> getlocale(LC_CTYPE)
(None, None)
>>> setlocale(LC_CTYPE, 'en_US')
'en_US'
>>> getlocale(LC_CTYPE)
('en_US', 'ISO8859-1')
>>> setlocale(LC_CTYPE, 'el_GR')
'el_GR'
>>> getlocale(LC_CTYPE)
('el_GR', 'ISO8859-7')
>>> setlocale(LC_CTYPE, 'en_US.utf-8')
'en_US.utf-8'
>>> getlocale(LC_CTYPE)
('en_US', 'UTF-8')
>>>

Note that _get_locale_encoding() is not available, so using
getlocale() instead.

The returned values for the encoding look mostly correct to
me, except the one for the 'C' locale which should be 'ascii'.

getpreferredencoding() doesn't honor those changes, though.
It returns 'cp1252' for me, or 'UTF-8' when using UTF-8
mode.

Now, if I explicitly set the locale, I'd expect this to be
used by Python for I/O as well. This currently doesn't happen
and that's confusing:

>>> setlocale(LC_CTYPE, 'de_DE.utf-8')
'de_DE.utf-8'
>>> getpreferredencoding()
'cp1252'

Anyway, UTF-8 mode is the way to go these days, esp. if you want
to write applications which are portable across platforms and
behave the same on all.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 11 2021)
>>> Python Projects, Coaching and Support ...    https://www.egenix.com/
>>> Python Product Development ...        https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               https://www.egenix.com/company/contact/
                     https://www.malemburg.com/
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/X4EIMLTBBSYKCUDJ65GBJIGFOECJPOZD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Make UTF-8 mode more accessible for Windows users.

Reply via email to