[Python-ideas] Re: Add a couple of options to open()'s mode parameter to deal with common text encodings

Eryk Sun Thu, 04 Feb 2021 18:58:37 -0800

On 2/4/21, Ben Rudiak-Gould <benrud...@gmail.com> wrote:
>
> My proposal is to add a couple of single-character options to open()'s mode
> parameter. 'b' and 't' already exist, and the encoding parameter
> essentially selects subcategories of 't', but it's annoyingly verbose and
> so people often omit it.
>
> If '8' was equivalent to specifying encoding='UTF-8', and 'L' was
> equivalent to specifying encoding=(the real locale encoding, ignoring UTF-8
> mode), that would go a long way toward making open more convenient in the
> common cases on Windows, and I bet it would encourage at least some of
> those developing on Unixy platforms to write more portable code also.


A precedent for using the mode parameter is [_w]fopen in MSVC, which
supports a "ccs=<encoding>" flag, where "<encoding>" can be "UTF-8",
"UTF-16LE", or "UNICODE".

---

In terms of using the 'locale', keep in mind that the implementation
in Windows doesn't use the current LC_CTYPE locale. It only uses the
default locale, which in turn uses the process active (ANSI) code
page. The latter is a system setting, unless overridden to UTF-8 in
the application manifest (e.g. the manifest that's embedded in
"python.exe").

I'd like to see support for a -X option and/or environment variable to
make Python in Windows actually use the current locale to get the
locale encoding (a real shocker, I know). For example,
setlocale(LC_CTYPE, "el_GR") would select "cp1253" (Greek) as the
locale encoding, while setlocale(LC_CTYPE, "el_GR.utf-8") would select
"utf-8" as the locale encoding.

(The CRT supports UTF-8 in locales starting with Windows 10, build
17134, released on 2018-04-03.)

At startup, Python 3.8+ calls setlocale(LC_CTYPE, "") to use the
default locale, for use with C functions such as mbstowcs(). This
allows the default behavior to remain the same, unless the new option
also entails attempting locale coercion to UTF-8 via
setlocale(LC_CTYPE, ".utf-8").

The following gets the current locale's code page in C:

    #include <"locale.h">
    // ...
    loc = _get_current_locale();
    locinfo = (__crt_locale_data_public *)loc->locinfo;
    cp = locinfo->_locale_lc_codepage;

The "C" locale uses code page 0. C mbstowcs() and wcstombs() handle
this case as Latin-1. locale._get_locale_encoding() could instead map
it to the process ANSI code page, GetACP(). Also, the CRT displays
CP_UTF8 (65001) as "utf8". _get_locale_encoding() should map it to
"utf-8" instead of "cp65001".
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MZC4DDCTMOX25ZQVUGBNLE6VPVXHXNKU/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Add a couple of options to open()'s mode parameter to deal with common text encodings

Reply via email to