STINNER Victor <[email protected]> added the comment:
> RE making UnixMain public, I'd rather the core runtime require a known
> encoding, rather than trying to detect it. We should move the call into the
> detection logic into Programs/python.c so that embedders have to opt-in to
> detection (many embedding scenarios will prefer to do their own encoding).
Unix is a very complex beast and Python makes it worse by adding more options
(PEP 538 and PEP 540). Py_UnixMain() works "as expected": it uses the LC_CTYPE
locale encoding.
If you want to force the usage of UTF-8, you can opt-in for UTF-8 mode: call
putenv("PYTHONUTF8=1") before Py_UnixMain() for example.
You cannot pass an encoding to Py_UnixMain() because the implementation of
Python heavily rely on the LC_CTYPE locale: see Py_DecodeLocale() and
Py_EncodeLocale() functions. Anyway, Python must use the locale encoding to
avoid mojibake. Python must use the codec from the C library: mbstowcs() and
wcstombs() to be able to load its own codecs. Python has a few codecs
implemented in C like ASCII, UTF-8 and Latin1, but locales are way more diverse
than that. For example, ISO-8859-15 is used for "euro" locale variants. Example:
$ LANG=fr_FR.iso885915@euro python3 -c 'import sys;
print(sys.getfilesystemencoding())'
iso8859-15
Python has a ISO-8859-15 codec, but it's implemented in pure Python. Python
uses importlib to laod the codec, but how does Python decodes and encodes
filenames to import Lib/encodings/iso8859_15.py? That's why
mbstowcs()/wcstombs() and Py_DecodeLocale()/Py_EncodeLocale() come into the
game :-) Enjoy:
PyObject*
PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
{
PyInterpreterState *interp = _PyInterpreterState_GET_UNSAFE();
const _PyCoreConfig *config = &interp->core_config;
#if defined(__APPLE__)
return PyUnicode_DecodeUTF8Stateful(s, size, config->filesystem_errors,
NULL);
#else
/* Bootstrap check: if the filesystem codec is implemented in Python, we
cannot use it to encode and decode filenames before it is loaded. Load
the Python codec requires to encode at least its own filename. Use the C
implementation of the locale codec until the codec registry is
initialized and the Python codec is loaded. See initfsencoding(). */
if (interp->fscodec_initialized) {
return PyUnicode_Decode(s, size,
config->filesystem_encoding,
config->filesystem_errors);
}
else {
return unicode_decode_locale(s, size,
config->filesystem_errors, 0);
}
#endif
}
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue36204>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com