[issue19846] Python 3 raises Unicode errors with the C locale

STINNER Victor Mon, 09 Dec 2013 02:42:49 -0800

STINNER Victor added the comment:

I'm closing the issue as invalid, because Python 3 behaviour is correct and 
must not be changed.



Standard streams (sys.stdin, sys.stdout, sys.stderr) uses the locale encoding. 
sys.stdin and sys.stdout use the strict error handler, sys.stderr uses the 
backslashreplace error handler. These encodings and error handlers can be 
overriden by the PYTHONIOENCODING. Since Python 3.3, it's possible to only set 
the error handler using ":errors" syntax (ex: PYTHONIOENCODING=":replace").


Python uses sys.getfilesystemencoding() to decode data from / encode data to 
the operating system. Example of operating system data: command line arguments, 
environment variables, host names, filenames, user names, etc.

On Windows, Python tries to use the wide character (Unicode) API of Windows 
anywhere to avoid any conversion, to not loose data. The MBCS codec (ANSI code 
page) of Windows uses a replace error handler by default, it looses data. Try 
for example os.listdir() in a directory containing filenames not encodable to 
the ANSI code page in Python 2 (or os.listdir(b'.') in Python 3).

On Mac OS X, Python always use UTF-8 for sys.getfilesystemencoding() (with the 
surrogateescape error handler, see the PEP 383). The locale encoding is ignored 
for sys.getfilesystemencoding() (the locale encoding is still used in some 
functions).

On other operating systems... it's more complex. Python uses the locale 
encoding for sys.getfilesystemencoding() (with the surrogateescape error 
handler, see the PEP 383). For the POSIX locale (aka the "C" locale), you may 
get the ASCII encoding on Linux, ASCII on FreeBSD and Solaris (whereas these 
operating systems announce an alias of the ISO 8859-1 encoding, but use ASCII 
in practice), ISO 8859-1 on AIX etc. Using the locale encoding is the best 
choice for interoperability with other applications (which use also the locale 
encoding).

Even if an application uses "raw bytes" (like Python 2), these bytes are still 
"locale aware". For example, when "raw bytes" are written to the standard 
output, bytes are decoded to find the appropriate character in the font of the 
terminal. When "raw bytes" are written into a socket to generate a HTML 
document (ex: listing of a directory, so a list of filenames), the web brower 
will decode them from them encoding announced in the HTML page. Even if the 
encoding is not explicit, it does still exist. Read other comments of this 
issue for other examples.

Forcing the POSIX locale to get an user interface in english is wrong if you 
also expect from your application to still generate valid "raw bytes" in your 
"system" encoding (ISO 8859-1, ShiftJIS, UTF-8, whatever). To change the 
language, the correct environment variable is LC_CTYPE: use LC_CTYPE=C. Or 
better, use the real english locale which will probably handle better currency, 
numbers, etc. Example: LC_CTYPE=en_US.utf8 (on Fedora, "en_US" locale uses the 
ISO 8859-1 encoding).

----------
resolution:  -> invalid
status: open -> closed
title: Setting LANG=C breaks Python 3 on Linux -> Python 3 raises Unicode 
errors with the C locale
versions: +Python 3.3, Python 3.4 -Python 3.5

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue19846>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue19846] Python 3 raises Unicode errors with the C locale

Reply via email to