[issue19846] Setting LANG=C breaks Python 3 on Linux
Serhiy Storchaka added the comment: And yet, in Python 2, people could do that, and Python didn't care. *That's* the regression I'm worried about. If it hadn't round-tripped cleanly in Python 2, I wouldn't care here either. $ python2.7 -c print u'\u20ac' € $ LANG=C python2.7 -c print u'\u20ac' Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128) And even worse: $ python2.7 -c print u'\u20ac' /dev/null Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128) What the wart! Other program can produces wrong (or even absolutely senseless) output with C locale. $ LANG=C ls ?? ?? ?? ?? ?? ?? ?? ?? What is better, silently produce corrupted output or raise an exception? If first, then let just set the replace or backslashreplace error handler for sys.stdout by default. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
Serhiy Storchaka added the comment: sworddragon@ubuntu:~$ LANG=C sworddragon@ubuntu:~$ ä bash: $'\303\244': command not found - The terminal doesn't pseudo-crash with an exception because it doesn't matter about encodings. - It allows to change the encoding at runtime. This is not a locale of your terminal. Try `LANG=C xterm`. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
Marc-Andre Lemburg added the comment: The C locale is part of the ANSI C standard. The POSIX locale is an alias for the C locale and a POSIX standard, so we cannot just replace the ASCII encoding with UTF-8 as we wish, so Antoine's patch won't work. See e.g. http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html The C and POSIX locale settings are the only locale settings that are guaranteed to always exist in C libraries. Python 3 should work with such locale settings. It doesn't have to be able to output non-ASCII code points, but it should run with ASCII data. AFAIK, Python 3 does work with ASCII data in the C locale, so I'm not sure whether this is a bug at all. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
STINNER Victor added the comment: I didn't understand Serhiy's ls example. I tried: $ mkdir unicode $ cd unicode $ python3 -c 'open(ab\xe9.txt, w).close()' $ python3 -c 'open(euro\u20ac.txt, w).close()' $ ls abé.txt euro€.txt $ LANG=C ls ab??.txt euro???.txt Ah yes, I didn't remember that ls is aware of the locale encoding. printf() and wprintf() behave differently on unencodable/undecoable characters: http://unicodebook.readthedocs.org/en/latest/programming_languages.html#printf-functions-family Again, the issue is not specific to Python. So it's time to learn how to configure correctly your locales. About the interoperability point I mentionned in my first message (This encoding is the best choice for interopability with other (python2 or non python) programs.): if you work around the annoying ASCII encoding by forcing UTF-8 encoding, Python may produce data which would be incompatible with other applications following POSIX and so using the ASCII encoding. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
STINNER Victor added the comment: Nick testing applications for POSIX compliance Sorry but what do you mean by POSIX compliance? The POSIX standard only specify the ASCII encoding. http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html The tables in Locale Definition describe the characteristics and behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified. For C-language programs, the POSIX locale shall be the default locale when the setlocale() function is not called. http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3 Portable character set = ASCII -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
STINNER Victor added the comment: Marc-Andre AFAIK, Python 3 does work with ASCII data in the C locale, so I'm not sure whether this is a bug at all. What do you mean? Python uses the surrogateescape encoding since Python 3.1, undecodable bytes are stored as surrogate characters. Many bugs related to locales were fixed in Python 3.2, 3.3 and 3.4. There are remaining bugs? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
Marc-Andre Lemburg added the comment: On 09.12.2013 11:19, STINNER Victor wrote: STINNER Victor added the comment: Marc-Andre AFAIK, Python 3 does work with ASCII data in the C locale, so I'm not sure whether this is a bug at all. What do you mean? Python uses the surrogateescape encoding since Python 3.1, undecodable bytes are stored as surrogate characters. Many bugs related to locales were fixed in Python 3.2, 3.3 and 3.4. There are remaining bugs? I was referring to the original bug report on this ticket. FWIW: I don't think you can expect Python to work without exceptions if you use a C locale and write non-ASCII data to stdout. I also don't think that the new ticket title is correct - or at least, I fail to see which aspect of Python breaks with LANG=C :-) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3
Changes by Nick Coghlan ncogh...@gmail.com: -- title: print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding() - Setting LANG=C breaks Python 3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3
Changes by STINNER Victor victor.stin...@gmail.com: -- title: print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding() - Setting LANG=C breaks Python 3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3
STINNER Victor added the comment: Or said differently, the filesystem encoding is different than the locale encoding. Indeed, but the FS encoding and the IO encoding are the same. locale encoding doesn't really matter here, as we are assuming that it's wrong. Oh, I realized that FS encoding term in not clear. When I wrote FS encoding, I mean sys.getfilesystemencoding() which is mbcs on Windows, UTF-8 on Mac OS X and (currently) the locale encoding on other platforms (UNIX, ex: Linux/FreeBSD/Solaris/AIX). -- IMO there are two different points in this issue: (a) which encoding should be used when the C locale is used: the encoding announced by the OS using nl_langinfo(CODESET) (current choice) or use an arbitrary optimistic utf-8 encoding? (b) for technical reasons, Python reuses the C codec during Python initialization to decode and encode OS data, and so currently Python *must* use the locale encoding for its filesystem encoding Before being able to pronounce me on the point (a), I would like to see a patch fixing the point (b). I'm not against fixing point (b). I'm just saying that it's not trivial and obviously it must be fixed to change the status of point (a). I even gave clues to fix point (b). -- asciilocale.patch has many issues. Try to run the Python test suite using this patch to see what I mean. Example of failures: == FAIL: test_non_ascii (test.test_cmd_line.CmdLineTest) -- Traceback (most recent call last): File /home/haypo/prog/python/default/Lib/test/test_cmd_line.py, line 140, in test_non_ascii assert_python_ok('-c', command) File /home/haypo/prog/python/default/Lib/test/script_helper.py, line 69, in assert_python_ok return _assert_python(True, *args, **env_vars) File /home/haypo/prog/python/default/Lib/test/script_helper.py, line 55, in _assert_python stderr follows:\n%s % (rc, err.decode('ascii', 'ignore'))) AssertionError: Process return code is 1, stderr follows: Unable to decode the command from the command line: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 12: surrogates not allowed == FAIL: test_ioencoding_nonascii (test.test_sys.SysModuleTest) -- Traceback (most recent call last): File /home/haypo/prog/python/default/Lib/test/test_sys.py, line 603, in test_ioencoding_nonascii self.assertEqual(out, os.fsencode(test.support.FS_NONASCII)) AssertionError: b'' != b'\xc3\xa6' == FAIL: test_nonascii (test.test_warnings.CEnvironmentVariableTests) -- Traceback (most recent call last): File /home/haypo/prog/python/default/Lib/test/test_warnings.py, line 774, in test_nonascii ['ignore:Deprecaci\xf3nWarning'].encode('utf-8')) AssertionError: b['ignore:Deprecaci\\udcc3\\udcb3nWarning'] != b['ignore:Deprecaci\xc3\xb3nWarning'] == FAIL: test_nonascii (test.test_warnings.PyEnvironmentVariableTests) -- Traceback (most recent call last): File /home/haypo/prog/python/default/Lib/test/test_warnings.py, line 774, in test_nonascii ['ignore:Deprecaci\xf3nWarning'].encode('utf-8')) AssertionError: b['ignore:Deprecaci\\udcc3\\udcb3nWarning'] != b['ignore:Deprecaci\xc3\xb3nWarning'] test_warnings is probably #9988, test_cmd_line failure is maybe #9992. There are maybe other issues, the Python test suite only have a few tests for non-ASCII characters. -- If anything is changed, I would prefer to have more than a few months of test to make sure that it doesn't break anything. So I set the version field to Python 3.5. -- versions: +Python 3.5 -Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3
Antoine Pitrou added the comment: On dim., 2013-12-08 at 22:22 +, STINNER Victor wrote: (b) for technical reasons, Python reuses the C codec during Python initialization to decode and encode OS data, and so currently Python *must* use the locale encoding for its filesystem encoding Ahhh! Well indeed that's a bummer :-) asciilocale.patch has many issues. Try to run the Python test suite using this patch to see what I mean. I'm assuming much of this is due to (b) (all those tests seem to spawn external processes). It seems there is more work to do to get this right, but I'm not terribly interested either. Feel free to take over. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3
STINNER Victor added the comment: It seems there is more work to do to get this right, but I'm not terribly interested either. Feel free to take over. If you are talking to me: I'm currently opposed to change anything, so I'm not interested to work on a patch. IMO Python works fine and you should try to workaround the current limitations :-) If someone is interested to write an huge patch fixing all these issues, I would be able to reconsider my opinion on point (a). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
Nick Coghlan added the comment: End users tripping over this by setting LANG=C is one of the pain points of Python 3 relative to Python 2 for Fedora, so I've added a couple of Fedora folks to the nosy list. My current understanding of the situation: - we should leave Windows and Mac OS X alone, since they ignore the locale when choosing the OS API encoding anyway - the main problem is on Linux (but potentially other *nix systems as well), where people set LANG=C for a variety of reasons, but this has the side effect of Python 3 choosing an inappropriate encoding (ASCII rather than UTF-8) when talking to the OS APIs. Given the initialisation problems, this may be something that PEP 432 (the initialisation process rewrite) can help with (since it changes the initialisation order to create a more complete Python runtime before it starts to configure the OS interfaces). Tangentially related, we may want to consider aliasing sys.getfilesystemencoding, os.fsencode and os.fsdecode as something like sys.getosapiencoding, os.apiencode and os.apidecode, since the current naming is misleading (the value is based on the platform and environment, not any particular filesystem, and is used for almost all bytes-based OS APIs, not just filesystem metadata) -- nosy: +a.badger, bkabrda title: Setting LANG=C breaks Python 3 - Setting LANG=C breaks Python 3 on Linux ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
STINNER Victor added the comment: End users tripping over this by setting LANG=C is one of the pain points of Python 3 relative to Python 2 for Fedora, so I've added a couple of Fedora folks to the nosy list. Sorry, I'm not aware of such issue. Do you have examples? - the main problem is on Linux (but potentially other *nix systems as well), where people set LANG=C for a variety of reasons, but this has the side effect of Python 3 choosing an inappropriate encoding (ASCII rather than UTF-8) when talking to the OS APIs. Why do you think that the issue is specific to Python 3? Try to open a terminal with LC_ALL=C and try to type non-ASCII characters with your keyboard. You can't because your terminal uses ASCII. Did you applications written in another language handling Unicode, like Perl? (Perl with Unicode support correctly enabled, it's use utf8; if I remember correctly). Can you explain the various reasons why users explictly force the encoding to ASCII? I use LANG=C to get manual pages and error messages in english. But LANG=en_US man ls would be more correct, or LC_MESSAGES=en_US man ls to be pedantic. (Env var priority: LC_ALL LANG LC_xxx). IMO if you use LANG=C, you must not complain that Unicode stopped working, but you should learn how to configure locales. Trivial examples like the one which can be found in the initial message (msg204849) are wrong: why would you force all locales to C and use non-ASCII characters? Given the initialisation problems, this may be something that PEP 432 (the initialisation process rewrite) can help with (since it changes the initialisation order to create a more complete Python runtime before it starts to configure the OS interfaces). I don't see how it would help to solve my point (b). Technically, this issue cannot be fixed. Or to be more specific, I don't want to fix it, it's a waste of time. So I don't understand what do you expect from this open issue? I would prefer to close it as invalid or wontfix to be clear. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
Nick Coghlan added the comment: On 9 December 2013 12:08, STINNER Victor rep...@bugs.python.org wrote: STINNER Victor added the comment: End users tripping over this by setting LANG=C is one of the pain points of Python 3 relative to Python 2 for Fedora, so I've added a couple of Fedora folks to the nosy list. Sorry, I'm not aware of such issue. Do you have examples? Armin's travails with remote shell access and Python 3 are just as likely today as they were a couple of years ago: http://lucumr.pocoo.org/2011/12/7/thoughts-on-python3/ (although technically that was a terminal ending up with the POSIX locale, rather than specifically LANG=C) - the main problem is on Linux (but potentially other *nix systems as well), where people set LANG=C for a variety of reasons, but this has the side effect of Python 3 choosing an inappropriate encoding (ASCII rather than UTF-8) when talking to the OS APIs. Why do you think that the issue is specific to Python 3? Try to open a terminal with LC_ALL=C and try to type non-ASCII characters with your keyboard. You can't because your terminal uses ASCII. Did you applications written in another language handling Unicode, like Perl? (Perl with Unicode support correctly enabled, it's use utf8; if I remember correctly). It's the fact this used to work transparently in Python 2 (since all these interfaces were just bytes based on the Python side as well) that's a problem. That makes the new sensitivity to the locale encoding a usability regression, and that's a concern for distros that are considering switching their default Python version. Can you explain the various reasons why users explictly force the encoding to ASCII? - testing applications for POSIX compliance - default settings on servers where you don't control the environment - because they never previously had to care, and it's only Python 3 deciding to pay attention to it which makes it relevent for them I use LANG=C to get manual pages and error messages in english. But LANG=en_US man ls would be more correct, or LC_MESSAGES=en_US man ls to be pedantic. (Env var priority: LC_ALL LANG LC_xxx). IMO if you use LANG=C, you must not complain that Unicode stopped working, but you should learn how to configure locales. Trivial examples like the one which can be found in the initial message (msg204849) are wrong: why would you force all locales to C and use non-ASCII characters? And yet, in Python 2, people could do that, and Python didn't care. *That's* the regression I'm worried about. If it hadn't round-tripped cleanly in Python 2, I wouldn't care here either. Given the initialisation problems, this may be something that PEP 432 (the initialisation process rewrite) can help with (since it changes the initialisation order to create a more complete Python runtime before it starts to configure the OS interfaces). I don't see how it would help to solve my point (b). Having a Python runtime available makes things that are currently tediously painful to deal with during startup easier to tweak. I'm not sure it *will* help in this particular case, but it's now one I'm going to keep an eye on. Technically, this issue cannot be fixed. Or to be more specific, I don't want to fix it, it's a waste of time. So I don't understand what do you expect from this open issue? A way to get Python 3 to cope as well with a misconfigured OS environment as Python 2 did. I would prefer to close it as invalid or wontfix to be clear. It's a usability regression from Python 2, so I don't want to give up on it. It may be that we just implement a ignore what the OS claims, it's misconfigured, just use UTF-8 for everything flag. But OS configuration errors shouldn't cripple the Python runtime. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19846] Setting LANG=C breaks Python 3 on Linux
Sworddragon added the comment: You should keep things more simple: - Python and the operation system/filesystem are in a client-server relationship and Python should validate all. - It doesn't matter what you will finally decide to be the default encoding on various places - all will provide race-conditions with no exception. - The easiest way to fix this is to give the developer the ability to make a decision (like sys.use_strict_encoding(), sys.setfilesystemencoding(), sys.setdefaultencoding() etc.). * For example giving the developer control is especially needed if he wants to handle multiple different filesystems. Why do you think that the issue is specific to Python 3? Try to open a terminal with LC_ALL=C and try to type non-ASCII characters with your keyboard. You can't because your terminal uses ASCII. sworddragon@ubuntu:~$ LANG=C sworddragon@ubuntu:~$ ä bash: $'\303\244': command not found - The terminal doesn't pseudo-crash with an exception because it doesn't matter about encodings. - It allows to change the encoding at runtime. Did you applications written in another language handling Unicode, like Perl? Compare C: It wouldn't matter like the terminal. For example fopen will simply return NULL if it can't open the file 'ä' because the filesystem is endoded with ISO-8859-1 and we wanted to open the utf-8 counterpart. Can you explain the various reasons why users explictly force the encoding to ASCII? For example I'm using this for testcases to set the language uncomplicated to english. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue19846 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com