[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-09 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 And yet, in Python 2, people could do that, and Python didn't care.
 *That's* the regression I'm worried about. If it hadn't round-tripped
 cleanly in Python 2, I wouldn't care here either.

$ python2.7 -c print u'\u20ac'
€
$ LANG=C python2.7 -c print u'\u20ac'
Traceback (most recent call last):
  File string, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 
0: ordinal not in range(128)

And even worse:

$ python2.7 -c print u'\u20ac' /dev/null
Traceback (most recent call last):
  File string, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 
0: ordinal not in range(128)

What the wart!

Other program can produces wrong (or even absolutely senseless) output with C 
locale.

$ LANG=C ls
   
 ??
?? ??  
?? 
??  
??
?? ??   


What is better, silently produce corrupted output or raise an exception? If 
first, then let just set the replace or backslashreplace error handler for 
sys.stdout by default.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-09 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 sworddragon@ubuntu:~$ LANG=C
 sworddragon@ubuntu:~$ ä
 bash: $'\303\244': command not found
 
 - The terminal doesn't pseudo-crash with an exception because it doesn't
 matter about encodings. - It allows to change the encoding at runtime.

This is not a locale of your terminal. Try `LANG=C xterm`.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-09 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

The C locale is part of the ANSI C standard. The POSIX locale is an alias 
for the C locale and a POSIX standard, so we cannot just replace the ASCII 
encoding with UTF-8 as we wish, so Antoine's patch won't work.

See e.g. http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

The C and POSIX locale settings are the only locale settings that are 
guaranteed to always exist in C libraries. Python 3 should work with such 
locale settings. It doesn't have to be able to output non-ASCII code points, 
but it should run with ASCII data.

AFAIK, Python 3 does work with ASCII data in the C locale, so I'm not sure 
whether this is a bug at all.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-09 Thread STINNER Victor

STINNER Victor added the comment:

I didn't understand Serhiy's ls example. I tried:

$ mkdir unicode
$ cd unicode
$ python3 -c 'open(ab\xe9.txt, w).close()'
$ python3 -c 'open(euro\u20ac.txt, w).close()'
$ ls
abé.txt  euro€.txt
$ LANG=C ls
ab??.txt  euro???.txt


Ah yes, I didn't remember that ls is aware of the locale encoding.

printf() and wprintf() behave differently on unencodable/undecoable characters:
http://unicodebook.readthedocs.org/en/latest/programming_languages.html#printf-functions-family

Again, the issue is not specific to Python. So it's time to learn how to 
configure correctly your locales.

About the interoperability point I mentionned in my first message (This 
encoding is the best choice for interopability with other (python2 or non 
python) programs.): if you work around the annoying ASCII encoding by forcing 
UTF-8 encoding, Python may produce data which would be incompatible with other 
applications following POSIX and so using the ASCII encoding.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-09 Thread STINNER Victor

STINNER Victor added the comment:

Nick testing applications for POSIX compliance

Sorry but what do you mean by POSIX compliance? The POSIX standard only 
specify the ASCII encoding.

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
The tables in Locale Definition describe the characteristics and behavior of 
the POSIX locale for data consisting entirely of characters from the portable 
character set and the control character set. For other characters, the behavior 
is unspecified. For C-language programs, the POSIX locale shall be the default 
locale when the setlocale() function is not called.

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3
Portable character set = ASCII

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-09 Thread STINNER Victor

STINNER Victor added the comment:

Marc-Andre AFAIK, Python 3 does work with ASCII data in the C locale, so I'm 
not sure whether this is a bug at all.

What do you mean? Python uses the surrogateescape encoding since Python 3.1, 
undecodable bytes are stored as surrogate characters.

Many bugs related to locales were fixed in Python 3.2, 3.3 and 3.4.

There are remaining bugs?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-09 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 09.12.2013 11:19, STINNER Victor wrote:
 
 STINNER Victor added the comment:
 
 Marc-Andre AFAIK, Python 3 does work with ASCII data in the C locale, so I'm 
 not sure whether this is a bug at all.
 
 What do you mean? Python uses the surrogateescape encoding since Python 3.1, 
 undecodable bytes are stored as surrogate characters.
 
 Many bugs related to locales were fixed in Python 3.2, 3.3 and 3.4.
 
 There are remaining bugs?

I was referring to the original bug report on this ticket.

FWIW: I don't think you can expect Python to work without exceptions
if you use a C locale and write non-ASCII data to stdout. I also
don't think that the new ticket title is correct - or at least,
I fail to see which aspect of Python breaks with LANG=C :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3

2013-12-08 Thread Nick Coghlan

Changes by Nick Coghlan ncogh...@gmail.com:


--
title: print() and write() are relying on sys.getfilesystemencoding() instead 
of sys.getdefaultencoding() - Setting LANG=C breaks Python 3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3

2013-12-08 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@gmail.com:


--
title: print() and write() are relying on sys.getfilesystemencoding() instead 
of sys.getdefaultencoding() - Setting LANG=C breaks Python 3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3

2013-12-08 Thread STINNER Victor

STINNER Victor added the comment:

 Or said differently, the filesystem encoding is different than the
 locale encoding.

 Indeed, but the FS encoding and the IO encoding are the same.
 locale encoding doesn't really matter here, as we are assuming that
 it's wrong.

Oh, I realized that FS encoding term in not clear. When I wrote FS 
encoding, I mean sys.getfilesystemencoding() which is mbcs on Windows, UTF-8 
on Mac OS X and (currently) the locale encoding on other platforms (UNIX, ex: 
Linux/FreeBSD/Solaris/AIX).

--

IMO there are two different points in this issue:

(a) which encoding should be used when the C locale is used: the encoding 
announced by the OS using nl_langinfo(CODESET) (current choice) or use an 
arbitrary optimistic utf-8 encoding?

(b) for technical reasons, Python reuses the C codec during Python 
initialization to decode and encode OS data, and so currently Python *must* use 
the locale encoding for its filesystem encoding

Before being able to pronounce me on the point (a), I would like to see a patch 
fixing the point (b). I'm not against fixing point (b). I'm just saying that 
it's not trivial and obviously it must be fixed to change the status of point 
(a). I even gave clues to fix point (b).

--

asciilocale.patch has many issues. Try to run the Python test suite using this 
patch to see what I mean. Example of failures:

==
FAIL: test_non_ascii (test.test_cmd_line.CmdLineTest)
--
Traceback (most recent call last):
  File /home/haypo/prog/python/default/Lib/test/test_cmd_line.py, line 140, 
in test_non_ascii
assert_python_ok('-c', command)
  File /home/haypo/prog/python/default/Lib/test/script_helper.py, line 69, in 
assert_python_ok
return _assert_python(True, *args, **env_vars)
  File /home/haypo/prog/python/default/Lib/test/script_helper.py, line 55, in 
_assert_python
stderr follows:\n%s % (rc, err.decode('ascii', 'ignore')))
AssertionError: Process return code is 1, stderr follows:
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 
12: surrogates not allowed

==
FAIL: test_ioencoding_nonascii (test.test_sys.SysModuleTest)
--
Traceback (most recent call last):
  File /home/haypo/prog/python/default/Lib/test/test_sys.py, line 603, in 
test_ioencoding_nonascii
self.assertEqual(out, os.fsencode(test.support.FS_NONASCII))
AssertionError: b'' != b'\xc3\xa6'

==
FAIL: test_nonascii (test.test_warnings.CEnvironmentVariableTests)
--
Traceback (most recent call last):
  File /home/haypo/prog/python/default/Lib/test/test_warnings.py, line 774, 
in test_nonascii
['ignore:Deprecaci\xf3nWarning'].encode('utf-8'))
AssertionError: b['ignore:Deprecaci\\udcc3\\udcb3nWarning'] != 
b['ignore:Deprecaci\xc3\xb3nWarning']

==
FAIL: test_nonascii (test.test_warnings.PyEnvironmentVariableTests)
--
Traceback (most recent call last):
  File /home/haypo/prog/python/default/Lib/test/test_warnings.py, line 774, 
in test_nonascii
['ignore:Deprecaci\xf3nWarning'].encode('utf-8'))
AssertionError: b['ignore:Deprecaci\\udcc3\\udcb3nWarning'] != 
b['ignore:Deprecaci\xc3\xb3nWarning']


test_warnings is probably #9988, test_cmd_line failure is maybe #9992.

There are maybe other issues, the Python test suite only have a few tests for 
non-ASCII characters.

--

If anything is changed, I would prefer to have more than a few months of test 
to make sure that it doesn't break anything. So I set the version field to 
Python 3.5.

--
versions: +Python 3.5 -Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3

2013-12-08 Thread Antoine Pitrou

Antoine Pitrou added the comment:

On dim., 2013-12-08 at 22:22 +, STINNER Victor wrote:
 (b) for technical reasons, Python reuses the C codec during Python
 initialization to decode and encode OS data, and so currently Python
 *must* use the locale encoding for its filesystem encoding

Ahhh! Well indeed that's a bummer :-)

 asciilocale.patch has many issues. Try to run the Python test suite
 using this patch to see what I mean.

I'm assuming much of this is due to (b) (all those tests seem to spawn
external processes).

It seems there is more work to do to get this right, but I'm not
terribly interested either. Feel free to take over.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3

2013-12-08 Thread STINNER Victor

STINNER Victor added the comment:

 It seems there is more work to do to get this right, but I'm not
 terribly interested either. Feel free to take over.

If you are talking to me: I'm currently opposed to change anything, so I'm not 
interested to work on a patch. IMO Python works fine and you should try to 
workaround the current limitations :-)

If someone is interested to write an huge patch fixing all these issues, I 
would be able to reconsider my opinion on point (a).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-08 Thread Nick Coghlan

Nick Coghlan added the comment:

End users tripping over this by setting LANG=C is one of the pain points of 
Python 3 relative to Python 2 for Fedora, so I've added a couple of Fedora 
folks to the nosy list.

My current understanding of the situation:

- we should leave Windows and Mac OS X alone, since they ignore the locale when 
choosing the OS API encoding anyway

- the main problem is on Linux (but potentially other *nix systems as well), 
where people set LANG=C for a variety of reasons, but this has the side 
effect of Python 3 choosing an inappropriate encoding (ASCII rather than UTF-8) 
when talking to the OS APIs.

Given the initialisation problems, this may be something that PEP 432 (the 
initialisation process rewrite) can help with (since it changes the 
initialisation order to create a more complete Python runtime before it starts 
to configure the OS interfaces).

Tangentially related, we may want to consider aliasing 
sys.getfilesystemencoding, os.fsencode and os.fsdecode as something like 
sys.getosapiencoding, os.apiencode and os.apidecode, since the current naming 
is misleading (the value is based on the platform and environment, not any 
particular filesystem, and is used for almost all bytes-based OS APIs, not just 
filesystem metadata)

--
nosy: +a.badger, bkabrda
title: Setting LANG=C breaks Python 3 - Setting LANG=C breaks Python 3 on Linux

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-08 Thread STINNER Victor

STINNER Victor added the comment:

 End users tripping over this by setting LANG=C is one of the pain points of 
 Python 3 relative to Python 2 for Fedora, so I've added a couple of Fedora 
 folks to the nosy list.

Sorry, I'm not aware of such issue. Do you have examples?

 - the main problem is on Linux (but potentially other *nix systems as well), 
 where people set LANG=C for a variety of reasons, but this has the side 
 effect of Python 3 choosing an inappropriate encoding (ASCII rather than 
 UTF-8) when talking to the OS APIs.

Why do you think that the issue is specific to Python 3? Try to open a
terminal with LC_ALL=C and try to type non-ASCII characters with your
keyboard. You can't because your terminal uses ASCII. Did you
applications written in another language handling Unicode, like Perl?
(Perl with Unicode support correctly enabled, it's use utf8; if I
remember correctly).

Can you explain the various reasons why users explictly force the
encoding to ASCII?

I use LANG=C to get manual pages and error messages in english. But
LANG=en_US man ls would be more correct, or LC_MESSAGES=en_US man
ls to be pedantic. (Env var priority: LC_ALL  LANG  LC_xxx).

IMO if you use LANG=C, you must not complain that Unicode stopped
working, but you should learn how to configure locales. Trivial
examples like the one which can be found in the initial message
(msg204849) are wrong: why would you force all locales to C and use
non-ASCII characters?

 Given the initialisation problems, this may be something that PEP 432 (the 
 initialisation process rewrite) can help with (since it changes the 
 initialisation order to create a more complete Python runtime before it 
 starts to configure the OS interfaces).

I don't see how it would help to solve my point (b).

Technically, this issue cannot be fixed. Or to be more specific, I
don't want to fix it, it's a waste of time. So I don't understand what
do you expect from this open issue?

I would prefer to close it as invalid or wontfix to be clear.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-08 Thread Nick Coghlan

Nick Coghlan added the comment:

On 9 December 2013 12:08, STINNER Victor rep...@bugs.python.org wrote:

 STINNER Victor added the comment:

 End users tripping over this by setting LANG=C is one of the pain points of 
 Python 3 relative to Python 2 for Fedora, so I've added a couple of Fedora 
 folks to the nosy list.

 Sorry, I'm not aware of such issue. Do you have examples?

Armin's travails with remote shell access and Python 3 are just as
likely today as they were a couple of years ago:
http://lucumr.pocoo.org/2011/12/7/thoughts-on-python3/

(although technically that was a terminal ending up with the POSIX
locale, rather than specifically LANG=C)

 - the main problem is on Linux (but potentially other *nix systems as well), 
 where people set LANG=C for a variety of reasons, but this has the side 
 effect of Python 3 choosing an inappropriate encoding (ASCII rather than 
 UTF-8) when talking to the OS APIs.

 Why do you think that the issue is specific to Python 3? Try to open a
 terminal with LC_ALL=C and try to type non-ASCII characters with your
 keyboard. You can't because your terminal uses ASCII. Did you
 applications written in another language handling Unicode, like Perl?
 (Perl with Unicode support correctly enabled, it's use utf8; if I
 remember correctly).

It's the fact this used to work transparently in Python 2 (since all
these interfaces were just bytes based on the Python side as well)
that's a problem. That makes the new sensitivity to the locale
encoding a usability regression, and that's a concern for distros that
are considering switching their default Python version.

 Can you explain the various reasons why users explictly force the
 encoding to ASCII?

- testing applications for POSIX compliance
- default settings on servers where you don't control the environment
- because they never previously had to care, and it's only Python 3
deciding to pay attention to it which makes it relevent for them

 I use LANG=C to get manual pages and error messages in english. But
 LANG=en_US man ls would be more correct, or LC_MESSAGES=en_US man
 ls to be pedantic. (Env var priority: LC_ALL  LANG  LC_xxx).

 IMO if you use LANG=C, you must not complain that Unicode stopped
 working, but you should learn how to configure locales. Trivial
 examples like the one which can be found in the initial message
 (msg204849) are wrong: why would you force all locales to C and use
 non-ASCII characters?

And yet, in Python 2, people could do that, and Python didn't care.
*That's* the regression I'm worried about. If it hadn't round-tripped
cleanly in Python 2, I wouldn't care here either.

 Given the initialisation problems, this may be something that PEP 432 (the 
 initialisation process rewrite) can help with (since it changes the 
 initialisation order to create a more complete Python runtime before it 
 starts to configure the OS interfaces).

 I don't see how it would help to solve my point (b).

Having a Python runtime available makes things that are currently
tediously painful to deal with during startup easier to tweak. I'm not
sure it *will* help in this particular case, but it's now one I'm
going to keep an eye on.

 Technically, this issue cannot be fixed. Or to be more specific, I
 don't want to fix it, it's a waste of time. So I don't understand what
 do you expect from this open issue?

A way to get Python 3 to cope as well with a misconfigured OS
environment as Python 2 did.

 I would prefer to close it as invalid or wontfix to be clear.

It's a usability regression from Python 2, so I don't want to give up
on it. It may be that we just implement a ignore what the OS claims,
it's misconfigured, just use UTF-8 for everything flag. But OS
configuration errors shouldn't cripple the Python runtime.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19846] Setting LANG=C breaks Python 3 on Linux

2013-12-08 Thread Sworddragon

Sworddragon added the comment:

You should keep things more simple:

- Python and the operation system/filesystem are in a client-server 
relationship and Python should validate all.
- It doesn't matter what you will finally decide to be the default encoding on 
various places - all will provide race-conditions with no exception.
- The easiest way to fix this is to give the developer the ability to make a 
decision (like sys.use_strict_encoding(), sys.setfilesystemencoding(), 
sys.setdefaultencoding() etc.).
* For example giving the developer control is especially needed if he wants to 
handle multiple different filesystems.


 Why do you think that the issue is specific to Python 3? Try to open a
 terminal with LC_ALL=C and try to type non-ASCII characters with your
 keyboard. You can't because your terminal uses ASCII.

sworddragon@ubuntu:~$ LANG=C
sworddragon@ubuntu:~$ ä
bash: $'\303\244': command not found

- The terminal doesn't pseudo-crash with an exception because it doesn't matter 
about encodings.
- It allows to change the encoding at runtime.


 Did you
 applications written in another language handling Unicode, like Perl?

Compare C: It wouldn't matter like the terminal. For example fopen will simply 
return NULL if it can't open the file 'ä' because the filesystem is endoded 
with ISO-8859-1 and we wanted to open the utf-8 counterpart.


 Can you explain the various reasons why users explictly force the
 encoding to ASCII?

For example I'm using this for testcases to set the language uncomplicated to 
english.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19846
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com