Re: Yet another unicode WTF
Gabriel Genellina gagsl-...@yahoo.com.ar writes: Python knows the terminal encoding (or at least can make a good guess), but a file may use *any* encoding you want, completely unrelated to your terminal settings. It may, yes, and the programmer is free to specify any encoding. So when stdout is redirected, Python refuses to guess its encoding; But Python doesn't have to guess; the terminal encoding is as specified in either case, no? see the PYTHONIOENCODING environment variable. For the standard streams, the specified terminal encoding available to every program makes the most sense — certainly more sense than a Python-specific variable, or the “default to ASCII” of the current behaviour. -- \“Holy uncanny photographic mental processes, Batman!” —Robin | `\ | _o__) | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
In article 8763fbmk5a@benfinney.id.au, Ben Finney ben+pyt...@benfinney.id.au wrote: Ned Deily n...@acm.org writes: $ python2.6 -c 'import sys; print sys.stdout.encoding, \ sys.stdout.isatty()' UTF-8 True $ python2.6 -c 'import sys; print sys.stdout.encoding, \ sys.stdout.isatty()' foo ; cat foo None False So shouldn't the second case also detect UTF-8? The filesystem knows it's UTF-8, the shell knows it too. Why doesn't Python know it? The filesystem knows what is UTF-8? While the setting of the locale environment variables may influence how the file system interprets the *name* of a file, it has no direct influence on what the *contents* of a file is or is supposed to be. Remember in python 2.x, a file is a just sequence of bytes. If you want to write encode Unicode to the file, you need to use something like codecs.open to wrap the file object with the proper streamwriter encoder. What confuses matters in 2.x is the print statement's under-the-covers implicit Unicode encoding for files connected to a terminal: http://bugs.python.org/issue612627 http://bugs.python.org/issue4947 http://wiki.python.org/moin/PrintFails x = u'\u0430\u0431\u0432' print x [nice looking characters here] sys.stdout.write(x) Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) sys.stdout.encoding 'UTF-8' In python 3.x, of course, the encoding happens automatically but you still have to tell python, via the encoding argument to open, what the encoding of the file's content is (or accept python's default which may not be very useful): open('foo1','w').encoding 'mac-roman' WTF, indeed. -- Ned Deily, n...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
On 5 Jun, 03:18, Ron Garret rnospa...@flownet.com wrote: According to what I thought I knew about unix (and I had fancied myself a bit of an expert until just now) this is impossible. Python is obviously picking up a different default encoding when its output is being piped to a file, but I always thought one of the fundamental invariants of unix processes was that there's no way for a process to know what's on the other end of its stdout. The only way to think about this (in Python 2.x, at least) is to consider stream and file objects as things which only understand plain byte strings. Consequently, use of the codecs module is required if receiving/sending Unicode objects from/to streams and files. Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
On 5 Jun, 11:51, Ben Finney ben+pyt...@benfinney.id.au wrote: Actually strings in Python 2.4 or later have the ‘encode’ method, with no need for importing extra modules: = $ python -c 'import sys; sys.stdout.write(u\u03bb\n.encode(utf-8))' λ $ python -c 'import sys; sys.stdout.write(u\u03bb\n.encode(utf-8))' foo ; cat foo λ = Those are Unicode objects, not traditional Python strings. Although strings do have decode and encode methods, even in Python 2.3, the former is shorthand for the construction of a Unicode object using the stated encoding whereas the latter seems to rely on the error-prone automatic encoding detection in order to create a Unicode object and then encode the result - in effect, recoding the string. As I noted, if one wants to remain sane and not think about encoding everything everywhere, creating a stream using a codecs module function or class will permit the construction of something which deals with Unicode objects satisfactorily. Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
In article nad-31678a.00033005062...@ger.gmane.org, Ned Deily n...@acm.org wrote: In python 3.x, of course, the encoding happens automatically but you still have to tell python, via the encoding argument to open, what the encoding of the file's content is (or accept python's default which may not be very useful): open('foo1','w').encoding 'mac-roman' WTF, indeed. BTW, I've opened a 3.1 release blocker issue about 'mac-roman' as a default on OS X. Hard to believe none of us has noticed this up to now! http://bugs.python.org/issue6202 -- Ned Deily, n...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Yet another unicode WTF
Python 2.6.2 on OS X 10.5.7: [...@mickey:~]$ echo $LANG en_US.UTF-8 [...@mickey:~]$ cat frob.py #!/usr/bin/env python print u'\u03BB' [...@mickey:~]$ ./frob.py ª [...@mickey:~]$ ./frob.py foo Traceback (most recent call last): File ./frob.py, line 2, in module print u'\u03BB' UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in position 0: ordinal not in range(128) (That's supposed to be a small greek lambda, but I'm using a brain-damaged news reader that won't let me set the character encoding. It shows up correctly in my terminal.) According to what I thought I knew about unix (and I had fancied myself a bit of an expert until just now) this is impossible. Python is obviously picking up a different default encoding when its output is being piped to a file, but I always thought one of the fundamental invariants of unix processes was that there's no way for a process to know what's on the other end of its stdout. Clues appreciated. Thanks. rg -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
In message rnospamon-e7e08b.18181804062...@news.gha.chartermi.net, Ron Garret wrote: Python 2.6.2 on OS X 10.5.7: Same result, Python 2.6.1-3 on Debian Unstable. My $LANG is en_NZ.UTF-8. ... I always thought one of the fundamental invariants of unix processes was that there's no way for a process to know what's on the other end of its stdout. Well, there have long been functions like isatty(3). That's probably what's involved here. -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
Ron Garret rnospa...@flownet.com writes: According to what I thought I knew about unix (and I had fancied myself a bit of an expert until just now) this is impossible. Python is obviously picking up a different default encoding when its output is being piped to a file, but I always thought one of the fundamental invariants of unix processes was that there's no way for a process to know what's on the other end of its stdout. It certainly can. If you're using GNU and a terminal that declares support for colour, examine the difference between these two: $ ls --color=auto $ ls --color=auto foo ; cat foo Clues appreciated. Thanks. Research ‘man 3 isatty’ for the function most commonly used to determine whether a file descriptor represents a terminal. -- \ “If you are unable to leave your room, expose yourself in the | `\window.” —instructions in case of fire, hotel, Finland | _o__) | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
En Thu, 04 Jun 2009 22:18:24 -0300, Ron Garret rnospa...@flownet.com escribió: Python 2.6.2 on OS X 10.5.7: [...@mickey:~]$ echo $LANG en_US.UTF-8 [...@mickey:~]$ cat frob.py #!/usr/bin/env python print u'\u03BB' [...@mickey:~]$ ./frob.py ª [...@mickey:~]$ ./frob.py foo Traceback (most recent call last): File ./frob.py, line 2, in module print u'\u03BB' UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in position 0: ordinal not in range(128) (That's supposed to be a small greek lambda, but I'm using a brain-damaged news reader that won't let me set the character encoding. It shows up correctly in my terminal.) According to what I thought I knew about unix (and I had fancied myself a bit of an expert until just now) this is impossible. Python is obviously picking up a different default encoding when its output is being piped to a file, but I always thought one of the fundamental invariants of unix processes was that there's no way for a process to know what's on the other end of its stdout. It may be hard to know *who* is at the other end of the pipe, but it's easy to know *what* kind of file it is. Lots of programs detect whether stdout is a tty or not (using isatty(3)) and adapt their output accordingly; ls is one example. Python knows the terminal encoding (or at least can make a good guess), but a file may use *any* encoding you want, completely unrelated to your terminal settings. So when stdout is redirected, Python refuses to guess its encoding; see the PYTHONIOENCODING environment variable. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
In article h09ten$5q...@lust.ihug.co.nz, Lawrence D'Oliveiro l...@geek-central.gen.new_zealand wrote: In message rnospamon-e7e08b.18181804062...@news.gha.chartermi.net, Ron Garret wrote: Python 2.6.2 on OS X 10.5.7: Same result, Python 2.6.1-3 on Debian Unstable. My $LANG is en_NZ.UTF-8. ... I always thought one of the fundamental invariants of unix processes was that there's no way for a process to know what's on the other end of its stdout. Well, there have long been functions like isatty(3). That's probably what's involved here. Oh. Right. Duh. I am having an unbelievably bad day involving lawyers. (And not language lawyers, real ones.) Found the answer here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=415968 rg -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
Ron Garret rnospa...@flownet.com writes: Python 2.6.2 on OS X 10.5.7: [...@mickey:~]$ echo $LANG en_US.UTF-8 [...@mickey:~]$ cat frob.py #!/usr/bin/env python print u'\u03BB' [...@mickey:~]$ ./frob.py ª [...@mickey:~]$ ./frob.py foo Traceback (most recent call last): File ./frob.py, line 2, in module print u'\u03BB' UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in position 0: ordinal not in range(128) I get the same behaviour on Debian GNU/Linux, python 2.5.2. It's certainly not desirable; the terminal, the shell, and the filesystem are all using UTF-8 so it should work fine. You might be best advised to report this as a bug to the Python bug tracker URL:http://bugs.python.org/. -- \“I fly Air Bizarre. You buy a combination one-way round-trip | `\ticket. Leave any Monday, and they bring you back the previous | _o__) Friday. That way you still have the weekend.” —Steven Wright | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
In article rnospamon-e7e08b.18181804062...@news.gha.chartermi.net, Ron Garret rnospa...@flownet.com wrote: Python 2.6.2 on OS X 10.5.7: [...@mickey:~]$ echo $LANG en_US.UTF-8 [...@mickey:~]$ cat frob.py #!/usr/bin/env python print u'\u03BB' [...@mickey:~]$ ./frob.py ª [...@mickey:~]$ ./frob.py foo Traceback (most recent call last): File ./frob.py, line 2, in module print u'\u03BB' UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in position 0: ordinal not in range(128) (That's supposed to be a small greek lambda, but I'm using a brain-damaged news reader that won't let me set the character encoding. It shows up correctly in my terminal.) According to what I thought I knew about unix (and I had fancied myself a bit of an expert until just now) this is impossible. Python is obviously picking up a different default encoding when its output is being piped to a file, but I always thought one of the fundamental invariants of unix processes was that there's no way for a process to know what's on the other end of its stdout. Clues appreciated. Thanks. $ python2.6 -c 'import sys; print sys.stdout.encoding, \ sys.stdout.isatty()' UTF-8 True $ python2.6 -c 'import sys; print sys.stdout.encoding, \ sys.stdout.isatty()' foo ; cat foo None False -- Ned Deily, n...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
Ned Deily n...@acm.org writes: $ python2.6 -c 'import sys; print sys.stdout.encoding, \ sys.stdout.isatty()' UTF-8 True $ python2.6 -c 'import sys; print sys.stdout.encoding, \ sys.stdout.isatty()' foo ; cat foo None False So shouldn't the second case also detect UTF-8? The filesystem knows it's UTF-8, the shell knows it too. Why doesn't Python know it? -- \ “When I was born I was so surprised I couldn't talk for a year | `\and a half.” —Gracie Allen | _o__) | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: Yet another unicode WTF
In message mailman.1149.1244167714.8015.python-l...@python.org, Gabriel Genellina wrote: Python knows the terminal encoding (or at least can make a good guess), but a file may use *any* encoding you want, completely unrelated to your terminal settings. It should still respect your localization settings, though. -- http://mail.python.org/mailman/listinfo/python-list