subject:"Yet another unicode WTF"

Re: Yet another unicode WTF

2009-06-05 Thread Ben Finney

Gabriel Genellina gagsl-...@yahoo.com.ar writes:

 Python knows the terminal encoding (or at least can make a good
 guess), but a file may use *any* encoding you want, completely
 unrelated to your terminal settings.

It may, yes, and the programmer is free to specify any encoding.

 So when stdout is redirected, Python refuses to guess its encoding;

But Python doesn't have to guess; the terminal encoding is as specified
in either case, no?

 see the PYTHONIOENCODING environment variable.

For the standard streams, the specified terminal encoding available to
every program makes the most sense — certainly more sense than a
Python-specific variable, or the “default to ASCII” of the current
behaviour.

-- 
 \“Holy uncanny photographic mental processes, Batman!” —Robin |
  `\   |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-05 Thread Ned Deily

In article 8763fbmk5a@benfinney.id.au,
 Ben Finney ben+pyt...@benfinney.id.au wrote:
 Ned Deily n...@acm.org writes:
  $ python2.6 -c 'import sys; print sys.stdout.encoding, \
   sys.stdout.isatty()'
  UTF-8 True
  $ python2.6 -c 'import sys; print sys.stdout.encoding, \
   sys.stdout.isatty()'  foo ; cat foo
  None False
 
 So shouldn't the second case also detect UTF-8? The filesystem knows
 it's UTF-8, the shell knows it too. Why doesn't Python know it?

The filesystem knows what is UTF-8?  While the setting of the locale 
environment variables may influence how the file system interprets the 
*name* of a file, it has no direct influence on what the *contents* of a 
file is or is supposed to be.  Remember in python 2.x, a file is a just 
sequence of bytes.  If you want to write encode Unicode to the file, you 
need to use something like codecs.open to wrap the file object with the 
proper streamwriter encoder.

What confuses matters in 2.x is the print statement's under-the-covers 
implicit Unicode encoding for files connected to a terminal:

http://bugs.python.org/issue612627
http://bugs.python.org/issue4947
http://wiki.python.org/moin/PrintFails

 x = u'\u0430\u0431\u0432'
 print x
[nice looking characters here]
 sys.stdout.write(x)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode characters in position 
0-2: ordinal not in range(128)
 sys.stdout.encoding
'UTF-8'

In python 3.x, of course, the encoding happens automatically but you 
still have to tell python, via the encoding argument to open, what the 
encoding of the file's content is (or accept python's default which may 
not be very useful):

 open('foo1','w').encoding
'mac-roman'

WTF, indeed.

-- 
 Ned Deily,
 n...@acm.org

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-05 Thread Paul Boddie

On 5 Jun, 03:18, Ron Garret rnospa...@flownet.com wrote:

 According to what I thought I knew about unix (and I had fancied myself
 a bit of an expert until just now) this is impossible.  Python is
 obviously picking up a different default encoding when its output is
 being piped to a file, but I always thought one of the fundamental
 invariants of unix processes was that there's no way for a process to
 know what's on the other end of its stdout.

The only way to think about this (in Python 2.x, at least) is to
consider stream and file objects as things which only understand plain
byte strings. Consequently, use of the codecs module is required if
receiving/sending Unicode objects from/to streams and files.

Paul
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-05 Thread Paul Boddie

On 5 Jun, 11:51, Ben Finney ben+pyt...@benfinney.id.au wrote:

 Actually strings in Python 2.4 or later have the ‘encode’ method, with
 no need for importing extra modules:

 =
 $ python -c 'import sys; sys.stdout.write(u\u03bb\n.encode(utf-8))'
 λ

 $ python -c 'import sys; sys.stdout.write(u\u03bb\n.encode(utf-8))'  foo 
 ; cat foo
 λ
 =

Those are Unicode objects, not traditional Python strings. Although
strings do have decode and encode methods, even in Python 2.3, the
former is shorthand for the construction of a Unicode object using the
stated encoding whereas the latter seems to rely on the error-prone
automatic encoding detection in order to create a Unicode object and
then encode the result - in effect, recoding the string.

As I noted, if one wants to remain sane and not think about encoding
everything everywhere, creating a stream using a codecs module
function or class will permit the construction of something which
deals with Unicode objects satisfactorily.

Paul
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-05 Thread Ned Deily

In article nad-31678a.00033005062...@ger.gmane.org,
 Ned Deily n...@acm.org wrote:
 In python 3.x, of course, the encoding happens automatically but you 
 still have to tell python, via the encoding argument to open, what the 
 encoding of the file's content is (or accept python's default which may 
 not be very useful):
 
  open('foo1','w').encoding
 'mac-roman'
 
 WTF, indeed.

BTW, I've opened a 3.1 release blocker issue about 'mac-roman' as a 
default on OS X.  Hard to believe none of us has noticed this up to now!

http://bugs.python.org/issue6202

-- 
 Ned Deily,
 n...@acm.org

-- 
http://mail.python.org/mailman/listinfo/python-list

Yet another unicode WTF

2009-06-04 Thread Ron Garret

Python 2.6.2 on OS X 10.5.7:

[...@mickey:~]$ echo $LANG
en_US.UTF-8
[...@mickey:~]$ cat frob.py 
#!/usr/bin/env python
print u'\u03BB'

[...@mickey:~]$ ./frob.py 
ª
[...@mickey:~]$ ./frob.py  foo
Traceback (most recent call last):
  File ./frob.py, line 2, in module
print u'\u03BB'
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in 
position 0: ordinal not in range(128)


(That's supposed to be a small greek lambda, but I'm using a 
brain-damaged news reader that won't let me set the character encoding.  
It shows up correctly in my terminal.)

According to what I thought I knew about unix (and I had fancied myself 
a bit of an expert until just now) this is impossible.  Python is 
obviously picking up a different default encoding when its output is 
being piped to a file, but I always thought one of the fundamental 
invariants of unix processes was that there's no way for a process to 
know what's on the other end of its stdout.

Clues appreciated.  Thanks.

rg
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-04 Thread Lawrence D'Oliveiro

In message rnospamon-e7e08b.18181804062...@news.gha.chartermi.net, Ron 
Garret wrote:

 Python 2.6.2 on OS X 10.5.7:

Same result, Python 2.6.1-3 on Debian Unstable. My $LANG is en_NZ.UTF-8.

 ... I always thought one of the fundamental
 invariants of unix processes was that there's no way for a process to
 know what's on the other end of its stdout.

Well, there have long been functions like isatty(3). That's probably what's 
involved here.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-04 Thread Ben Finney

Ron Garret rnospa...@flownet.com writes:

 According to what I thought I knew about unix (and I had fancied myself 
 a bit of an expert until just now) this is impossible.  Python is 
 obviously picking up a different default encoding when its output is 
 being piped to a file, but I always thought one of the fundamental 
 invariants of unix processes was that there's no way for a process to 
 know what's on the other end of its stdout.

It certainly can. If you're using GNU and a terminal that declares
support for colour, examine the difference between these two:

$ ls --color=auto
$ ls --color=auto  foo ; cat foo

 Clues appreciated.  Thanks.

Research ‘man 3 isatty’ for the function most commonly used to determine
whether a file descriptor represents a terminal.

-- 
 \   “If you are unable to leave your room, expose yourself in the |
  `\window.” —instructions in case of fire, hotel, Finland |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-04 Thread Gabriel Genellina

En Thu, 04 Jun 2009 22:18:24 -0300, Ron Garret rnospa...@flownet.com  
escribió:



Python 2.6.2 on OS X 10.5.7:

[...@mickey:~]$ echo $LANG
en_US.UTF-8
[...@mickey:~]$ cat frob.py
#!/usr/bin/env python
print u'\u03BB'

[...@mickey:~]$ ./frob.py
ª
[...@mickey:~]$ ./frob.py  foo
Traceback (most recent call last):
  File ./frob.py, line 2, in module
print u'\u03BB'
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in
position 0: ordinal not in range(128)


(That's supposed to be a small greek lambda, but I'm using a
brain-damaged news reader that won't let me set the character encoding.
It shows up correctly in my terminal.)

According to what I thought I knew about unix (and I had fancied myself
a bit of an expert until just now) this is impossible.  Python is
obviously picking up a different default encoding when its output is
being piped to a file, but I always thought one of the fundamental
invariants of unix processes was that there's no way for a process to
know what's on the other end of its stdout.


It may be hard to know *who* is at the other end of the pipe, but it's  
easy to know *what* kind of file it is.
Lots of programs detect whether stdout is a tty or not (using isatty(3))  
and adapt their output accordingly; ls is one example.


Python knows the terminal encoding (or at least can make a good guess),  
but a file may use *any* encoding you want, completely unrelated to your  
terminal settings. So when stdout is redirected, Python refuses to guess  
its encoding; see the PYTHONIOENCODING environment variable.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-04 Thread Ron Garret

In article h09ten$5q...@lust.ihug.co.nz,
 Lawrence D'Oliveiro l...@geek-central.gen.new_zealand wrote:

 In message rnospamon-e7e08b.18181804062...@news.gha.chartermi.net, Ron 
 Garret wrote:
 
  Python 2.6.2 on OS X 10.5.7:
 
 Same result, Python 2.6.1-3 on Debian Unstable. My $LANG is en_NZ.UTF-8.
 
  ... I always thought one of the fundamental
  invariants of unix processes was that there's no way for a process to
  know what's on the other end of its stdout.
 
 Well, there have long been functions like isatty(3). That's probably what's 
 involved here.

Oh.  Right.  Duh.

I am having an unbelievably bad day involving lawyers.  (And not 
language lawyers, real ones.)

Found the answer here:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=415968

rg
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-04 Thread Ben Finney

Ron Garret rnospa...@flownet.com writes:

 Python 2.6.2 on OS X 10.5.7:
 
 [...@mickey:~]$ echo $LANG
 en_US.UTF-8
 [...@mickey:~]$ cat frob.py 
 #!/usr/bin/env python
 print u'\u03BB'
 
 [...@mickey:~]$ ./frob.py 
 ª
 [...@mickey:~]$ ./frob.py  foo
 Traceback (most recent call last):
   File ./frob.py, line 2, in module
 print u'\u03BB'
 UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in 
 position 0: ordinal not in range(128)

I get the same behaviour on Debian GNU/Linux, python 2.5.2. It's
certainly not desirable; the terminal, the shell, and the filesystem are
all using UTF-8 so it should work fine.

You might be best advised to report this as a bug to the Python bug
tracker URL:http://bugs.python.org/.

-- 
 \“I fly Air Bizarre. You buy a combination one-way round-trip |
  `\ticket. Leave any Monday, and they bring you back the previous |
_o__) Friday. That way you still have the weekend.” —Steven Wright |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-04 Thread Ned Deily

In article rnospamon-e7e08b.18181804062...@news.gha.chartermi.net,
 Ron Garret rnospa...@flownet.com wrote:
 Python 2.6.2 on OS X 10.5.7:
 
 [...@mickey:~]$ echo $LANG
 en_US.UTF-8
 [...@mickey:~]$ cat frob.py 
 #!/usr/bin/env python
 print u'\u03BB'
 
 [...@mickey:~]$ ./frob.py 
 ª
 [...@mickey:~]$ ./frob.py  foo
 Traceback (most recent call last):
   File ./frob.py, line 2, in module
 print u'\u03BB'
 UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in 
 position 0: ordinal not in range(128)
 
 
 (That's supposed to be a small greek lambda, but I'm using a 
 brain-damaged news reader that won't let me set the character encoding.  
 It shows up correctly in my terminal.)
 
 According to what I thought I knew about unix (and I had fancied myself 
 a bit of an expert until just now) this is impossible.  Python is 
 obviously picking up a different default encoding when its output is 
 being piped to a file, but I always thought one of the fundamental 
 invariants of unix processes was that there's no way for a process to 
 know what's on the other end of its stdout.
 
 Clues appreciated.  Thanks.

$ python2.6 -c 'import sys; print sys.stdout.encoding, \
 sys.stdout.isatty()'
UTF-8 True
$ python2.6 -c 'import sys; print sys.stdout.encoding, \
 sys.stdout.isatty()'  foo ; cat foo
None False

-- 
 Ned Deily,
 n...@acm.org

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-04 Thread Ben Finney

Ned Deily n...@acm.org writes:

 $ python2.6 -c 'import sys; print sys.stdout.encoding, \
  sys.stdout.isatty()'
 UTF-8 True
 $ python2.6 -c 'import sys; print sys.stdout.encoding, \
  sys.stdout.isatty()'  foo ; cat foo
 None False

So shouldn't the second case also detect UTF-8? The filesystem knows
it's UTF-8, the shell knows it too. Why doesn't Python know it?

-- 
 \  “When I was born I was so surprised I couldn't talk for a year |
  `\and a half.” —Gracie Allen |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

2009-06-04 Thread Lawrence D'Oliveiro

In message mailman.1149.1244167714.8015.python-l...@python.org, Gabriel 
Genellina wrote:

 Python knows the terminal encoding (or at least can make a good guess),
 but a file may use *any* encoding you want, completely unrelated to your
 terminal settings.

It should still respect your localization settings, though.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

Re: Yet another unicode WTF

14 matches

Site Navigation

Mail list logo

Footer information