Re: the stupid encoding problem to stdout

2011-06-13 Thread Sérgio Monteiro Basto
Ian Kelly wrote:

 If you want your output to behave that way, then all you have to do is
 specify that with an explicit encode step.

ok 

 If we want we change default for whatever we want, but without this
 default change Python should not change his behavior depending on
 output. yeah I prefer strange output for a different platform, to a
 decode errors.
 
 Sorry, I disagree.  If your program is going to fail, it's better that
 it fail noisily (with an error) than silently (with no notice that
 anything is wrong).

Hi, 
ok a little resume, I got the solution which is setting env with 
PYTHONIOENCODING=utf-8, which if it was a default for modern GNU Linux, was 
made me save lots of time.
My practical problem is simple like, I make a script that want run in shell 
for testing and log to a file when use with a configuration. 
Everything runs well in a shell and sometimes (later) fails when log to a 
file, with a  UnicodeEncodeError: 'ascii' codec can't encode character 
u'\xe7' in position.
So to work in both cases (tty and files), I filled all code with string 
.encode('utf-8') to workaround, when what always I want was use  
PYTHONIOCONDIG=utf-8. I got anything in utf-8, database is in utf-8, I 
coding in utf-8, my OS is in utf-8. In last about 3 years of learning Python 
I lost many many hours to understand this problem.  
And see, I can send ascii and utf-8 to utf-8 output and never have problems, 
but if I send ascii and utf-8 to ascii files sometimes got encode errors.
So you please consider, at least on Linux, default encode to utf-8 (because 
we have less problems) or make more clear that pipe to a file is different 
to a tty and problem was in files that defaults to ascii. Or 
make the default of IOENCONDIG based on env LANG.

Anyway many thanks for your time and for help me out.
I don't know how run the things in Python 3 , in python 3 defaults are utf-8 
? 

Thanks, 
--
Sérgio M. B. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-13 Thread Chris Angelico
2011/6/14 Sérgio Monteiro Basto sergi...@sapo.pt:
 And see, I can send ascii and utf-8 to utf-8 output and never have problems,
 but if I send ascii and utf-8 to ascii files sometimes got encode errors.


If something fits inside 7-bit ASCII, it is by definition valid UTF-8.
This is not a coincidence.

Those hours you've spent grokking this are not wasted, if you now have
a comprehension of characters vs encodings. More people in the world
need to understand that difference! :)

Chris Angelico
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-10 Thread Laurent Claessens

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ../test.py
 moçambique
 moçambique


The following tries to encode before to print. If you pass an already 
utf-8 object, it just print it; if not it encode it. All the print 
statements pass by MyPrint.write


#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode(utf8)
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)


MyPrint()

u = u'moçambique'
print u.encode(utf-8)
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py  test.txt
$ cat test.txt
moçambique
moçambique


By the way, my code will not help for error message. I think that the 
errors are printed by sys.stderr.write. So if you want to do

raise moçambique
you should think about add stderr to the class MyPrint


If you know French, I strongly recommend Comprendre les erreurs 
unicode by Victor Stinner :

http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent
--
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-10 Thread Sérgio Monteiro Basto
Ben Finney wrote:

  What should it decode to, then?

 UTF-8, as in tty
 
 But when you explicitly redirect to a file, it's not going to a TTY.
 It's going to a file whose encoding isn't known unless you specify it.

ok after thinking about this, this problem exist because Python want be 
smart with ttys, which is in my point of view is wrong, should not encode to 
utf-8, because tty is in utf-8. Python should always encode to the same 
thing. If the default is ascii, should always encode to ascii. 
yeah should send to tty in ascii, if I send my code to a guy in windows 
which use tty with cp1000whatever , shouldn't give decoding errors and 
should send in ascii . 
If we want we change default for whatever we want, but without this default 
change Python should not change his behavior depending on output.  
yeah I prefer strange output for a different platform, to a decode errors. 
And I have /usr/bin/iconv .
 
Thanks for attention, sorry about my very limited English.
--
Sérgio M. B.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-10 Thread Ian Kelly
2011/6/10 Sérgio Monteiro Basto sergi...@sapo.pt:
 ok after thinking about this, this problem exist because Python want be
 smart with ttys, which is in my point of view is wrong, should not encode to
 utf-8, because tty is in utf-8. Python should always encode to the same
 thing. If the default is ascii, should always encode to ascii.
 yeah should send to tty in ascii, if I send my code to a guy in windows
 which use tty with cp1000whatever , shouldn't give decoding errors and
 should send in ascii .

You can't have your cake and eat it too.  If Python needs to output a
string in ascii, and that string can't be represented in ascii, then
raising an exception is the only reasonable thing to do.  You seem to
be suggesting that Python should do an implicit output.encode('ascii',
'replace') on all Unicode output, which might be okay for a TTY, but
you wouldn't want that for file output; it would allow Python to
silently create garbage data.

And what if you send your code to somebody with a UTF-16 terminal?
You try to output ASCII to that, and you're just going to get complete
garbage.

If you want your output to behave that way, then all you have to do is
specify that with an explicit encode step.

 If we want we change default for whatever we want, but without this default
 change Python should not change his behavior depending on output.
 yeah I prefer strange output for a different platform, to a decode errors.

Sorry, I disagree.  If your program is going to fail, it's better that
it fail noisily (with an error) than silently (with no notice that
anything is wrong).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-10 Thread Chris Angelico
2011/6/11 Sérgio Monteiro Basto sergi...@sapo.pt:
 ok after thinking about this, this problem exist because Python want be
 smart with ttys

The *anomaly* (not problem) exists because Python has a way of being
told a target encoding. If two parties agree on an encoding, they can
send characters to each other. I had this discussion at work a while
ago; my boss was talking about being binary-safe (which really meant
8-bit safe), while I was saying that we should support, verify, and
demand properly-formed UTF-8. The main significance is that agreeing
on an encoding means we can change the encoding any time it's
convenient, without having to document that we've changed the data -
because we haven't. I can take the number twelve thousand three
hundred and forty-five and render that as a string of decimal digits
as 12345, or as hexadecimal digits as 3039, but I haven't changed
the number. If you know that I'm giving you a string of decimal
digits, and I give you 12345, you will get the same number at the
far side.

Python has agreed with stdout that it will send it characters encoded
in UTF-8. Having made that agreement, Python and stdout can happily
communicate in characters, not bytes. You don't need to explicitly
encode your characters into bytes - and in fact, this would be a very
bad thing to do, because you don't know _what_ encoding stdout is
using. If it's expecting UTF-16, you'll get a whole lot of rubbish if
you send it UTF-8 - but it'll look fine if you send it Unicode.

Chris Angelico
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Benjamin Kaplan wrote:

 2011/6/8 Sérgio Monteiro Basto sergi...@sapo.pt:
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ./test.py
 moçambique
 moçambique

 ./test.py  output.txt
 Traceback (most recent call last):
 File ./test.py, line 5, in module
 print u
 UnicodeEncodeError: 'ascii' codec can't encode character
 u'\xe7' in position 2: ordinal not in range(128)

 in python 2.7
 how I explain to python to send the same thing to stdout and
 the file output.txt ?

 Don't seems logic, when send things to a file the beaviour
 change.

 Thanks,
 Sérgio M. B.
 
 That's not a terminal vs file thing. It's a file that declares it's
 encoding vs a file that doesn't declare it's encoding thing. Your
 terminal declares that it is UTF-8. So when you print a Unicode string
 to your terminal, Python knows that it's supposed to turn it into
 UTF-8. When you pipe the output to a file, that file doesn't declare
 an encoding. So rather than guess which encoding you want, Python
 defaults to the lowest common denominator: ASCII. If you want
 something to be a particular encoding, you have to encode it yourself.

Exactly the opposite , if python don't know the encoding should not try 
decode to ASCII.

 
 You have a couple of choices on how to make it work:
 1) Play dumb and always encode as UTF-8. This would look really weird
 if someone tried running your program in a terminal with a CP-847
 encoding (like cmd.exe on at least the US version of Windows), but it
 would never crash.

I want python don't care about encoding terminal and send characters as they 
are or for a file . 

 2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
 string in the string-escape encoding, which substitutes the escape
 sequence in for all non-ASCII characters.

How I change sys.stdout.encoding always to UTF-8 ? at least have a  
consistent sys.stdout.encoding 

 3) Check to see if sys.stdout.isatty() and have different behavior for
 terminals vs files. If you're on a terminal that doesn't declare its
 encoding, encoding it as UTF-8 probably won't help. If you're writing
 to a file, that might be what you want to do.


Thanks,


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Ben Finney wrote:

 Sérgio Monteiro Basto sergi...@sapo.pt writes:
 
 ./test.py
 moçambique
 moçambique
 
 In this case your terminal is reporting its encoding to Python, and it's
 capable of taking the UTF-8 data that you send to it in both cases.
 
 ./test.py  output.txt
 Traceback (most recent call last):
   File ./test.py, line 5, in module
 print u
 UnicodeEncodeError: 'ascii' codec can't encode character
 u'\xe7' in position 2: ordinal not in range(128)
 
 In this case your shell has no preference for the encoding (since you're
 redirecting output to a file).
 

How I say to python that I want that write in utf-8 to files ? 


 In the first print statement you specify the encoding UTF-8, which is
 capable of encoding the characters.
 
 In the second print statement you haven't specified any encoding, so the
 default ASCII encoding is used.
 
 
 Moral of the tale: Make sure an encoding is specified whenever data
 steps between bytes and characters.
 
 Don't seems logic, when send things to a file the beaviour change.
 
 They're different files, which have been opened with different
 encodings. If you want a different encoding, you need to specify that.
 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Nobody
On Thu, 09 Jun 2011 22:14:17 +0100, Sérgio Monteiro Basto wrote:

 Exactly the opposite , if python don't know the encoding should not try 
 decode to ASCII.

What should it decode to, then?

You can't write characters to a stream, only bytes.

 I want python don't care about encoding terminal and send characters as they 
 are or for a file . 

You can't write characters to a stream, only bytes.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Ben Finney
Sérgio Monteiro Basto sergi...@sapo.pt writes:

 Ben Finney wrote:

  In this case your shell has no preference for the encoding (since
  you're redirecting output to a file).

 How I say to python that I want that write in utf-8 to files ? 

You already did:

  In the first print statement you specify the encoding UTF-8, which
  is capable of encoding the characters.

If you want UTF-8 on the byte stream for a file, specify it when opening
the file, or when reading or writing the file.

-- 
 \   “But Marge, what if we chose the wrong religion? Each week we |
  `\  just make God madder and madder.” —Homer, _The Simpsons_ |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Terry Reedy

On 6/9/2011 5:46 PM, Nobody wrote:

On Thu, 09 Jun 2011 22:14:17 +0100, Sérgio Monteiro Basto wrote:


Exactly the opposite , if python don't know the encoding should not try
decode to ASCII.


What should it decode to, then?

You can't write characters to a stream, only bytes.


I want python don't care about encoding terminal and send characters as they
are or for a file .


You can't write characters to a stream, only bytes.


Characters, representations are for people, byte representations are for 
computers.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Mark Tolonen


Sérgio Monteiro Basto sergi...@sapo.pt wrote in message 
news:4df137a7$0$30580$a729d...@news.telepac.pt...



How I change sys.stdout.encoding always to UTF-8 ? at least have a
consistent sys.stdout.encoding


There is an environment variable that can force Python I/O to be a specfic 
encoding:


   PYTHONIOENCODING=utf-8

-Mark


--
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Nobody wrote:

 Exactly the opposite , if python don't know the encoding should not try
 decode to ASCII.
 
 What should it decode to, then?

UTF-8, as in tty, how I change this default ? 

 You can't write characters to a stream, only bytes.
 
ok got the point . 
Thanks, 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Mark Tolonen wrote:

 
 Sérgio Monteiro Basto sergi...@sapo.pt wrote in message
 news:4df137a7$0$30580$a729d...@news.telepac.pt...
 
 How I change sys.stdout.encoding always to UTF-8 ? at least have a
 consistent sys.stdout.encoding
 
 There is an environment variable that can force Python I/O to be a specfic
 encoding:
 
 PYTHONIOENCODING=utf-8

Excellent thanks , double thanks.

BTW: should be set by default on a utf-8 systems like Fedora, Ubuntu, Debian 
, Redhat, and all Linuxs. For sure I will put this on startup of my systems.
 
 -Mark
--
Sérgio M. B.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Ben Finney
Sérgio Monteiro Basto sergi...@sapo.pt writes:

 Nobody wrote:

  Exactly the opposite , if python don't know the encoding should not
  try decode to ASCII.

Are you advocating that Python should refuse to write characters unless
the encoding is specified? I could sympathise with that, but currently
that's not what Python does; instead it defaults to the ASCII codec.

  What should it decode to, then?

 UTF-8, as in tty

But when you explicitly redirect to a file, it's *not* going to a TTY.
It's going to a file whose encoding isn't known unless you specify it.

-- 
 \ “Reality must take precedence over public relations, for nature |
  `\cannot be fooled.” —Richard P. Feynman |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Ben Finney wrote:

  Exactly the opposite , if python don't know the encoding should not
  try decode to ASCII.
 
 Are you advocating that Python should refuse to write characters unless
 the encoding is specified? I could sympathise with that, but currently
 that's not what Python does; instead it defaults to the ASCII codec.

could be a solution ;) or a smarter default based on LANG for example (as 
many GNU does).

--
Sérgio M. B.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Laurent Claessens

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ../test.py
 moçambique
 moçambique


The following tries to encode before to print. If you pass an already 
utf-8 object, it just print it; if not it encode it. All the print 
statements pass by MyPrint.write


#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode(utf8)
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)


MyPrint()

u = u'moçambique'
print u.encode(utf-8)
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py  test.txt
$ cat test.txt
moçambique
moçambique


By the way, my code will not help for error message. I think that the 
errors are printed by sys.stderr.write. So if you want to do

raise moçambique
you should think about add stderr to the class MyPrint


If you know French, I strongly recommend Comprendre les erreurs 
unicode by Victor Stinner :

http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent
--
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Laurent Claessens

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ../test.py
 moçambique
 moçambique


The following tries to encode before to print. If you pass an already 
utf-8 object, it just print it; if not it encode it. All the print 
statements pass by MyPrint.write


#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode(utf8)
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)


MyPrint()

u = u'moçambique'
print u.encode(utf-8)
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py  test.txt
$ cat test.txt
moçambique
moçambique


By the way, my code will not help for error message. I think that the 
errors are printed by sys.stderr.write. So if you want to do

raise moçambique
you should think about add stderr to the class MyPrint


If you know French, I strongly recommend Comprendre les erreurs 
unicode by Victor Stinner :

http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent
--
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-08 Thread Ben Finney
Sérgio Monteiro Basto sergi...@sapo.pt writes:

 ./test.py
 moçambique
 moçambique

In this case your terminal is reporting its encoding to Python, and it's
capable of taking the UTF-8 data that you send to it in both cases.

 ./test.py  output.txt
 Traceback (most recent call last):
   File ./test.py, line 5, in module
 print u
 UnicodeEncodeError: 'ascii' codec can't encode character 
 u'\xe7' in position 2: ordinal not in range(128)

In this case your shell has no preference for the encoding (since you're
redirecting output to a file).

In the first print statement you specify the encoding UTF-8, which is
capable of encoding the characters.

In the second print statement you haven't specified any encoding, so the
default ASCII encoding is used.


Moral of the tale: Make sure an encoding is specified whenever data
steps between bytes and characters.

 Don't seems logic, when send things to a file the beaviour change.

They're different files, which have been opened with different
encodings. If you want a different encoding, you need to specify that.

-- 
 \   “There's no excuse to be bored. Sad, yes. Angry, yes. |
  `\Depressed, yes. Crazy, yes. But there's no excuse for boredom, |
_o__)  ever.” —Viggo Mortensen |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-08 Thread Benjamin Kaplan
2011/6/8 Sérgio Monteiro Basto sergi...@sapo.pt:
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ./test.py
 moçambique
 moçambique

 ./test.py  output.txt
 Traceback (most recent call last):
  File ./test.py, line 5, in module
    print u
 UnicodeEncodeError: 'ascii' codec can't encode character
 u'\xe7' in position 2: ordinal not in range(128)

 in python 2.7
 how I explain to python to send the same thing to stdout and
 the file output.txt ?

 Don't seems logic, when send things to a file the beaviour
 change.

 Thanks,
 Sérgio M. B.

That's not a terminal vs file thing. It's a file that declares it's
encoding vs a file that doesn't declare it's encoding thing. Your
terminal declares that it is UTF-8. So when you print a Unicode string
to your terminal, Python knows that it's supposed to turn it into
UTF-8. When you pipe the output to a file, that file doesn't declare
an encoding. So rather than guess which encoding you want, Python
defaults to the lowest common denominator: ASCII. If you want
something to be a particular encoding, you have to encode it yourself.

You have a couple of choices on how to make it work:
1) Play dumb and always encode as UTF-8. This would look really weird
if someone tried running your program in a terminal with a CP-847
encoding (like cmd.exe on at least the US version of Windows), but it
would never crash.
2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
string in the string-escape encoding, which substitutes the escape
sequence in for all non-ASCII characters.
3) Check to see if sys.stdout.isatty() and have different behavior for
terminals vs files. If you're on a terminal that doesn't declare its
encoding, encoding it as UTF-8 probably won't help. If you're writing
to a file, that might be what you want to do.
-- 
http://mail.python.org/mailman/listinfo/python-list