[ python-Bugs-1668295 ] Strange unicode behaviour

SourceForge.net Sun, 25 Feb 2007 14:27:47 -0800

Bugs item #1668295, was opened at 2007-02-25 12:10
Message generated for change (Comment added) made by sgala
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1668295&group_id=5470


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Open
Resolution: Invalid
Priority: 5
Private: No
Submitted By: Santiago Gala (sgala)
Assigned to: Nobody/Anonymous (nobody)
Summary: Strange unicode behaviour

Initial Comment:

I know that python is very funny WRT unicode processing, but this defies all my 
knowledge.

I use the es_ES.UTF-8 encoding on linux. The script:


python -c "print unicode('á %s' % 'éí','utf8') " works, i.e., prints á éí in 
the next line.

However, if I redirect it to less or to a file, like

python -c "print unicode('á %s' % 'éí','utf8') " >test
Traceback (most recent call last):
  File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: 
ordinal not in range(128)


Why is the behaviour different when stdout is redirected? How can I get it to 
do "the right thing" in both cases?

----------------------------------------------------------------------

>Comment By: Santiago Gala (sgala)
Date: 2007-02-25 23:27

Message:
Logged In: YES 
user_id=178886
Originator: YES

re: consistent, my experience it is that python unicode handling is
consistently stupid, doing almost always the wrong thing. It remembers me
of the defaults of WordPerfect, that were always exactly the opposite of
what the user wanted 99% of time. I hope python 3000 comes fast and stops
that real pain.

I love the language, but the way it handles unicode provokes hundreds of
bugs.

>Python could correctly find out your terminal
>encoding, the Unicode string is automatically encoded in that encoding.
>
>If you output to a file, Python does not know which encoding you want to
>have, so all Unicode strings are converted to ascii only.

>>> sys.getfilesystemencoding()
'UTF-8'

so python is really dumb if print does not know my filesystemencoding, but
knows my terminal encoding.

I though breaking the least surprising behaviour was not considered
pythonic, and now you tell me that having a program running on console but
issuing an exception when redirected is intended. I would prefer an
exception in both cases. Or, even better, using
sys.getfilesystemencoding(), or allowing me to set defaultencoding()

>Please direct further questions to the Python mailing list or newsgroup.

I would if I didn't consider this behaviour a bug, and a serious one. 

>The basic rule when handling Unicode is: use Unicode everywhere inside
the
>program, and byte strings for input and output.
>So, your code is exactly the other way round: it takes a byte string,
>decodes it to unicode and *then* prints it.
>
>You should do it the other way: use Unicode literals in your code, and
>when y(ou write something to a file, *encode* them in utf-8.

Do you mean that I need to say print unicode(whatever).encode('utf8'),
like:

>>> a = unicode('\xc3\xa1','utf8') # instead of 'á', easy to read and
understand, even in files encoded as utf8. Assume this is a literal or
input
...
>>> print unicode(a).encode('utf8') # because a could be a number, or a
different object

every time, instead of "a='á'; print a"

Cool, I'm starting to really love it. Concise and pythonic

Are you seriously meaning that there is no way to tell print to use a
default encoding, and it will magically try to find it and fail for
everything not being a terminal?


Are you seriously telling me that this is not a bug? Even worse, that it
is "intended behaviour". BTW, jython acts differently about this, in all
the versions I tried.

And with -S I am allowed to change the encoding, which is crippled in site
for no known good reason. 

python -S -c "import sys; sys.setdefaultencoding('utf8'); print
unicode('\xc3\xa1','utf8')" >test
(works, test contains an accented a as intended


>use Unicode everywhere inside the
>program, and byte strings for input and output.

Have you ever wondered that to use unicode everywhere inside the program,
one needs to decode literals (or input) to unicode (the next sentence you
complain about)?

>So, your code is exactly the other way round: it takes a byte string,
>decodes it to unicode and *then* prints it.

I follow this principle in my programming since about 6 years ago, so I'm
not a novice. I'm playing by the rules:
a) "decodes it to unicode" is the first step to get it into processing.
This is just a test case, so processing is zero.
b) I refuse to believe that the only way to ensure something to be printed
right is wrapping every item into unicode(var).encode('utf8') [The
redundant unicode call is because the var could be a number, or a different
object]
c) or making my code non portable by patching site.py to get a real
encoding instead of ascii.

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2007-02-25 20:43

Message:
Logged In: YES 
user_id=849994
Originator: NO

First of all: Python's Unicode handling is very consistent and
straightforward, if you know the basics. Sadly, most people don't know the
difference between Unicode and encoded strings.

What you're seeing is not a bug, it is due to the fact that if you print
Unicode to the console, and Python could correctly find out your terminal
encoding, the Unicode string is automatically encoded in that encoding.

If you output to a file, Python does not know which encoding you want to
have, so all Unicode strings are converted to ascii only.

Please direct further questions to the Python mailing list or newsgroup.

The basic rule when handling Unicode is: use Unicode everywhere inside the
program, and byte strings for input and output.
So, your code is exactly the other way round: it takes a byte string,
decodes it to unicode and *then* prints it.

You should do it the other way: use Unicode literals in your code, and
when you write something to a file, *encode* them in utf-8.

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2007-02-25 12:17

Message:
Logged In: YES 
user_id=178886
Originator: YES

Forgot to say that it happens consistently with 2.4.3, 2.5-svn and svn
trunk

Also, some people asks for repr of strings (I guess to reproduce if they
can't read the caracters). Those are printed in utf-8:

$python -c "print repr('á %s')"
'\xc3\xa1 %s'
$ python -c "print repr('éi')"
'\xc3\xa9i'

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1668295&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[ python-Bugs-1668295 ] Strange unicode behaviour

Reply via email to