Diez B. Roggisch wrote: > Frank Stajano wrote: > >> A simple unicode question. How do I print? >> >> Sample code: >> >> # -*- coding: utf-8 -*- >> s1 = u"héllô wórld" >> print s1 >> # Gives UnicodeEncodeError: 'ascii' codec can't encode character >> # u'\xe9' in position 1: ordinal not in range(128) >> >> >> What I actually want to do is slightly more elaborate: read from a text >> file which is in utf-8, do some manipulations of the text and print the >> result on stdout. I understand I must open the file with >> >> f = codecs.open("input.txt", "r", "utf-8") >> >> but then I get stuck as above. >> >> I tried >> >> s2 = s1.encode("utf-8") >> print s2 >> >> but got >> >> héllô wórld > > Which is perfectly alright - it's just that your terminal isn't prepared to > decode UTF-8, but some other encoding, like latin1.
Aha! Thanks for spotting this. You are right about the terminal (rxvt/cygwin) not being ready to handle utf-8, as I can now confirm with a cat t2.py (t2.py being the program above) which displays the source code garbled in the same way. If I do s1 = u"héllô wórld" print s1 at the interactive prompt of Idle, I get the proper output héllô wórld So why is it that in the first case I got UnicodeEncodeError: 'ascii' codec can't encode? Seems as if, within Idle, a utf-8 codec is being selected automagically... why should that be so there and not in the first case? >> Then, in the hope of being able to write the string to a file if not to >> stdout, I also tried >> >> >> import codecs >> f = codecs.open("out.txt", "w", "utf-8") >> f.write(s2) >> >> but got >> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: >> ordinal not in range(128) > > Instead of writing s2 (which is a byte-string!!!), write s1. It will work. OK, many thanks, I got this to work! > The error you get stems from f.write wanting a unicode-object, but s2 is a > bytestring (you explicitly converted it before), so python tries to encode > the bytestring with the default encoding - ascii - to a unicode string. > This of course fails. I think I have a better understanding of it now. If the terminal hadn't fooled me, I probably wouldn't have assumed that the code I originally wrote (following the first examples I found) was wrong! I assume that when you say "bytestring" you mean "a string of bytes in a certain encoding (here utf-8) that can be used as an external representation for the unicode string which is instead a sequence of code points". Thanks again -- http://mail.python.org/mailman/listinfo/python-list