[Tutor] unicode issue?

2006-03-21 Thread Matt Dempsey
I'm having a new problem with my House vote script. It's returning the following error: Traceback (most recent call last): File C:/Python24/evenmorevotes, line 20, in -toplevel- f.write(nm+'*'+pt+'*'+vt+'*'+md['vote-result'][0]+'*'+md['vote-desc'][0]+'*'+'\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 172: ordinal not in range(128)Here's the code: http://pastebin.com/615240What am I doing wrong?


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] unicode issue?

2006-03-21 Thread Danny Yoo


On Tue, 21 Mar 2006, Matt Dempsey wrote:

 I'm having a new problem with my House vote script. It's returning the
 following error:

 Traceback (most recent call last):
   File C:/Python24/evenmorevotes, line 20, in -toplevel-
 f.write
 (nm+'*'+pt+'*'+vt+'*'+md['vote-result'][0]+'*'+md['vote-desc'][0]+'*'+'\n')
 UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in
 position 172: ordinal not in range(128)


Hi Matt,

Just wondering: how familiar are you with Unicode?  What's going on is
that one of the strings in the string concatenation above contains a
Unicode string.


It's like an infection: anything that touches Unicode turns Unicode.
*grin*

##
 'hello' + u'world'
u'helloworld'
##


This has repercussions: when we're writing these strings back to files,
because we have a Unicode string, we must now be more explicit about how
Unicode is written, since files are really full of bytes, not unicode
characters.  That is, we need to specify an encoding.


'utf-8' is a popular encoding that turns Unicode reliably into a bunch of
bytes:

##
 u'\u201c'.encode('utf8')
'\xe2\x80\x9c'
##

and this can be written to a file.  Recovering Unicode from bytes can be
done by going the other way, by decoding:

##
 '\xe2\x80\x9c'.decode(utf8)
u'\u201c'
##



The codecs.open() function in the Standard Library is useful for handling
this encode/decode thing so that all we need to do is concentrate on
Unicode:

http://www.python.org/doc/lib/module-codecs.html#l2h-991

For example:

##
 import codecs

 f = codecs.open(foo.txt, wb, utf8)
 f.write(u'\u201c')
 f.close()

 open('foo.txt', 'rb').read()
'\xe2\x80\x9c'

 codecs.open(foo.txt, rb, utf-8).read()
u'\u201c'
##


We can see that if we read and write to a codec-opened file, it'll
transparently do the encoding/decoding step for us as we write() and
read() the file.


You may also find Joel Spolsky's post on The Absolute Minimum Every
Software Developer Absolutely, Positively Must Know About Unicode And
Character Sets (No Excuses!) useful in clarifying the basic concepts of
Unicode:

http://www.joelonsoftware.com/articles/Unicode.html


I hope this helps!

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] unicode issue? (fwd)

2006-03-21 Thread Danny Yoo

 A friend of mine showed me where the unicode is showing up but we still
 can't get script to work right. We tried encoding the appropriate
 variable but it still is spitting back the error. Would I just have:
 u'\u201c'.encode('utf8') in my script or should it be
 md['vote-desc'][0].encode('utf8')


[Keeping tutor in CC.  Please do not send replies only to me; you need to
give other people on Tutor the opportunity to answer as well.


Hi Matt,


I should clarify: when I said:

 What's going on is that one of the strings in the string concatenation
 above contains a Unicode string.

I was a bit imprecise.  What I should really have said was:

What's going on is that at least one --- there could be more! ---
strings in the string concatenation contains a Unicode string.

I'm positive you have more than one Unicode string in there.  *grin*


Rather than hunt-and-peck for all the places where Unicode is coming from,
it's probably a better approach to just wholesale encode() the whole
string after you do the concatenation and before passing it off to
write().

Alternative, take a close look at the codecs example I showed before near
the bottom of the last message: it handles encoding and decoding for you.
Using the codecs module is probably the better approach here, since
otherwise you have to look at every file write(), and that can be a bit
tiring.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor