hi all, i have a newbie problem arising from writing-then-reading a 
unicode file, and i can't work out what syntax i need to read it in.

the syntax i'm using now (just using quick hack tmp files):
BEGIN
f=codecs.open("tt.xml","r","utf8")
fwrap=codecs.EncodedFile(f,"ascii","utf8")
try:
     ss=u''
     ss=fwrap.read()
     print ss
     ## rrr=xml.dom.minidom.parseString(f.read()) # originally
finally:
     f.close()
END

barfs with this error:
BEGIN
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in 
position 5092: ordinal not in range(128)
END

any ideas?


--
Context (if interested):
had a look at the blogger api, downloaded the "15 most recent posts" 
into a miniDOM document, then decided to learn how to traverse the xml 
object in python.  getting annoyed with the time taken to reconnect each 
time i played with a new syntax, i wrote the xml object to a file.  that 
barfed with a similar sort of encoding error. sure enough, there in the 
debug coming back from blogger: "charset=utf-8".  my python book said i 
needed to switch from "open/print" to "codecs.open/write", so i did this:
BEGIN
# get xml doct (from blogger: atom format)
rrr=xml.dom.minidom.Document()
conn.request("GET","/atom/1234",None,headers)
response=conn.getresponse()
rrr=xml.dom.minidom.parseString(response.read())
print rrr

# dump to disk
import codecs
f=codecs.open("ttt.xml","w","utf8")
try:
##    print  >> f, rrr.toxml()
     f.write(rrr.toxml())
finally:
     f.close()
END

this works fine and the resulting file looks like good xml to the naked eye.

oh and i have tried both "utf8" and "utf-8" as the en/decoding tokens -- 
no change.
ditto with explicitly initialising "ss" as unicode: same error as before 
when it was not explicitly initialised at all.

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to