In short what I'm trying to do is read a document using an xml parser and then upload that data back into a database. I've got the code more or less completed using xml.etree.ElementTree for the parser and dbi/ odbc for my db connection.
To fix problems with unicode I built a work-around by mapping unicode characters to equivalent ascii characters and then encoding everything to ascii. That allowed me to build the application and debug it without running into problems printing to file or stdout to screen. However, now that I've got all that working I'd like to simply take the unicode data from the xml parser and then pass it directly into the database (which is currently set up for unicode data). I've run into problems and just can't figure why this isn't working. The breakdown is occurring when I try to execute the db query: cur.execute( query ) Fairly straightforward. I get the following error: File "atp_alt.py", line 273, in dbWrite cur.execute( query ) UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 3 79: ordinal not in range(128) I've verified that query is of type unicode by checking the type a statement or two earlier (output: <type 'unicode'>). So then I thought maybe the odbc execute just can't handle unicode data. But when I do the following command: query = query.encode('utf-8') It actually works. So apparently execute can handle unicode data. The problem now is that basically the data has been encoded twice and is in the wrong format when I pull it from the database: >>> a u'+CMGL: (\xe2\u20ac\u0153REC UNREAD\xe2\u20ac\x9d,\xe2\u20ac\x9dREC READ\xe2\u20ac\x9d,\xe2\u20ac\x9dSTO UNSENT\xe2\u20ac\x9d,\xe2\u20ac \x9dSTO SENT\xe2\u20ac\x9d,\xe2\u20ac\x9dALL\xe2\u20ac\x9d) OK' >>> print a +CMGL: (“REC UNREADâ€�,â€�REC READâ€�,â€�STO UNSENTâ€�,â€�STO SENTâ €�,â€�ALLâ€�) OK The non-alpha characters should be double-quotes. It works correctly if I copy/paste into the editor: >>> d u'\u201cREC' >>> print d “REC >>> d.encode('utf-8') '\xe2\x80\x9cREC' >>> type(d.encode('utf-8')) <type 'str'> I can then decode the string to get back the proper unicode data. I can't do that with the data out of the db because it's of the wrong type for the data that it has. I think the problem is that I'm having to encode data again to force it into the database, but how can I use the odbc.execute() function without having to do that?
-- http://mail.python.org/mailman/listinfo/python-list