Hello, I am trying to process an xml file that contains unicode characters (see http://vyakarnam.wordpress.com/). Wordpress allows exporting the entire content of the website into an xml file. Using xml.dom.minidom, I wrote a few lines of python code to parse out the xml file, but am stuck with the following error:
>>> import xml.dom.minidom >>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml") >>> titles = dom.getElementsByTagName("title") >>> for title in titles: ... print "childNode = ", title.childNodes ... childNode = [<DOM Text node "Sanskrit N...">] childNode = [<DOM Text node "Sanskrit N...">] childNode = [] childNode = [] childNode = [<DOM Text node "1-1-1">] childNode = Traceback (most recent call last): File "<stdin>", line 2, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-18: ordinal not in range(128) >>> Python exited when it was trying to parse the following node: <title>अन् </title> The xml header tells me that the document is UTF-8: <?xml version="1.0" encoding="UTF-8"?> I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are as below: $locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= I googled around for similar errors, and tried using unicode but that didn't help either: >>> foo = unicode(titles[5].childNodes) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-18: ordinal not in range(128) I'm a novice with unicode, and am not not sure about how best to handle the unicode text I'm dealing with (devanagari). Any suggestions will be helpful. Thanks -- http://mail.python.org/mailman/listinfo/python-list