Vanilla (this works fine):
#!/usr/bin/python
from elementtree import ElementTree as etree
eg = """<seuss><fish>red</fish><fish>blue</fish></seuss>"""
xml = etree.fromstring(eg)
If I change the example string to this:
<seuss><fish>red</fish><fish>blu�</fish></seuss>
I get the following error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 32)
According to:
http://mail.python.org/pipermail/xml-sig/2006-May/011513.html
the XML content must declare itself what encoding it uses. For example:
#####################################################################
text = """<?xml version='1.0' encoding='utf-8'?>
<p>\xed\x95\x98\xeb\xa3\xa8\xeb\x8f\x99\xec\x95\x88
IDLE\xea\xb0\x80\xec\xa7\x80\xea\xb3\xa0 \xeb\x86\x80\xea\xb8\xb0</p>
"""
#####################################################################
Note that the encoding declaration must be on the top of the document.
Then it's ok to use fromstring() on it:
##################################################
doc = elementtree.ElementTree.fromstring(text)
doc.text
u'\ud558\ub8e8\ub3d9\uc548\nIDLE\uac00\uc9c0\uace0 \ub180\uae30'
##################################################
If I use the wrong encoding declaration, or if I'm missing the declaration
altogether, then yes, I see the same errors that you seen.
Okay, the default encoding for my program (and thus my example string)
is US-ASCII, so I'll use 8859-1 instead, adding this line: # coding:
iso-8859-1
I get the same error. Just for laughs I'll change the encoding to
utf-8. Oops, I get the same error.
The XML encoding has to be explicitely described as part of the XML
document text. It's the difference between:
###########################################################################
text = '<seuss><fish>red</fish><fish>blu\xe9</fish></seuss>'
import elementtree.ElementTree
elementtree.ElementTree.fromstring(text)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py", line
960, in XML
parser.feed(text)
File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py", line
1242, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 32
##########################################################################
and:
##########################################################
text = '''<?xml version="1.0" encoding="iso-8859-1"?>
... <seuss><fish>red</fish><fish>blu\xe9</fish></seuss>'''
doc = elementtree.ElementTree.fromstring(text)
##########################################################
which does work.
If you're dealing with XML content, make sure that your XML documents have
that encoding declaration, or else you're bound to run into these kinds of
errors.
Good luck!
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor