Vanilla (this works fine):
#!/usr/bin/python

from elementtree import ElementTree as etree

eg = """<seuss><fish>red</fish><fish>blue</fish></seuss>"""

xml = etree.fromstring(eg)

If I change the example string to this:
<seuss><fish>red</fish><fish>blu�</fish></seuss>

I get the following error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 32)


According to:

    http://mail.python.org/pipermail/xml-sig/2006-May/011513.html

the XML content must declare itself what encoding it uses.  For example:

#####################################################################
text = """<?xml version='1.0' encoding='utf-8'?>
<p>\xed\x95\x98\xeb\xa3\xa8\xeb\x8f\x99\xec\x95\x88 IDLE\xea\xb0\x80\xec\xa7\x80\xea\xb3\xa0 \xeb\x86\x80\xea\xb8\xb0</p>
"""
#####################################################################

Note that the encoding declaration must be on the top of the document.


Then it's ok to use fromstring() on it:

##################################################
doc = elementtree.ElementTree.fromstring(text)
doc.text
u'\ud558\ub8e8\ub3d9\uc548\nIDLE\uac00\uc9c0\uace0 \ub180\uae30'
##################################################

If I use the wrong encoding declaration, or if I'm missing the declaration altogether, then yes, I see the same errors that you seen.


Okay, the default encoding for my program (and thus my example string) is US-ASCII, so I'll use 8859-1 instead, adding this line: # coding: iso-8859-1

I get the same error. Just for laughs I'll change the encoding to utf-8. Oops, I get the same error.


The XML encoding has to be explicitely described as part of the XML document text. It's the difference between:

###########################################################################
text = '<seuss><fish>red</fish><fish>blu\xe9</fish></seuss>'
import elementtree.ElementTree
elementtree.ElementTree.fromstring(text)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py", line 960, in XML
    parser.feed(text)
File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py", line 1242, in feed
    self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 32
##########################################################################

and:

##########################################################
text = '''<?xml version="1.0" encoding="iso-8859-1"?>
... <seuss><fish>red</fish><fish>blu\xe9</fish></seuss>'''
doc = elementtree.ElementTree.fromstring(text)
##########################################################

which does work.

If you're dealing with XML content, make sure that your XML documents have that encoding declaration, or else you're bound to run into these kinds of errors.

Good luck!
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to