On Wed, 2008-08-20 at 15:36 -0700, George Sakkis wrote: > It seems xml.etree.cElementTree.iterparse() is not unicode aware: > > >>> from StringIO import StringIO > >>> from xml.etree.cElementTree import iterparse > >>> s = > >>> u'<name>\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce\u03c4\u03b7\u03c2</name>' > >>> for event,elem in iterparse(StringIO(s)): > ... print elem.text > ... > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "<string>", line 64, in __iter__ > UnicodeEncodeError: 'ascii' codec can't encode characters in position > 6-15: ordinal not in range(128) > > Am I using it incorrectly or it doesn't currently support unicode ? > > George > -- > http://mail.python.org/mailman/listinfo/python-list
As iterparse expects an actual file as input, using a unicode string is problematic. If you want to use iterparse, the simplest way would be to encode your string before inserting it into the StringIO object, as so: >>> for event,elem in iterparse(StringIO(s.encode('UTF8')): ... print elem.text ... If you encode using UTF-8, you don't need to worry about the <?xml header bit as suggested previously, as it's the default for XML. If you're using unicode extensively, you should consider using lxml, which implements the same interface as ElementTree, but handles unicode better (though it also doesn't run your example above without first encoding the string): http://codespeak.net/lxml/parsing.html#python-unicode-strings You may also find the target parser interface to be more accepting of unicode than iterparse, though it requires a different parsing interface: http://codespeak.net/lxml/parsing.html#the-target-parser-interface -- John Krukoff <[EMAIL PROTECTED]> Land Title Guarantee Company -- http://mail.python.org/mailman/listinfo/python-list