Steven Bethard schrieb: > I'm having trouble using elementtree with an XML file that has some > gbk-encoded text. (I can't read Chinese, so I'm taking their word for > it that it's gbk-encoded.) I always have trouble with encodings, so I'm > sure I'm just screwing something simple up. Can anyone help me? > > Here's the interactive session. Sorry it's a little verbose, but I > figured it would be better to include too much than not enough. I > basically expected et.ElementTree(file=...) to fail since no encoding > was specified, but I don't know what I'm doing wrong when I use > codecs.open(...)
The first and most important lesson to learn here is that well-formed XML must contain a xml-header that specifies the used encoding. This has two consequences for you: 1) all xml-parsers expect byte-strings, as they have to first read the header to know what encoding awaits them. So no use reading the xml-file with a codec - even if it is the right one. It will get converted back to a string when fed to the parser, with the default codec being used - resulting in the well-known unicode error. 2) your xml is _not_ well-formed, as it doesn't contain a xml-header! You need ask these guys to deliver the xml with header. Of course for now it is ok to just prepend the text with something like <?xml version="1.0" encoding="gbk"?>. But I'd still request them to deliver it with that header - otherwise it is _not_ XML, but just something that happens to look similar and doesn't guarantee to be well-formed and thus can be safely fed to a parser. HTH Diez -- http://mail.python.org/mailman/listinfo/python-list