Diez B. Roggisch wrote: > Steven Bethard schrieb: >> I'm having trouble using elementtree with an XML file that has some >> gbk-encoded text. (I can't read Chinese, so I'm taking their word for >> it that it's gbk-encoded.) I always have trouble with encodings, so >> I'm sure I'm just screwing something simple up. Can anyone help me? >> >> Here's the interactive session. Sorry it's a little verbose, but I >> figured it would be better to include too much than not enough. I >> basically expected et.ElementTree(file=...) to fail since no encoding >> was specified, but I don't know what I'm doing wrong when I use >> codecs.open(...) > > The first and most important lesson to learn here is that well-formed > XML must contain a xml-header that specifies the used encoding. This has > two consequences for you: > > 1) all xml-parsers expect byte-strings, as they have to first read the > header to know what encoding awaits them. So no use reading the xml-file > with a codec - even if it is the right one. It will get converted back > to a string when fed to the parser, with the default codec being used - > resulting in the well-known unicode error. > > 2) your xml is _not_ well-formed, as it doesn't contain a xml-header! > You need ask these guys to deliver the xml with header. Of course for > now it is ok to just prepend the text with something like <?xml > version="1.0" encoding="gbk"?>. But I'd still request them to deliver it > with that header - otherwise it is _not_ XML, but just something that > happens to look similar and doesn't guarantee to be well-formed and thus > can be safely fed to a parser.
Thanks, that's very helpful. I'll definitely harrass the people producing these files to make sure they put encoding declarations in them. Here's what I get with the prepending hack: >>> et.fromstring('<?xml version="1.0" encoding="gbk"?>\n' + open(filename).read()) Traceback (most recent call last): File "<interactive input>", line 1, in ? File "C:\Program Files\Python\lib\site-packages\elementtree\ElementTree.py", line 960, in XML parser.feed(text) File "C:\Program Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242, in feed self._parser.Parse(data, 0) ExpatError: unknown encoding: line 1, column 30 Are the XML encoding names different from the Python ones? The "gbk" encoding seems to work okay from Python: >>> open(filename).read().decode('gbk') u'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n <DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN (LCP-TMP (IP (NP-PN-SBJ (NR \u4f0f\u660e\u971e)) \n\t\t (VP (VV \u83b7\u5f97) \n\t\t\t (NP-OBJ (NN \u5973\u5b50) \n\t\t\t\t (NN \u8df3\u53f0) \n\t\t\t\t (NN \u8df3\u6c34) \n\t\t\t\t (NN \u51a0\u519b)))) \n\t\t (LC \u540e)) \n (PU \uff0c) \n (NP-SBJ (NP-PN (NR \u82cf\u8054\u961f)) \n (NP (NN \u6559\u7ec3))) \n (VP (ADVP (AD \u70ed\u60c5)) \n (PP-DIR (P \u5411) \n\t\t (NP (PN \u5979))) \n (VP (VV \u795d\u8d3a))) \n (PU \u3002)) ) \n</S>\n<S ID=2567>\n( (FRAG (NR \u65b0\u534e\u793e) \n (NN \u8bb0\u8005) \n (NR \u7a0b\u81f3\u5584) \n (VV \u6444) )) \n</S>\n</HEADLINE>\n<TEXT>\n</TEXT>\n</BODY>\n</DOC>\n' STeve -- http://mail.python.org/mailman/listinfo/python-list