Hello,
I received a bunch of XML files that contain HTML entities (so far, I’ve seen
only used). I can’t parse these files with an XML parser because of
these HTML entities:
>>> parser = lxml.etree.XMLParser(huge_tree=True, remove_comments=True,
>>> schema=None, dtd_validation=False)
>>> xml = lxml.etree.parse("test.xml", parser)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/etree.pyx", line 3521, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1859, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1885, in lxml.etree._parseDocumentFromURL
File "src/lxml/parser.pxi", line 1789, in lxml.etree._parseDocFromFile
File "src/lxml/parser.pxi", line 1177, in
lxml.etree._BaseParser._parseDocFromFile
File "src/lxml/parser.pxi", line 615, in
lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "test.xml", line 66
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 66, column 25
The `resolve_entities` parameter for XMLParser unfortunately doesn’t seem to
resolve HTML entities.
If I parse the file using an HTMLParser it works:
>>> parser = lxml.etree.HTMLParser(huge_tree=True, remove_comments=True)
>>> xml = lxml.etree.parse("test.xml", parser)
>>> xml
<lxml.etree._ElementTree object at 0x10589baa0>
but then the upper/lower case of all tags is lost because HTML is
case-insensitive (XML is not) and it seems that the HTML parser turns all tag
names to lower case:
>>> xml.getroot().find("body/*")
<Element docxml at 0x1059cb730>
This should be a `DocXML` tag name. Now my original XML file is broken and
fails schema validation…
So, what now? I feel very hesitant to treat the original XML file as a string
and replace HTML entities (except & < >) on a string level. I think a
better approach would be to make the XML parser aware of HTML entities but that
may be a libxml2 issue rather than lxml? (Haven’t looked at the source yet.)
Would you have any other recommendations? How else could I work with this issue?
Much thanks!
Jens
--
Jens Tröger
https://savage.light-speed.de/
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]