Barak, Ron, 04.05.2010 09:01:
 I'm parsing XML files using ElementTree from xml.etree (see code below
(and attached xml_parse_example.py)).

However, I'm coming across input XML files (attached an example:
tmp.xml) which include invalid characters, that produce the following
traceback:

$ python xml_parse_example.py
Traceback (most recent call last):
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 6, column 34

I hope you are aware that this means that the input you are parsing is not XML. It's best to reject the file and tell the producers that they are writing broken output files. You should always fix the source, instead of trying to make sense out of broken input in fragile ways.


I read the documentation for xml.etree.ElementTree and see that it may
take an optional parser parameter, but I don't know what this parser
should be - to ignore the invalid characters.

Could you suggest a way to call ElementTree, so it won't bomb on these
invalid characters ?

No. The parser in lxml.etree has a 'recover' option that lets it try to recover from input errors, but in general, XML parsers are required to reject non well-formed input.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to