"Martin v. Löwis" <[EMAIL PROTECTED]> writes: > If they contain such things, and do not contain a document type > definition, they are not well-formed XML files (i.e. can't be > called "XML" in a meaningful sense).
The documents do have a DTD, however the DTD file doesn't say anything about these entities. > It would have been helpful if you had given an example of such > a document. I can't post a whole document because these docs are very large and I'm not sure that the data is public. It does look like the DTD is public: the document begins with <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd"> <ONIXmessage release="2.1"> ... and that url points to the DTD which is online. Basically the doc has elements like <b036>Diana Montané</b036> and both ElementTree and xmllint complain about the character entities (and there are a lot of them). > If there is a document type declaration in the document, the best > way is to parse it in a mode where the parser downloads the DTD > when parsing it, and resolves the entity references itself. Hmm, ok, I see there are a lot of <!ENTITY ...> directives in the DTD but nothing about those character entities--am I looking in the right place? > In ElementTree, the XMLTreeBuilder has an attribute entity > which is a dictionary used to map entity names in entity references > to their definitions. Whether you can make the parser download > the DTD itself, I don't know. Chuck Rhode posted some code for something like this so I'll try it on Monday. Thanks! -- http://mail.python.org/mailman/listinfo/python-list