What you suggested solved my problem, but unfortunately it did reveal that the 
HTML that I was parsing was not compliant with the DTD that it should have 
been. There were a lot of missing end tags.

In light of this frustrating problem i've gone back to the source docbook code. 
There are many isolated XML files with content that I want to parse out. One 
example that I am focussing on starts with this…

<?xml version='1.0' encoding='iso-8859-1'?>
<appendix xmlns="http://docbook.org/ns/docbook"; xml:id="indexes">
 <title>&FunctionIndex;</title>

My xml.sax parser fails with…

phpdoc/doc-base/funcindex.xml:3:8: undefined entity

I went looking for "FunctionIndex" and grep told me…

phpdoc/en/language-defs.ent:<!ENTITY FunctionIndex     "Function Index">

…and so I look for pages that explicitly reference "language-defs.ent" and rep 
tells me…

phpdoc/doc-base/install-unix.xml:<!ENTITY % language-defs     SYSTEM 
"./en/language-defs.ent">

By this stage, i'm a bit tangled up. Is a SAX parser the right way to parse 
docbooks when they have locally defined external entities? I am hoping to 
extract structured information from this documentation to present in another 
format.
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to