What you suggested solved my problem, but unfortunately it did reveal that the HTML that I was parsing was not compliant with the DTD that it should have been. There were a lot of missing end tags.
In light of this frustrating problem i've gone back to the source docbook code. There are many isolated XML files with content that I want to parse out. One example that I am focussing on starts with this… <?xml version='1.0' encoding='iso-8859-1'?> <appendix xmlns="http://docbook.org/ns/docbook" xml:id="indexes"> <title>&FunctionIndex;</title> My xml.sax parser fails with… phpdoc/doc-base/funcindex.xml:3:8: undefined entity I went looking for "FunctionIndex" and grep told me… phpdoc/en/language-defs.ent:<!ENTITY FunctionIndex "Function Index"> …and so I look for pages that explicitly reference "language-defs.ent" and rep tells me… phpdoc/doc-base/install-unix.xml:<!ENTITY % language-defs SYSTEM "./en/language-defs.ent"> By this stage, i'm a bit tangled up. Is a SAX parser the right way to parse docbooks when they have locally defined external entities? I am hoping to extract structured information from this documentation to present in another format. -- http://mail.python.org/mailman/listinfo/python-list