[EMAIL PROTECTED] wrote:
> How do we get HTMLDocument for a html, the following doesnt seem to work:
>
> DOMParser domp = new DOMParser();
> domp.parse(new InputSource(htmlfile));
> Document d = domp.getDocument();
The parser doesn't try to assume you want an HTML document
instance if the root element is "html". But if you know that
you are parsing HTML documents that are well-formed according
to the XML specification, then set the following property
*before* calling "parse":
domp.setProperty("http://apache.org/xml/properties/dom/document-class-name",
"org.apache.html.dom.HTMLDocumentImpl");
However, if your documents are *not* well-formed XML docs
(and most HTML documents are not) then you need to "tidy"
them before parsing them with Xerces. You can use JTidy to
do the job or NekoHTML (and there are probably other tools
available as well). Here are the links:
http://www.sourceforge.net/projects/jtidy
http://www.apache.org/~andyc/
--
Andy Clark * [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]