I've asked this question some month ago, and someone (A. Clark?) pointed me
to JTidy at http://lempinen.net/sami/jtidy/
It's an HTML parser, which therefore doesn't care about well formedness.
I too thought that the org.w3c.dom.html package was an HTML Parser, but it
is only an HTML DOM implementation.
Anyway, JTidy worked for me. However, it uses a DOM 1.0 implementation, and
therefore the Document it produce cannot be used for much.
What I ended doing was to run an empty xalan transform taking a old DOM in
to output a new DOM (2.0) out:
Here is what the code looks like, but it probably won't compile as such, I
just cut and pasted bits and pieces together, anyway, my Java skills are
not that good.
Also, please note that I have problem compiling JTidy stuff with xerces
1.4.3...
javax.xml.transform.TransformerFactory tf=
javax.xml.transform.TransformerFactory.newInstance();
javax.xml.transform.Transformer iTransformer = tf.newTransformer();
org.w3c.tidy.Tidy tidy = new org.w3c.tidy.Tidy();
org.w3c.dom.Document tidyDoc = tidy.parseDOM(new
java.io.ByteArrayInputStream(html.getBytes()), null);
org.w3c.dom.Document iDocumentIn = null;
org.w3c.dom.Document iDocumentOut = null;
iTransformer.transform(
new javax.xml.transform.dom.DOMSource(tidyDoc),
new javax.xml.transform.dom.DOMResult(iDocumentOut));
atta ur-rehman <[EMAIL PROTECTED]> on 17/10/2001 06:03:58
Please respond to [EMAIL PROTECTED]
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
cc:
Subject: A question about HTML Parser in Xerces 1.4.3
Dear All,
Excuse me if I'm posting my question to inappropirate mailing list, though,
I believe it has to someone from this list who could answer my question.
The question is about Xerces and I have been trying my best on the web to
find an answer to it, to no avail. I hope you may help and would appreciate
the help.
What I'm trying to do is very simple. I have an HTML document, HTML not
XHTML, which may or may not be well-formed and I need to parse this
document to get an instance of org.w3c.dom.html.HTMLDocument type against
it. How can I do that? I'm sure it's already implemented in Xerces-J, I
just can't figure out how.
Getting a org.apache.html.dom.HTMLDocumentImpl, or
org.w3c.dom.html.HTMLDocument for that matter, Instance from
org.apache.html.dom.HTMLBuilder set as the DocumentHandler for the
SAXParser seems to be nearest I could get to it through the Xerces API
documentation. But that didn't help either, the SAXParser throws exception
during parse.
I would really appreciate any help in this regards.
Thanks and regards,
ATTA
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]