I've asked this question some month ago, and someone (A. Clark?) pointed me
to JTidy at  http://lempinen.net/sami/jtidy/
It's an HTML parser, which therefore doesn't care about well formedness.

I too thought that the org.w3c.dom.html package was an HTML Parser, but it
is only an HTML DOM implementation.

Anyway, JTidy worked for me. However, it uses a DOM 1.0 implementation, and
therefore the Document it produce cannot be used for much.

What I ended doing was to run an empty xalan transform taking a old DOM in
to output a new DOM (2.0) out:

Here is what the code looks like, but it probably won't compile as such, I
just cut and pasted bits and pieces together, anyway, my Java skills are
not that good.

Also, please note that I have problem compiling JTidy stuff with xerces
1.4.3...

javax.xml.transform.TransformerFactory tf=
          javax.xml.transform.TransformerFactory.newInstance();

javax.xml.transform.Transformer iTransformer = tf.newTransformer();

org.w3c.tidy.Tidy tidy = new org.w3c.tidy.Tidy();

 org.w3c.dom.Document tidyDoc = tidy.parseDOM(new 
java.io.ByteArrayInputStream(html.getBytes()), null);

org.w3c.dom.Document iDocumentIn = null;
org.w3c.dom.Document iDocumentOut = null;
iTransformer.transform(
            new javax.xml.transform.dom.DOMSource(tidyDoc),
            new javax.xml.transform.dom.DOMResult(iDocumentOut));




atta ur-rehman <[EMAIL PROTECTED]> on 17/10/2001 06:03:58

Please respond to [EMAIL PROTECTED]

To:   "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
cc:
Subject:  A question about HTML Parser in Xerces 1.4.3


Dear All,

Excuse me if I'm posting my question to inappropirate mailing list, though,
I believe it has to someone from this list who could answer my question.
The question is about Xerces and I have been trying my best on the web to
find an answer to it, to no avail. I hope you may help and would appreciate
the help.

What I'm trying to do is very simple. I have an HTML document, HTML not
XHTML, which may or may not be well-formed and I need to parse this
document to get an instance of org.w3c.dom.html.HTMLDocument type against
it. How can I do that? I'm sure it's already implemented in Xerces-J, I
just can't figure out how.

Getting a org.apache.html.dom.HTMLDocumentImpl, or
org.w3c.dom.html.HTMLDocument for that matter, Instance from
org.apache.html.dom.HTMLBuilder set as the DocumentHandler for the
SAXParser seems to be nearest I could get to it through the Xerces API
documentation. But that didn't help either, the SAXParser throws exception
during parse.

I would really appreciate any help in this regards.

Thanks and regards,


ATTA



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to