Re: Parsing HTML

Andy Clark Thu, 06 Oct 2005 23:39:08 -0700

Paul Green wrote:

I read recently (in Elliotte Rusty Harold's "Processing XML with
Java") that Xerces-J is capable of parsing an HTML document into a
DOM tree. Xerces-J 1.4.4 does indeed contain an "html" package with
all the required interfaces to represent an HTML document in DOM
form. However, I have been unable to determine how to set up the DOM
parser to create such a document, despite an extensive search. I
would be grateful if someone could point me at any documentation, and
particularly code examples, describing how to do this. Alternatively,
if i'm barking up the wrong tree, which tree should I go and bark up?


I haven't read the Man-With-Three-First-Names's article
but the core Xerces package does not support HTML parsing
out of the box. The HTML DOM packages are just an
implementation of the HTML interfaces of the W3C DOM spec.
That being said, there is still a way to parse HTML using
Xerces, though.

I have a side project called NekoHTML that builds on the
Xerces Native Interface (XNI) and implements an HTML
scanner and tag-balancer. This allows you to parse HTML
documents and treat the results as XML, with all of the
parsing APIs available in Xerces (e.g. DOM and SAX).

Here's the link where you can download and evaluate it:

  http://www.apache.org/~andyc/neko/doc/html/

Another option for parsing HTML and using XML interfaces
is JTidy. It's primarilly used for cleaning up HTML and
saving the result back to a file but it also supports a
DOM result.

There are a number of other HTML parsers available but
I have less experience with them and, from what I've
seen, most have custom programming interfaces.

So evaluate a few of the available options and choose
the one that works best for you and what you need to do.

--
Andy Clark * [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing HTML

Reply via email to