Paul Green wrote:
I read recently (in Elliotte Rusty Harold's "Processing XML with Java") that Xerces-J is capable of parsing an HTML document into a DOM tree. Xerces-J 1.4.4 does indeed contain an "html" package with all the required interfaces to represent an HTML document in DOM form. However, I have been unable to determine how to set up the DOM parser to create such a document, despite an extensive search. I would be grateful if someone could point me at any documentation, and particularly code examples, describing how to do this. Alternatively, if i'm barking up the wrong tree, which tree should I go and bark up?
I haven't read the Man-With-Three-First-Names's article but the core Xerces package does not support HTML parsing out of the box. The HTML DOM packages are just an implementation of the HTML interfaces of the W3C DOM spec. That being said, there is still a way to parse HTML using Xerces, though. I have a side project called NekoHTML that builds on the Xerces Native Interface (XNI) and implements an HTML scanner and tag-balancer. This allows you to parse HTML documents and treat the results as XML, with all of the parsing APIs available in Xerces (e.g. DOM and SAX). Here's the link where you can download and evaluate it: http://www.apache.org/~andyc/neko/doc/html/ Another option for parsing HTML and using XML interfaces is JTidy. It's primarilly used for cleaning up HTML and saving the result back to a file but it also supports a DOM result. There are a number of other HTML parsers available but I have less experience with them and, from what I've seen, most have custom programming interfaces. So evaluate a few of the available options and choose the one that works best for you and what you need to do. -- Andy Clark * [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
