How to parse HTML with dom4j (was Re: [dom4j-user] validating using a DOMReader)

James Strachan Fri, 19 Apr 2002 19:37:04 -0700

----- Original Message -----
From: "James Strachan" <[EMAIL PROTECTED]>
>
> Also Andy Clark from the Xerces team has put together a HTML parser called
> NekoHTML which looks really cool (and could well be a great event-based
> replacement for JTidy).
>
> http://www.apache.org/~andyc/
>
> I think its moving into the Xerces codebase soon.


If you want to parse HTML straight into a dom4j Document then I recommend
you take a good look at NekoHTML from Andy Clark in Xerces. It can be used
as a regular SAX parser so it plugs right into the dom4j SAXReader.

You can instantiate NekoHTML as follows

There's nothing quicker than the following:

  XMLReader xmlReader = new org.cyberneko.html.parsers.SAXParser();

or through the SAX helper class if you like:

  String className = "org.cyberneko.html.parsers.SAXParser";
  XMLReader xmlReader = XMLReaderFactory.createXMLReader(className);

Then use it with dom4j as follows

    SAXReader reader = new SAXReader( xmlReader );
    Document doc = reader.read( "foo.html" );

Right now NekoHTML is fast and doesn't require any W3C DOM. Though I guess
JTidy might do a few more clever things to fix really wierd HTML but I'm
sure over time Neko will catch up.

James



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


_______________________________________________
dom4j-user mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dom4j-user

How to parse HTML with dom4j (was Re: [dom4j-user] validating using a DOMReader)

Reply via email to