Re: Parsing HTML

Garrett Rooney Mon, 07 Aug 2006 15:09:21 -0700

On 8/7/06, James M Snell <[EMAIL PROTECTED]> wrote:

I've put together a fairly simple HTML->Abdera/Axiom impl based on the
Tagsoup parser [1].  It implements the Abdera Parser interface and
creates a Document<Element> model that represents HTML as well-formed
XHTML content.  Further, it supports the ParseFilter mechansism so we
can filter out unsafe HTML content (e.g. script tags).


For example:

    Parser parser = new HtmlParser();
    ParserOptions options = parser.getDefaultParserOptions();
    options.setParseFilter(new SafeContentWhiteListParseFilter());

    String h = "foo<p style='background-color:blue'>This
<script>alert('foo');</script> <a href='this is foo'>is</a> foo
<b>bar</b> &nbsp;&raquo;&lt;foo&gt; hello";

    ByteArrayInputStream in = new ByteArrayInputStream(h.getBytes());

    Document<Element> doc = parser.parse(in, (URI)null, options);

    doc.getRoot().writeTo(System.out);

// Outputs
<xhtml:div xmlns:xhtml="http://www.w3.org/1999/xhtml";>foo<xhtml:p>This
alert('foo'); <xhtml:a href="this is foo" shape="rect">is</xhtml:a> foo
<xhtml:b>bar</xhtml:b>  »&lt;foo&gt; hello</xhtml:p></xhtml:div>

There are still little bits of wierdness, but for the most part it seems
to work really well.  On the downside, I'm not sure if the Tagsoup
license is compatible with the Apache license, otherwise I'd check this
in to the extensions module.

(and oh, btw, so far this has been implemented as a single class with
only 189 lines of code, most of which are formatting :-) ....)


That's pretty slick.  I'll look into the licensing issue.

-garrett

Re: Parsing HTML

Reply via email to