On 8/7/06, James M Snell <[EMAIL PROTECTED]> wrote:
I've put together a fairly simple HTML->Abdera/Axiom impl based on the Tagsoup parser [1]. It implements the Abdera Parser interface and creates a Document<Element> model that represents HTML as well-formed XHTML content. Further, it supports the ParseFilter mechansism so we can filter out unsafe HTML content (e.g. script tags).For example: Parser parser = new HtmlParser(); ParserOptions options = parser.getDefaultParserOptions(); options.setParseFilter(new SafeContentWhiteListParseFilter()); String h = "foo<p style='background-color:blue'>This <script>alert('foo');</script> <a href='this is foo'>is</a> foo <b>bar</b> »<foo> hello"; ByteArrayInputStream in = new ByteArrayInputStream(h.getBytes()); Document<Element> doc = parser.parse(in, (URI)null, options); doc.getRoot().writeTo(System.out); // Outputs <xhtml:div xmlns:xhtml="http://www.w3.org/1999/xhtml">foo<xhtml:p>This alert('foo'); <xhtml:a href="this is foo" shape="rect">is</xhtml:a> foo <xhtml:b>bar</xhtml:b> ยป<foo> hello</xhtml:p></xhtml:div> There are still little bits of wierdness, but for the most part it seems to work really well. On the downside, I'm not sure if the Tagsoup license is compatible with the Apache license, otherwise I'd check this in to the extensions module. (and oh, btw, so far this has been implemented as a single class with only 189 lines of code, most of which are formatting :-) ....)
That's pretty slick. I'll look into the licensing issue. -garrett
