HTML Parser

James M Snell Mon, 14 Jan 2008 12:27:38 -0800

All,

I have some code based on Henri Sivonen's html5 parser that adds HTMLparsing capabilities to the Abdera api. For instance,


  URL url = new URL("http://www.snellspace.com";);
  Abdera abdera = Abdera.getInstance();
  Parser parser = abdera.getParserFactory().getParser("html");
  Document doc = parser.parse(url.openStream());
  doc.writeTo(System.out);

The parser will repair broken markup and allow it to be accessed usingthe Abdera Element objects. The two cases where this becomesparticularly use is...


a) Performing autodiscovery of feeds and atompub service docs
b) Converting HTML content to XHTML content and protecting feeds against
   accidental breakage.

For example,

  List<Element> list =
    HtmlHelper.discoverLinks(
      "http://www.snellspace.com/wp";,
      "application/atom+xml",
      "alternate");
  for (Element el : list) {
    String href = el.getAttributeValue("href");
    String title = el.getAttributeValue("title");
    String type = el.getAttributeValue("type");
    System.out.println(type + ", " + title + ", " + href);
  }

And another:

  Abdera abdera = Abdera.getInstance();
  Entry entry = abdera.newEntry();
  entry.setContentAsXhtml(HtmlCleaner.parse("<p>test<br>foo"));
  System.out.println(entry);

Which outputs:

  <entry xmlns="http://www.w3.org/2005/Atom";>
    <content type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml";>
        <p>test<br />foo</p>
      </div>
    </content>
  </entry>

Note that the html fragment is fixed by the HtmlCleaner.

I could commit this but doing so means adding two new optionaldependency jars. I think the function is valuable enough to justify theaddition but I wanted to run it past the rest of you first.


- James

HTML Parser

Reply via email to