I've submitted a revised patch (https://issues.apache.org/jira/browse/TIKA-420
), and had one key question.
Currently the BoilerpipeContentHandler calls a delegate
ContentHandler, but it only makes the following calls to the delegate:
startDocument();
then for each text block...
startElement("p");
characters(...);
endElement("p");
endDocument();
This means that you don't get valid XHTML from the handler, which I
think is OK (versus parsers, which must generate valid XHTML).
But I could easily add dummy tags for html and body - would that be
better?
Thanks,
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g