I've submitted a revised patch (https://issues.apache.org/jira/browse/TIKA-420 ), and had one key question.

Currently the BoilerpipeContentHandler calls a delegate ContentHandler, but it only makes the following calls to the delegate:

startDocument();

then for each text block...

        startElement("p");
        characters(...);
        endElement("p");

endDocument();

This means that you don't get valid XHTML from the handler, which I think is OK (versus parsers, which must generate valid XHTML).

But I could easily add dummy tags for html and body - would that be better?

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to