Hi Arno,
On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:
Hello,
I would like to use BoilerPipe (a very good program which cleans the
html content from surplus "clutter").
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible
from solr, am I right?
How I can Activate BoilerPipe in Solr? Do I need to change
solrconfig.xml ( with
org.apache.solr.handler.extraction.ExtractingRequestHandler)?
Or do I need to modify some code inside Solr?
I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration
) is it the right way?
You need to add the BoilerpipeContentHandler into Tika's content
handler chain.
Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk)
the TikaEntityProcessor.getHtmlHandler() method. I'd try something like:
return new BoilerpipeContentHandler(new ContentHandlerDecorator(....
Though from a quick look at that code, I'm curious why it doesn't use
BodyContentHandler, versus the current ContentHandlerDecorator.
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g