Thanks Ken,
this what I wanted to know, I'm not very familiar with this kind of modification. However, I will try to do it and ask you some information in case of need.
regards,

Arno

Le 14.01.2011 18:04, Ken Krugler a écrit :
Hi Arno,

On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:

Hello,

I would like to use BoilerPipe (a very good program which cleans the html content from surplus "clutter"). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right?

How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)?

Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) is it the right way?

You need to add the BoilerpipeContentHandler into Tika's content handler chain.

Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk) the TikaEntityProcessor.getHtmlHandler() method. I'd try something like:

    return new BoilerpipeContentHandler(new ContentHandlerDecorator(....

Though from a quick look at that code, I'm curious why it doesn't use BodyContentHandler, versus the current ContentHandlerDecorator.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Reply via email to