I just saw TagSoup and it seems to clean bad HTML tags to create a good HTML file. what's BoilerPipe does, it try to eliminate html content which is not part of the useful content for a human reader (ie. navigation contents, ads, comments...) take a look here: http://boilerpipe-web.appspot.com/ and try with one of your URL

And other type of this application, is 'Readability' which is more for a end-user (http://lab.arc90.com/experiments/readability/)


Le 14.01.2011 16:51, Adam Estrada a écrit :
Is there a drastic difference between this and TagSoup which is already
included in Solr?

On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
<arnaud.gaudi...@gmail.com>wrote:

Hello,

I would like to use BoilerPipe (a very good program which cleans the html
content from surplus "clutter").
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from
solr, am I right?

How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml (
with org.apache.solr.handler.extraction.ExtractingRequestHandler)?

Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum (
http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration)
is it the right way?

Thanks in advance,

Arno.



Reply via email to