I just saw TagSoup and it seems to clean bad HTML tags to create a good
HTML file.
what's BoilerPipe does, it try to eliminate html content which is not
part of the useful content for a human reader (ie. navigation contents,
ads, comments...)
take a look here: http://boilerpipe-web.appspot.com/ and try with one of
your URL
And other type of this application, is 'Readability' which is more for a
end-user (http://lab.arc90.com/experiments/readability/)
Le 14.01.2011 16:51, Adam Estrada a écrit :
Is there a drastic difference between this and TagSoup which is already
included in Solr?
On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
<arnaud.gaudi...@gmail.com>wrote:
Hello,
I would like to use BoilerPipe (a very good program which cleans the html
content from surplus "clutter").
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from
solr, am I right?
How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml (
with org.apache.solr.handler.extraction.ExtractingRequestHandler)?
Or do I need to modify some code inside Solr?
I so something like TikaCLI -F in the tika forum (
http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration)
is it the right way?
Thanks in advance,
Arno.