Re: boilerpipe solr tika howto please
Thanks Ken, this what I wanted to know, I'm not very familiar with this kind of modification. However, I will try to do it and ask you some information in case of need. regards, Arno Le 14.01.2011 18:04, Ken Krugler a écrit : Hi Arno, On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus "clutter"). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) is it the right way? You need to add the BoilerpipeContentHandler into Tika's content handler chain. Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk) the TikaEntityProcessor.getHtmlHandler() method. I'd try something like: return new BoilerpipeContentHandler(new ContentHandlerDecorator( Though from a quick look at that code, I'm curious why it doesn't use BodyContentHandler, versus the current ContentHandlerDecorator. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: boilerpipe solr tika howto please
Hi Arno, On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus "clutter"). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration ) is it the right way? You need to add the BoilerpipeContentHandler into Tika's content handler chain. Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk) the TikaEntityProcessor.getHtmlHandler() method. I'd try something like: return new BoilerpipeContentHandler(new ContentHandlerDecorator( Though from a quick look at that code, I'm curious why it doesn't use BodyContentHandler, versus the current ContentHandlerDecorator. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: boilerpipe solr tika howto please
There is another way to ingest data using DIH. Check out the HTMLStripTransformer http://www2c.cdc.gov/podcasts/createrss.asp?t=r&c=19"; processor="XPathEntityProcessor" forEach="/rss/channel | /rss/channel/item" transformer="DateFormatTransformer,HTMLStripTransformer"> On Fri, Jan 14, 2011 at 11:10 AM, arnaud gaudinat wrote: > I just saw TagSoup and it seems to clean bad HTML tags to create a good > HTML file. > what's BoilerPipe does, it try to eliminate html content which is not part > of the useful content for a human reader (ie. navigation contents, ads, > comments...) > take a look here: http://boilerpipe-web.appspot.com/ and try with one of > your URL > > And other type of this application, is 'Readability' which is more for a > end-user (http://lab.arc90.com/experiments/readability/) > > > Le 14.01.2011 16:51, Adam Estrada a écrit : > > Is there a drastic difference between this and TagSoup which is already >> included in Solr? >> >> On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat >> wrote: >> >> Hello, >>> >>> I would like to use BoilerPipe (a very good program which cleans the html >>> content from surplus "clutter"). >>> I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from >>> solr, am I right? >>> >>> How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml >>> ( >>> with org.apache.solr.handler.extraction.ExtractingRequestHandler)? >>> >>> Or do I need to modify some code inside Solr? >>> >>> I so something like TikaCLI -F in the tika forum ( >>> >>> http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration >>> ) >>> is it the right way? >>> >>> Thanks in advance, >>> >>> Arno. >>> >>> >>> >
Re: boilerpipe solr tika howto please
I just saw TagSoup and it seems to clean bad HTML tags to create a good HTML file. what's BoilerPipe does, it try to eliminate html content which is not part of the useful content for a human reader (ie. navigation contents, ads, comments...) take a look here: http://boilerpipe-web.appspot.com/ and try with one of your URL And other type of this application, is 'Readability' which is more for a end-user (http://lab.arc90.com/experiments/readability/) Le 14.01.2011 16:51, Adam Estrada a écrit : Is there a drastic difference between this and TagSoup which is already included in Solr? On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat wrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus "clutter"). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum ( http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) is it the right way? Thanks in advance, Arno.
Re: boilerpipe solr tika howto please
Is there a drastic difference between this and TagSoup which is already included in Solr? On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat wrote: > Hello, > > I would like to use BoilerPipe (a very good program which cleans the html > content from surplus "clutter"). > I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from > solr, am I right? > > How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( > with org.apache.solr.handler.extraction.ExtractingRequestHandler)? > > Or do I need to modify some code inside Solr? > > I so something like TikaCLI -F in the tika forum ( > http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) > is it the right way? > > Thanks in advance, > > Arno. > >