Re: boilerpipe solr tika howto please

2011-01-17 Thread arnaud gaudinat

Thanks Ken,
this what I wanted to know, I'm not very familiar with this kind of 
modification. However, I will try to do it and ask you some information 
in case of need.

regards,

Arno

Le 14.01.2011 18:04, Ken Krugler a écrit :

Hi Arno,

On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:


Hello,

I would like to use BoilerPipe (a very good program which cleans the 
html content from surplus "clutter").
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible 
from solr, am I right?


How I can Activate BoilerPipe in Solr? Do I need to change 
solrconfig.xml ( with 
org.apache.solr.handler.extraction.ExtractingRequestHandler)?


Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum 
(http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) 
is it the right way?


You need to add the BoilerpipeContentHandler into Tika's content 
handler chain.


Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk) 
the TikaEntityProcessor.getHtmlHandler() method. I'd try something like:


return new BoilerpipeContentHandler(new ContentHandlerDecorator(

Though from a quick look at that code, I'm curious why it doesn't use 
BodyContentHandler, versus the current ContentHandlerDecorator.


-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g










Re: boilerpipe solr tika howto please

2011-01-14 Thread Ken Krugler

Hi Arno,

On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:


Hello,

I would like to use BoilerPipe (a very good program which cleans the  
html content from surplus "clutter").
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible  
from solr, am I right?


How I can Activate BoilerPipe in Solr? Do I need to change  
solrconfig.xml ( with  
org.apache.solr.handler.extraction.ExtractingRequestHandler)?


Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration 
) is it the right way?


You need to add the BoilerpipeContentHandler into Tika's content  
handler chain.


Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk)  
the TikaEntityProcessor.getHtmlHandler() method. I'd try something like:


return new BoilerpipeContentHandler(new ContentHandlerDecorator(

Though from a quick look at that code, I'm curious why it doesn't use  
BodyContentHandler, versus the current ContentHandlerDecorator.


-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: boilerpipe solr tika howto please

2011-01-14 Thread Adam Estrada
There is another way to ingest data using DIH. Check out the
HTMLStripTransformer

  http://www2c.cdc.gov/podcasts/createrss.asp?t=r&c=19";
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"
transformer="DateFormatTransformer,HTMLStripTransformer">















  

On Fri, Jan 14, 2011 at 11:10 AM, arnaud gaudinat  wrote:

> I just saw TagSoup and it seems to clean bad HTML tags to create a good
> HTML file.
> what's BoilerPipe does, it try to eliminate html content which is not part
> of the useful content for a human reader (ie. navigation contents, ads,
> comments...)
> take a look here: http://boilerpipe-web.appspot.com/ and try with one of
> your URL
>
> And other type of this application, is 'Readability' which is more for a
> end-user (http://lab.arc90.com/experiments/readability/)
>
>
> Le 14.01.2011 16:51, Adam Estrada a écrit :
>
>  Is there a drastic difference between this and TagSoup which is already
>> included in Solr?
>>
>> On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
>> wrote:
>>
>>  Hello,
>>>
>>> I would like to use BoilerPipe (a very good program which cleans the html
>>> content from surplus "clutter").
>>> I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from
>>> solr, am I right?
>>>
>>> How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml
>>> (
>>> with org.apache.solr.handler.extraction.ExtractingRequestHandler)?
>>>
>>> Or do I need to modify some code inside Solr?
>>>
>>> I so something like TikaCLI -F in the tika forum (
>>>
>>> http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration
>>> )
>>> is it the right way?
>>>
>>> Thanks in advance,
>>>
>>> Arno.
>>>
>>>
>>>
>


Re: boilerpipe solr tika howto please

2011-01-14 Thread arnaud gaudinat
I just saw TagSoup and it seems to clean bad HTML tags to create a good 
HTML file.
what's BoilerPipe does, it try to eliminate html content which is not 
part of the useful content for a human reader (ie. navigation contents, 
ads, comments...)
take a look here: http://boilerpipe-web.appspot.com/ and try with one of 
your URL


And other type of this application, is 'Readability' which is more for a 
end-user (http://lab.arc90.com/experiments/readability/)



Le 14.01.2011 16:51, Adam Estrada a écrit :

Is there a drastic difference between this and TagSoup which is already
included in Solr?

On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
wrote:


Hello,

I would like to use BoilerPipe (a very good program which cleans the html
content from surplus "clutter").
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from
solr, am I right?

How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml (
with org.apache.solr.handler.extraction.ExtractingRequestHandler)?

Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum (
http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration)
is it the right way?

Thanks in advance,

Arno.






Re: boilerpipe solr tika howto please

2011-01-14 Thread Adam Estrada
Is there a drastic difference between this and TagSoup which is already
included in Solr?

On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
wrote:

> Hello,
>
> I would like to use BoilerPipe (a very good program which cleans the html
> content from surplus "clutter").
> I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from
> solr, am I right?
>
> How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml (
> with org.apache.solr.handler.extraction.ExtractingRequestHandler)?
>
> Or do I need to modify some code inside Solr?
>
> I so something like TikaCLI -F in the tika forum (
> http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration)
> is it the right way?
>
> Thanks in advance,
>
> Arno.
>
>