Hello - Apple's homepage should not yield and text with ArticleExtractor. But, 
you have both parse-html and parse-tika enabled. So i suspect parse-tika is not 
configured to run for html or xhtml pages. Check your parse-plugins.xml 
configuration file.

Don't expect too much from Boilerpipe although it's the best open source 
solution right now. It kind of usually works for typical article like pages.

Markus
 
 
-----Original message-----
> From:Manish Verma <m_ve...@apple.com>
> Sent: Wednesday 29th June 2016 19:36
> To: user@nutch.apache.org
> Subject: Re: Remove Header from content
> 
> It does not seems working for me , tried will all three Boilerpipe algorithm.
> 
> Tried with apple.com <http://apple.com/> but content still has header stuff, 
> my header start with this <nav id="ac-globalnav"
> 
> Added below in my nutch-site.xml with default plugin included
>  
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> 
> 
> <property>
>   <name>tika.extractor</name>
>   <value>boilerpipe</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   </description>
> </property>
>  
> <property> 
>   <name>tika.extractor.boilerpipe.algorithm</name>
>   <value>CanolaExtractor</value>
>   <description> 
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   </description>
> </property>
> 
> Am I missing something here ?
> 
> 
> Regards,
> Manish Verma
> AML Search
> 
> > On Jun 29, 2016, at 3:06 AM, Markus Jelsma <markus.jel...@openindex.io> 
> > wrote:
> > 
> > Manish - you're in luck. Nutch 1.12 was released and has Boilerpipe 
> > support. Check:
> > https://issues.apache.org/jira/browse/NUTCH-961
> > 
> > Markus
> > 
> > 
> > 
> > -----Original message-----
> >> From:Manish Verma <m_ve...@apple.com>
> >> Sent: Tuesday 28th June 2016 23:46
> >> To: user@nutch.apache.org
> >> Subject: Remove Header from content
> >> 
> >> Hi,
> >> 
> >> I don’t want to index header and footer of content , I know we can make 
> >> changes in HtmlParser.java but I don’t want to change nutch core code, is 
> >> there any other way(plugin) to eleminate Header div from content.
> >> 
> >> Thanks MV
> >> 
> >> 
> 
> 

Reply via email to