Hello - Apple's homepage should not yield and text with ArticleExtractor. But, you have both parse-html and parse-tika enabled. So i suspect parse-tika is not configured to run for html or xhtml pages. Check your parse-plugins.xml configuration file.
Don't expect too much from Boilerpipe although it's the best open source solution right now. It kind of usually works for typical article like pages. Markus -----Original message----- > From:Manish Verma <m_ve...@apple.com> > Sent: Wednesday 29th June 2016 19:36 > To: user@nutch.apache.org > Subject: Re: Remove Header from content > > It does not seems working for me , tried will all three Boilerpipe algorithm. > > Tried with apple.com <http://apple.com/> but content still has header stuff, > my header start with this <nav id="ac-globalnav" > > Added below in my nutch-site.xml with default plugin included > > <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > > <property> > <name>tika.extractor</name> > <value>boilerpipe</value> > <description> > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > </description> > </property> > > <property> > <name>tika.extractor.boilerpipe.algorithm</name> > <value>CanolaExtractor</value> > <description> > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > </description> > </property> > > Am I missing something here ? > > > Regards, > Manish Verma > AML Search > > > On Jun 29, 2016, at 3:06 AM, Markus Jelsma <markus.jel...@openindex.io> > > wrote: > > > > Manish - you're in luck. Nutch 1.12 was released and has Boilerpipe > > support. Check: > > https://issues.apache.org/jira/browse/NUTCH-961 > > > > Markus > > > > > > > > -----Original message----- > >> From:Manish Verma <m_ve...@apple.com> > >> Sent: Tuesday 28th June 2016 23:46 > >> To: user@nutch.apache.org > >> Subject: Remove Header from content > >> > >> Hi, > >> > >> I don’t want to index header and footer of content , I know we can make > >> changes in HtmlParser.java but I don’t want to change nutch core code, is > >> there any other way(plugin) to eleminate Header div from content. > >> > >> Thanks MV > >> > >> > >