Hello - check out NUTCH-961. It adds support for Boilerpipe to Nutch' Tika parser. It's crude but works reasonably. https://issues.apache.org/jira/browse/NUTCH-961
Markus -----Original message----- > From:Richardson, Jacquelyn F. <fluke...@ornl.gov> > Sent: Thursday 26th March 2015 16:20 > To: user@nutch.apache.org > Subject: Ignore navigation during index > > Hi, > > Is there a way to tell nutch to ignore the navigation or footer parts of an > html page during the crawl process? Specifically I do not want the > information in the navigation or footer to be indexed. My environment is > Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7. > > Any assistance will be greatly appreciated. > > Thanks, > Jackie > >