Hello Megha - upgrade to 1.12 and try again. Markus
-----Original message----- > From:Megha Bhandari <mbhanda...@sapient.com> > Sent: Friday 8th July 2016 16:28 > To: user@nutch.apache.org > Subject: Nutch 1.11 | Ignoring content header and footer content while > parsing HTML > > Hi > > Read a couple of threads that suggest that we can use Tika's boilerplate > content handler to ignore content like header and footer in Nutch. > > Tried the below configurations in nutch-site.xml (Nutch 1.11) . However we > can still see header and footer content getting extracted. > > <property> > <name>plugin.includes</name> > > <value>protocol-(http|httpclient)|urlfilter-regex|headings|parse-(html|tika|metatags)|index-(basic|metadata)|indexer-solr|urlnormalizer-(pass|regex|basic)|language-identifier</value> > </property> > > <property> > <name>parser.html.NodesToExclude</name> > > <value>div;class;navigation-wrapper|footer;class;main-footer|div;class;header|div;id;uhc-top-nav-menu</value> > </property> > <property> > <name>tika.use_boilerpipe</name> > <value>true</value> > </property> > <property> > <name>tika.boilerpipe.extractor</name> > <value>ArticleExtractor</value> > </property> > > Anything we are missing here? > > Regards > Megha >