RE: Nutch 1.11 | Ignoring content header and footer content while parsing HTML

Markus Jelsma Fri, 08 Jul 2016 08:08:13 -0700

Hello Megha - upgrade to 1.12 and try again.
Markus


 
 
-----Original message-----
> From:Megha Bhandari <mbhanda...@sapient.com>
> Sent: Friday 8th July 2016 16:28
> To: user@nutch.apache.org
> Subject: Nutch 1.11 | Ignoring content header and footer content while 
> parsing HTML
> 
> Hi
> 
> Read a couple of threads that suggest that we can use Tika's boilerplate 
> content handler to ignore content like header and footer in Nutch.
> 
> Tried the below configurations in nutch-site.xml (Nutch 1.11) . However we 
> can still see header and footer content getting extracted.
> 
> <property>
>                   <name>plugin.includes</name>
>                   
> <value>protocol-(http|httpclient)|urlfilter-regex|headings|parse-(html|tika|metatags)|index-(basic|metadata)|indexer-solr|urlnormalizer-(pass|regex|basic)|language-identifier</value>
>                 </property>
> 
> <property>
>     <name>parser.html.NodesToExclude</name>
>     
> <value>div;class;navigation-wrapper|footer;class;main-footer|div;class;header|div;id;uhc-top-nav-menu</value>
>   </property>
>   <property>
>   <name>tika.use_boilerpipe</name>
>   <value>true</value>
> </property>
> <property>
>   <name>tika.boilerpipe.extractor</name>
>   <value>ArticleExtractor</value>
> </property>
> 
> Anything we are missing here?
> 
> Regards
> Megha
>

RE: Nutch 1.11 | Ignoring content header and footer content while parsing HTML

Reply via email to