Re: Can Nutch index/parse targeted sections of a web page?

Andrzej Bialecki Thu, 28 Oct 2010 04:02:22 -0700

On 2010-10-28 12:41, Andrew McCombe wrote:
> Hi
> 
> I've successfully set up nutch with solr to index a clients site.  However,
> it has indexed all text content on every page and so solr search results are
> getting polluted by terms in the site navigation, footer etc.
> 
> What is the correct way to index the content from a particular div on a
> page?


You need to implement an HtmlParseFilter plugin (take a look at
creativecommons plugin for inspiration), and in the filter(..) method
you need to traverse the DOM tree and extract only portions that you are
interested in, and then replace the text in ParseResult with your version.

Hopefully this kind of functionality will be improved soon through the
use of the Boilerpipe functionality in Tika - but this is still to be
integrated into both Tika and Nutch.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Can Nutch index/parse targeted sections of a web page?

Reply via email to