On 2010-10-28 12:41, Andrew McCombe wrote:
> Hi
> 
> I've successfully set up nutch with solr to index a clients site.  However,
> it has indexed all text content on every page and so solr search results are
> getting polluted by terms in the site navigation, footer etc.
> 
> What is the correct way to index the content from a particular div on a
> page?

You need to implement an HtmlParseFilter plugin (take a look at
creativecommons plugin for inspiration), and in the filter(..) method
you need to traverse the DOM tree and extract only portions that you are
interested in, and then replace the text in ParseResult with your version.

Hopefully this kind of functionality will be improved soon through the
use of the Boilerpipe functionality in Tika - but this is still to be
integrated into both Tika and Nutch.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to