On 2010-10-28 12:41, Andrew McCombe wrote: > Hi > > I've successfully set up nutch with solr to index a clients site. However, > it has indexed all text content on every page and so solr search results are > getting polluted by terms in the site navigation, footer etc. > > What is the correct way to index the content from a particular div on a > page?
You need to implement an HtmlParseFilter plugin (take a look at creativecommons plugin for inspiration), and in the filter(..) method you need to traverse the DOM tree and extract only portions that you are interested in, and then replace the text in ParseResult with your version. Hopefully this kind of functionality will be improved soon through the use of the Boilerpipe functionality in Tika - but this is still to be integrated into both Tika and Nutch. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

