Hi Markus, in short, you have to write a parse filter plugin which does in the filter(...) method: 1. traverse the DOM tree and constructs a "clean" text by skipping certain content. See o.a.n.utils.NodeWalker o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of parse-html plugin) 2. then replace the old plain text in ParseResult by new "clean" text
Maybe this issue can help (there is also a patch but I'm not sure whether it's working and fulfills your needs): https://issues.apache.org/jira/browse/NUTCH-585 Sebastian On 02/11/2014 04:24 PM, Markus Källander wrote: > Hi, > > How do I skip indexing of HTML tags with certain id:s or css classes? I am > using Nutch 1.7. > > Thanks > Markus >