Hi, The patch seems to fulfil my needs, but how do I use it with Nutch 1.7? Is the patch not release yet?
Markus Källander Mobile +46 73 622 0547 -----Original Message----- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: den 11 februari 2014 17:44 To: user@nutch.apache.org Subject: Re: HTML tag filtering Hi Markus, in short, you have to write a parse filter plugin which does in the filter(...) method: 1. traverse the DOM tree and constructs a "clean" text by skipping certain content. See o.a.n.utils.NodeWalker o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of parse-html plugin) 2. then replace the old plain text in ParseResult by new "clean" text Maybe this issue can help (there is also a patch but I'm not sure whether it's working and fulfills your needs): https://issues.apache.org/jira/browse/NUTCH-585 Sebastian On 02/11/2014 04:24 PM, Markus Källander wrote: > Hi, > > How do I skip indexing of HTML tags with certain id:s or css classes? I am > using Nutch 1.7. > > Thanks > Markus >