Re: HTML tag filtering

Sebastian Nagel Tue, 11 Feb 2014 08:48:25 -0800

Hi Markus,

in short, you have to write a parse filter plugin which does in the filter(...) 
method:
1. traverse the DOM tree and constructs a "clean" text
by skipping certain content. See
 o.a.n.utils.NodeWalker
 o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of parse-html plugin)
2. then replace the old plain text in ParseResult by new "clean" text


Maybe this issue can help (there is also a patch but I'm not sure whether it's 
working
and fulfills your needs):
 https://issues.apache.org/jira/browse/NUTCH-585

Sebastian

On 02/11/2014 04:24 PM, Markus Källander wrote:
> Hi,
> 
> How do I skip indexing of HTML tags with certain id:s or css classes? I am 
> using Nutch 1.7.
> 
> Thanks
> Markus
>

Re: HTML tag filtering

Reply via email to