RE: HTML tag filtering

Markus Källander Wed, 12 Feb 2014 06:05:46 -0800

Hi,

The patch seems to fulfil my needs, but how do I use it with Nutch 1.7? Is the 
patch not release yet?

Markus Källander

Mobile +46 73 622 0547

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: den 11 februari 2014 17:44
To: user@nutch.apache.org
Subject: Re: HTML tag filtering

Hi Markus,

in short, you have to write a parse filter plugin which does in the filter(...) 
method:
1. traverse the DOM tree and constructs a "clean" text by skipping certain 
content. See  o.a.n.utils.NodeWalker
 o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of parse-html 
plugin) 2. then replace the old plain text in ParseResult by new "clean" text

Maybe this issue can help (there is also a patch but I'm not sure whether it's 
working and fulfills your needs):
 https://issues.apache.org/jira/browse/NUTCH-585

Sebastian

On 02/11/2014 04:24 PM, Markus Källander wrote:
> Hi,
> 
> How do I skip indexing of HTML tags with certain id:s or css classes? I am 
> using Nutch 1.7.
> 
> Thanks
> Markus
>

RE: HTML tag filtering

Reply via email to