Hi Jorge - perhaps i am missing something, but the linkdb cannot hold content 
derived information such as similarity hashes, nor does it cluster similar 
URL's as you would want when detecting spider traps. 

What do you think?

Markus 
 
-----Original message-----
> From:Jorge Luis Betancourt González <jlbetanco...@uci.cu>
> Sent: Wednesday 18th February 2015 23:05
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]URL filter plugins for nutch
> 
> The idea behind the URL filter plugins is to decide weather the current URL 
> (string) should be allowed to be fetched or not, in your particular case I 
> think that you could try to read the LinkDB and then decide if you want to 
> fetch or not this particular URL, keep in mind that this logic should be 
> something fast, because is going to be executed a lot of times (one for each 
> URL). 
> 
> I don't know of any plugin that does this, typically this is kind of hard to 
> do right (if possible at all), but you can check out the LinkDbReader for a 
> way to read from LinkDB to do your check. One more detail if you only filter 
> by the URL you can find resources on the Web where the content has changed 
> and in this case you will discard the fetching of the resource. 
> 
> Regards,
> 
> ----- Original Message -----
> From: "Madan Patil" <madan...@usc.edu>
> To: "user" <user@nutch.apache.org>
> Sent: Wednesday, February 18, 2015 3:09:00 PM
> Subject: [MASSMAIL]URL filter plugins for nutch
> 
> Hi,
> 
> I am working on assignment where I am supposed to use nutch to crawl
> antractic data.
> I am writing a plugin which extends URLFilter to not crawl duplicate (exact
> and near duplicate) URLs. All the plugins, the defaults ones and others on
> web, have only one URL. They decide what to do or not to do based on
> content of one URL.
> 
> Could any one point me to resources which would help me compare content of
> one URL with the ones already crawled?
> 
> Thanks in advance.
> 
> Regards,
> Madan Patil
> 

Reply via email to