RE: URL filter plugins for nutch

Markus Jelsma Wed, 18 Feb 2015 13:30:14 -0800

Hi - this is not going to work. URLFilter interface operates on single URL's 
only, it is not aware of content, it is not aware of possible metadata 
(simhash) attached to the CrawlDatum. It would be more straightforward to 
implement Signature and calculate the simhash there. Now, Nutch has a 
DeduplicationJob but it it operates on equals signatures as mapreduce key, and 
this is not going to work with simhashes. I remember there was a trick to get 
similar hashes in the same key buckets by emitting them to multiple buckets 
from the mapper, so then in the reducer you can do a sorensen similarity on the 
hashes.


This is really tricky stuff, especially getting the hashes in the same bucket.

Are you doing this for removing duplicates from search results? Then it might 
be more easier to implement the sorensen similarity in a custom Lucene 
collector. Because the top docs contain duplicates, the pass through the same 
collector implementation and a single point to remove them. The problem now is 
that it wont really work with distributed search, unless you hash similar URL's 
to the same shard, but the cluster now becomes unbalanced and difficult to 
manage, plus that IDF and norms become skewed.

Good luck, we have tried many different approaches to this problem, especially 
online deduplication. But offline is also hard because of reducer keys.

Markus

 
 
-----Original message-----
> From:Madan Patil <madan...@usc.edu>
> Sent: Wednesday 18th February 2015 22:16
> To: user <user@nutch.apache.org>
> Subject: Re: URL filter plugins for nutch
> 
> I am not sure if I understand you right. But here is what I am trying to
> implement,
> 
> I have implemented Charikar's simhash and now want to use it to detect
> near-duplicates/duplicates.
> I would like to make it a plugin(which implements URLFilter interface).
> Hence filter all those URLs, whose content is nearly same as one which have
> alrady been fetched. Would this be possible or I am heading in wrong
> direction.
> 
> Thanks for your patience Markus.
> 
> 
> Regards,
> Madan Patil
> 
> On Wed, Feb 18, 2015 at 1:05 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Easiest is to set the signature to TextProfileSignature and delete
> > duplicates from the index, but you will still crawl and waste resources on
> > them. Or are you by any change trying to prevent spider traps from being
> > crawled?
> >
> >
> >
> > -----Original message-----
> > > From:Madan Patil <madan...@usc.edu>
> > > Sent: Wednesday 18th February 2015 21:58
> > > To: user <user@nutch.apache.org>
> > > Subject: Re: URL filter plugins for nutch
> > >
> > > Hi Markus,
> > >
> > > I am looking for the one's with similar in content.
> > >
> > > Regards,
> > > Madan Patil
> > >
> > > On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > By near-duplicate you mean similar URL's, or URL's with similar
> > content?
> > > >
> > > > -----Original message-----
> > > > > From:Madan Patil <madan...@usc.edu>
> > > > > Sent: Wednesday 18th February 2015 21:10
> > > > > To: user <user@nutch.apache.org>
> > > > > Subject: URL filter plugins for nutch
> > > > >
> > > > > Hi,
> > > > >
> > > > > I am working on assignment where I am supposed to use nutch to crawl
> > > > > antractic data.
> > > > > I am writing a plugin which extends URLFilter to not crawl duplicate
> > > > (exact
> > > > > and near duplicate) URLs. All the plugins, the defaults ones and
> > others
> > > > on
> > > > > web, have only one URL. They decide what to do or not to do based on
> > > > > content of one URL.
> > > > >
> > > > > Could any one point me to resources which would help me compare
> > content
> > > > of
> > > > > one URL with the ones already crawled?
> > > > >
> > > > > Thanks in advance.
> > > > >
> > > > > Regards,
> > > > > Madan Patil
> > > > >
> > > >
> > >
> >
>

RE: URL filter plugins for nutch

Reply via email to