Thanks for your comments. Please consider adding it to the issue so we can keep 
track of it.

 
 
-----Original message-----
> From:John McCormac <j...@hackwatch.com>
> Sent: Sat 23-Jun-2012 16:36
> To: user@nutch.apache.org
> Subject: Re: Near Duplicate Detection in nutch /Solr
> 
> On 23/06/2012 14:21, Markus Jelsma wrote:
> > Keep an eye on these open issues:
> >
> > https://issues.apache.org/jira/browse/NUTCH-1324
> > https://issues.apache.org/jira/browse/NUTCH-1325
> > https://issues.apache.org/jira/browse/NUTCH-1326
> >
> > They are a set of tools capable of deduplicating the various databases via 
> > the HostNormalizer. They collect information on hosts, most importantly the 
> > link score. It also collects information on duplicates within a host and 
> > then produce deduplication rules for the HostNormalizer based on host and 
> > duplicate information.
> >
> > It's limited to domain because that's a larger problem in terms of 
> > resources and a bit easier to deal with.
> 
> The HostDB patch looks interesting. (I'm still very much a novice as 
> regards Nutch and Java.) It might be a good thing to add a DNS lookup 
> field and an IP lookup field. Some hosters have domain graveyard IPs 
> (and PPC parking pages) where they point undeveloped or unrenewed 
> domains. This would help with the blacklisting process by allowing 
> unrenewed sites to be identified simply by IP. In DNS terms, if a domain 
> moves to a PPC (sedoparking.com etc) or auction hoster (afternic.com 
> etc) then it is no longer worth including in an active index.
> 
> Regards...jmcc
> -- 
> **********************************************************
> John McCormac  *  e-mail: j...@hosterstats.com
> MC2            *  web: http://www.hosterstats.com/
> 22 Viewmount   *  Domain Registrations Statistics
> Waterford      *  And Historical DNS Database.
> Ireland        *  Over 275 Million Domains Tracked.
> IE             *  http://www.hosterstats.com/blog
> **********************************************************
> 
> 
> 

Reply via email to