Thanks for your comments. Please consider adding it to the issue so we can keep track of it.
-----Original message----- > From:John McCormac <j...@hackwatch.com> > Sent: Sat 23-Jun-2012 16:36 > To: user@nutch.apache.org > Subject: Re: Near Duplicate Detection in nutch /Solr > > On 23/06/2012 14:21, Markus Jelsma wrote: > > Keep an eye on these open issues: > > > > https://issues.apache.org/jira/browse/NUTCH-1324 > > https://issues.apache.org/jira/browse/NUTCH-1325 > > https://issues.apache.org/jira/browse/NUTCH-1326 > > > > They are a set of tools capable of deduplicating the various databases via > > the HostNormalizer. They collect information on hosts, most importantly the > > link score. It also collects information on duplicates within a host and > > then produce deduplication rules for the HostNormalizer based on host and > > duplicate information. > > > > It's limited to domain because that's a larger problem in terms of > > resources and a bit easier to deal with. > > The HostDB patch looks interesting. (I'm still very much a novice as > regards Nutch and Java.) It might be a good thing to add a DNS lookup > field and an IP lookup field. Some hosters have domain graveyard IPs > (and PPC parking pages) where they point undeveloped or unrenewed > domains. This would help with the blacklisting process by allowing > unrenewed sites to be identified simply by IP. In DNS terms, if a domain > moves to a PPC (sedoparking.com etc) or auction hoster (afternic.com > etc) then it is no longer worth including in an active index. > > Regards...jmcc > -- > ********************************************************** > John McCormac * e-mail: j...@hosterstats.com > MC2 * web: http://www.hosterstats.com/ > 22 Viewmount * Domain Registrations Statistics > Waterford * And Historical DNS Database. > Ireland * Over 275 Million Domains Tracked. > IE * http://www.hosterstats.com/blog > ********************************************************** > > >