On 7/30/07, Matthew A. Bockol <[EMAIL PROTECTED]> wrote: > > Would it be possible to remove duplicates once the index is complete?
Yes, that is what DeleteDuplicates does. You can create a new Signature function (see TextProfileSignature for details) to re-define what being a duplicate means. > > > ----- Original Message ----- > From: "Kai_testing Middleton" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Friday, July 27, 2007 12:27:57 AM (GMT-0600) America/Chicago > Subject: Re: eliminating almost duplicate URLs > > It seems that the regular expression rules work one URL at a time ... I > don't know how you'd maintain some state to know a similar URL had already > been seen. > > ----- Original Message ---- > From: Matthew A. Bockol <[EMAIL PROTECTED]> > To: nutch user <[EMAIL PROTECTED]> > Sent: Thursday, July 26, 2007 8:58:50 PM > Subject: eliminating almost duplicate URLs > > Hi Folks, > > The site we're crawling serves up pages both via http and https. There are > links switching from one to the other depending on the page. When this > happens, I'll see two results which are almost identical except one page is > http and the next is https. Is there any way to remove those duplicates > through normal nutch config? There are some pages that only show up via > https, so I can't just exclude those. > > Thanks, > Matt > > > > > > > > > > ____________________________________________________________________________________ > Be a better Heartthrob. Get better relationship answers from someone who > knows. Yahoo! Answers - Check it out. > http://answers.yahoo.com/dir/?link=list&sid=396545433 > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
