On 7/30/07, Matthew A. Bockol <[EMAIL PROTECTED]> wrote:
>
> Would it be possible to remove duplicates once the index is complete?

Yes, that is what DeleteDuplicates does. You can create a new
Signature function (see TextProfileSignature for details) to re-define
what being a duplicate means.

>
>
> ----- Original Message -----
> From: "Kai_testing Middleton" <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, July 27, 2007 12:27:57 AM (GMT-0600) America/Chicago
> Subject: Re: eliminating almost duplicate URLs
>
> It seems that the regular expression rules work one URL at a time  ... I 
> don't know how you'd maintain some state to know a similar URL had already 
> been seen.
>
> ----- Original Message ----
> From: Matthew A. Bockol <[EMAIL PROTECTED]>
> To: nutch user <[EMAIL PROTECTED]>
> Sent: Thursday, July 26, 2007 8:58:50 PM
> Subject: eliminating almost duplicate URLs
>
> Hi Folks,
>
> The site we're crawling serves up pages both via http and https. There are 
> links switching from one to the other depending on the page. When this 
> happens, I'll see two results which are almost identical except one page is 
> http and the next is https. Is there any way to remove those duplicates 
> through normal nutch config? There are some pages that only show up via 
> https, so I can't just exclude those.
>
> Thanks,
> Matt
>
>
>
>
>
>
>
>
>
> ____________________________________________________________________________________
> Be a better Heartthrob. Get better relationship answers from someone who 
> knows. Yahoo! Answers - Check it out.
> http://answers.yahoo.com/dir/?link=list&sid=396545433
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to