Re: [Nutch-general] eliminating almost duplicate URLs

Kai_testing Middleton Thu, 26 Jul 2007 22:28:43 -0700

It seems that the regular expression rules work one URL at a time  ... I don't 
know how you'd maintain some state to know a similar URL had already been seen.

----- Original Message ----
From: Matthew A. Bockol <[EMAIL PROTECTED]>
To: nutch user <[EMAIL PROTECTED]>
Sent: Thursday, July 26, 2007 8:58:50 PM
Subject: eliminating almost duplicate URLs

Hi Folks,

The site we're crawling serves up pages both via http and https. There are 
links switching from one to the other depending on the page. When this happens, 
I'll see two results which are almost identical except one page is http and the 
next is https. Is there any way to remove those duplicates through normal nutch 
config? There are some pages that only show up via https, so I can't just 
exclude those. 

Thanks,
Matt

____________________________________________________________________________________
Be a better Heartthrob. Get better relationship answers from someone who knows. 
Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=list&sid=396545433

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] eliminating almost duplicate URLs

Reply via email to