Hi Folks, The site we're crawling serves up pages both via http and https. There are links switching from one to the other depending on the page. When this happens, I'll see two results which are almost identical except one page is http and the next is https. Is there any way to remove those duplicates through normal nutch config? There are some pages that only show up via https, so I can't just exclude those.
Thanks, Matt ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
