This is the behaviour I am noticing with pages that have a server redirect (300-range code):
Say page A redirects to page B. A is in the fetchlist created by generate. When A is fetched, the redirect is followed and B is fetched. At the next updatedb, both A and B go into the crawldb. For some reason, at the next generate, page B is listed to be fetched. And again at the next generate, and so on. An example is: http://www.selecthotels.com which redirects to http://203.210.113.143/ ('page B'). This page always seems to be in the fetchlist no matter how many times it gets fetched. (To make matter more complicated, it also redirects to yet another URL.) How do I fix this behaviour? Also, other URLs whose fetch fails for some reason stay in the crawldb and are tried again and again. For a 'deep' search using topN=1000, each fetchlist generated after a number of runs has many hundreds of these failed URLs that it tries to refetch. How do I fix this behaviour too? End of the day for me (NZST). I'll try again tomorrow.... Cheers, Carl. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
