Hi, I just had to solve the same problem. Luckily, I found a patch for it here: http://issues.apache.org/jira/browse/NUTCH-273
I applied it to Nutch 0.8.1, and it works. No more infinite loops. Just be sure to put the line into the right block (redirect). greetings from Berlin (CET), RĂ¼diger Carl Cerecke-3 wrote: > > This is the behaviour I am noticing with pages that have a server > redirect (300-range code): > > Say page A redirects to page B. A is in the fetchlist created by > generate. When A is fetched, the redirect is followed and B is fetched. > At the next updatedb, both A and B go into the crawldb. For some reason, > at the next generate, page B is listed to be fetched. And again at the > next generate, and so on. > > > An example is: > > http://www.selecthotels.com > > which redirects to http://203.210.113.143/ ('page B'). > This page always seems to be in the fetchlist no matter how many times > it gets fetched. (To make matter more complicated, it also redirects to > yet another URL.) > > How do I fix this behaviour? > > Also, other URLs whose fetch fails for some reason stay in the crawldb > and are tried again and again. For a 'deep' search using topN=1000, each > fetchlist generated after a number of runs has many hundreds of these > failed URLs that it tries to refetch. > > How do I fix this behaviour too? > > > > End of the day for me (NZST). I'll try again tomorrow.... > > Cheers, > Carl. > > -- View this message in context: http://www.nabble.com/Redirected-to-pages-and-not-there-pages-are-fetched-multiple-times-tf4149250.html#a11811984 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
