This is the behaviour I am noticing with pages that have a server 
redirect (300-range code):

Say page A redirects to page B. A is in the fetchlist created by 
generate. When A is fetched, the redirect is followed and B is fetched. 
At the next updatedb, both A and B go into the crawldb. For some reason, 
  at the next generate, page B is listed to be fetched. And again at the 
next generate, and so on.


An example is:

http://www.selecthotels.com

which redirects to http://203.210.113.143/ ('page B').
This page always seems to be in the fetchlist no matter how many times 
it gets fetched. (To make matter more complicated, it also redirects to 
yet another URL.)

How do I fix this behaviour?

Also, other URLs whose fetch fails for some reason stay in the crawldb 
and are tried again and again. For a 'deep' search using topN=1000, each 
fetchlist generated after a number of runs has many hundreds of these 
failed URLs that it tries to refetch.

How do I fix this behaviour too?



End of the day for me (NZST). I'll try again tomorrow....

Cheers,
Carl.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to