Richard Braman wrote:
when you get an error while fetching, and you get the org.apache.nutch.protocol.retrylater because the max retries have been reached, nutch says it has given up and will retry later, when does that retry occur? How would you make a fetchlist of all urls that have failed? Is this information maintained somewhere?
Each url in the crawldb has a retry count, the number of times it has been tried without a conclusive result. When the maximum (db.fetch.retry.max) then the page is considered gone. Until then it will be generated for fetch along with other pages. There is no command that generates a fetchlist for only pages whose retry count is greater than zero.
Doug