From your original correspondence on this topic you mentioned you appeared to be using a custom crawl script, you also mentioned that you were getting some 2/5 of the seed urls. If the search intervals below have not been exceeded then it is unlikely that you will get a re-fetch of these exact urls but would be extremely likely to get outlink (or inlinlk) urls depending on the config of nutch-site. Is it possible you have been experimenting with default options and have asked nutch to execute irregular behaviour?
Personally I would try your crawl script with a fresh nutch-site/nutch-default file, after that you can then go back and fine tune it the way you wish. Finally, it is merely a suggestion but is it possible that the urls you are fetching possibly have page re-directs? If this is the case then this will need to be considered within your nutch-site configuration. HTH Lewis --------------------------------------------------------------------------------------- 2011-05-30 11:21:25,148 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-05-30 11:21:25,149 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2011-05-30 11:21:25,589 WARN crawl.Generator - Generator: 0 records selected for fetching, exiting ... I have started new crawling and despite of 5 domains in "seed file" Nutch haven't discovered any urls for fetching ----- Regards, Jotta PS. Sorry for my English :) -- Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

