Dear Nutch Community, Thanks for the help you provided so far, more recently we have been struggling with adaptive recrawl.
My goal is to run a large-scale crawl job adaptively, starting off with a small number of injected sites, but in a few cycles I want to see lots of different sites and even within the sites themselves different articles and not the same ones fetched again and again. What I'm experiencing however is that URLs chosen barely ever change between my crawls. From what I can see, the crawldb is growing steadily round by round. Since the number of URLs grow rapidly, I have to limit though how many URLs are fetched in each crawl-cycle and for that the generate command's topN parameter is used. Based on the documentation, I think this is where my problem resides, for some reason the same x number of URLs are chosen as top candidates to fetch, even though there should be plenty others available. So my main question is, how can I play around with the rules that determine which URLs will be chosen to fetch? Of course, if there's some other way to limit the number of URLs fetched each round, that would be great too! My configuration seems fine (but probably isn't), based on this recrawl tutorial: https://www.pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ *db.fetch.schedule.class* is set to * org.apache.nutch.crawl.AdaptiveFetchSchedule*, while I tried playing around with the other settings like *min_interval*, *inc_rate*&*dec_rate* (0-0.5), *sync_delta* and *sync_delta_rate*, but no luck. Thanks in advance! Best, Zoltán