Struggling with adaptive recrawl

Zoltán Zvara Fri, 25 Aug 2017 08:24:07 -0700

Dear Nutch Community,

Thanks for the help you provided so far, more recently we have been
struggling with adaptive recrawl.


My goal is to run a large-scale crawl job adaptively, starting off with a
small number of injected sites, but in a few cycles I want to see lots of
different sites and even within the sites themselves different articles and
not the same ones fetched again and again.

What I'm experiencing however is that URLs chosen barely ever change
between my crawls. From what I can see, the crawldb is growing steadily
round by round.

Since the number of URLs grow rapidly, I have to limit though how many URLs
are fetched in each crawl-cycle and for that the generate command's topN
parameter is used. Based on the documentation, I think this is where my
problem resides, for some reason the same x number of URLs are chosen as
top candidates to fetch, even though there should be plenty others
available. So my main question is, how can I play around with the rules
that determine which URLs will be chosen to fetch? Of course, if there's
some other way to limit the number of URLs fetched each round, that would
be great too!

My configuration seems fine (but probably isn't), based on this recrawl
tutorial:
https://www.pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

*db.fetch.schedule.class* is set to
* org.apache.nutch.crawl.AdaptiveFetchSchedule*, while I tried playing
around with the other settings like *min_interval*, *inc_rate*&*dec_rate*
 (0-0.5), *sync_delta* and *sync_delta_rate*, but no luck.

Thanks in advance!

Best,

Zoltán

Struggling with adaptive recrawl

Reply via email to