On 07.10.24 14:18, Markus Jelsma wrote:
Oh this sounds great! It would mean Nutch learns on it's own which URLs
to fetch more or less often and adapts the intervals accordingly.
Now I am interested in the inner mechanics. Where can I learn more about
this?
Look at the DefaultFetchSchedule and its parent classes for the basics and
the API. Then you can check AdaptiveFetchSchedule to see how it can change
dynamically based on whether the page's signature has changed. The default
signature class will work just fine, but only for simple sites.
Excellent hint! Via DefaultFetchSchedule I found the
AdaptiveFetchSchedule, and with that keyword I found this page on the
internet:
https://www.pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
That one answers all my questions, plus the one how to activate the
AdaptiveFetchSchedule.
The content on that page should go into the standard Nutch documentation.