On 07.10.24 13:55, Markus Jelsma wrote:
Hello, Refetch interval is mainly controlled by the configured FetchSchedule class. But it can also be controlled in a custom ScoringFilter in updateDbScore(). We use both. Nutch comes with an AdaptiveFetchSchedule that increases interval for records that change more often, usually link/hub/overview pages, and does the opposite for unchanging pages. This works well for simple sites.
Oh this sounds great! It would mean Nutch learns on it's own which URLs to fetch more or less often and adapts the intervals accordingly. Now I am interested in the inner mechanics. Where can I learn more about this? How long would it take the algorithm for one or the other decision? What are the boundaries (minimum or maximum fetch intervals)? Is there a prerequisite that plugins need to fulfill for nutch to realize the content has changed or not?

