> Oh this sounds great! It would mean Nutch learns on it's own which URLs to fetch more or less often and adapts the intervals accordingly.
> Now I am interested in the inner mechanics. Where can I learn more about this? Look at the DefaultFetchSchedule and its parent classes for the basics and the API. Then you can check AdaptiveFetchSchedule to see how it can change dynamically based on whether the page's signature has changed. The default signature class will work just fine, but only for simple sites. > How long would it take the algorithm for one or the other decision? What are the boundaries (minimum or maximum fetch intervals)? That is configurable. > Is there a prerequisite that plugins need to fulfill for nutch to realize the content has changed or not? To detect whether content has changed is actually a pretty difficult task, but the Signature class controls this. It uses the extracted text as a source to make a signature. Op ma 7 okt 2024 om 14:08 schreef Hiran Chaudhuri <[email protected]>: > > On 07.10.24 13:55, Markus Jelsma wrote: > > Hello, > > > > Refetch interval is mainly controlled by the configured FetchSchedule > > class. But it can also be controlled in a custom ScoringFilter in > > updateDbScore(). We use both. > > > > Nutch comes with an AdaptiveFetchSchedule that increases interval for > > records that change more often, usually link/hub/overview pages, and does > > the opposite for unchanging pages. This works well for simple sites. > > > Oh this sounds great! It would mean Nutch learns on it's own which URLs > to fetch more or less often and adapts the intervals accordingly. > > Now I am interested in the inner mechanics. Where can I learn more about > this? > How long would it take the algorithm for one or the other decision? What > are the boundaries (minimum or maximum fetch intervals)? > Is there a prerequisite that plugins need to fulfill for nutch to > realize the content has changed or not? > >

