Hello, Refetch interval is mainly controlled by the configured FetchSchedule class. But it can also be controlled in a custom ScoringFilter in updateDbScore(). We use both.
Nutch comes with an AdaptiveFetchSchedule that increases interval for records that change more often, usually link/hub/overview pages, and does the opposite for unchanging pages. This works well for simple sites. Regards, Markus Op ma 7 okt 2024 om 13:43 schreef Hiran Chaudhuri <[email protected]>: > Going through the different URLs that are handled in protocol plugins > there are 'directory' links and 'document' links. Assume a filesystem to > consist of folders and files. Or an IMAP server that organizes emails > (=documents) in folders. > > Now Nutch seems to crawl URLs every 30 days. This may be good enough for > big remote sites that change only every now and then. > In my case the documents do not change too often so 30 days would be > good enough. But if new files are added I would not like to wait up to a > month for them to be indexed. Especially on emails I'd like to check > every hour or so - although emails do not change that often so fetching > them every 30 days or longer is absolutely ok. > > How can a protocol plugin define which delay should be applied to > recrawl a URL? > How can a protocol plugin know when the URL was fetched last time and > prevent a new fetch if the resource was not modified since? > >

