Re: protocol-plugin to define when next crawl should happen?

Markus Jelsma Mon, 07 Oct 2024 04:55:57 -0700

Hello,

Refetch interval is mainly controlled by the configured FetchSchedule
class. But it can also be controlled in a custom ScoringFilter in
updateDbScore(). We use both.


Nutch comes with an AdaptiveFetchSchedule that increases interval for
records that change more often, usually link/hub/overview pages, and does
the opposite for unchanging pages. This works well for simple sites.

Regards,
Markus

Op ma 7 okt 2024 om 13:43 schreef Hiran Chaudhuri
<[email protected]>:

> Going through the different URLs that are handled in protocol plugins
> there are 'directory' links and 'document' links. Assume a filesystem to
> consist of folders and files. Or an IMAP server that organizes emails
> (=documents) in folders.
>
> Now Nutch seems to crawl URLs every 30 days. This may be good enough for
> big remote sites that change only every now and then.
> In my case the documents do not change too often so 30 days would be
> good enough. But if new files are added I would not like to wait up to a
> month for them to be indexed. Especially on emails I'd like to check
> every hour or so - although emails do not change that often so fetching
> them every 30 days or longer is absolutely ok.
>
> How can a protocol plugin define which delay should be applied to
> recrawl a URL?
> How can a protocol plugin know when the URL was fetched last time and
> prevent a new fetch if the resource was not modified since?
>
>

Re: protocol-plugin to define when next crawl should happen?

Reply via email to