Going through the different URLs that are handled in protocol plugins
there are 'directory' links and 'document' links. Assume a filesystem to
consist of folders and files. Or an IMAP server that organizes emails
(=documents) in folders.

Now Nutch seems to crawl URLs every 30 days. This may be good enough for
big remote sites that change only every now and then.
In my case the documents do not change too often so 30 days would be
good enough. But if new files are added I would not like to wait up to a
month for them to be indexed. Especially on emails I'd like to check
every hour or so - although emails do not change that often so fetching
them every 30 days or longer is absolutely ok.

How can a protocol plugin define which delay should be applied to
recrawl a URL?
How can a protocol plugin know when the URL was fetched last time and
prevent a new fetch if the resource was not modified since?

Reply via email to