I am indexing sites that have some pages that are changed every hour and
some pages that are changed every year. Mostly the pages that change often
seems to be the same pages. I would imagine that this is common for most
of the world wide web, and we should use that knowledge to optimize the
Period (time to next re-indexing).
So what I am proposing is using exponentially backoff to dynamically
determine the "right" period for every page. It could be done as follows:
MinPeriod 1h
MaxPeriod 30d
Time:
1: Index page.
Set reindex_at = date+
max(MinPeriod,min(MaxPeriod, date-last_changed))
reindex_at: Re-index page.
If page changed: Set last_changed = date
Set reindex_at = date+
max(MinPeriod,min(MaxPeriod, date-last_changed))
This will double the time between reindexing as long as the page is not
changed (Though at most MaxPeriod). If the page is changed then the time
between reindexing will be MinPeriod.
If the page has not been changed for the last 2 minutes/hours/days/weeks
chances are that it will not be changed in the next 2
minutes/hours/days/weeks either.
A minor problem is when a URL that has been unmodified for a year starts
to get modified a lot. This is where MaxPeriod will kick in, so the page
_will_ be indexed once in a while.
/Ole