Sounds nice. Anybody care to implement it? Seems it's easy.

PS I was always thinking about implementing some kind of learning
automata with penalty system. Penalty will be given if document
reindexed changed long ago. Algorythm described below is similar
to that, but more straitforward.

Ole Tange wrote:
> 
> I am indexing sites that have some pages that are changed every hour and
> some pages that are changed every year. Mostly the pages that change often
> seems to be the same pages. I would imagine that this is common for most
> of the world wide web, and we should use that knowledge to optimize the
> Period (time to next re-indexing).
> 
> So what I am proposing is using exponentially backoff to dynamically
> determine the "right" period for every page. It could be done as follows:
> 
> MinPeriod 1h
> MaxPeriod 30d
> 
> Time:
> 1:           Index page.
>              Set reindex_at = date+
>                  max(MinPeriod,min(MaxPeriod, date-last_changed))
> reindex_at:  Re-index page.
>              If page changed: Set last_changed = date
>              Set reindex_at = date+
>                  max(MinPeriod,min(MaxPeriod, date-last_changed))
> 
> This will double the time between reindexing as long as the page is not
> changed (Though at most MaxPeriod). If the page is changed then the time
> between reindexing will be MinPeriod.
> 
> If the page has not been changed for the last 2 minutes/hours/days/weeks
> chances are that it will not be changed in the next 2
> minutes/hours/days/weeks either.
> 
> A minor problem is when a URL that has been unmodified for a year starts
> to get modified a lot. This is where MaxPeriod will kick in, so the page
> _will_ be indexed once in a while.
> 
> /Ole

-- 
[EMAIL PROTECTED]  http://kir.vtx.ru/    ICQ 7551596  Phone +7 903 6722750
Hi, I'm a signature virus: copy me to your .signature to help me spread!
--

Reply via email to