> 
> What i mean by expired pages is those pages whose last Modified date has
> changed since last fetch.
> Whole-web crawling fetches all pages that are due to be fetched (e.g,
> every 30 days). These pages may not have actually changed in content. I
> would like to know if there is any way to tell Nutch to compare the last
> modified date and fetch the page only if the date is different from what
> is there in the index. I think this way we can save time by fetching and
> indexing only the modified pages while re-crawling the same site after
> some time.


I have suggested many time ago use the HEAD method or the GET header 
If-Modified-Since (as sugested by Otis) in order to fetch only changed 
documents.
The discussion is here: 
http://www.mail-archive.com/[email protected]/msg00091.html
But actually I don't find time to implement this feature... 

Jerome

-- 
http://motrech.free.fr/
http://frutch.free.fr/

Reply via email to