> > What i mean by expired pages is those pages whose last Modified date has > changed since last fetch. > Whole-web crawling fetches all pages that are due to be fetched (e.g, > every 30 days). These pages may not have actually changed in content. I > would like to know if there is any way to tell Nutch to compare the last > modified date and fetch the page only if the date is different from what > is there in the index. I think this way we can save time by fetching and > indexing only the modified pages while re-crawling the same site after > some time.
I have suggested many time ago use the HEAD method or the GET header If-Modified-Since (as sugested by Otis) in order to fetch only changed documents. The discussion is here: http://www.mail-archive.com/[email protected]/msg00091.html But actually I don't find time to implement this feature... Jerome -- http://motrech.free.fr/ http://frutch.free.fr/
