> Can you please suggest how to go about implementing this? I would like
> to add this check.
In the HttpResponse class, just add something like (it uses the
If-Modified-Since header, not the HEAD method) :
reqStr.append("If-Modified-Since: ");
reqStr.append(TheDateToCheck);
reqStr.append("\r\n");
just before the last line:
reqStr.append("\r\n");
Then, you must correctly format the TheDateToCheck variable content (see
http://www.zvon.org/tmRFC/RFC2068/Output/chapter14.html#sub24 for a
description of If-Modified-Since header specifications).
Then, the question is where the TheDateToCheck value comes from?
1. From the previously indexed document (I know that this information is
stored): It certainly consumes more process time that the second solution.
My knowledge of Nutch internal is not enougth to know how to retrieve
quickly this information from the document's url... someone can help us on
this point?
2. From the last time you launch the fetching : So it is a global date that
you store in a file and retrieves in the HttpResponse class to perform the
date check...
You must change the Http.java class too, in order to correctly handle the
304 response code:
After the lines
} else if (code == 410) { // page is gone
throw new ResourceGone(url, "Http: " + code);
Just add something like:
} else if (code == 304) {
throw new HttpException("No modified: " + urlString);
(I hope that throwing the exception will not remove the previous one. I
think it is not the case, it is the ResourceGone exception that is uses to
indicate that a resource no more exists, but if someone can confirm....)
Jerome
--
http://motrech.free.fr/
http://frutch.free.fr/