Re: [Nutch-dev] Suggested changes to Protocol & Fetcher to support variable fetch intervals

Olaf Thiele Tue, 06 Jul 2004 09:12:10 -0700

Hi Andrzej,
I really like your ideas for saving bandwidth
and as you requested comments, here they are:

1. Exceptions, as far as I know, should not be used
to exchange reqular information, just exceptional states.
So I think this would be all right for a quick patch, but
maybe some other way of submitting the information would
produce a more stable and longer lasting software.

2. The proposed algorithm sounds a lot like TCP window resizing.
So the algorithm is widely tested and accepted. But I recall
problems with this chainsaw algorithm as it may produce more bandwidth,
because pages are crawled more often. Maybe fetchInterval *= 0.7
would be better. Furthermore, there should be a minimum otherwise
a page could be fetched continously.

3. Last-modified header information can be used for static pages, but
for scripts (like PHP) the authors have to send this header information
themselves. I don't know what happens if they omit it, but this could
screw up the algorithm. Some pages would be crawled pretty often without
getting new information.

4. One could use HTTP 1.1 for fetching pages. That would save some
time and bandwidth. But the servers have to be 1.1 compatible, not
very common today.

Regards,
Olaf

Andrzej Bialecki wrote:

Hi,
Reading the other day the searchenginewatch forum I came to conclusion that currently Nutch is rather careless about the bandwidth - it always fetches pages after their getNextFetchTime() arrived, no matter if the pages are really changed or not.

What it should do instead is to put an "If-Modified-Since" header (or perform an equivalent check for other protocols), and use the time of the last update to check if it needs to fetch the new content. For local files this could be the last modification time.

Benefits are obvious: saves bandwidth and CPU for parsing, and also gives an important information about how quickly the resource is changing.

In order to implement this, plugins should support a slightly extended API, and this API should be used by Fetcher. I suggest the following:

* change the method signature of Protocol.getContent(String url) to Protocol.getContent(Page page). The method should throw a new exception type, e.g. ResourceNotModified.
* use this new method in Fetcher.java:88.
* what would be the action if the content is not modified? well, I guess the code in Fetcher.java should follow the path to handleNoFetch(), using a modified FetchListEntry to adjust the fetch interval and the next fetch time. A simple algorithm was suggested by someone on this list (if changed: fetchInterval /= 2; if not changed: fetchInterval *= 1.2). These values will be then propagated to the WebDB during the database update, and eventually they will self-adjust to the frequency of updates of the original resource.

* in the protocol-http plugin add the "If-Modified-Since" header, based on the (page.getNextFetchTime() - page.getFetchInterval()) which is the last time the fetch occured. If the resource is not modified, throw ResourceNotModified.

* in the protocol-file, modify FileResponse.getFileAsHttpResponse() to check the last modification time using the same formula as above.

* in the protocol-ftp, right after FtpResponse.java:285 put a logic to check the last modification time of the file (taken from the listing attributes), again using the same formula.

Any comments? If you think it's a good idea, I may try to provide the patches...

------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Suggested changes to Protocol & Fetcher to support variable fetch intervals

Reply via email to