Hi Andrzej, I really like your ideas for saving bandwidth and as you requested comments, here they are:
1. Exceptions, as far as I know, should not be used to exchange reqular information, just exceptional states. So I think this would be all right for a quick patch, but maybe some other way of submitting the information would produce a more stable and longer lasting software.
2. The proposed algorithm sounds a lot like TCP window resizing. So the algorithm is widely tested and accepted. But I recall problems with this chainsaw algorithm as it may produce more bandwidth, because pages are crawled more often. Maybe fetchInterval *= 0.7 would be better. Furthermore, there should be a minimum otherwise a page could be fetched continously.
3. Last-modified header information can be used for static pages, but for scripts (like PHP) the authors have to send this header information themselves. I don't know what happens if they omit it, but this could screw up the algorithm. Some pages would be crawled pretty often without getting new information.
4. One could use HTTP 1.1 for fetching pages. That would save some time and bandwidth. But the servers have to be 1.1 compatible, not very common today.
Regards, Olaf
Andrzej Bialecki wrote:
Hi,
Reading the other day the searchenginewatch forum I came to conclusion that currently Nutch is rather careless about the bandwidth - it always fetches pages after their getNextFetchTime() arrived, no matter if the pages are really changed or not.
What it should do instead is to put an "If-Modified-Since" header (or perform an equivalent check for other protocols), and use the time of the last update to check if it needs to fetch the new content. For local files this could be the last modification time.
Benefits are obvious: saves bandwidth and CPU for parsing, and also gives an important information about how quickly the resource is changing.
In order to implement this, plugins should support a slightly extended API, and this API should be used by Fetcher. I suggest the following:
* change the method signature of Protocol.getContent(String url) to Protocol.getContent(Page page). The method should throw a new exception type, e.g. ResourceNotModified.
* use this new method in Fetcher.java:88.
* what would be the action if the content is not modified? well, I guess the code in Fetcher.java should follow the path to handleNoFetch(), using a modified FetchListEntry to adjust the fetch interval and the next fetch time. A simple algorithm was suggested by someone on this list (if changed: fetchInterval /= 2; if not changed: fetchInterval *= 1.2). These values will be then propagated to the WebDB during the database update, and eventually they will self-adjust to the frequency of updates of the original resource.
* in the protocol-http plugin add the "If-Modified-Since" header, based on the (page.getNextFetchTime() - page.getFetchInterval()) which is the last time the fetch occured. If the resource is not modified, throw ResourceNotModified.
* in the protocol-file, modify FileResponse.getFileAsHttpResponse() to check the last modification time using the same formula as above.
* in the protocol-ftp, right after FtpResponse.java:285 put a logic to check the last modification time of the file (taken from the listing attributes), again using the same formula.
Any comments? If you think it's a good idea, I may try to provide the patches...
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
