Andrzej:

Since a crawler is a form of a web browser, it should try to mimic most of the browser behaviors (only the good ones - not the bugs). In that light your idea seems very good.

You may also want to investigate the use of http ETAG headers to see if a page has changed or not before fetching it.

shiraz


Andrzej Bialecki wrote:

Hi,

Reading the other day the searchenginewatch forum I came to conclusion that currently Nutch is rather careless about the bandwidth - it always fetches pages after their getNextFetchTime() arrived, no matter if the pages are really changed or not.

What it should do instead is to put an "If-Modified-Since" header (or perform an equivalent check for other protocols), and use the time of the last update to check if it needs to fetch the new content. For local files this could be the last modification time.

Benefits are obvious: saves bandwidth and CPU for parsing, and also gives an important information about how quickly the resource is changing.

In order to implement this, plugins should support a slightly extended API, and this API should be used by Fetcher. I suggest the following:

* change the method signature of Protocol.getContent(String url) to Protocol.getContent(Page page). The method should throw a new exception type, e.g. ResourceNotModified.

* use this new method in Fetcher.java:88.

* what would be the action if the content is not modified? well, I guess the code in Fetcher.java should follow the path to handleNoFetch(), using a modified FetchListEntry to adjust the fetch interval and the next fetch time. A simple algorithm was suggested by someone on this list (if changed: fetchInterval /= 2; if not changed: fetchInterval *= 1.2). These values will be then propagated to the WebDB during the database update, and eventually they will self-adjust to the frequency of updates of the original resource.

* in the protocol-http plugin add the "If-Modified-Since" header, based on the (page.getNextFetchTime() - page.getFetchInterval()) which is the last time the fetch occured. If the resource is not modified, throw ResourceNotModified.

* in the protocol-file, modify FileResponse.getFileAsHttpResponse() to check the last modification time using the same formula as above.

* in the protocol-ftp, right after FtpResponse.java:285 put a logic to check the last modification time of the file (taken from the listing attributes), again using the same formula.

Any comments? If you think it's a good idea, I may try to provide the patches...




-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to