Olaf Thiele wrote:
Hi Andrzej, I really like your ideas for saving bandwidth and as you requested comments, here they are:
Hello Olaf,
My comments inline below.
1. Exceptions, as far as I know, should not be used to exchange reqular information, just exceptional states. So I think this would be all right for a quick patch, but maybe some other way of submitting the information would produce a more stable and longer lasting software.
True. However, if you look at Fetcher.FetcherThread.run() you'll see that that's precisely the method it uses now, I was just following the trend... erhm, minimizing the patches, that is... ;-)
2. The proposed algorithm sounds a lot like TCP window resizing. So the algorithm is widely tested and accepted. But I recall problems with this chainsaw algorithm as it may produce more bandwidth, because pages are crawled more often. Maybe fetchInterval *= 0.7
I agree - however, the more I think about it I realize that the true benefit from this algorithm is not the absolute reduction of bandwidth consumption - that's perhaps a useful by-product - but a fairer treatment of sites with quickly vs. slowly changing content. This results in less traffic for static sites (and more traffic for dynamic ones - but they presumably can take it), and at the same time in a higher-quality index, more up-to-date for URLs with frequently changing content.
The multipliers and divisors' values should be taken from the NutchConf, so that you can tune them for your installation.
would be better. Furthermore, there should be a minimum otherwise a page could be fetched continously.
Yes, that's true - 1 day would probably be a sensible default minimum... Of course, there is still an "operational minimum" determined by how often you run the FetchListTool/Fetcher.
3. Last-modified header information can be used for static pages, but for scripts (like PHP) the authors have to send this header information themselves. I don't know what happens if they omit it, but this could screw up the algorithm. Some pages would be crawled pretty often without getting new information.
Well, I believe it's somewhat different from what you suggest. If the server handles properly the "If-Modified-Since", and the content was unchanged, then it does NOT send 200 OK as response, but 304 (IIRC). So, the difference is pretty obvious - if you get 200, then you have to get the content anyway, if you get 304 - you don't. From this POV it doesn't matter what genereated the response.
Now, what can you do next? If you got the content because of 200 response, you can apply a checksum to it, so that you can discover if it's really unchanged... Of course, you already wasted some bandwidth, because you had to get the page anyway, but now that you know it's not changed, then you can apply the same algorithm as before.
This approach requires that you store the content checksum in the index - fairly small cost for the gain it gives.
4. One could use HTTP 1.1 for fetching pages. That would save some time and bandwidth. But the servers have to be 1.1 compatible, not very common today.
I remember there were some problems with 1.1 interoperability, so it was disabled for now.
Thank you for these comments - very useful!
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
