Shawn wrote:
Hi, I have been trying to figure out a way to limit the massive amount of bandwidth that search bots (Googlebot/2.1) consume daily from my website. My problem is that I am running Apache::ASP and about 90% of the site is dynamic content, links such as product.htm?id=100. The dynamic content gets changed quite a bit so I don’t want to use any caching for regular users, but it would be fine for the bots to use a cached copy for a month or so. The solution I came up with is manually modifying the headers to keeping sending back 304 HTTP_NOT_MODIFIED for a month before allowing new content to be served up to only search bots and not to regular web browsers. Can anyone tell me if there are some problems you for see with doing something like this? I have only tested this on a dev server and was just wondering if anyone else had this problem or any suggestions they might have.


You could also try compressing your content with CompressGzip setting. You can try setting the Expires header to one month in the future. You could set a /robots.txt file to disallow Google from searching a portion of your site that migth be excludable & high bandwidth. You could sleep(N)seconds when Google does a request, I wonder if that will slow their spiders down across their cluster(s).

Just ideas, I have not tried to throttle search bots before.

Oh, you might write your own custom mod_perl module that keeps track
of bandwidth for search bots and send a 503 "server busy" error code
if bandwidth is exceeded.  This might tell Google to back off for
a while (?).

Regards,

Josh
________________________________________________________________
Josh Chamas, Founder                   phone:925-552-0128
Chamas Enterprises Inc.                http://www.chamas.com
NodeWorks Link Checker                 http://www.nodeworks.com


-- Reporting bugs: http://perl.apache.org/bugs/ Mail list info: http://perl.apache.org/maillist/modperl.html List etiquette: http://perl.apache.org/maillist/email-etiquette.html



Reply via email to