On Feb 8, 2012 10:57 PM, "Michael Mol" <mike...@gmail.com> wrote: > > On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman > <paul.hartman+gen...@gmail.com> wrote: > > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pa...@poluan.info> wrote: > >> > >> On Jan 27, 2012 11:18 PM, "Paul Hartman" <paul.hartman+gen...@gmail.com > > >> wrote: > >>> > >> > >> ---- >8 snippage > >> > >>> > >>> BTW, the Baidu spider hits my site more than all of the others combined... > >>> > >> > >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was the > >> reason why my company decided to change our webhosting company: Its > >> spidering brought our previous webhosting to its knees... > >> > >> Rgds, > > > > I wonder if Baidu crawler honors the Crawl-delay directive in robots.txt? > > > > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit rules. ;) > > I don't remember if it respects Crawl-Delay, but it respects forbidden > paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get > DDOS'd by Yahoo a number of times. Turned out the solution was to > disallow access to expensive-to-render pages. If you're using > MediaWiki with prettified URLs, this works great: > > User-agent: * > Allow: /mw/images/ > Allow: /mw/skins/ > Allow: /mw/title.png > Disallow: /w/ > Disallow: /mw/ > Disallow: /wiki/Special: >
*slaps forehead* Now why didn't I think of that before?! Thanks for reminding me! Rgds,