On Feb 8, 2012 10:57 PM, "Michael Mol" <mike...@gmail.com> wrote:
>
> On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman
> <paul.hartman+gen...@gmail.com> wrote:
> > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pa...@poluan.info> wrote:
> >>
> >> On Jan 27, 2012 11:18 PM, "Paul Hartman" <paul.hartman+gen...@gmail.com
>
> >> wrote:
> >>>
> >>
> >> ---- >8 snippage
> >>
> >>>
> >>> BTW, the Baidu spider hits my site more than all of the others
combined...
> >>>
> >>
> >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu
was the
> >> reason why my company decided to change our webhosting company: Its
> >> spidering brought our previous webhosting to its knees...
> >>
> >> Rgds,
> >
> > I wonder if Baidu crawler honors the Crawl-delay directive in
robots.txt?
> >
> > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit
rules. ;)
>
> I don't remember if it respects Crawl-Delay, but it respects forbidden
> paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get
> DDOS'd by Yahoo a number of times. Turned out the solution was to
> disallow access to expensive-to-render pages. If you're using
> MediaWiki with prettified URLs, this works great:
>
> User-agent: *
> Allow: /mw/images/
> Allow: /mw/skins/
> Allow: /mw/title.png
> Disallow: /w/
> Disallow: /mw/
> Disallow: /wiki/Special:
>

*slaps forehead*

Now why didn't I think of that before?!

Thanks for reminding me!

Rgds,

Reply via email to