On Feb 9, 2012 1:35 AM, "Michael Mol" <mike...@gmail.com> wrote:
>
> On Wed, Feb 8, 2012 at 12:17 PM, Pandu Poluan <pa...@poluan.info> wrote:
> >
> > On Feb 8, 2012 10:57 PM, "Michael Mol" <mike...@gmail.com> wrote:
> >>
> >> On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman
> >> <paul.hartman+gen...@gmail.com> wrote:
> >> > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pa...@poluan.info>
wrote:
> >> >>
> >> >> On Jan 27, 2012 11:18 PM, "Paul Hartman"
> >> >> <paul.hartman+gen...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>
> >> >> ---- >8 snippage
> >> >>
> >> >>>
> >> >>> BTW, the Baidu spider hits my site more than all of the others
> >> >>> combined...
> >> >>>
> >> >>
> >> >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu
was
> >> >> the
> >> >> reason why my company decided to change our webhosting company: Its
> >> >> spidering brought our previous webhosting to its knees...
> >> >>
> >> >> Rgds,
> >> >
> >> > I wonder if Baidu crawler honors the Crawl-delay directive in
> >> > robots.txt?
> >> >
> >> > Or I wonder if Baidu cralwer IPs need to be covered by firewall
tarpit
> >> > rules. ;)
> >>
> >> I don't remember if it respects Crawl-Delay, but it respects forbidden
> >> paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get
> >> DDOS'd by Yahoo a number of times. Turned out the solution was to
> >> disallow access to expensive-to-render pages. If you're using
> >> MediaWiki with prettified URLs, this works great:
> >>
> >> User-agent: *
> >> Allow: /mw/images/
> >> Allow: /mw/skins/
> >> Allow: /mw/title.png
> >> Disallow: /w/
> >> Disallow: /mw/
> >> Disallow: /wiki/Special:
> >>
> >
> > *slaps forehead*
> >
> > Now why didn't I think of that before?!
> >
> > Thanks for reminding me!
>
> I didn't think of it until I watched the logs live and saw it crawling
> through page histories during one of the events. MediaWiki stores page
> histories as a series of diffs from the current version, so it has to
> assemble old versions by reverse-applying the diffs of all the made to
> the page between the current version and the version you're asking
> for. if you have a bot retrieve ten versions of a page that has ten
> revisions, that's 210 reverse diff operations. Grabbing all versions
> of a page with 20 revisions would result in over 1500 reverse diffs.
> My 'hello world' page has over five hundred revisions.
>
> So the page history crawling was pretty quickly obvious...
>

Although my website is not a wiki, I can already guess which part of the
site brought the server to its knees...

My company's "research" division everyday selects important economic and
financial news to be republished in the corporate website. We have news
from 3-4 years ago. To make visitors easier to find any news, the website
designer provided a nice "calendar" interface.

The problems:

- The calendar interface is dynamically generated; days without interesting
news have no hyperlinks, only days with news have hyperlinks.

- Every page in the website has a sidebar that provides a summary of the
stock market for the day (5-minute delay). The sidebar is "pre-generated"
by server-side PHP, before being handed over to the AJAX framework.

- Someone had a flash of 'brilliance' to do a URL rewrite, thus hiding the
telltale '?' query indicator, thus misleading spiders (they probably
thought that the hundreds of news pages are static pages that got magically
updated by unicorns every 10-20 seconds)

I'm going to disallow spidering fit the news pages. I'm almost certain that
this will result in a much lighter load on the poor webserver.

Rgds,

Reply via email to