On Sun, Aug 9, 2009 at 7:55 AM, Cameron Kaiser <spec...@floodgap.com> wrote:

>
> > I guess if you crawl like a crawler It should be okay otherwise how a
> > newly developed spider would work. But if you crawl like a scraper
> > you'll be banned.
> > But I am not sure if Twitter can differentiate between them. I mean
> > the request pattern. Does twitter check it?
>
> I'm sure they have ways of determining it internally, and I'm sure they
> won't reveal what those ways are.


They already have, to a certain extent.  It appears that they want up to a
10-second delay, which is an eternity in web crawling.  It limits a crawler
to 8,640 requests a day, which is peanuts, as I'm sure most people here
would realize immediately.

http://twitter.com/robots.txt

#Google Search Engine Robot
User-agent: Googlebot
# Crawl-delay: 10 -- Googlebot ignores crawl-delay ftl
Disallow: /*?
Disallow: /*/with_friends

#Yahoo! Search Engine Robot
User-Agent: Slurp
Crawl-delay: 1
Disallow: /*?
Disallow: /*/with_friends

#Microsoft Search Engine Robot
User-Agent: msnbot
Crawl-delay: 10
Disallow: /*?
Disallow: /*/with_friends

# Every bot that might possibly read and respect this file.
User-agent: *
Disallow: /*?
Disallow: /*/with_friends


What you don't see here is whatever internal list of user-agents and IP
addresses they might block.

Having spent countless days identifying robots on dozens of big consumer and
business sites, I know that it's very hard to eliminate the low-volume ones
that spoof normal web clients.  But any one that starts grabbing a
significant number of pages, or the same page more often than every 24 hours
or so, stands out from the data quickly... and probably gets blocked.  Of
course, that's essentially what the DDoS is doing, except that it probably
also probes for and uses specific vulnerabilities, rather than just making
page requests.

Nick
(who, among other things, was the product manager for the first commercial
web crawler and helped set the robots.txt standard)

Reply via email to