On Sun, Aug 9, 2009 at 7:55 AM, Cameron Kaiser <spec...@floodgap.com> wrote:
> > > I guess if you crawl like a crawler It should be okay otherwise how a > > newly developed spider would work. But if you crawl like a scraper > > you'll be banned. > > But I am not sure if Twitter can differentiate between them. I mean > > the request pattern. Does twitter check it? > > I'm sure they have ways of determining it internally, and I'm sure they > won't reveal what those ways are. They already have, to a certain extent. It appears that they want up to a 10-second delay, which is an eternity in web crawling. It limits a crawler to 8,640 requests a day, which is peanuts, as I'm sure most people here would realize immediately. http://twitter.com/robots.txt #Google Search Engine Robot User-agent: Googlebot # Crawl-delay: 10 -- Googlebot ignores crawl-delay ftl Disallow: /*? Disallow: /*/with_friends #Yahoo! Search Engine Robot User-Agent: Slurp Crawl-delay: 1 Disallow: /*? Disallow: /*/with_friends #Microsoft Search Engine Robot User-Agent: msnbot Crawl-delay: 10 Disallow: /*? Disallow: /*/with_friends # Every bot that might possibly read and respect this file. User-agent: * Disallow: /*? Disallow: /*/with_friends What you don't see here is whatever internal list of user-agents and IP addresses they might block. Having spent countless days identifying robots on dozens of big consumer and business sites, I know that it's very hard to eliminate the low-volume ones that spoof normal web clients. But any one that starts grabbing a significant number of pages, or the same page more often than every 24 hours or so, stands out from the data quickly... and probably gets blocked. Of course, that's essentially what the DDoS is doing, except that it probably also probes for and uses specific vulnerabilities, rather than just making page requests. Nick (who, among other things, was the product manager for the first commercial web crawler and helped set the robots.txt standard)