Hi, I'm trying to do some basic calculations trying to figure out what, in terms of time, resources, and cost, it would take to crawl 500M URLs. The obvious environment for this is EC2, so I'm wondering what people are seeing in terms of fetch rate there these days? 50 pages/second? 100? 200?
Here's a time calculation that assumes 100 pages/second per EC2 instance: 100*60*60*12 = 4,320,000 URLs/day per EC2 instance That 12 means 12 hours because last time I used Nutch I recall about half of the time being spent in updatedb, generate, and other non-fetching steps. If I have 20 servers fetching URLs, that's: 100*60*60*12*20 = 86,400,000 URLs/day -- this is starting to sound too good to be true Then to crawl 500M URLs: 500000000/(100*60*60*12*20) = 5.78 days -- that's less than 1 week Suspiciously short, isn't it? Thanks, Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/

