Hi,

I'm trying to do some basic calculations trying to figure out what, in terms of 
time, resources, and cost, it would take to crawl 500M URLs.
The obvious environment for this is EC2, so I'm wondering what people are 
seeing 
in terms of fetch rate there these days? 50 pages/second? 100? 200?


Here's a time calculation that assumes 100 pages/second per EC2 instance:

  100*60*60*12 = 4,320,000 URLs/day per EC2 instance

That 12 means 12 hours because last time I used Nutch I recall about  half of 
the time being spent in updatedb, generate, and other  non-fetching steps.

If I have 20 servers fetching URLs, that's:

  100*60*60*12*20 = 86,400,000 URLs/day   -- this is starting to sound too good 
to be true

Then to crawl 500M URLs:

  500000000/(100*60*60*12*20) = 5.78 days  -- that's less than 1 week

Suspiciously short, isn't it?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Reply via email to