Hi Otis,
I'm trying to do some basic calculations trying to figure out what,
in terms of
time, resources, and cost, it would take to crawl 500M URLs.
I can't directly comment on Nutch, but we recently did something
similar to this (563M pages via EC2) using Bixo.
Since we're also using a sequence file to store the crawldb, the
update time should be comparable, but we ran only one loop (since we
started with a large set of known URLs).
Some parameters that obviously impact crawl performance:
* A default crawl delay of 15 seconds
* Batching (via keep-alive) 50 URLs per connection per IP address.
* 500 fetch threads/server (250 per each of two reducers per server)
* Crawling 1.7M domains
* Starting with about 1.2B known links
* Running in "efficient" mode - skip batches of URLs that can't be
fetched due to politeness.
* Fetching text, HTML, and image files
* Cluster size of 50 slaves, using m1.large instances (with spot
pricing)
The results were:
* CPU cost was only $250
* data-in was $2100 ($0.10/GB, and we fetched 21TB)
* Major performance issue was not enough domains with lots of URLs to
fetch (power curve for URLs/domain)
* total cluster time of 22 hours, fetch time of about 12 hours
We didn't parse the fetched pages, which would have added some
significant CPU cost.
HTH,
-- Ken
The obvious environment for this is EC2, so I'm wondering what
people are seeing
in terms of fetch rate there these days? 50 pages/second? 100? 200?
Here's a time calculation that assumes 100 pages/second per EC2
instance:
100*60*60*12 = 4,320,000 URLs/day per EC2 instance
That 12 means 12 hours because last time I used Nutch I recall
about half of
the time being spent in updatedb, generate, and other non-fetching
steps.
If I have 20 servers fetching URLs, that's:
100*60*60*12*20 = 86,400,000 URLs/day -- this is starting to
sound too good
to be true
Then to crawl 500M URLs:
500000000/(100*60*60*12*20) = 5.78 days -- that's less than 1 week
Suspiciously short, isn't it?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g