Hi Otis,

I'm trying to do some basic calculations trying to figure out what, in terms of
time, resources, and cost, it would take to crawl 500M URLs.

I can't directly comment on Nutch, but we recently did something similar to this (563M pages via EC2) using Bixo.

Since we're also using a sequence file to store the crawldb, the update time should be comparable, but we ran only one loop (since we started with a large set of known URLs).

Some parameters that obviously impact crawl performance:

* A default crawl delay of 15 seconds
* Batching (via keep-alive) 50 URLs per connection per IP address.
* 500 fetch threads/server (250 per each of two reducers per server)
* Crawling 1.7M domains
* Starting with about 1.2B known links
* Running in "efficient" mode - skip batches of URLs that can't be fetched due to politeness.
* Fetching text, HTML, and image files
* Cluster size of 50 slaves, using m1.large instances (with spot pricing)

The results were:

* CPU cost was only $250
* data-in was $2100 ($0.10/GB, and we fetched 21TB)
* Major performance issue was not enough domains with lots of URLs to fetch (power curve for URLs/domain)
* total cluster time of 22 hours, fetch time of about 12 hours

We didn't parse the fetched pages, which would have added some significant CPU cost.

HTH,

-- Ken

The obvious environment for this is EC2, so I'm wondering what people are seeing
in terms of fetch rate there these days? 50 pages/second? 100? 200?


Here's a time calculation that assumes 100 pages/second per EC2 instance:

 100*60*60*12 = 4,320,000 URLs/day per EC2 instance

That 12 means 12 hours because last time I used Nutch I recall about half of the time being spent in updatedb, generate, and other non-fetching steps.

If I have 20 servers fetching URLs, that's:

100*60*60*12*20 = 86,400,000 URLs/day -- this is starting to sound too good
to be true

Then to crawl 500M URLs:

 500000000/(100*60*60*12*20) = 5.78 days  -- that's less than 1 week

Suspiciously short, isn't it?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to