Re: Pages per second on EC2?

Ken Krugler Fri, 04 Mar 2011 15:01:12 -0800

Hi Otis,

I'm trying to do some basic calculations trying to figure outwhat, in terms
of
time, resources, and cost, it would take to crawl  500M URLs.
I can't directly comment on Nutch, but we recently did somethingsimilar to
this (563M pages via EC2) using Bixo.
Eh, I was going to go over to Bixo's list and ask if Bixo issuitable for suchwide crawls or whether it's meant more for vertical crawls. Forsome reason inmy head I had it in the "for vertical crawl" bucket, but it seems Iwas wrong,
ha?

Well, it's a toolkit - so hooking it up for a wide/big crawl meanswriting some code, but there's nothing in the Bixo architecture (aftera few revs) that precludes using it in this manner.

Since we're also using a sequence file to store the crawldb, theupdate timeshould be comparable, but we ran only one loop (since we startedwith a large
set of known  URLs).

Some parameters that obviously impact crawl performance:

*  A default crawl delay of 15 seconds
That's very polite.  Is 3 seconds delay acceptable?

It depends on the site. For somebody big like (say) CNN, 3 secondswould probably be OK.


For smaller sights, 30 seconds is actually a better value.

We actually modified the fetch policy we're using so that for low-traffic sites (based on Alexa/Quantcast data) we use a 60 seconddelay, down to about 5 seconds for top sites.

Even more important is to limit pages/day for smaller sites to <1000...unless you enjoy getting angry emails from irate webmasters :)

* Batching (via keep-alive) 50 URLs per  connection per IP address.
Does Nutch automatically do this? I don't recall seeing thissetting in Nutch,
but it's been a while...


No, I don't believe Nutch has this implemented.

* 500 fetch threads/server (250 per each of two  reducers per server)
* Crawling 1.7M domains
Is this because you restricted it to 1.7M domains, or is that howmany distinctdomains were in your seed list, or is that how many domains you'vediscovered
while crawling?

We restricted it, based on top domains (where top == most US-basedtraffic).

* Starting with about 1.2B  known links


Where did you get that many of them?


From a previous crawl, of roughly the same size.

Also, if you start with 1.2B known links, how do you end up withjust 563M pagesfetched? Maybe out of 1.2B you simply got to only 563M before youstopped
crawling?


Because we're running with Bixo's "efficient" mode (see below).

* Running in "efficient" mode - skip batches of URLs that can't befetched due
to politeness.
Doesn't Nutch (and Bixo) do this automatically?


Nutch will block and not fetch a URL until sufficient time has passed.

Bixo can do the same thing, but when you run a crawl like this, youoften wind up blocked on a few slow sites.

* Fetching text, HTML, and image files
* Cluster size of 50 slaves, using m1.large instances (with spotpricing)
I've never used spot instances. Isn't it the case that when you usespotinstances you can use them as long as the price you paid isadequate. When theprice goes up due to demand, and you are using a spot instance,don't you getkicked off (because you are not paying enough to meet the price anymore)? Ifthat's so, what happens with the cluster? You keep to adding newspot instances(at new/higher prices) to keep the cluster of more or lessconsistent size?

We run the master without spot pricing, and use a very high max bid tohelp ensure we rarely (if ever) lose servers.

The results were:

* CPU cost was only $250
* data-in  was $2100 ($0.10/GB, and we fetched 21TB)
That's fast!
* Major performance issue was not enough domains with lots of URLsto fetch
(power curve for URLs/domain)
Why is this a problem? Isn't this actually good? Isn't it betterto have 100hosts/domains with 10 pages each than 10 hosts/domains with 100each? Wouldn't
fetching of the former complete faster?

The problem is that there are too many domains with only a handful ofpages.

So very quickly, the set of available domains to fetch from is reduceddown to a fraction of that initial 1.7M, and then politeness startscausing either (a) very inefficient utilization of resources, as mostthreads are spinning w/o doing any work, or (b) you start skippinglots of URLs for domains that aren't ready yet (not enough time haselapsed since the prior batch of URLs were fetched).

*  total cluster time of 22 hours, fetch time of about 12 hours


That's fast.  Where did the delta of 10h go?

Jobs to extract links from the crawldb, partition by IP address, fetchrobots.txt, etc.

We didn't parse the fetched pages, which would have added somesignificant CPU
cost.
Yeah.  Would you dare to guess how much that would add in terms of
time/servers/cost?

I've got some data, but I'd need to dig it up after we finish adeliverable that's due by 5pm :(


-- Ken

The obvious environment for this is EC2, so I'm wondering whatpeople are
seeing
in terms of fetch rate  there these days? 50 pages/second? 100? 200?
Here's a time calculation that assumes 100 pages/second per EC2instance:
100*60*60*12 = 4,320,000 URLs/day per EC2 instance
That 12 means 12 hours because last time I used Nutch I recallabout half
of
the time being spent in updatedb, generate, and other non-fetching steps.
If I have 20 servers fetching  URLs, that's:
100*60*60*12*20 = 86,400,000 URLs/day -- this is starting tosound too
good
to be true

Then  to crawl 500M URLs:

500000000/(100*60*60*12*20) = 5.78  days  -- that's less than 1 week

Suspiciously short, isn't  it?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
--------------------------
Ken  Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n  g


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Pages per second on EC2?

Reply via email to