Re: Pages per second on EC2?

Ken Krugler Fri, 04 Mar 2011 15:13:45 -0800

Hi Otis,

More input, though mostly from recent experience w/Bixo...

I'm trying to do some basic calculations trying to figure outwhat, in terms
of
time, resources, and cost, it would take to  crawl 500M URLs.
The obvious environment for this is EC2, so I'm wondering whatpeople are
seeing
in terms of fetch rate there  these days? 50 pages/second? 100? 200?
Depends mostly on the distributions of URLS / host, whether youhave a DNScache etc... Using large instances, you can start with aconservative
In my case the crawl would be wide, which is good for URLdistribution, but bad
for DNS.
What's recommended for DNA caching?  I do see
http://wiki.apache.org/nutch/OptimizingCrawls -- does that meansetting up alocal DNS server (e.g. bind) or something like pdnsdr or somethingelse?

We fire up nscd on every server in the cluster - check out the Bixoremote-init.sh script.

And we tweak the config, so that negative lookups (for example) have alonger TTL than by default.

estimate at 125K URLs fetched  per node and per hour
125K URLs per node per hour.... so again assuming I'm fetching only12h out of
24h and with 50 machines (to match Ken's example):

125000*12*50=75M / day

That means 525M in 7 days.  That is close to Ken's number, good. :)
Here's a time calculation that assumes 100 pages/second per EC2instance:
100*60*60*12 =  4,320,000 URLs/day per EC2 instance
That 12 means 12 hours because last time I used Nutch I recallabout half
of
the time being spent in updatedb, generate, and other non-fetching steps.
The time spent in generate and update is proportional to the sizeof thecrawldb. Might take half the time at one point but will take morethan that.The best option would probably be to generate multiple segments inone go(see options for the Generator), fetch all the segments one byone, then
merge them with the crawldb in a single call to  update
Right.
But with time (or, more precisely, as crawldb grows) this generationwill start
taking more and more time, and there is no way around that, right?


Correct.

How does Bixo deal with that?

We don't. Or rather, we live with the pain of having to update thecrawldb, via building a new version from old + fetch loop results.

One solution is to use something like HBase, which we could easily do,since there's a Cascading Tap for it.

We could partition the DB by pending/processed, which would reducetime for later fetch phases. Early on, though, most entries are "new"thus this doesn't save much.


[snip]

You will also inevitably hit slow servers which will have animpact thefetchrate - although not as bad as before the introduction of thetimeout on
fetching.
Right, I remember this problem. So now one can specify how longeach fetch
should last and fetching will stop when that time is reached?


You can in Bixo, don't know about the current version of Nutch.

How does one guess what that time limit to pick, especially sincefetch runs can
vary in terms of how fast they are depending on what hosts are in it?
Wouldn't it be better to express this in requests/second instead oftime, sothat you can say "when fetching goes below N requests per second andstays like
that for M minutes, abort fetch"?

We implemented something like that in Nutch back in 2006, but at thetime the Nutch fetching architecture was such that this felt very flaky.

In Bixo we have support for aborting requests if the response rate isless than some specified limit, which seems to work well to avoidproblems with slow sites.

What if you have a really fast fetch run going on, but the time isstill reachedand fetch aborted? What do you do? Restart the fetch with the samelist ofgenerated URLs as before? Somehow restart with only unfetchedURLs? Generate a
whole new fetchlist (which ends up being slow)?

In Bixo, at least, what happens is that fetched URLs are thenprocessed, and unfetched URLs have their state unchanged - so they'llget fetched in the next loop. I assume something similar is possiblein Nutch.


[snip]

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Pages per second on EC2?

Reply via email to