Hi Otis,

More input, though mostly from recent experience w/Bixo...

I'm trying to do some basic calculations trying to figure out what, in terms
of
time, resources, and cost, it would take to  crawl 500M URLs.
The obvious environment for this is EC2, so I'm wondering what people are
seeing
in terms of fetch rate there  these days? 50 pages/second? 100? 200?


Depends mostly on the distributions of URLS / host, whether you have a DNS cache etc... Using large instances, you can start with a conservative

In my case the crawl would be wide, which is good for URL distribution, but bad
for DNS.
What's recommended for DNA caching?  I do see
http://wiki.apache.org/nutch/OptimizingCrawls -- does that mean setting up a local DNS server (e.g. bind) or something like pdnsdr or something else?

We fire up nscd on every server in the cluster - check out the Bixo remote-init.sh script.

And we tweak the config, so that negative lookups (for example) have a longer TTL than by default.

estimate at 125K URLs fetched  per node and per hour

125K URLs per node per hour.... so again assuming I'm fetching only 12h out of
24h and with 50 machines (to match Ken's example):

125000*12*50=75M / day

That means 525M in 7 days.  That is close to Ken's number, good. :)

Here's a time calculation that assumes 100 pages/second per EC2 instance:

100*60*60*12 =  4,320,000 URLs/day per EC2 instance

That 12 means 12 hours because last time I used Nutch I recall about half
of
the time being spent in updatedb, generate, and other non- fetching steps.


The time spent in generate and update is proportional to the size of the crawldb. Might take half the time at one point but will take more than that. The best option would probably be to generate multiple segments in one go (see options for the Generator), fetch all the segments one by one, then
merge them with the crawldb in a single call to  update

Right.
But with time (or, more precisely, as crawldb grows) this generation will start
taking more and more time, and there is no way around that, right?

Correct.

How does Bixo deal with that?

We don't. Or rather, we live with the pain of having to update the crawldb, via building a new version from old + fetch loop results.

One solution is to use something like HBase, which we could easily do, since there's a Cascading Tap for it.

We could partition the DB by pending/processed, which would reduce time for later fetch phases. Early on, though, most entries are "new" thus this doesn't save much.

[snip]

You will also inevitably hit slow servers which will have an impact the fetchrate - although not as bad as before the introduction of the timeout on
fetching.

Right, I remember this problem. So now one can specify how long each fetch
should last and fetching will stop when that time is reached?

You can in Bixo, don't know about the current version of Nutch.

How does one guess what that time limit to pick, especially since fetch runs can
vary in terms of how fast they are depending on what hosts are in it?

Wouldn't it be better to express this in requests/second instead of time, so that you can say "when fetching goes below N requests per second and stays like
that for M minutes, abort fetch"?

We implemented something like that in Nutch back in 2006, but at the time the Nutch fetching architecture was such that this felt very flaky.

In Bixo we have support for aborting requests if the response rate is less than some specified limit, which seems to work well to avoid problems with slow sites.

What if you have a really fast fetch run going on, but the time is still reached and fetch aborted? What do you do? Restart the fetch with the same list of generated URLs as before? Somehow restart with only unfetched URLs? Generate a
whole new fetchlist (which ends up being slow)?

In Bixo, at least, what happens is that fetched URLs are then processed, and unfetched URLs have their state unchanged - so they'll get fetched in the next loop. I assume something similar is possible in Nutch.

[snip]

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to