Hi Otis,
More input, though mostly from recent experience w/Bixo...
I'm trying to do some basic calculations trying to figure out
what, in terms
of
time, resources, and cost, it would take to crawl 500M URLs.
The obvious environment for this is EC2, so I'm wondering what
people are
seeing
in terms of fetch rate there these days? 50 pages/second? 100? 200?
Depends mostly on the distributions of URLS / host, whether you
have a DNS
cache etc... Using large instances, you can start with a
conservative
In my case the crawl would be wide, which is good for URL
distribution, but bad
for DNS.
What's recommended for DNA caching? I do see
http://wiki.apache.org/nutch/OptimizingCrawls -- does that mean
setting up a
local DNS server (e.g. bind) or something like pdnsdr or something
else?
We fire up nscd on every server in the cluster - check out the Bixo
remote-init.sh script.
And we tweak the config, so that negative lookups (for example) have a
longer TTL than by default.
estimate at 125K URLs fetched per node and per hour
125K URLs per node per hour.... so again assuming I'm fetching only
12h out of
24h and with 50 machines (to match Ken's example):
125000*12*50=75M / day
That means 525M in 7 days. That is close to Ken's number, good. :)
Here's a time calculation that assumes 100 pages/second per EC2
instance:
100*60*60*12 = 4,320,000 URLs/day per EC2 instance
That 12 means 12 hours because last time I used Nutch I recall
about half
of
the time being spent in updatedb, generate, and other non-
fetching steps.
The time spent in generate and update is proportional to the size
of the
crawldb. Might take half the time at one point but will take more
than that.
The best option would probably be to generate multiple segments in
one go
(see options for the Generator), fetch all the segments one by
one, then
merge them with the crawldb in a single call to update
Right.
But with time (or, more precisely, as crawldb grows) this generation
will start
taking more and more time, and there is no way around that, right?
Correct.
How does Bixo deal with that?
We don't. Or rather, we live with the pain of having to update the
crawldb, via building a new version from old + fetch loop results.
One solution is to use something like HBase, which we could easily do,
since there's a Cascading Tap for it.
We could partition the DB by pending/processed, which would reduce
time for later fetch phases. Early on, though, most entries are "new"
thus this doesn't save much.
[snip]
You will also inevitably hit slow servers which will have an
impact the
fetchrate - although not as bad as before the introduction of the
timeout on
fetching.
Right, I remember this problem. So now one can specify how long
each fetch
should last and fetching will stop when that time is reached?
You can in Bixo, don't know about the current version of Nutch.
How does one guess what that time limit to pick, especially since
fetch runs can
vary in terms of how fast they are depending on what hosts are in it?
Wouldn't it be better to express this in requests/second instead of
time, so
that you can say "when fetching goes below N requests per second and
stays like
that for M minutes, abort fetch"?
We implemented something like that in Nutch back in 2006, but at the
time the Nutch fetching architecture was such that this felt very flaky.
In Bixo we have support for aborting requests if the response rate is
less than some specified limit, which seems to work well to avoid
problems with slow sites.
What if you have a really fast fetch run going on, but the time is
still reached
and fetch aborted? What do you do? Restart the fetch with the same
list of
generated URLs as before? Somehow restart with only unfetched
URLs? Generate a
whole new fetchlist (which ends up being slow)?
In Bixo, at least, what happens is that fetched URLs are then
processed, and unfetched URLs have their state unchanged - so they'll
get fetched in the next loop. I assume something similar is possible
in Nutch.
[snip]
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g