Hi Otis,
I'm trying to do some basic calculations trying to figure out
what, in terms
of
time, resources, and cost, it would take to crawl 500M URLs.
I can't directly comment on Nutch, but we recently did something
similar to
this (563M pages via EC2) using Bixo.
Eh, I was going to go over to Bixo's list and ask if Bixo is
suitable for such
wide crawls or whether it's meant more for vertical crawls. For
some reason in
my head I had it in the "for vertical crawl" bucket, but it seems I
was wrong,
ha?
Well, it's a toolkit - so hooking it up for a wide/big crawl means
writing some code, but there's nothing in the Bixo architecture (after
a few revs) that precludes using it in this manner.
Since we're also using a sequence file to store the crawldb, the
update time
should be comparable, but we ran only one loop (since we started
with a large
set of known URLs).
Some parameters that obviously impact crawl performance:
* A default crawl delay of 15 seconds
That's very polite. Is 3 seconds delay acceptable?
It depends on the site. For somebody big like (say) CNN, 3 seconds
would probably be OK.
For smaller sights, 30 seconds is actually a better value.
We actually modified the fetch policy we're using so that for low-
traffic sites (based on Alexa/Quantcast data) we use a 60 second
delay, down to about 5 seconds for top sites.
Even more important is to limit pages/day for smaller sites to <
1000...unless you enjoy getting angry emails from irate webmasters :)
* Batching (via keep-alive) 50 URLs per connection per IP address.
Does Nutch automatically do this? I don't recall seeing this
setting in Nutch,
but it's been a while...
No, I don't believe Nutch has this implemented.
* 500 fetch threads/server (250 per each of two reducers per server)
* Crawling 1.7M domains
Is this because you restricted it to 1.7M domains, or is that how
many distinct
domains were in your seed list, or is that how many domains you've
discovered
while crawling?
We restricted it, based on top domains (where top == most US-based
traffic).
* Starting with about 1.2B known links
Where did you get that many of them?
From a previous crawl, of roughly the same size.
Also, if you start with 1.2B known links, how do you end up with
just 563M pages
fetched? Maybe out of 1.2B you simply got to only 563M before you
stopped
crawling?
Because we're running with Bixo's "efficient" mode (see below).
* Running in "efficient" mode - skip batches of URLs that can't be
fetched due
to politeness.
Doesn't Nutch (and Bixo) do this automatically?
Nutch will block and not fetch a URL until sufficient time has passed.
Bixo can do the same thing, but when you run a crawl like this, you
often wind up blocked on a few slow sites.
* Fetching text, HTML, and image files
* Cluster size of 50 slaves, using m1.large instances (with spot
pricing)
I've never used spot instances. Isn't it the case that when you use
spot
instances you can use them as long as the price you paid is
adequate. When the
price goes up due to demand, and you are using a spot instance,
don't you get
kicked off (because you are not paying enough to meet the price any
more)? If
that's so, what happens with the cluster? You keep to adding new
spot instances
(at new/higher prices) to keep the cluster of more or less
consistent size?
We run the master without spot pricing, and use a very high max bid to
help ensure we rarely (if ever) lose servers.
The results were:
* CPU cost was only $250
* data-in was $2100 ($0.10/GB, and we fetched 21TB)
That's fast!
* Major performance issue was not enough domains with lots of URLs
to fetch
(power curve for URLs/domain)
Why is this a problem? Isn't this actually good? Isn't it better
to have 100
hosts/domains with 10 pages each than 10 hosts/domains with 100
each? Wouldn't
fetching of the former complete faster?
The problem is that there are too many domains with only a handful of
pages.
So very quickly, the set of available domains to fetch from is reduced
down to a fraction of that initial 1.7M, and then politeness starts
causing either (a) very inefficient utilization of resources, as most
threads are spinning w/o doing any work, or (b) you start skipping
lots of URLs for domains that aren't ready yet (not enough time has
elapsed since the prior batch of URLs were fetched).
* total cluster time of 22 hours, fetch time of about 12 hours
That's fast. Where did the delta of 10h go?
Jobs to extract links from the crawldb, partition by IP address, fetch
robots.txt, etc.
We didn't parse the fetched pages, which would have added some
significant CPU
cost.
Yeah. Would you dare to guess how much that would add in terms of
time/servers/cost?
I've got some data, but I'd need to dig it up after we finish a
deliverable that's due by 5pm :(
-- Ken
The obvious environment for this is EC2, so I'm wondering what
people are
seeing
in terms of fetch rate there these days? 50 pages/second? 100? 200?
Here's a time calculation that assumes 100 pages/second per EC2
instance:
100*60*60*12 = 4,320,000 URLs/day per EC2 instance
That 12 means 12 hours because last time I used Nutch I recall
about half
of
the time being spent in updatedb, generate, and other non-
fetching steps.
If I have 20 servers fetching URLs, that's:
100*60*60*12*20 = 86,400,000 URLs/day -- this is starting to
sound too
good
to be true
Then to crawl 500M URLs:
500000000/(100*60*60*12*20) = 5.78 days -- that's less than 1 week
Suspiciously short, isn't it?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g