Hi Otis,

I'm trying to do some basic calculations trying to figure out what, in terms
of
time, resources, and cost, it would take to crawl  500M URLs.

I can't directly comment on Nutch, but we recently did something similar to
this (563M pages via EC2) using Bixo.

Eh, I was going to go over to Bixo's list and ask if Bixo is suitable for such wide crawls or whether it's meant more for vertical crawls. For some reason in my head I had it in the "for vertical crawl" bucket, but it seems I was wrong,
ha?

Well, it's a toolkit - so hooking it up for a wide/big crawl means writing some code, but there's nothing in the Bixo architecture (after a few revs) that precludes using it in this manner.

Since we're also using a sequence file to store the crawldb, the update time should be comparable, but we ran only one loop (since we started with a large
set of known  URLs).

Some parameters that obviously impact crawl performance:

*  A default crawl delay of 15 seconds

That's very polite.  Is 3 seconds delay acceptable?

It depends on the site. For somebody big like (say) CNN, 3 seconds would probably be OK.

For smaller sights, 30 seconds is actually a better value.

We actually modified the fetch policy we're using so that for low- traffic sites (based on Alexa/Quantcast data) we use a 60 second delay, down to about 5 seconds for top sites.

Even more important is to limit pages/day for smaller sites to < 1000...unless you enjoy getting angry emails from irate webmasters :)

* Batching (via keep-alive) 50 URLs per  connection per IP address.

Does Nutch automatically do this? I don't recall seeing this setting in Nutch,
but it's been a while...

No, I don't believe Nutch has this implemented.


* 500 fetch threads/server (250 per each of two  reducers per server)
* Crawling 1.7M domains

Is this because you restricted it to 1.7M domains, or is that how many distinct domains were in your seed list, or is that how many domains you've discovered
while crawling?

We restricted it, based on top domains (where top == most US-based traffic).

* Starting with about 1.2B  known links

Where did you get that many of them?

From a previous crawl, of roughly the same size.

Also, if you start with 1.2B known links, how do you end up with just 563M pages fetched? Maybe out of 1.2B you simply got to only 563M before you stopped
crawling?

Because we're running with Bixo's "efficient" mode (see below).

* Running in "efficient" mode - skip batches of URLs that can't be fetched due
to politeness.

Doesn't Nutch (and Bixo) do this automatically?

Nutch will block and not fetch a URL until sufficient time has passed.

Bixo can do the same thing, but when you run a crawl like this, you often wind up blocked on a few slow sites.

* Fetching text, HTML, and image files
* Cluster size of 50 slaves, using m1.large instances (with spot pricing)

I've never used spot instances. Isn't it the case that when you use spot instances you can use them as long as the price you paid is adequate. When the price goes up due to demand, and you are using a spot instance, don't you get kicked off (because you are not paying enough to meet the price any more)? If that's so, what happens with the cluster? You keep to adding new spot instances (at new/higher prices) to keep the cluster of more or less consistent size?

We run the master without spot pricing, and use a very high max bid to help ensure we rarely (if ever) lose servers.

The results were:

* CPU cost was only $250
* data-in  was $2100 ($0.10/GB, and we fetched 21TB)

That's fast!

* Major performance issue was not enough domains with lots of URLs to fetch
(power curve for URLs/domain)

Why is this a problem? Isn't this actually good? Isn't it better to have 100 hosts/domains with 10 pages each than 10 hosts/domains with 100 each? Wouldn't
fetching of the former complete faster?

The problem is that there are too many domains with only a handful of pages.

So very quickly, the set of available domains to fetch from is reduced down to a fraction of that initial 1.7M, and then politeness starts causing either (a) very inefficient utilization of resources, as most threads are spinning w/o doing any work, or (b) you start skipping lots of URLs for domains that aren't ready yet (not enough time has elapsed since the prior batch of URLs were fetched).

*  total cluster time of 22 hours, fetch time of about 12 hours

That's fast.  Where did the delta of 10h go?

Jobs to extract links from the crawldb, partition by IP address, fetch robots.txt, etc.

We didn't parse the fetched pages, which would have added some significant CPU
cost.

Yeah.  Would you dare to guess how much that would add in terms of
time/servers/cost?

I've got some data, but I'd need to dig it up after we finish a deliverable that's due by 5pm :(

-- Ken

The obvious environment for this is EC2, so I'm wondering what people are
seeing
in terms of fetch rate  there these days? 50 pages/second? 100? 200?


Here's a time calculation that assumes 100 pages/second per EC2 instance:

100*60*60*12 = 4,320,000 URLs/day per EC2 instance

That 12 means 12 hours because last time I used Nutch I recall about half
of
the time being spent in updatedb, generate, and other non- fetching steps.

If I have 20 servers fetching  URLs, that's:

100*60*60*12*20 = 86,400,000 URLs/day -- this is starting to sound too
good
to be true

Then  to crawl 500M URLs:

500000000/(100*60*60*12*20) = 5.78  days  -- that's less than 1 week

Suspiciously short, isn't  it?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


--------------------------
Ken  Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n  g







--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to