Hi,
> Hi Otis, Hi Ken :) > > I'm trying to do some basic calculations trying to figure out what, in > > terms >of > > time, resources, and cost, it would take to crawl 500M URLs. > > I can't directly comment on Nutch, but we recently did something similar to >this (563M pages via EC2) using Bixo. Eh, I was going to go over to Bixo's list and ask if Bixo is suitable for such wide crawls or whether it's meant more for vertical crawls. For some reason in my head I had it in the "for vertical crawl" bucket, but it seems I was wrong, ha? > Since we're also using a sequence file to store the crawldb, the update time >should be comparable, but we ran only one loop (since we started with a large >set of known URLs). > > Some parameters that obviously impact crawl performance: > > * A default crawl delay of 15 seconds That's very polite. Is 3 seconds delay acceptable? > * Batching (via keep-alive) 50 URLs per connection per IP address. Does Nutch automatically do this? I don't recall seeing this setting in Nutch, but it's been a while... > * 500 fetch threads/server (250 per each of two reducers per server) > * Crawling 1.7M domains Is this because you restricted it to 1.7M domains, or is that how many distinct domains were in your seed list, or is that how many domains you've discovered while crawling? > * Starting with about 1.2B known links Where did you get that many of them? Also, if you start with 1.2B known links, how do you end up with just 563M pages fetched? Maybe out of 1.2B you simply got to only 563M before you stopped crawling? > * Running in "efficient" mode - skip batches of URLs that can't be fetched > due >to politeness. Doesn't Nutch (and Bixo) do this automatically? > * Fetching text, HTML, and image files > * Cluster size of 50 slaves, using m1.large instances (with spot pricing) I've never used spot instances. Isn't it the case that when you use spot instances you can use them as long as the price you paid is adequate. When the price goes up due to demand, and you are using a spot instance, don't you get kicked off (because you are not paying enough to meet the price any more)? If that's so, what happens with the cluster? You keep to adding new spot instances (at new/higher prices) to keep the cluster of more or less consistent size? > The results were: > > * CPU cost was only $250 > * data-in was $2100 ($0.10/GB, and we fetched 21TB) That's fast! > * Major performance issue was not enough domains with lots of URLs to fetch >(power curve for URLs/domain) Why is this a problem? Isn't this actually good? Isn't it better to have 100 hosts/domains with 10 pages each than 10 hosts/domains with 100 each? Wouldn't fetching of the former complete faster? > * total cluster time of 22 hours, fetch time of about 12 hours That's fast. Where did the delta of 10h go? > We didn't parse the fetched pages, which would have added some significant > CPU >cost. Yeah. Would you dare to guess how much that would add in terms of time/servers/cost? Many thanks! Thanks, Otis > HTH, > > -- Ken > > > The obvious environment for this is EC2, so I'm wondering what people are >seeing > > in terms of fetch rate there these days? 50 pages/second? 100? 200? > > > > > > Here's a time calculation that assumes 100 pages/second per EC2 instance: > > > > 100*60*60*12 = 4,320,000 URLs/day per EC2 instance > > > > That 12 means 12 hours because last time I used Nutch I recall about half >of > > the time being spent in updatedb, generate, and other non-fetching steps. > > > > If I have 20 servers fetching URLs, that's: > > > > 100*60*60*12*20 = 86,400,000 URLs/day -- this is starting to sound too >good > > to be true > > > > Then to crawl 500M URLs: > > > > 500000000/(100*60*60*12*20) = 5.78 days -- that's less than 1 week > > > > Suspiciously short, isn't it? > > > > Thanks, > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > >

