Hi,

> Hi Otis,

Hi Ken :)

> > I'm trying to do some basic calculations trying to figure  out what, in 
> > terms 
>of
> > time, resources, and cost, it would take to crawl  500M URLs.
> 
> I can't directly comment on Nutch, but we recently did  something similar to 
>this (563M pages via EC2) using Bixo.

Eh, I was going to go over to Bixo's list and ask if Bixo is suitable for such 
wide crawls or whether it's meant more for vertical crawls.  For some reason in 
my head I had it in the "for vertical crawl" bucket, but it seems I was wrong, 
ha?

> Since we're  also using a sequence file to store the crawldb, the update time 
>should be  comparable, but we ran only one loop (since we started with a large 
>set of known  URLs).
> 
> Some parameters that obviously impact crawl performance:
> 
> *  A default crawl delay of 15 seconds

That's very polite.  Is 3 seconds delay acceptable?

> * Batching (via keep-alive) 50 URLs per  connection per IP address.

Does Nutch automatically do this?  I don't recall seeing this setting in Nutch, 
but it's been a while...

> * 500 fetch threads/server (250 per each of two  reducers per server)
> * Crawling 1.7M domains

Is this because you restricted it to 1.7M domains, or is that how many distinct 
domains were in your seed list, or is that how many domains you've discovered 
while crawling?

> * Starting with about 1.2B  known links

Where did you get that many of them?
Also, if you start with 1.2B known links, how do you end up with just 563M 
pages 
fetched?  Maybe out of 1.2B you simply got to only 563M before you stopped 
crawling?

> * Running in "efficient" mode - skip batches of URLs that can't  be fetched 
> due 
>to politeness.

Doesn't Nutch (and Bixo) do this automatically?

> * Fetching text, HTML, and image files
> *  Cluster size of 50 slaves, using m1.large instances (with spot  pricing)

I've never used spot instances.  Isn't it the case that when you use spot 
instances you can use them as long as the price you paid is adequate.  When the 
price goes up due to demand, and you are using a spot instance, don't you get 
kicked off (because you are not paying enough to meet the price any more)?  If 
that's so, what happens with the cluster?  You keep to adding new spot 
instances 
(at new/higher prices) to keep the cluster of more or less consistent size?

> The results were:
> 
> * CPU cost was only $250
> * data-in  was $2100 ($0.10/GB, and we fetched 21TB)

That's fast!

> * Major performance issue was not  enough domains with lots of URLs to fetch 
>(power curve for URLs/domain)

Why is this a problem?  Isn't this actually good?  Isn't it better to have 100 
hosts/domains with 10 pages each than 10 hosts/domains with 100 each?  Wouldn't 
fetching of the former complete faster?

> *  total cluster time of 22 hours, fetch time of about 12 hours

That's fast.  Where did the delta of 10h go?

> We didn't  parse the fetched pages, which would have added some significant 
> CPU  
>cost.

Yeah.  Would you dare to guess how much that would add in terms of 
time/servers/cost?

Many thanks!

Thanks,
Otis


> HTH,
> 
> -- Ken
> 
> > The obvious environment for this is  EC2, so I'm wondering what people are 
>seeing
> > in terms of fetch rate  there these days? 50 pages/second? 100? 200?
> > 
> > 
> > Here's a  time calculation that assumes 100 pages/second per EC2 instance:
> > 
> >  100*60*60*12 = 4,320,000 URLs/day per EC2 instance
> > 
> > That 12 means 12 hours because last time I used Nutch I recall  about  half 
>of
> > the time being spent in updatedb, generate, and  other  non-fetching steps.
> > 
> > If I have 20 servers fetching  URLs, that's:
> > 
> >  100*60*60*12*20 = 86,400,000 URLs/day    -- this is starting to sound too 
>good
> > to be true
> > 
> > Then  to crawl 500M URLs:
> > 
> >  500000000/(100*60*60*12*20) = 5.78  days  -- that's less than 1 week
> > 
> > Suspiciously short, isn't  it?
> > 
> > Thanks,
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> > 
> 
> --------------------------
> Ken  Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n  g
> 
> 
> 
> 
> 
> 

Reply via email to