Hi

> I'm trying to do some basic calculations trying to figure out  what, in terms
> > of
> > time, resources, and cost, it would take to  crawl 500M URLs.
> > The obvious environment for this is EC2, so I'm  wondering what people are
> > seeing
> > in terms of fetch rate there  these days? 50 pages/second? 100? 200?
> >
> 
> Depends mostly on the  distributions of URLS / host, whether you have a DNS
> cache etc... Using large  instances, you can start with a conservative

In my case the crawl would be wide, which is good for URL distribution, but bad 
for DNS.
What's recommended for DNA caching?  I do see  
http://wiki.apache.org/nutch/OptimizingCrawls -- does that mean setting  up a 
local DNS server (e.g. bind) or something like pdnsdr or something  else?

> estimate at 125K URLs fetched  per node and per hour

125K URLs per node per hour.... so again assuming I'm fetching only 12h out of 
24h and with 50 machines (to match Ken's example):

125000*12*50=75M / day

That means 525M in 7 days.  That is close to Ken's number, good. :)

> > Here's a time calculation that  assumes 100 pages/second per EC2 instance:
> >
> >  100*60*60*12 =  4,320,000 URLs/day per EC2 instance
> >
> > That 12 means 12 hours  because last time I used Nutch I recall about  half
> > of
> > the  time being spent in updatedb, generate, and other  non-fetching  steps.
> >
> 
> The time spent in generate and update is proportional to  the size of the
> crawldb. Might take half the time at one point but will take  more than that.
> The best option would probably be to generate multiple  segments in one go
> (see options for the Generator), fetch all the segments  one by one, then
> merge them with the crawldb in a single call to  update

Right.
But with time (or, more precisely, as crawldb grows) this generation will start 
taking more and more time, and there is no way around that, right?

How does Bixo deal with that?

> > If I have 20 servers fetching URLs,  that's:
> >
> >  100*60*60*12*20 = 86,400,000 URLs/day   --  this is starting to sound too
> > good
> > to be true
> >
> >  Then to crawl 500M URLs:
> >
> >  500000000/(100*60*60*12*20) =  5.78 days  -- that's less than 1 week
> >
> > Suspiciously short,  isn't it?
> >
> 
> it also depends on the rate at which new URLs are  discovered and hence on
> your seedlist.

Yeah.  I want Ken's seed list! :)

> You will also inevitably hit  slow servers which will have an impact the
> fetchrate - although not as bad as  before the introduction of the timeout on
> fetching. 

Right, I remember this problem.  So now one can specify how long each fetch 
should last and fetching will stop when that time is reached?

How does one guess what that time limit to pick, especially since fetch runs 
can 
vary in terms of how fast they are depending on what hosts are in it?

Wouldn't it be better to express this in requests/second instead of time, so 
that you can say "when fetching goes below N requests per second and stays like 
that for M minutes, abort fetch"?

What if you have a really fast fetch run going on, but the time is still 
reached 
and fetch aborted?  What do you do?  Restart the fetch with the same list of 
generated URLs as before?  Somehow restart with only unfetched URLs?  Generate 
a 
whole new fetchlist (which ends up being slow)?

A ton of questions, I know. :(

> The main point being that  you will definitely get plenty of new
> URLs to fetch but will need to pay  attention to the *quality* of what is
> fetched. Unless you are dealing with a  limited number of target hosts, you
> will inevitable get loads of porn if you  crawl in the open and adult URLs
> (mostly redirections to other porn sites)  will quickly take over your
> crawldb. As a result what your crawl will just be  churning URLs generated
> automatically from adult sites and despite the fact  that your crawldb will
> contain loads of URLs there will be very little useful  ones.

One man's trash is another man's...
But this is very good to know, thanks!

> Anyway, it's not just a matter of pages / seconds. Doing large,  open crawls
> brings up a lot of interesting challenges  :-)

Yup.  Thanks Julien!

Otis

Reply via email to