Hi Otis,

I'm trying to do some basic calculations trying to figure out what, in terms
> of
> time, resources, and cost, it would take to crawl 500M URLs.
> The obvious environment for this is EC2, so I'm wondering what people are
> seeing
> in terms of fetch rate there these days? 50 pages/second? 100? 200?
>

Depends mostly on the distributions of URLS / host, whether you have a DNS
cache etc... Using large instances, you can start with a conservative
estimate at 125K URLs fetched per node and per hour


>
> Here's a time calculation that assumes 100 pages/second per EC2 instance:
>
>  100*60*60*12 = 4,320,000 URLs/day per EC2 instance
>
> That 12 means 12 hours because last time I used Nutch I recall about  half
> of
> the time being spent in updatedb, generate, and other  non-fetching steps.
>

The time spent in generate and update is proportional to the size of the
crawldb. Might take half the time at one point but will take more than that.
The best option would probably be to generate multiple segments in one go
(see options for the Generator), fetch all the segments one by one, then
merge them with the crawldb in a single call to update


>
> If I have 20 servers fetching URLs, that's:
>
>  100*60*60*12*20 = 86,400,000 URLs/day   -- this is starting to sound too
> good
> to be true
>
> Then to crawl 500M URLs:
>
>  500000000/(100*60*60*12*20) = 5.78 days  -- that's less than 1 week
>
> Suspiciously short, isn't it?
>

it also depends on the rate at which new URLs are discovered and hence on
your seedlist.

You will also inevitably hit slow servers which will have an impact the
fetchrate - although not as bad as before the introduction of the timeout on
fetching. The main point being that you will definitely get plenty of new
URLs to fetch but will need to pay attention to the *quality* of what is
fetched. Unless you are dealing with a limited number of target hosts, you
will inevitable get loads of porn if you crawl in the open and adult URLs
(mostly redirections to other porn sites) will quickly take over your
crawldb. As a result what your crawl will just be churning URLs generated
automatically from adult sites and despite the fact that your crawldb will
contain loads of URLs there will be very little useful ones.

Anyway, it's not just a matter of pages / seconds. Doing large, open crawls
brings up a lot of interesting challenges :-)

HTH

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to