Hi
> I'm trying to do some basic calculations trying to figure out what, in terms > > of > > time, resources, and cost, it would take to crawl 500M URLs. > > The obvious environment for this is EC2, so I'm wondering what people are > > seeing > > in terms of fetch rate there these days? 50 pages/second? 100? 200? > > > > Depends mostly on the distributions of URLS / host, whether you have a DNS > cache etc... Using large instances, you can start with a conservative In my case the crawl would be wide, which is good for URL distribution, but bad for DNS. What's recommended for DNA caching? I do see http://wiki.apache.org/nutch/OptimizingCrawls -- does that mean setting up a local DNS server (e.g. bind) or something like pdnsdr or something else? > estimate at 125K URLs fetched per node and per hour 125K URLs per node per hour.... so again assuming I'm fetching only 12h out of 24h and with 50 machines (to match Ken's example): 125000*12*50=75M / day That means 525M in 7 days. That is close to Ken's number, good. :) > > Here's a time calculation that assumes 100 pages/second per EC2 instance: > > > > 100*60*60*12 = 4,320,000 URLs/day per EC2 instance > > > > That 12 means 12 hours because last time I used Nutch I recall about half > > of > > the time being spent in updatedb, generate, and other non-fetching steps. > > > > The time spent in generate and update is proportional to the size of the > crawldb. Might take half the time at one point but will take more than that. > The best option would probably be to generate multiple segments in one go > (see options for the Generator), fetch all the segments one by one, then > merge them with the crawldb in a single call to update Right. But with time (or, more precisely, as crawldb grows) this generation will start taking more and more time, and there is no way around that, right? How does Bixo deal with that? > > If I have 20 servers fetching URLs, that's: > > > > 100*60*60*12*20 = 86,400,000 URLs/day -- this is starting to sound too > > good > > to be true > > > > Then to crawl 500M URLs: > > > > 500000000/(100*60*60*12*20) = 5.78 days -- that's less than 1 week > > > > Suspiciously short, isn't it? > > > > it also depends on the rate at which new URLs are discovered and hence on > your seedlist. Yeah. I want Ken's seed list! :) > You will also inevitably hit slow servers which will have an impact the > fetchrate - although not as bad as before the introduction of the timeout on > fetching. Right, I remember this problem. So now one can specify how long each fetch should last and fetching will stop when that time is reached? How does one guess what that time limit to pick, especially since fetch runs can vary in terms of how fast they are depending on what hosts are in it? Wouldn't it be better to express this in requests/second instead of time, so that you can say "when fetching goes below N requests per second and stays like that for M minutes, abort fetch"? What if you have a really fast fetch run going on, but the time is still reached and fetch aborted? What do you do? Restart the fetch with the same list of generated URLs as before? Somehow restart with only unfetched URLs? Generate a whole new fetchlist (which ends up being slow)? A ton of questions, I know. :( > The main point being that you will definitely get plenty of new > URLs to fetch but will need to pay attention to the *quality* of what is > fetched. Unless you are dealing with a limited number of target hosts, you > will inevitable get loads of porn if you crawl in the open and adult URLs > (mostly redirections to other porn sites) will quickly take over your > crawldb. As a result what your crawl will just be churning URLs generated > automatically from adult sites and despite the fact that your crawldb will > contain loads of URLs there will be very little useful ones. One man's trash is another man's... But this is very good to know, thanks! > Anyway, it's not just a matter of pages / seconds. Doing large, open crawls > brings up a lot of interesting challenges :-) Yup. Thanks Julien! Otis

