So I indeed have a local DNS server running does not seem to help so much Just finished a run of 80K urls, at the beginning the speed can be around 15 Fetch/s and at the end due to long tail effects I get below 1 Fetch/s... Average on a 12h34 run : 1,7 Fetch/s pretty slow really.
I limit URL per site at 1000 to limit the long tail effect.... however I have half-a-dozen site which I need to index and they have around 60k URLs so it means I will need 60 runs to get them indexed. So I would like to increase the limit and the long tail effect. Interesting dilemna. 2009/11/24 MilleBii <[email protected]> > Why would DNS local caching work... It only is working if you are > going to crawl often the same site ... In which case you are hit by > the politeness. > > if you have segments with only/mainly different sites it is not/really > going to help. > > So far I have not seen my quad core + 100mb/s + pseudo distributed > hadoop going faster than 10 fetch / s... Let me check the DNS and I > will tell you. > > I vote for 100 Fetch/s not sure how to get it though > > > > 2009/11/24, Dennis Kubes <[email protected]>: > > Hi Mark, > > > > I just put this up on the wiki. Hope it helps: > > > > http://wiki.apache.org/nutch/OptimizingCrawls > > > > Dennis > > > > > > Mark Kerzner wrote: > >> Hi, guys, > >> > >> my goal is to do by crawls at 100 fetches per second, observing, of > >> course, > >> polite crawling. But, when URLs are all different domains, what > >> theoretically would stop some software from downloading from 100 domains > >> at > >> once, achieving the desired speed? > >> > >> But, whatever I do, I can't make Nutch crawl at that speed. Even if it > >> starts at a few dozen URLs/second, it slows down at the end (as > discussed > >> by > >> many and by Krugler). > >> > >> Should I write something of my own, or are their fast crawlers? > >> > >> Thanks! > >> > >> Mark > >> > > > > -- > Envoyé avec mon mobile > > -MilleBii- > -- -MilleBii-
