I looked at the bandwidth profile of my last two runs and they have the same shape:
starts at 5 MBytes/s and decreases to below 500 kBytes/s with 1/x kind of curve shape Fairly even distribution of urls, local DNS is running... I can't find a good explanation for this behavior ... It looks to me that when the fetch queue is full it is a lot less effective ??? I use 100 threads. 2009/11/24, MilleBii <[email protected]>: > So I indeed have a local DNS server running does not seem to help so much > > Just finished a run of 80K urls, at the beginning the speed can be around > 15 > Fetch/s and at the end due to long tail effects I get below 1 Fetch/s... > Average on a 12h34 run : 1,7 Fetch/s pretty slow really. > > I limit URL per site at 1000 to limit the long tail effect.... however I > have half-a-dozen site which I need to index and they have around 60k URLs > so it means I will need 60 runs to get them indexed. So I would like to > increase the limit and the long tail effect. Interesting dilemna. > > 2009/11/24 MilleBii <[email protected]> > >> Why would DNS local caching work... It only is working if you are >> going to crawl often the same site ... In which case you are hit by >> the politeness. >> >> if you have segments with only/mainly different sites it is not/really >> going to help. >> >> So far I have not seen my quad core + 100mb/s + pseudo distributed >> hadoop going faster than 10 fetch / s... Let me check the DNS and I >> will tell you. >> >> I vote for 100 Fetch/s not sure how to get it though >> >> >> >> 2009/11/24, Dennis Kubes <[email protected]>: >> > Hi Mark, >> > >> > I just put this up on the wiki. Hope it helps: >> > >> > http://wiki.apache.org/nutch/OptimizingCrawls >> > >> > Dennis >> > >> > >> > Mark Kerzner wrote: >> >> Hi, guys, >> >> >> >> my goal is to do by crawls at 100 fetches per second, observing, of >> >> course, >> >> polite crawling. But, when URLs are all different domains, what >> >> theoretically would stop some software from downloading from 100 >> >> domains >> >> at >> >> once, achieving the desired speed? >> >> >> >> But, whatever I do, I can't make Nutch crawl at that speed. Even if it >> >> starts at a few dozen URLs/second, it slows down at the end (as >> discussed >> >> by >> >> many and by Krugler). >> >> >> >> Should I write something of my own, or are their fast crawlers? >> >> >> >> Thanks! >> >> >> >> Mark >> >> >> > >> >> -- >> Envoyé avec mon mobile >> >> -MilleBii- >> > > > > -- > -MilleBii- > -- Envoyé avec mon mobile -MilleBii-
