The logs show that my fetch queue is full and my 100 threads are mostly spin waiting towards the end.
Now the very last run (150kURLs) I can clearly see 4 phases: + very high speed : 3MB/s for a few minutes + sudden speed drop around 1MB/s and flat for several hours + another speed drop to around 400kB/s for several hours + another speed drop to around 200kB/s for a few hours two. So probably it is just a consequence of the url mix which isn't that good nota: I have limited to 1000 URLS per host, and there are about 20-30 hosts in the mix which get limited that way. May be there is better mix of URLs possible ? 2009/11/25 Julien Nioche <[email protected]> > or it is stuck on a couple of hosts which time out? The logs should have a > trace with the number of active threads, which should give some indication > of what's happening. > > Julien > > > 2009/11/25 Dennis Kubes <[email protected]> > > > If it is waiting and the box is idle, my first though is not dns. I just > > put that up as one of the things people will run into. Most likely it is > > uneven distribution of urls or something like that. > > > > Dennis > > > > > > MilleBii wrote: > > > >> Get your point... Although I thought high number of threads would do > >> exactly the same. Maybe I miss something. > >> > >> During my fetcher runs used bandwidth gets low pretty quickly, disk > >> I/O is low, the CPU is low... So it must be waiting for something but > >> what ? > >> > >> Could be the DNS cache wich is full and any new request gets forwarded > >> to the master DNS of my ISP, > >> Any idea how to check that ? I'm not familiar with Bind myself... What > >> is the typical rate you can get how many dns request/s ? > >> > >> > >> > >> 2009/11/25, Dennis Kubes <[email protected]>: > >> > >>> It is not about the local DNS caching as much as having local DNS > >>> servers. Too many fetchers hitting a centralized DNS server can act as > >>> a DOS attack and slow down the entire fetching system. > >>> > >>> For example say I have a single centralized DNS server for my network. > >>> And say I have 2 map task per machine, 50 machines, 20 threads per > task. > >>> That would be 50 * 2 * 20 = 2000 fetchers. Meaning a possibility of > >>> 2000 DNS requests / sec. Most local DNS servers for smaller networks > >>> can't handle that. If everything is hitting a centralized DNS and that > >>> DNS takes 1-3 sec per request because of too many requests. The entire > >>> fetching system stalls. > >>> > >>> Hitting a secondary larger cache, such as OpenDNS, can have an effect > >>> because you are making one hop to get the name versus multiple hops to > >>> root servers then domain servers. > >>> > >>> Working off of a single server these issues don't show up as much > >>> because there aren't enough fetchers. > >>> > >>> Dennis Kubes > >>> > >>> MilleBii wrote: > >>> > >>>> Why would DNS local caching work... It only is working if you are > >>>> going to crawl often the same site ... In which case you are hit by > >>>> the politeness. > >>>> > >>>> if you have segments with only/mainly different sites it is not/really > >>>> going to help. > >>>> > >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed > >>>> hadoop going faster than 10 fetch / s... Let me check the DNS and I > >>>> will tell you. > >>>> > >>>> I vote for 100 Fetch/s not sure how to get it though > >>>> > >>>> > >>>> > >>>> 2009/11/24, Dennis Kubes <[email protected]>: > >>>> > >>>>> Hi Mark, > >>>>> > >>>>> I just put this up on the wiki. Hope it helps: > >>>>> > >>>>> http://wiki.apache.org/nutch/OptimizingCrawls > >>>>> > >>>>> Dennis > >>>>> > >>>>> > >>>>> Mark Kerzner wrote: > >>>>> > >>>>>> Hi, guys, > >>>>>> > >>>>>> my goal is to do by crawls at 100 fetches per second, observing, of > >>>>>> course, > >>>>>> polite crawling. But, when URLs are all different domains, what > >>>>>> theoretically would stop some software from downloading from 100 > >>>>>> domains > >>>>>> at > >>>>>> once, achieving the desired speed? > >>>>>> > >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if > it > >>>>>> starts at a few dozen URLs/second, it slows down at the end (as > >>>>>> discussed > >>>>>> by > >>>>>> many and by Krugler). > >>>>>> > >>>>>> Should I write something of my own, or are their fast crawlers? > >>>>>> > >>>>>> Thanks! > >>>>>> > >>>>>> Mark > >>>>>> > >>>>>> > >> > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > -- -MilleBii-
