Judging by how this discussion goes, there may be a need for URL mix optimizer and for a fast crawler based on that. Is this something worth pursuing. MilleBii, q'en pensez vous?
Mark On Wed, Nov 25, 2009 at 3:44 PM, MilleBii <[email protected]> wrote: > The logs show that my fetch queue is full and my 100 threads are mostly > spin > waiting towards the end. > > Now the very last run (150kURLs) I can clearly see 4 phases: > + very high speed : 3MB/s for a few minutes > + sudden speed drop around 1MB/s and flat for several hours > + another speed drop to around 400kB/s for several hours > + another speed drop to around 200kB/s for a few hours two. > > So probably it is just a consequence of the url mix which isn't that good > nota: I have limited to 1000 URLS per host, and there are about 20-30 hosts > in the mix which get limited that way. > > May be there is better mix of URLs possible ? > > 2009/11/25 Julien Nioche <[email protected]> > > > or it is stuck on a couple of hosts which time out? The logs should have > a > > trace with the number of active threads, which should give some > indication > > of what's happening. > > > > Julien > > > > > > 2009/11/25 Dennis Kubes <[email protected]> > > > > > If it is waiting and the box is idle, my first though is not dns. I > just > > > put that up as one of the things people will run into. Most likely it > is > > > uneven distribution of urls or something like that. > > > > > > Dennis > > > > > > > > > MilleBii wrote: > > > > > >> Get your point... Although I thought high number of threads would do > > >> exactly the same. Maybe I miss something. > > >> > > >> During my fetcher runs used bandwidth gets low pretty quickly, disk > > >> I/O is low, the CPU is low... So it must be waiting for something but > > >> what ? > > >> > > >> Could be the DNS cache wich is full and any new request gets forwarded > > >> to the master DNS of my ISP, > > >> Any idea how to check that ? I'm not familiar with Bind myself... What > > >> is the typical rate you can get how many dns request/s ? > > >> > > >> > > >> > > >> 2009/11/25, Dennis Kubes <[email protected]>: > > >> > > >>> It is not about the local DNS caching as much as having local DNS > > >>> servers. Too many fetchers hitting a centralized DNS server can act > as > > >>> a DOS attack and slow down the entire fetching system. > > >>> > > >>> For example say I have a single centralized DNS server for my > network. > > >>> And say I have 2 map task per machine, 50 machines, 20 threads per > > task. > > >>> That would be 50 * 2 * 20 = 2000 fetchers. Meaning a possibility of > > >>> 2000 DNS requests / sec. Most local DNS servers for smaller > networks > > >>> can't handle that. If everything is hitting a centralized DNS and > that > > >>> DNS takes 1-3 sec per request because of too many requests. The > entire > > >>> fetching system stalls. > > >>> > > >>> Hitting a secondary larger cache, such as OpenDNS, can have an effect > > >>> because you are making one hop to get the name versus multiple hops > to > > >>> root servers then domain servers. > > >>> > > >>> Working off of a single server these issues don't show up as much > > >>> because there aren't enough fetchers. > > >>> > > >>> Dennis Kubes > > >>> > > >>> MilleBii wrote: > > >>> > > >>>> Why would DNS local caching work... It only is working if you are > > >>>> going to crawl often the same site ... In which case you are hit by > > >>>> the politeness. > > >>>> > > >>>> if you have segments with only/mainly different sites it is > not/really > > >>>> going to help. > > >>>> > > >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed > > >>>> hadoop going faster than 10 fetch / s... Let me check the DNS and I > > >>>> will tell you. > > >>>> > > >>>> I vote for 100 Fetch/s not sure how to get it though > > >>>> > > >>>> > > >>>> > > >>>> 2009/11/24, Dennis Kubes <[email protected]>: > > >>>> > > >>>>> Hi Mark, > > >>>>> > > >>>>> I just put this up on the wiki. Hope it helps: > > >>>>> > > >>>>> http://wiki.apache.org/nutch/OptimizingCrawls > > >>>>> > > >>>>> Dennis > > >>>>> > > >>>>> > > >>>>> Mark Kerzner wrote: > > >>>>> > > >>>>>> Hi, guys, > > >>>>>> > > >>>>>> my goal is to do by crawls at 100 fetches per second, observing, > of > > >>>>>> course, > > >>>>>> polite crawling. But, when URLs are all different domains, what > > >>>>>> theoretically would stop some software from downloading from 100 > > >>>>>> domains > > >>>>>> at > > >>>>>> once, achieving the desired speed? > > >>>>>> > > >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even > if > > it > > >>>>>> starts at a few dozen URLs/second, it slows down at the end (as > > >>>>>> discussed > > >>>>>> by > > >>>>>> many and by Krugler). > > >>>>>> > > >>>>>> Should I write something of my own, or are their fast crawlers? > > >>>>>> > > >>>>>> Thanks! > > >>>>>> > > >>>>>> Mark > > >>>>>> > > >>>>>> > > >> > > > > > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > > > > > -- > -MilleBii- >
