The logs show that my fetch queue is full and my 100 threads are mostly spin
waiting towards the end.

Now the very last run (150kURLs) I can clearly see 4 phases:
+ very high speed : 3MB/s  for a few minutes
+ sudden speed drop around 1MB/s and flat for several hours
+ another speed drop to around 400kB/s for several hours
+ another speed drop to around  200kB/s for a few hours two.

So probably it is just a consequence of the url mix which isn't that good
nota: I have limited to 1000 URLS per host, and there are about 20-30 hosts
in the mix which get limited that way.

May be there is better mix of URLs possible ?

2009/11/25 Julien Nioche <[email protected]>

> or it is stuck on a couple of hosts which time out? The logs should have a
> trace with the number of active threads, which should give some indication
> of what's happening.
>
> Julien
>
>
> 2009/11/25 Dennis Kubes <[email protected]>
>
> > If it is waiting and the box is idle, my first though is not dns.  I just
> > put that up as one of the things people will run into.  Most likely it is
> > uneven distribution of urls or something like that.
> >
> > Dennis
> >
> >
> > MilleBii wrote:
> >
> >> Get your point... Although I thought high number of threads would do
> >> exactly the same. Maybe I miss something.
> >>
> >> During my fetcher runs used bandwidth gets low pretty quickly, disk
> >> I/O is low, the CPU is low... So it must be waiting for something but
> >> what ?
> >>
> >> Could be the DNS cache wich is full and any new request gets forwarded
> >> to the master DNS of my ISP,
> >> Any idea how to check that ? I'm not familiar with Bind myself... What
> >> is the typical rate you can get how many dns request/s ?
> >>
> >>
> >>
> >> 2009/11/25, Dennis Kubes <[email protected]>:
> >>
> >>> It is not about the local DNS caching as much as having local DNS
> >>> servers.  Too many fetchers hitting a centralized DNS server can act as
> >>> a DOS attack and slow down the entire fetching system.
> >>>
> >>> For example say I have a single centralized DNS server for my network.
> >>> And say I have 2 map task per machine, 50 machines, 20 threads per
> task.
> >>>  That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
> >>> 2000  DNS requests / sec.  Most local DNS servers for smaller networks
> >>> can't handle that.  If everything is hitting a centralized DNS and that
> >>> DNS takes 1-3 sec per request because of too many requests.  The entire
> >>> fetching system stalls.
> >>>
> >>> Hitting a secondary larger cache, such as OpenDNS, can have an effect
> >>> because you are making one hop to get the name versus multiple hops to
> >>> root servers then domain servers.
> >>>
> >>> Working off of a single server these issues don't show up as much
> >>> because there aren't enough fetchers.
> >>>
> >>> Dennis Kubes
> >>>
> >>> MilleBii wrote:
> >>>
> >>>> Why would DNS local caching work... It only is working if you are
> >>>> going to crawl often the same site ... In which case you are hit by
> >>>> the politeness.
> >>>>
> >>>> if you have segments with only/mainly different sites it is not/really
> >>>> going to help.
> >>>>
> >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed
> >>>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
> >>>> will tell you.
> >>>>
> >>>> I vote for 100 Fetch/s not sure how to get it though
> >>>>
> >>>>
> >>>>
> >>>> 2009/11/24, Dennis Kubes <[email protected]>:
> >>>>
> >>>>> Hi Mark,
> >>>>>
> >>>>> I just put this up on the wiki.  Hope it helps:
> >>>>>
> >>>>> http://wiki.apache.org/nutch/OptimizingCrawls
> >>>>>
> >>>>> Dennis
> >>>>>
> >>>>>
> >>>>> Mark Kerzner wrote:
> >>>>>
> >>>>>> Hi, guys,
> >>>>>>
> >>>>>> my goal is to do by crawls at 100 fetches per second, observing, of
> >>>>>> course,
> >>>>>> polite crawling. But, when URLs are all different domains, what
> >>>>>> theoretically would stop some software from downloading from 100
> >>>>>> domains
> >>>>>> at
> >>>>>> once, achieving the desired speed?
> >>>>>>
> >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if
> it
> >>>>>> starts at a few dozen URLs/second, it slows down at the end (as
> >>>>>> discussed
> >>>>>> by
> >>>>>> many and by Krugler).
> >>>>>>
> >>>>>> Should I write something of my own, or are their fast crawlers?
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>> Mark
> >>>>>>
> >>>>>>
> >>
>
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>



-- 
-MilleBii-

Reply via email to