Judging by how this discussion goes, there may be a need for URL mix
optimizer and for a fast crawler based on that. Is this something worth
pursuing. MilleBii, q'en pensez vous?

Mark

On Wed, Nov 25, 2009 at 3:44 PM, MilleBii <[email protected]> wrote:

> The logs show that my fetch queue is full and my 100 threads are mostly
> spin
> waiting towards the end.
>
> Now the very last run (150kURLs) I can clearly see 4 phases:
> + very high speed : 3MB/s  for a few minutes
> + sudden speed drop around 1MB/s and flat for several hours
> + another speed drop to around 400kB/s for several hours
> + another speed drop to around  200kB/s for a few hours two.
>
> So probably it is just a consequence of the url mix which isn't that good
> nota: I have limited to 1000 URLS per host, and there are about 20-30 hosts
> in the mix which get limited that way.
>
> May be there is better mix of URLs possible ?
>
> 2009/11/25 Julien Nioche <[email protected]>
>
> > or it is stuck on a couple of hosts which time out? The logs should have
> a
> > trace with the number of active threads, which should give some
> indication
> > of what's happening.
> >
> > Julien
> >
> >
> > 2009/11/25 Dennis Kubes <[email protected]>
> >
> > > If it is waiting and the box is idle, my first though is not dns.  I
> just
> > > put that up as one of the things people will run into.  Most likely it
> is
> > > uneven distribution of urls or something like that.
> > >
> > > Dennis
> > >
> > >
> > > MilleBii wrote:
> > >
> > >> Get your point... Although I thought high number of threads would do
> > >> exactly the same. Maybe I miss something.
> > >>
> > >> During my fetcher runs used bandwidth gets low pretty quickly, disk
> > >> I/O is low, the CPU is low... So it must be waiting for something but
> > >> what ?
> > >>
> > >> Could be the DNS cache wich is full and any new request gets forwarded
> > >> to the master DNS of my ISP,
> > >> Any idea how to check that ? I'm not familiar with Bind myself... What
> > >> is the typical rate you can get how many dns request/s ?
> > >>
> > >>
> > >>
> > >> 2009/11/25, Dennis Kubes <[email protected]>:
> > >>
> > >>> It is not about the local DNS caching as much as having local DNS
> > >>> servers.  Too many fetchers hitting a centralized DNS server can act
> as
> > >>> a DOS attack and slow down the entire fetching system.
> > >>>
> > >>> For example say I have a single centralized DNS server for my
> network.
> > >>> And say I have 2 map task per machine, 50 machines, 20 threads per
> > task.
> > >>>  That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
> > >>> 2000  DNS requests / sec.  Most local DNS servers for smaller
> networks
> > >>> can't handle that.  If everything is hitting a centralized DNS and
> that
> > >>> DNS takes 1-3 sec per request because of too many requests.  The
> entire
> > >>> fetching system stalls.
> > >>>
> > >>> Hitting a secondary larger cache, such as OpenDNS, can have an effect
> > >>> because you are making one hop to get the name versus multiple hops
> to
> > >>> root servers then domain servers.
> > >>>
> > >>> Working off of a single server these issues don't show up as much
> > >>> because there aren't enough fetchers.
> > >>>
> > >>> Dennis Kubes
> > >>>
> > >>> MilleBii wrote:
> > >>>
> > >>>> Why would DNS local caching work... It only is working if you are
> > >>>> going to crawl often the same site ... In which case you are hit by
> > >>>> the politeness.
> > >>>>
> > >>>> if you have segments with only/mainly different sites it is
> not/really
> > >>>> going to help.
> > >>>>
> > >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed
> > >>>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
> > >>>> will tell you.
> > >>>>
> > >>>> I vote for 100 Fetch/s not sure how to get it though
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2009/11/24, Dennis Kubes <[email protected]>:
> > >>>>
> > >>>>> Hi Mark,
> > >>>>>
> > >>>>> I just put this up on the wiki.  Hope it helps:
> > >>>>>
> > >>>>> http://wiki.apache.org/nutch/OptimizingCrawls
> > >>>>>
> > >>>>> Dennis
> > >>>>>
> > >>>>>
> > >>>>> Mark Kerzner wrote:
> > >>>>>
> > >>>>>> Hi, guys,
> > >>>>>>
> > >>>>>> my goal is to do by crawls at 100 fetches per second, observing,
> of
> > >>>>>> course,
> > >>>>>> polite crawling. But, when URLs are all different domains, what
> > >>>>>> theoretically would stop some software from downloading from 100
> > >>>>>> domains
> > >>>>>> at
> > >>>>>> once, achieving the desired speed?
> > >>>>>>
> > >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even
> if
> > it
> > >>>>>> starts at a few dozen URLs/second, it slows down at the end (as
> > >>>>>> discussed
> > >>>>>> by
> > >>>>>> many and by Krugler).
> > >>>>>>
> > >>>>>> Should I write something of my own, or are their fast crawlers?
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>> Mark
> > >>>>>>
> > >>>>>>
> > >>
> >
> >
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
>
>
>
> --
> -MilleBii-
>

Reply via email to