Get your point... Although I thought high number of threads would do
exactly the same. Maybe I miss something.

During my fetcher runs used bandwidth gets low pretty quickly, disk
I/O is low, the CPU is low... So it must be waiting for something but
what ?

Could be the DNS cache wich is full and any new request gets forwarded
to the master DNS of my ISP,
Any idea how to check that ? I'm not familiar with Bind myself... What
is the typical rate you can get how many dns request/s ?



2009/11/25, Dennis Kubes <[email protected]>:
> It is not about the local DNS caching as much as having local DNS
> servers.  Too many fetchers hitting a centralized DNS server can act as
> a DOS attack and slow down the entire fetching system.
>
> For example say I have a single centralized DNS server for my network.
> And say I have 2 map task per machine, 50 machines, 20 threads per task.
>   That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
> 2000  DNS requests / sec.  Most local DNS servers for smaller networks
> can't handle that.  If everything is hitting a centralized DNS and that
> DNS takes 1-3 sec per request because of too many requests.  The entire
> fetching system stalls.
>
> Hitting a secondary larger cache, such as OpenDNS, can have an effect
> because you are making one hop to get the name versus multiple hops to
> root servers then domain servers.
>
> Working off of a single server these issues don't show up as much
> because there aren't enough fetchers.
>
> Dennis Kubes
>
> MilleBii wrote:
>> Why would DNS local caching work... It only is working if you are
>> going to crawl often the same site ... In which case you are hit by
>> the politeness.
>>
>> if you have segments with only/mainly different sites it is not/really
>> going to help.
>>
>> So far I have not seen my quad core + 100mb/s + pseudo distributed
>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
>> will tell you.
>>
>> I vote for 100 Fetch/s not sure how to get it though
>>
>>
>>
>> 2009/11/24, Dennis Kubes <[email protected]>:
>>> Hi Mark,
>>>
>>> I just put this up on the wiki.  Hope it helps:
>>>
>>> http://wiki.apache.org/nutch/OptimizingCrawls
>>>
>>> Dennis
>>>
>>>
>>> Mark Kerzner wrote:
>>>> Hi, guys,
>>>>
>>>> my goal is to do by crawls at 100 fetches per second, observing, of
>>>> course,
>>>> polite crawling. But, when URLs are all different domains, what
>>>> theoretically would stop some software from downloading from 100 domains
>>>> at
>>>> once, achieving the desired speed?
>>>>
>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>>>> starts at a few dozen URLs/second, it slows down at the end (as
>>>> discussed
>>>> by
>>>> many and by Krugler).
>>>>
>>>> Should I write something of my own, or are their fast crawlers?
>>>>
>>>> Thanks!
>>>>
>>>> Mark
>>>>
>>
>

-- 
Envoyé avec mon mobile

-MilleBii-

Reply via email to