It is not about the local DNS caching as much as having local DNS servers. Too many fetchers hitting a centralized DNS server can act as a DOS attack and slow down the entire fetching system.

For example say I have a single centralized DNS server for my network. And say I have 2 map task per machine, 50 machines, 20 threads per task. That would be 50 * 2 * 20 = 2000 fetchers. Meaning a possibility of 2000 DNS requests / sec. Most local DNS servers for smaller networks can't handle that. If everything is hitting a centralized DNS and that DNS takes 1-3 sec per request because of too many requests. The entire fetching system stalls.

Hitting a secondary larger cache, such as OpenDNS, can have an effect because you are making one hop to get the name versus multiple hops to root servers then domain servers.

Working off of a single server these issues don't show up as much because there aren't enough fetchers.

Dennis Kubes

MilleBii wrote:
Why would DNS local caching work... It only is working if you are
going to crawl often the same site ... In which case you are hit by
the politeness.

if you have segments with only/mainly different sites it is not/really
going to help.

So far I have not seen my quad core + 100mb/s + pseudo distributed
hadoop  going faster than 10 fetch / s... Let me check the DNS and I
will tell you.

I vote for 100 Fetch/s not sure how to get it though



2009/11/24, Dennis Kubes <[email protected]>:
Hi Mark,

I just put this up on the wiki.  Hope it helps:

http://wiki.apache.org/nutch/OptimizingCrawls

Dennis


Mark Kerzner wrote:
Hi, guys,

my goal is to do by crawls at 100 fetches per second, observing, of
course,
polite crawling. But, when URLs are all different domains, what
theoretically would stop some software from downloading from 100 domains
at
once, achieving the desired speed?

But, whatever I do, I can't make Nutch crawl at that speed. Even if it
starts at a few dozen URLs/second, it slows down at the end (as discussed
by
many and by Krugler).

Should I write something of my own, or are their fast crawlers?

Thanks!

Mark


Reply via email to