It is not about the local DNS caching as much as having local DNS
servers. Too many fetchers hitting a centralized DNS server can act as
a DOS attack and slow down the entire fetching system.
For example say I have a single centralized DNS server for my network.
And say I have 2 map task per machine, 50 machines, 20 threads per task.
That would be 50 * 2 * 20 = 2000 fetchers. Meaning a possibility of
2000 DNS requests / sec. Most local DNS servers for smaller networks
can't handle that. If everything is hitting a centralized DNS and that
DNS takes 1-3 sec per request because of too many requests. The entire
fetching system stalls.
Hitting a secondary larger cache, such as OpenDNS, can have an effect
because you are making one hop to get the name versus multiple hops to
root servers then domain servers.
Working off of a single server these issues don't show up as much
because there aren't enough fetchers.
Dennis Kubes
MilleBii wrote:
Why would DNS local caching work... It only is working if you are
going to crawl often the same site ... In which case you are hit by
the politeness.
if you have segments with only/mainly different sites it is not/really
going to help.
So far I have not seen my quad core + 100mb/s + pseudo distributed
hadoop going faster than 10 fetch / s... Let me check the DNS and I
will tell you.
I vote for 100 Fetch/s not sure how to get it though
2009/11/24, Dennis Kubes <[email protected]>:
Hi Mark,
I just put this up on the wiki. Hope it helps:
http://wiki.apache.org/nutch/OptimizingCrawls
Dennis
Mark Kerzner wrote:
Hi, guys,
my goal is to do by crawls at 100 fetches per second, observing, of
course,
polite crawling. But, when URLs are all different domains, what
theoretically would stop some software from downloading from 100 domains
at
once, achieving the desired speed?
But, whatever I do, I can't make Nutch crawl at that speed. Even if it
starts at a few dozen URLs/second, it slows down at the end (as discussed
by
many and by Krugler).
Should I write something of my own, or are their fast crawlers?
Thanks!
Mark