If it is waiting and the box is idle, my first though is not dns. I
just put that up as one of the things people will run into. Most likely
it is uneven distribution of urls or something like that.
Dennis
MilleBii wrote:
Get your point... Although I thought high number of threads would do
exactly the same. Maybe I miss something.
During my fetcher runs used bandwidth gets low pretty quickly, disk
I/O is low, the CPU is low... So it must be waiting for something but
what ?
Could be the DNS cache wich is full and any new request gets forwarded
to the master DNS of my ISP,
Any idea how to check that ? I'm not familiar with Bind myself... What
is the typical rate you can get how many dns request/s ?
2009/11/25, Dennis Kubes <[email protected]>:
It is not about the local DNS caching as much as having local DNS
servers. Too many fetchers hitting a centralized DNS server can act as
a DOS attack and slow down the entire fetching system.
For example say I have a single centralized DNS server for my network.
And say I have 2 map task per machine, 50 machines, 20 threads per task.
That would be 50 * 2 * 20 = 2000 fetchers. Meaning a possibility of
2000 DNS requests / sec. Most local DNS servers for smaller networks
can't handle that. If everything is hitting a centralized DNS and that
DNS takes 1-3 sec per request because of too many requests. The entire
fetching system stalls.
Hitting a secondary larger cache, such as OpenDNS, can have an effect
because you are making one hop to get the name versus multiple hops to
root servers then domain servers.
Working off of a single server these issues don't show up as much
because there aren't enough fetchers.
Dennis Kubes
MilleBii wrote:
Why would DNS local caching work... It only is working if you are
going to crawl often the same site ... In which case you are hit by
the politeness.
if you have segments with only/mainly different sites it is not/really
going to help.
So far I have not seen my quad core + 100mb/s + pseudo distributed
hadoop going faster than 10 fetch / s... Let me check the DNS and I
will tell you.
I vote for 100 Fetch/s not sure how to get it though
2009/11/24, Dennis Kubes <[email protected]>:
Hi Mark,
I just put this up on the wiki. Hope it helps:
http://wiki.apache.org/nutch/OptimizingCrawls
Dennis
Mark Kerzner wrote:
Hi, guys,
my goal is to do by crawls at 100 fetches per second, observing, of
course,
polite crawling. But, when URLs are all different domains, what
theoretically would stop some software from downloading from 100 domains
at
once, achieving the desired speed?
But, whatever I do, I can't make Nutch crawl at that speed. Even if it
starts at a few dozen URLs/second, it slows down at the end (as
discussed
by
many and by Krugler).
Should I write something of my own, or are their fast crawlers?
Thanks!
Mark