Get your point... Although I thought high number of threads would do exactly the same. Maybe I miss something.
During my fetcher runs used bandwidth gets low pretty quickly, disk I/O is low, the CPU is low... So it must be waiting for something but what ? Could be the DNS cache wich is full and any new request gets forwarded to the master DNS of my ISP, Any idea how to check that ? I'm not familiar with Bind myself... What is the typical rate you can get how many dns request/s ? 2009/11/25, Dennis Kubes <[email protected]>: > It is not about the local DNS caching as much as having local DNS > servers. Too many fetchers hitting a centralized DNS server can act as > a DOS attack and slow down the entire fetching system. > > For example say I have a single centralized DNS server for my network. > And say I have 2 map task per machine, 50 machines, 20 threads per task. > That would be 50 * 2 * 20 = 2000 fetchers. Meaning a possibility of > 2000 DNS requests / sec. Most local DNS servers for smaller networks > can't handle that. If everything is hitting a centralized DNS and that > DNS takes 1-3 sec per request because of too many requests. The entire > fetching system stalls. > > Hitting a secondary larger cache, such as OpenDNS, can have an effect > because you are making one hop to get the name versus multiple hops to > root servers then domain servers. > > Working off of a single server these issues don't show up as much > because there aren't enough fetchers. > > Dennis Kubes > > MilleBii wrote: >> Why would DNS local caching work... It only is working if you are >> going to crawl often the same site ... In which case you are hit by >> the politeness. >> >> if you have segments with only/mainly different sites it is not/really >> going to help. >> >> So far I have not seen my quad core + 100mb/s + pseudo distributed >> hadoop going faster than 10 fetch / s... Let me check the DNS and I >> will tell you. >> >> I vote for 100 Fetch/s not sure how to get it though >> >> >> >> 2009/11/24, Dennis Kubes <[email protected]>: >>> Hi Mark, >>> >>> I just put this up on the wiki. Hope it helps: >>> >>> http://wiki.apache.org/nutch/OptimizingCrawls >>> >>> Dennis >>> >>> >>> Mark Kerzner wrote: >>>> Hi, guys, >>>> >>>> my goal is to do by crawls at 100 fetches per second, observing, of >>>> course, >>>> polite crawling. But, when URLs are all different domains, what >>>> theoretically would stop some software from downloading from 100 domains >>>> at >>>> once, achieving the desired speed? >>>> >>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it >>>> starts at a few dozen URLs/second, it slows down at the end (as >>>> discussed >>>> by >>>> many and by Krugler). >>>> >>>> Should I write something of my own, or are their fast crawlers? >>>> >>>> Thanks! >>>> >>>> Mark >>>> >> > -- Envoyé avec mon mobile -MilleBii-
