Hello Alexander - for Nutch and other JVM crawlers crawl speed does not really 
matter. The JVM caches DNS lookups. The only thing we ever had to worry about 
(when crawling large scale and at high speed) whether or not we'd overflow the 
DNS server in our local DC.

In case of Nutch, i would not worry about having a machine-local DNS cache. It 
eats memory and in case of Nutch won't have a very high hitrate. I'd prefer a 
close but central powerful DNS server that can dedicate its memory to DNS.

Markus 
 
-----Original message-----
> From:Alexander Sibiryakov <sixty-...@yandex.ru>
> Sent: Tuesday 16th February 2016 13:57
> To: user@nutch.apache.org
> Subject: Re: DNS caching best practices
> 
> Otis, Marcus,
> it depends on the speed you operate your crawler. If it’s relatively slow, 
> than that’s ok using ISP general purpose DNS for it.
> 
> I think below information could be useful, just to realize what kind of 
> problems we cause to internet infrastructure.
> 
> I was talking with one of the guys from https://selectel.ru/ 
> <https://selectel.ru/> (huge cloud and hosting provider) responsible for DNS 
> service, and he said they built a dedicated DNS cache for various crawlers 
> and bots, to help persist the cache in their main DNS server. Before that, 
> during the night time (the crawlers time!) the cache were changing 
> significantly and causing slow downs for typical users next day.
> 
> The recommendation from him was to use http://unbound.net/ 
> <http://unbound.net/> as a local caching DNS service, and configuring it 
> without upstream, so it will resolve DNS recursively on it’s own. It even 
> provides a way to dump/load a cache on disk.
> 
> Linux OS has no internal DNS cache, so it makes sense if your crawler makes 
> repetitive requests to the same website.
> 
> A.
> 
> > 1 февр. 2016 г., в 11:18, Markus Jelsma <markus.jel...@openindex.io> 
> > написал(а):
> > 
> > Otis - we tried local DNS caching when we did very large scale crawls but 
> > decided to get rid of it as soon as possible because it got us too much 
> > overhead. Instead, we relied on an, apparently, powerful DNS server put 
> > available by the ISP in the network center. If the server is fast and has a 
> > lot of RAM the mapper won't quickly overwhelm it.
> > 
> > Markus
> > 
> > 
> > -----Original message-----
> >> From:Otis Gospodnetić <otis.gospodne...@gmail.com>
> >> Sent: Sunday 31st January 2016 23:36
> >> To: Nutch User List <nutch-u...@lucene.apache.org>
> >> Subject: DNS caching best practices
> >> 
> >> Hi,
> >> 
> >> The first item on http://wiki.apache.org/nutch/OptimizingCrawls is DNS
> >> caching.  Is this still something people regularly do?  Even when running
> >> in EC2, which I assume has nameservers that are relatively close to
> >> instances doing crawling and nameserver lookups?
> >> 
> >> If so, are there any recommendations for the best DNS caching server/config
> >> to use?
> >> 
> >> Thanks,
> >> Otis
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >> 
> 
> 

Reply via email to