Hi - you have to get rid of those URL's via URL filters. If you cannot filter them out you can set the fetcher time limit (see nutch-default) to limit the time the fetcher runs or set the fetcher minumum throughput (see nutch-default). The latter will abort the fetcher if less than N pages/second are fetched. The unfetched records will be fetched later on together with other queues. -----Original message----- > From:manubharghav <manubharg...@gmail.com> > Sent: Fri 14-Dec-2012 07:39 > To: user@nutch.apache.org > Subject: identify domains from fetch lists taking lot of time. > > Hi, > > I initiated a crawl on 200 domains till a depth of 5 with a topN of 1 > million. A single domain extended my fetch time by a day as it kept > generating outlinks to the same page with different urls( the parameters > change, but the content remains same.) > .http://www.awex.com.au/about-awex.html?s=___________. So is there anyway > to run the content dedup while fetching itself or are there any other steps > to avoid such cases. The problem is that as the size of the fetch list is > increasing the fetcher has a delay of say 3 seconds hitting the same server. > This is causing the delay in the node and hence delaying the effective time > of the crawl. > > > Thanks in advance. > Manu Reddy. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/identify-domains-from-fetch-lists-taking-lot-of-time-tp4026942.html > Sent from the Nutch - User mailing list archive at Nabble.com. >