Hi - you have to get rid of those URL's via URL filters. If you cannot filter 
them out you can set the fetcher time limit (see nutch-default) to limit the 
time the fetcher runs or set the fetcher minumum throughput (see 
nutch-default). The latter will abort the fetcher if less than N pages/second 
are fetched. The unfetched records will be fetched later on together with other 
queues. 
 
-----Original message-----
> From:manubharghav <manubharg...@gmail.com>
> Sent: Fri 14-Dec-2012 07:39
> To: user@nutch.apache.org
> Subject: identify domains from fetch lists taking lot of time.
> 
> Hi,
> 
> I initiated a crawl on 200 domains till a depth of 5 with a topN of 1
> million.  A single domain extended my fetch time by a day as it kept
> generating outlinks to the same page with different urls( the parameters
> change, but the content remains same.)
> .http://www.awex.com.au/about-awex.html?s=___________.    So is there anyway
> to run the content dedup while fetching itself or are there any other steps
> to avoid such cases. The problem is that as the size of the fetch list is
> increasing the fetcher has a delay of say 3 seconds hitting the same server.
> This is causing the delay in the node and hence delaying the effective time
> of the crawl.
> 
> 
> Thanks in advance.
> Manu Reddy.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/identify-domains-from-fetch-lists-taking-lot-of-time-tp4026942.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to