On Thu, Jul 16, 2009 at 16:11, Hrishikesh Agashe<[email protected]> wrote: > Hi, > > We are running Nutch 1.0 on 4 VM Hadoop cluster (Each VM is: 2 CPU, Quad > core, 3 GB RAM) located on data center with data storage on a common NAS. The > bandwidth available to us is 150 Mb / sec. Theoretical calculation tells me > that I can download 1.62 TB data per day (150 * 60 *60 * 24) / 8 = 1620000 MB. > > Now my aim is to tune Nutch to get as close as possible to this figure. > I played a lot with different Nutch params (num of maps=17, num of reduce=7, > num of threads=800, fairness=3 sec etc), but max I could get is 3GB / hour, > which is 72 GB per day, which is way less than 1.62 TB. I have set filters to > just download html text. No images, no videos etc. > > So wanted to know about what all params constitute to the speed of Nutch data > download? > Am I missing some very obvious thing? Are number of machines too less? Is > hardware configuration not powerful enough? >
What are you downloading? Remember that nutch waits between successive requests to the same host, so you may simply be running out of hosts to fetch (so fetcher just waits). However, several people suggested that Fetcher in Nutch 1.0 _is_ slower: https://issues.apache.org/jira/browse/NUTCH-721 My recommendation would be to use OldFetcher class in trunk to see if it makes a difference. > TIA, > -Hrishi > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is the > property of Persistent Systems Ltd. It is intended only for the use of the > individual or entity to which it is addressed. If you are not the intended > recipient, you are not authorized to read, retain, copy, print, distribute or > use this message. If you have received this communication in error, please > notify the sender and delete all copies of this message. Persistent Systems > Ltd. does not accept any liability for virus infected mails. > -- Doğacan Güney
