Hi, We are running Nutch 1.0 on 4 VM Hadoop cluster (Each VM is: 2 CPU, Quad core, 3 GB RAM) located on data center with data storage on a common NAS. The bandwidth available to us is 150 Mb / sec. Theoretical calculation tells me that I can download 1.62 TB data per day (150 * 60 *60 * 24) / 8 = 1620000 MB.
Now my aim is to tune Nutch to get as close as possible to this figure. I played a lot with different Nutch params (num of maps=17, num of reduce=7, num of threads=800, fairness=3 sec etc), but max I could get is 3GB / hour, which is 72 GB per day, which is way less than 1.62 TB. I have set filters to just download html text. No images, no videos etc. So wanted to know about what all params constitute to the speed of Nutch data download? Am I missing some very obvious thing? Are number of machines too less? Is hardware configuration not powerful enough? TIA, -Hrishi DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
