Hi,

We are running Nutch 1.0 on 4 VM Hadoop cluster (Each VM is: 2 CPU, Quad core, 
3 GB RAM) located on data center with data storage on a common NAS. The 
bandwidth available to us is 150 Mb / sec. Theoretical calculation tells me 
that I can download 1.62 TB data per day (150 * 60 *60 * 24) / 8 = 1620000 MB.

Now my aim is to tune Nutch to get as close as possible to this figure.
I played a lot with different Nutch params (num of maps=17, num of reduce=7, 
num of threads=800, fairness=3 sec etc), but max I could get is 3GB / hour, 
which is 72 GB per day, which is way less than 1.62 TB. I have set filters to 
just download html text. No images, no videos etc.

So wanted to know about what all params constitute to the speed of Nutch data 
download? 
Am I missing some very obvious thing? Are number of machines too less? Is 
hardware configuration not powerful enough?

TIA,
-Hrishi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Reply via email to