I am using Nutch to crawl about 10M pages and looks like the machines
on my cluster are heavily under-utilized – I wanted to know what would
be a good configuration I can use to improve the performance.

I have setup my cluster on Amazon/EC2 and currently I have 3 EC2
instances of type high CPU/Medium  (dual processors with about 1.7G
RAM each) running on Cent OS.  I have configured mapred-default.xml
file to define 30 map tasks and 7 reduce tasks. For nutch fetch jobs,
I have set generate.max.per.host to 50 and number of fetcher threads
to about 300. However, I noticed that whenever I run any nutch jobs,
it runs only 1 reduce jobs. Also, for tasks like fetch, it starts only
2 map tasks, one each on two slaves and a reduce task on the third
machine.  And so, CPU & memory utilization on each of the boxes is
very low – is that normal? For fetch jobs of 300K records, it is
taking over 10 hrs to finish and sometimes jobs like mergesegs runs on
only one of the boxes and fail to finish because of outofMemory
exception.

What would be a good configuration of map/reduce tasks I can use for
these boxes? Or does nutch override configuration defined in
mapred-default file and set its own number of tasks for each job? Is
above set of machines enough for me to be able to crawl up to 10M
pages?


Cheers,

Reply via email to