I am using Nutch to crawl about 10M pages and looks like the machines on my cluster are heavily under-utilized – I wanted to know what would be a good configuration I can use to improve the performance.
I have setup my cluster on Amazon/EC2 and currently I have 3 EC2 instances of type high CPU/Medium (dual processors with about 1.7G RAM each) running on Cent OS. I have configured mapred-default.xml file to define 30 map tasks and 7 reduce tasks. For nutch fetch jobs, I have set generate.max.per.host to 50 and number of fetcher threads to about 300. However, I noticed that whenever I run any nutch jobs, it runs only 1 reduce jobs. Also, for tasks like fetch, it starts only 2 map tasks, one each on two slaves and a reduce task on the third machine. And so, CPU & memory utilization on each of the boxes is very low – is that normal? For fetch jobs of 300K records, it is taking over 10 hrs to finish and sometimes jobs like mergesegs runs on only one of the boxes and fail to finish because of outofMemory exception. What would be a good configuration of map/reduce tasks I can use for these boxes? Or does nutch override configuration defined in mapred-default file and set its own number of tasks for each job? Is above set of machines enough for me to be able to crawl up to 10M pages? Cheers,
