maybe more map,replication? what is the network speed between 3? what is the internet speed they have?
On Fri, Apr 3, 2009 at 6:39 AM, DS jha <[email protected]> wrote: > I am using Nutch to crawl about 10M pages and looks like the machines > on my cluster are heavily under-utilized – I wanted to know what would > be a good configuration I can use to improve the performance. > > I have setup my cluster on Amazon/EC2 and currently I have 3 EC2 > instances of type high CPU/Medium (dual processors with about 1.7G > RmaAM each) running on Cent OS. I have configured mapred-default.xml > file to define 30 map tasks and 7 reduce tasks. For nutch fetch jobs, > I have set generate.max.per.host to 50 and number of fetcher threads > to about 300. However, I noticed that whenever I run any nutch jobs, > it runs only 1 reduce jobs. Also, for tasks like fetch, it starts only > 2 map tasks, one each on two slaves and a reduce task on the third > machine. And so, CPU & memory utilization on each of the boxes is > very low – is that normal? For fetch jobs of 300K records, it is > taking over 10 hrs to finish and sometimes jobs like mergesegs runs on > only one of the boxes and fail to finish because of outofMemory > exception. > > What would be a good configuration of map/reduce tasks I can use for > these boxes? Or does nutch override configuration defined in > mapred-default file and set its own number of tasks for each job? Is > above set of machines enough for me to be able to crawl up to 10M > pages? > > > Cheers, >
