Where do you set your task properties? (mapred.map.tasks and mapred.reduce.tasks)? I had these set in mapred-default.xml - but it didn't seem to have an impact on nutch jobs. When I set these properties in hadoop-site.xml or nutch-site.xml, nutch jobs uses these task properties. Where should these config entries be defined?
Thank you, On Thu, Apr 2, 2009 at 9:54 PM, Jack Yu <[email protected]> wrote: > on my 3 local machine > 36 map(should be more),16 reduce setting > there are 2 map jobs and 2 reduce jobs each machine runnng > > > On Fri, Apr 3, 2009 at 6:39 AM, DS jha <[email protected]> wrote: > >> I am using Nutch to crawl about 10M pages and looks like the machines >> on my cluster are heavily under-utilized – I wanted to know what would >> be a good configuration I can use to improve the performance. >> >> I have setup my cluster on Amazon/EC2 and currently I have 3 EC2 >> instances of type high CPU/Medium (dual processors with about 1.7G >> RAM each) running on Cent OS. I have configured mapred-default.xml >> file to define 30 map tasks and 7 reduce tasks. For nutch fetch jobs, >> I have set generate.max.per.host to 50 and number of fetcher threads >> to about 300. However, I noticed that whenever I run any nutch jobs, >> it runs only 1 reduce jobs. Also, for tasks like fetch, it starts only >> 2 map tasks, one each on two slaves and a reduce task on the third >> machine. And so, CPU & memory utilization on each of the boxes is >> very low – is that normal? For fetch jobs of 300K records, it is >> taking over 10 hrs to finish and sometimes jobs like mergesegs runs on >> only one of the boxes and fail to finish because of outofMemory >> exception. >> >> What would be a good configuration of map/reduce tasks I can use for >> these boxes? Or does nutch override configuration defined in >> mapred-default file and set its own number of tasks for each job? Is >> above set of machines enough for me to be able to crawl up to 10M >> pages? >> >> >> Cheers, >> >
