Re: nutch/hadoop performance and optimal configuration

DS jha Fri, 03 Apr 2009 08:05:31 -0700

replication is set to 2. and Network speed seems pretty decent. My
main concern is number of map/reduce tasks that gets generated for
Nutch jobs. In my mapred-default file, I have set 30 map tasks and 7
reduce tasks and tried with several other values as well (high and
low) - but it seems that nutch jobs overwrites those values - For most
nutch jobs, including parse, fetch, updatedb, etc - it always creates
only 1 reduce task. On jobtracker, job configuration file also has
mapred.reduce.tasks value set to 1. For fetch jobs, even though there
are 3 slaves, it creates only 2 map tasks and 1 redue tasks (one on
each of the slave).  Am I missing any config step/param entry?




Cheers,


On Thu, Apr 2, 2009 at 9:45 PM, Jack Yu <[email protected]> wrote:
> maybe more map,replication?
> what is the network speed between 3?
> what is the internet speed they have?
>
> On Fri, Apr 3, 2009 at 6:39 AM, DS jha <[email protected]> wrote:
>
>> I am using Nutch to crawl about 10M pages and looks like the machines
>> on my cluster are heavily under-utilized – I wanted to know what would
>> be a good configuration I can use to improve the performance.
>>
>> I have setup my cluster on Amazon/EC2 and currently I have 3 EC2
>> instances of type high CPU/Medium  (dual processors with about 1.7G
>> RmaAM each) running on Cent OS.  I have configured mapred-default.xml
>> file to define 30 map tasks and 7 reduce tasks. For nutch fetch jobs,
>> I have set generate.max.per.host to 50 and number of fetcher threads
>> to about 300. However, I noticed that whenever I run any nutch jobs,
>> it runs only 1 reduce jobs. Also, for tasks like fetch, it starts only
>> 2 map tasks, one each on two slaves and a reduce task on the third
>> machine.  And so, CPU & memory utilization on each of the boxes is
>> very low – is that normal? For fetch jobs of 300K records, it is
>> taking over 10 hrs to finish and sometimes jobs like mergesegs runs on
>> only one of the boxes and fail to finish because of outofMemory
>> exception.
>>
>> What would be a good configuration of map/reduce tasks I can use for
>> these boxes? Or does nutch override configuration defined in
>> mapred-default file and set its own number of tasks for each job? Is
>> above set of machines enough for me to be able to crawl up to 10M
>> pages?
>>
>>
>> Cheers,
>>
>

Re: nutch/hadoop performance and optimal configuration

Reply via email to