Re: nutch/hadoop performance and optimal configuration

perezcebreros Mon, 18 May 2009 13:13:50 -0700


About that....


please, I have a question.  Why do i need to set a prime number in
hadoo-site.xml,, So is it necessary 
may be for performance ?? do you know please, any link

Thank you


DS jha wrote:
> 
> Where do you set your task properties? (mapred.map.tasks and
> mapred.reduce.tasks)? I had these set in mapred-default.xml - but it
> didn't seem to have an impact on nutch jobs. When I set these
> properties in hadoop-site.xml or nutch-site.xml, nutch jobs uses these
> task properties.  Where should these config entries be defined?
> 
> Thank you,
> 
> 
> 
> 
> On Thu, Apr 2, 2009 at 9:54 PM, Jack Yu <[email protected]> wrote:
>> on my 3 local machine
>> 36 map(should be more),16 reduce setting
>> there are 2 map jobs and 2 reduce jobs each machine runnng
>>
>>
>> On Fri, Apr 3, 2009 at 6:39 AM, DS jha <[email protected]> wrote:
>>
>>> I am using Nutch to crawl about 10M pages and looks like the machines
>>> on my cluster are heavily under-utilized – I wanted to know what would
>>> be a good configuration I can use to improve the performance.
>>>
>>> I have setup my cluster on Amazon/EC2 and currently I have 3 EC2
>>> instances of type high CPU/Medium  (dual processors with about 1.7G
>>> RAM each) running on Cent OS.  I have configured mapred-default.xml
>>> file to define 30 map tasks and 7 reduce tasks. For nutch fetch jobs,
>>> I have set generate.max.per.host to 50 and number of fetcher threads
>>> to about 300. However, I noticed that whenever I run any nutch jobs,
>>> it runs only 1 reduce jobs. Also, for tasks like fetch, it starts only
>>> 2 map tasks, one each on two slaves and a reduce task on the third
>>> machine.  And so, CPU & memory utilization on each of the boxes is
>>> very low – is that normal? For fetch jobs of 300K records, it is
>>> taking over 10 hrs to finish and sometimes jobs like mergesegs runs on
>>> only one of the boxes and fail to finish because of outofMemory
>>> exception.
>>>
>>> What would be a good configuration of map/reduce tasks I can use for
>>> these boxes? Or does nutch override configuration defined in
>>> mapred-default file and set its own number of tasks for each job? Is
>>> above set of machines enough for me to be able to crawl up to 10M
>>> pages?
>>>
>>>
>>> Cheers,
>>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/nutch-hadoop-performance-and-optimal-configuration-tp22858445p23604528.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch/hadoop performance and optimal configuration

Reply via email to