Are not EC2 virtual hosts. I had problem with speed in my virtual hosts in
local linux box.
What is preferable, a dedicated server or an EC2?
-----Original Message-----
From: Jack Yu <[email protected]>
To: [email protected]
Sent: Thu, 2 Apr 2009 6:54 pm
Subject: Re: nutch/hadoop performance and optimal configuration
on my 3 local machine
36 map(should be more),16 reduce setting
there are 2 map jobs and 2 reduce jobs each machine runnng
On Fri, Apr 3, 2009 at 6:39 AM, DS jha <[email protected]> wrote:
> I am using Nutch to crawl about 10M pages and looks like the machines
> on my cluster are heavily under-utilized – I wanted to know what would
> be a good configuration I can use to improve the performance.
>
> I have setup my cluster on Amazon/EC2 and currently I have 3 EC2
> instances of type high CPU/Medium (dual processors with about 1.7G
> RAM each) running on Cent OS. I have configured mapred-default.xml
> file to define 30 map tasks and 7 reduce tasks. For nutch fetch jobs,
> I have set generate.max.per.host to 50 and number of fetcher threads
> to about 300. However, I noticed that whenever I run any nutch jobs,
> it runs only 1 reduce jobs. Also, for tasks like fetch, it starts only
> 2 map tasks, one each on two slaves and a reduce task on the third
> machine. And so, CPU & memory utilization on each of the boxes is
> very low – is that normal? For fetch jobs of 300K record
s, it is
> taking over 10 hrs to finish and sometimes jobs like mergesegs runs on
> only one of the boxes and fail to finish because of outofMemory
> exception.
>
> What would be a good configuration of map/reduce tasks I can use for
> these boxes? Or does nutch override configuration defined in
> mapred-default file and set its own number of tasks for each job? Is
> above set of machines enough for me to be able to crawl up to 10M
> pages?
>
>
> Cheers,
>