Re: Hadoop cluster optimization

Michel Segel Sun, 21 Aug 2011 19:16:32 -0700

Avi,
First why 32 bit OS?
You have a 64 bit processor that has 4 cores hyper threaded looks like 8cpus.


With only 1.7 GB you're going to be limited on the number of slots you can 
configure. 
I'd say run ganglia but that would take resources away from you.  It sounds 
like the default parameters are a pretty good fit.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Aug 21, 2011, at 6:57 AM, "Avi Vaknin" <avivakni...@gmail.com> wrote:

> Hi all !
> How are you?
> 
> My name is Avi and I have been fascinated by Apache Hadoop for the last few
> months.
> I am spending the last two weeks trying to optimize my configuration files
> and environment.
> I have been going through many Hadoop's configuration properties and it
> seems that none
> of them is  making a big difference (+- 3 minutes of a total job run time).
> 
> In Hadoop's meanings my cluster considered to be extremely small (260 GB of
> text files, while every job is going through only +- 8 GB).
> I have one server acting as "NameNode and JobTracker", and another 5 servers
> acting as "DataNodes and TaskTreckers".
> Right now Hadoop's configurations are set to default, beside the DFS Block
> Size which is set to 256 MB since every file on my cluster takes 155 - 250
> MB.
> 
> All of the above servers are exactly the same and having the following
> hardware and software:
> 1.7 GB memory
> 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
> Ubuntu Server 10.10 , 32-bit platform
> Cloudera CDH3 Manual Hadoop Installation
> (for the ones who are familiar with Amazon Web Services, I am talking about
> Small EC2 Instances/Servers)
> 
> Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
> MB and 10 reduce tasks).
> 
> Based on the above information, does anyone can recommend on a best practice
> configuration??
> Do you thinks that when dealing with such a small cluster, and when
> processing such a small amount of data,
> is it even possible to optimize jobs so they would run much faster? 
> 
> By the way, it seems like none of the nodes are having a hardware
> performance issues (CPU/Memory) while running the job.
> Thats true unless I am having a bottle neck somewhere else (seems like
> network bandwidth is not the issue).
> That issue is a little confusing because  the NameNode process and the
> JobTracker process should allocate 1GB of memory each,
> which means that my hardware starting point is insufficient and in that case
> why am I not seeing a full Memory utilization using 'top' 
> command on the NameNode & JobTracker Server? 
> How would you recommend to measure/monitor different Hadoop's properties to
> find out where is the bottle neck?
> 
> Thanks for your help!!
> 
> Avi
> 
> 
>

Re: Hadoop cluster optimization

Reply via email to