Weird performance problem.

GUOJUN Zhu Fri, 16 Mar 2012 14:23:41 -0700

We have a weird performance problem with a hadoop job on our cluster.  We 
have a 32-node experimenting cluster of blades (2 hex-core), one dedicated 
job tracker, one dedicated namenode, with Cloudera's CDH3 (0.20.2-cdh3u3, 
03b655719d13929bd68bb2c2f9cee615b389cea9 ) .  All nodes are bought 
together with the same kick-start script.  All in Redhat 6.1 (Linux 
he3lxvd607 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 
x86_64 x86_64 x86_64 GNU/Linux).


When we run the our job (~300 tasks),  all tasks fire off at once, so 
averagely 10 tasks per node.   We observe the higher-half of the nodes 
(node 17-32)  have the average load close to 10, CPU is about 50% used. 
However, the lower-half (node 1-16) does not utilize the CPU fully, load 
is about 1-3, CPU is <10%.   In the final metrics, the map task in the 
lower half has about the same "CPU time spent (ms) " count as the one in 
the higher half.  So it is like that something throtles the tasks in the 
lower half (1-16).  We checked the difference between the two sets of 
nodes in every aspects we can think of.  No difference. 

Our job uses  the old mapred API.  It has a quite modest input (<1G input 
for 300 maps) and very tiny output.  The intermediate output  from maps 
are larger (maybe 10x input). The slow part is actually within the map, 
when we try to convert the input format into some classes before we can do 
the real calculation. 

We then physically switch the blades in 1-16 with the blades in 17-32.  We 
still see the under-utilization in now 1-16.  So it is more like some 
configuration in the hadoop or system. 

We run out of ideas.  Any suggestions are highly appreciated. 

We run terasort or word-count, They seem to use all nodes the same. 

Zhu, Guojun
Modeling Sr Graduate
571-3824370
guojun_...@freddiemac.com
Financial Engineering
Freddie Mac

Weird performance problem.

Reply via email to