Did you try using the Hadoop Vaidya or Karmasphere to diagnose the problem?
Jie On Fri, Mar 16, 2012 at 5:23 PM, GUOJUN Zhu <guojun_...@freddiemac.com>wrote: > > We have a weird performance problem with a hadoop job on our cluster. We > have a 32-node experimenting cluster of blades (2 hex-core), one dedicated > job tracker, one dedicated namenode, with Cloudera's CDH3 (0.20.2-cdh3u3, > 03b655719d13929bd68bb2c2f9cee615b389cea9 ) . All nodes are bought > together with the same kick-start script. All in Redhat 6.1 (Linux > he3lxvd607 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64 x86_64 x86_64 GNU/Linux). > > When we run the our job (~300 tasks), all tasks fire off at once, so > averagely 10 tasks per node. We observe the higher-half of the nodes > (node 17-32) have the average load close to 10, CPU is about 50% used. > However, the lower-half (node 1-16) does not utilize the CPU fully, load > is about 1-3, CPU is <10%. In the final metrics, the map task in the > lower half has about the same "CPU time spent (ms) " count as the one in > the higher half. So it is like that something throtles the tasks in the > lower half (1-16). We checked the difference between the two sets of nodes > in every aspects we can think of. No difference. > > Our job uses the old mapred API. It has a quite modest input (<1G input > for 300 maps) and very tiny output. The intermediate output from maps are > larger (maybe 10x input). The slow part is actually within the map, when we > try to convert the input format into some classes before we can do the real > calculation. > > We then physically switch the blades in 1-16 with the blades in 17-32. We > still see the under-utilization in now 1-16. So it is more like some > configuration in the hadoop or system. > > We run out of ideas. Any suggestions are highly appreciated. > > We run terasort or word-count, They seem to use all nodes the same. > > Zhu, Guojun > Modeling Sr Graduate > 571-3824370 > guojun_...@freddiemac.com > Financial Engineering > Freddie Mac