Hi John, Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what are the JVM arguments you use for the spawning the map/reduce slots?
Check if the JVM is stuck in the machine. Sometimes I have seen task JVM just launching, gets into spinning mode and occupies 100% CPU. Can you check if this is the case? ~Rajesh Balamohan On Fri, Dec 16, 2011 at 2:26 AM, John Miller <jmil...@mybuys.com> wrote: > Hello Arun,**** > > ** ** > > Thanks for the quick reply. I totally understand the CDH issue but > figured I’d ask the broader community as well in case there was any > upstream known issue as I’ve noticed some patches relating to “somewhat > similar” issues.**** > > ** ** > > The jstack was currently on my radar but I hadn’t even thought about > tcpdump to catch weather the tasks were heartbeating or not so thanks for > the tip, will make sure to check that out! We are also planning our release > update to CDH 3u2 vs. 3u0 which will give us the updated hadoop > 0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix > the issue as well, in which case I’ll at least let everyone here know if it > does.**** > > ** ** > > Any further ideas or if anyone else has experienced this similar issue my > ears are open. Thanks again Arun! J**** > > ** ** > > *John Miller **|* Sr. Linux Systems Administrator** > > [image: mybuys-ops-small] <http://mybuys.com/>** > > 530 E. Liberty St.**** > > Ann Arbor, MI 48104**** > > Direct: 734.922.7007**** > > *http://mybuys.com/* > > ** ** > > *From:* Arun C Murthy [mailto:a...@hortonworks.com] > *Sent:* Thursday, December 15, 2011 2:03 PM > *To:* mapreduce-user@hadoop.apache.org > *Subject:* Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout > not working)**** > > ** ** > > Hi John,**** > > ** ** > > It's hard for folks on this list to diagnose CDH (you might have to ask > their lists). However, I haven't seen similar issues with hadoop-0.20.2xx > in a while.**** > > ** ** > > One thing to check would be to grab a stack trace (jstack) on the tasks > to see what they are upto. Next, try get a tcpdump to see if the tasks are > indeed sending heartbeats to the TT, which might be the reason the TTs > aren't timing them out.**** > > ** ** > > hth,**** > > Arun**** > > ** ** > > On Dec 15, 2011, at 7:58 AM, John Miller wrote:**** > > > > **** > > I’ve recently come across some interesting things happening within a > 50-node cluster regarding the tasktrackers and task attempts. Essentially > tasks are being created but they are sticking at 0.0% and it seems the > ‘mapreduce.task.timeout’ isn’t taking effect and they just sit there (for > days if we let them) and the jobs have to get killed. Its interesting to > note that the HDFS datanode service and HBASE regionserver running on these > nodes work fine and we’ve been simply shutting down the tasktracker service > on them in order to get around jobs stalling forever.**** > > **** > > Some historical information… We’re running Cloudera’s cdh3u0 release, and > this has so far only happened on a handful of random tasktracker nodes and > it seems to only effected those that have been taken down for maintenance > and then brought back into the cluster, or alternatively one node was > brought into the cluster after it had been running for a while and we ran > into the same issue. After re-adding the nodes back into the cluster the > tasktracker service starts getting these stalls. Also know that this has > not happened to every node that has been taken out of service for a time > and then re-added… I would say about 1/3’rd of them or so has ran into this > issue after maintenance. The particular maintenance issues on the effected > nodes were NOT the same, i.e. one was bad ram another was a bad sector on a > disk etc… never the same initial problem only the same outcome after > rejoining the cluster.**** > > **** > > It’s also never the same mapred job that sticks, nor is there any time > related evidence relating the stalls to a specific time of day. Rather the > node will run fine for many jobs and then just all of a sudden some tasks > will stall and stick at 0.0%. There are no visible errors in the log > outputs, although nothing will move forward nor will it release the mappers > for any other jobs to use until the stalled job is killed. It seems that > the default ‘mapreduce.task.timeout’ just isn’t working for some reason.** > ** > > **** > > Has anyone come across anything similar to this? I can provide more > details/data as needed.**** > > **** > > *John Miller **|* Sr. Linux Systems Administrator**** > > <image001.png> <http://mybuys.com/>**** > > 530 E. Liberty St.**** > > Ann Arbor, MI 48104**** > > Direct: 734.922.7007**** > > *http://mybuys.com/***** > > **** > > ** ** >
<<image001.png>>