Hi, We have hadoop(hadoop-0.20.1) cluster of 14 nodes and daily some jobs execute on this cluster.
Recently we faced an issue in which jobtracker looses track of all tasktrackers and as a result job execution stops. Killing or submitting new jobs does not work and only remedy for it is to restart map-reduce. After restarting mapreduce everything works fine. Below is the detailed description of errors, Following errors started occuring in job execution status: Error launching task 11/08/13 08:28:36 WARN mapred.JobClient: Error reading task outputhttp://MLDataNode3009:50060/tasklog?plaintext=true&taskid=attempt_201108121116_0059_m_000201_0&filter= 11/08/13 08:28:36 WARN mapred.JobClient: Error reading task outputhttp://MLDataNode3009:50060/tasklog?plaintext=true&taskid=attempt_201108121116_0059_m_000201_0&filter= 11/08/13 08:28:36 INFO mapred.JobClient: Task Id : attempt_201108121116_0059_m_000224_0, Status : FAILED Error launching task 11/08/13 08:28:36 WARN mapred.JobClient: Error reading task outputhttp://MLDataNode3009:50060/tasklog?plaintext=true&taskid=attempt_201108121116_0059_m_000224_0&filter= 11/08/13 08:28:37 WARN mapred.JobClient: Error reading task outputhttp://MLDataNode3009:50060/tasklog?plaintext=true&taskid=attempt_201108121116_0059_m_000224_0&filter= 11/08/13 08:28:37 INFO mapred.JobClient: Task Id : attempt_201108121116_0059_m_000230_0, Status : FAILED Error launching task Tracking it to jobtracker: following errors were present in jobtracker logs at same time, 2011-08-13 08:15:21,702 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@f09b96, false, false, true, 24565) from 10.0.9.132:59314: output error 2011-08-13 08:15:21,702 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54311 caught: java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:341) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) 2011-08-13 08:15:21,702 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54311 caught: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1135) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:354) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:939) here 10.0.9.132 is one of the tasktracker+datanode. One occurrence of this error was present in log for each of the tasktrackers. Error corresponding to above at task tracker log: 2011-08-13 08:15:36,073 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to MLNameNode3002/10.0.9.205:54311failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:774) at org.apache.hadoop.ipc.Client.call(Client.java:742) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1215) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1037) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2011-08-13 08:15:36,074 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'MLNameNode3002' with reponseId '24565 We had increased value of property mapred.map/reduce.tasks.maximum to accomodate increased data inflow. Above issue occurred twice after doing this modification, and everything was stable before. Can somebody please help with this. What might be cause of this error? and can it be handled by configuring some config property? thanks, Ajit.