Fwd: Jobtracker looses track of all tasktrackers, needs mapred restart

Ajit Ratnaparkhi Mon, 15 Aug 2011 04:14:08 -0700

Hi,

We have hadoop(hadoop-0.20.1) cluster of 14 nodes and daily some jobs
execute on this cluster.


Recently we faced an issue in which jobtracker looses track of all
tasktrackers and as a result job execution stops. Killing or submitting new
jobs does not work and only remedy for it is to restart map-reduce. After
restarting mapreduce everything works fine.

Below is the detailed description of errors,

Following errors started occuring in job execution status:

Error launching task
11/08/13 08:28:36 WARN mapred.JobClient: Error reading task
outputhttp://MLDataNode3009:50060/tasklog?plaintext=true&taskid=attempt_201108121116_0059_m_000201_0&filter=
11/08/13 08:28:36 WARN mapred.JobClient: Error reading task
outputhttp://MLDataNode3009:50060/tasklog?plaintext=true&taskid=attempt_201108121116_0059_m_000201_0&filter=
11/08/13 08:28:36 INFO mapred.JobClient: Task Id :
attempt_201108121116_0059_m_000224_0, Status : FAILED
Error launching task
11/08/13 08:28:36 WARN mapred.JobClient: Error reading task
outputhttp://MLDataNode3009:50060/tasklog?plaintext=true&taskid=attempt_201108121116_0059_m_000224_0&filter=
11/08/13 08:28:37 WARN mapred.JobClient: Error reading task
outputhttp://MLDataNode3009:50060/tasklog?plaintext=true&taskid=attempt_201108121116_0059_m_000224_0&filter=
11/08/13 08:28:37 INFO mapred.JobClient: Task Id :
attempt_201108121116_0059_m_000230_0, Status : FAILED
Error launching task

Tracking it to jobtracker: following errors were present in jobtracker logs
at same time,

2011-08-13 08:15:21,702 WARN org.apache.hadoop.ipc.Server: IPC Server
Responder, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@f09b96,
false, false, true, 24565) from 10.0.9.132:59314: output error
2011-08-13 08:15:21,702 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54311 caught: java.nio.channels.ClosedByInterruptException
        at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:341)
        at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195)
        at org.apache.hadoop.ipc.Server.access$1900(Server.java:77)
        at
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613)
        at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981)

2011-08-13 08:15:21,702 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54311 caught: java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1135)
        at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312)
        at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:354)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)

here 10.0.9.132 is one of the tasktracker+datanode. One occurrence of this
error was present in log for each of the tasktrackers.

Error corresponding to above at task tracker log:

2011-08-13 08:15:36,073 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call to
MLNameNode3002/10.0.9.205:54311failed on local exception:
java.io.EOFException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
        at org.apache.hadoop.ipc.Client.call(Client.java:742)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
        at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1215)
        at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1037)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

2011-08-13 08:15:36,074 INFO org.apache.hadoop.mapred.TaskTracker: Resending
'status' to 'MLNameNode3002' with reponseId '24565


We had increased value of property mapred.map/reduce.tasks.maximum to
accomodate increased data inflow. Above issue occurred twice after doing
this modification, and everything was stable before.

Can somebody please help with this. What might be cause of this error? and
can it be handled by configuring some config property?

thanks,
Ajit.

Fwd: Jobtracker looses track of all tasktrackers, needs mapred restart

Reply via email to