One of the reasons I can think of could be a version mismatch. You may want to ensure that the job in question was not carrying a separate version of Hadoop along with it inside, perhaps?
On Fri, May 13, 2011 at 12:42 AM, Sidney Simmons <ssimm...@nmitconsulting.co.uk> wrote: > Hi there, > > I'm experiencing some unusual behaviour on our 0.20.2 hadoop cluster. > Randomly (periodically), we're getting "Call to namenode" failures on > tasktrackers causing tasks to fail: > > 2011-05-12 14:36:37,462 WARN org.apache.hadoop.mapred.TaskRunner: > attempt_201105090819_059_m_0038_0Child Error > java.io.IOException: Call to namenode/10.10.10.10:9000 failed on local > exception: java.io.EOFException > at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) > at org.apache.hadoop.ipc.Client.call(Client.java:743) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) > at $Proxy5.getFileInfo(Unknown Source) > at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > at $Proxy5.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:210) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(Unknown Source) > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) > > The namenode log (logging level = INFO) shows the following a few seconds > either side of the above timestamps. Could be relevant or it could be a > coincidence : > > 2011-05-12 14:36:40,005 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 57 on 9000 caught: java.nio.channels.ClosedChannelException > at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source) > at sun.nio.ch.SocketChannelImpl.write(Unknown Source) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1213) > at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:622) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:686) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:997) > > The jobtracker does however have an entry that correlates with the > tasktracker : > > 2011-05-12 14:36:39,781 INFO org.apache.hadoop.mapred.TaskInProgress: Error > from attempt_201105090819_059_m_0038_0: java.io.IOException: Call to > namenode/10.10.10.10:9000 failed on local exception: java.io.EOFException > at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) > at org.apache.hadoop.ipc.Client.call(Client.java:743) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) > at $Proxy1.getProtocolVersion(Unknown Source) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) > at > org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:105) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:208) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:169) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) > at org.apache.hadoop.mapred.Child.main(Child.java:157) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(Unknown Source) > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) > > Can anyone give me any pointers on how to start troubleshooting this issue? > It's very sporadic and we haven't been able to reproduce the issue yet in > our lab. After looking through the mailing list archives, some of the > suggestions revolve around the following settings: > > dfs.namenode.handler.count 128 (existing 64) > dfs.datanode.handler.count 10 (existing 3) > dfs.datanode.max.xcievers 4096 (existing 256) > > Any pointers ? > > Thanks in advance > > Sid Simmons > Infrastructure Support Specialist > -- Harsh J