Hi there, I'm experiencing some unusual behaviour on our 0.20.2 hadoop cluster. Randomly (periodically), we're getting "Call to namenode" failures on tasktrackers causing tasks to fail:
2011-05-12 14:36:37,462 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201105090819_059_m_0038_0Child Error java.io.IOException: Call to namenode/10.10.10.10:9000 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy5.getFileInfo(Unknown Source) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy5.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:210) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(Unknown Source) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) The namenode log (logging level = INFO) shows the following a few seconds either side of the above timestamps. Could be relevant or it could be a coincidence : 2011-05-12 14:36:40,005 INFO org.apache.hadoop.ipc.Server: IPC Server handler 57 on 9000 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source) at sun.nio.ch.SocketChannelImpl.write(Unknown Source) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1213) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:622) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:686) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:997) The jobtracker does however have an entry that correlates with the tasktracker : 2011-05-12 14:36:39,781 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201105090819_059_m_0038_0: java.io.IOException: Call to namenode/10.10.10.10:9000 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:105) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:208) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:169) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.hadoop.mapred.Child.main(Child.java:157) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(Unknown Source) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) Can anyone give me any pointers on how to start troubleshooting this issue? It's very sporadic and we haven't been able to reproduce the issue yet in our lab. After looking through the mailing list archives, some of the suggestions revolve around the following settings: dfs.namenode.handler.count 128 (existing 64) dfs.datanode.handler.count 10 (existing 3) dfs.datanode.max.xcievers 4096 (existing 256) Any pointers ? Thanks in advance Sid Simmons Infrastructure Support Specialist