It's not a single node. It occurs on multiple nodes at (seemingly) random points throughout the day. Should we be performing period restarts of the processes / datanode servers ?
On 13 May 2011 07:02, highpointe <highpoint...@gmail.com> wrote: > Bounce mapred and TT on the node > > > > Sent from my iPhone > > On May 12, 2011, at 3:56 PM, Sidney Simmons <ssimm...@nmitconsulting.co.uk> > wrote: > > > Hi there, > > > > Apologies if this comes through twice but i sent the mail a few hours > > ago and haven't seen it on the mailing list. > > > > I'm experiencing some unusual behaviour on our 0.20.2 hadoop cluster. > > Randomly (periodically), we're getting "Call to namenode" failures on > > tasktrackers causing tasks to fail: > > > > 2011-05-12 14:36:37,462 WARN org.apache.hadoop.mapred.TaskRunner: > > attempt_201105090819_059_m_0038_0Child Error > > java.io.IOException: Call to namenode/10.10.10.10:9000 failed on local > > exception: java.io.EOFException > > at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) > > at org.apache.hadoop.ipc.Client.call(Client.java:743) > > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) > > at $Proxy5.getFileInfo(Unknown Source) > > at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > > at java.lang.reflect.Method.invoke(Unknown Source) > > at > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > > at > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > > at $Proxy5.getFileInfo(Unknown Source) > > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615) > > at > > > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) > > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:210) > > Caused by: java.io.EOFException > > at java.io.DataInputStream.readInt(Unknown Source) > > at > > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) > > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) > > > > The namenode log (logging level = INFO) shows the following a few seconds > > either side of the above timestamps. Could be relevant or it could be a > > coincidence : > > > > 2011-05-12 14:36:40,005 INFO org.apache.hadoop.ipc.Server: IPC Server > > handler 57 on 9000 caught: java.nio.channels.ClosedChannelException > > at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source) > > at sun.nio.ch.SocketChannelImpl.write(Unknown Source) > > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1213) > > at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) > > at > > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:622) > > at > org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:686) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:997) > > > > The jobtracker does however have an entry that correlates with the > > tasktracker : > > > > 2011-05-12 14:36:39,781 INFO org.apache.hadoop.mapred.TaskInProgress: > Error > > from attempt_201105090819_059_m_0038_0: java.io.IOException: Call to > > namenode/10.10.10.10:9000 failed on local exception: > java.io.EOFException > > at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) > > at org.apache.hadoop.ipc.Client.call(Client.java:743) > > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) > > at $Proxy1.getProtocolVersion(Unknown Source) > > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) > > at > > org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:105) > > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:208) > > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:169) > > at > > > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) > > at > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) > > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) > > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) > > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) > > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) > > at org.apache.hadoop.mapred.Child.main(Child.java:157) > > Caused by: java.io.EOFException > > at java.io.DataInputStream.readInt(Unknown Source) > > at > > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) > > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) > > > > Can anyone give me any pointers on how to start troubleshooting this > issue? > > It's very sporadic and we haven't been able to reproduce the issue yet in > > our lab. After looking through the mailing list archives, some of the > > suggestions revolve around the following settings: > > > > dfs.namenode.handler.count 128 (existing 64) > > dfs.datanode.handler.count 10 (existing 3) > > dfs.datanode.max.xcievers 4096 (existing 256) > > > > Any pointers ? > > > > Thanks in advance > > > > Sid Simmons > > Infrastructure Support Specialist >