It's not a single node. It occurs on multiple nodes at (seemingly) random
points throughout the day. Should we be performing period restarts of the
processes / datanode servers ?



On 13 May 2011 07:02, highpointe <highpoint...@gmail.com> wrote:

> Bounce mapred and TT on the node
>
>
>
> Sent from my iPhone
>
> On May 12, 2011, at 3:56 PM, Sidney Simmons <ssimm...@nmitconsulting.co.uk>
> wrote:
>
> > Hi there,
> >
> > Apologies if this comes through twice but i sent the mail a few hours
> > ago and haven't seen it on the mailing list.
> >
> > I'm experiencing some unusual behaviour on our 0.20.2 hadoop cluster.
> > Randomly (periodically), we're getting "Call to namenode" failures on
> > tasktrackers causing tasks to fail:
> >
> > 2011-05-12 14:36:37,462 WARN org.apache.hadoop.mapred.TaskRunner:
> > attempt_201105090819_059_m_0038_0Child Error
> > java.io.IOException: Call to namenode/10.10.10.10:9000 failed on local
> > exception: java.io.EOFException
> >       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
> >       at org.apache.hadoop.ipc.Client.call(Client.java:743)
> >       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> >       at $Proxy5.getFileInfo(Unknown Source)
> >       at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> >       at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> >       at java.lang.reflect.Method.invoke(Unknown Source)
> >       at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> >       at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> >       at $Proxy5.getFileInfo(Unknown Source)
> >       at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
> >       at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
> >       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:210)
> > Caused by: java.io.EOFException
> >       at java.io.DataInputStream.readInt(Unknown Source)
> >       at
> > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
> >       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> >
> > The namenode log (logging level = INFO) shows the following a few seconds
> > either side of the above timestamps. Could be relevant or it could be a
> > coincidence :
> >
> > 2011-05-12 14:36:40,005 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 57 on 9000 caught: java.nio.channels.ClosedChannelException
> >       at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
> >       at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
> >       at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1213)
> >       at org.apache.hadoop.ipc.Server.access$1900(Server.java:77)
> >       at
> > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:622)
> >       at
> org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:686)
> >       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:997)
> >
> > The jobtracker does however have an entry that correlates with the
> > tasktracker :
> >
> > 2011-05-12 14:36:39,781 INFO org.apache.hadoop.mapred.TaskInProgress:
> Error
> > from attempt_201105090819_059_m_0038_0: java.io.IOException: Call to
> > namenode/10.10.10.10:9000 failed on local exception:
> java.io.EOFException
> >       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
> >       at org.apache.hadoop.ipc.Client.call(Client.java:743)
> >       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> >       at $Proxy1.getProtocolVersion(Unknown Source)
> >       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
> >       at
> > org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:105)
> >       at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:208)
> >       at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:169)
> >       at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
> >       at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
> >       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
> >       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
> >       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
> >       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
> >       at org.apache.hadoop.mapred.Child.main(Child.java:157)
> > Caused by: java.io.EOFException
> >       at java.io.DataInputStream.readInt(Unknown Source)
> >       at
> > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
> >       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> >
> > Can anyone give me any pointers on how to start troubleshooting this
> issue?
> > It's very sporadic and we haven't been able to reproduce the issue yet in
> > our lab. After looking through the mailing list archives, some of the
> > suggestions revolve around the following settings:
> >
> > dfs.namenode.handler.count 128 (existing 64)
> > dfs.datanode.handler.count 10 (existing 3)
> > dfs.datanode.max.xcievers 4096 (existing 256)
> >
> > Any pointers ?
> >
> > Thanks in advance
> >
> > Sid Simmons
> > Infrastructure Support Specialist
>

Reply via email to