One of the reasons I can think of could be a version mismatch. You may
want to ensure that the job in question was not carrying a separate
version of Hadoop along with it inside, perhaps?

On Fri, May 13, 2011 at 12:42 AM, Sidney Simmons
<ssimm...@nmitconsulting.co.uk> wrote:
> Hi there,
>
> I'm experiencing some unusual behaviour on our 0.20.2 hadoop cluster.
> Randomly (periodically), we're getting "Call to namenode" failures on
> tasktrackers causing tasks to fail:
>
> 2011-05-12 14:36:37,462 WARN org.apache.hadoop.mapred.TaskRunner:
> attempt_201105090819_059_m_0038_0Child Error
> java.io.IOException: Call to namenode/10.10.10.10:9000 failed on local
> exception: java.io.EOFException
>       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
>       at org.apache.hadoop.ipc.Client.call(Client.java:743)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>       at $Proxy5.getFileInfo(Unknown Source)
>       at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>       at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>       at java.lang.reflect.Method.invoke(Unknown Source)
>       at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>       at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>       at $Proxy5.getFileInfo(Unknown Source)
>       at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:210)
> Caused by: java.io.EOFException
>       at java.io.DataInputStream.readInt(Unknown Source)
>       at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
>       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
>
> The namenode log (logging level = INFO) shows the following a few seconds
> either side of the above timestamps. Could be relevant or it could be a
> coincidence :
>
> 2011-05-12 14:36:40,005 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 57 on 9000 caught: java.nio.channels.ClosedChannelException
>       at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
>       at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>       at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1213)
>       at org.apache.hadoop.ipc.Server.access$1900(Server.java:77)
>       at
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:622)
>       at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:686)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:997)
>
> The jobtracker does however have an entry that correlates with the
> tasktracker :
>
> 2011-05-12 14:36:39,781 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201105090819_059_m_0038_0: java.io.IOException: Call to
> namenode/10.10.10.10:9000 failed on local exception: java.io.EOFException
>       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
>       at org.apache.hadoop.ipc.Client.call(Client.java:743)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>       at $Proxy1.getProtocolVersion(Unknown Source)
>       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
>       at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:105)
>       at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:208)
>       at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:169)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
>       at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
>       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
>       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
>       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
>       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
>       at org.apache.hadoop.mapred.Child.main(Child.java:157)
> Caused by: java.io.EOFException
>       at java.io.DataInputStream.readInt(Unknown Source)
>       at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
>       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
>
> Can anyone give me any pointers on how to start troubleshooting this issue?
> It's very sporadic and we haven't been able to reproduce the issue yet in
> our lab. After looking through the mailing list archives, some of the
> suggestions revolve around the following settings:
>
> dfs.namenode.handler.count 128 (existing 64)
> dfs.datanode.handler.count 10 (existing 3)
> dfs.datanode.max.xcievers 4096 (existing 256)
>
> Any pointers ?
>
> Thanks in advance
>
> Sid Simmons
> Infrastructure Support Specialist
>



-- 
Harsh J

Reply via email to