[ 
https://issues.apache.org/jira/browse/HADOOP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485678
 ] 

dhruba borthakur commented on HADOOP-1182:
------------------------------------------

That's what i thought earlier. But when I investigated, it appears that most of 
the calls are very very recent but are getting dropped because there isn't any 
more free space in the incoming call queue. The message "Call queue overflow 
discarding oldest call"" is logged when the queue overflows.


> DFS Scalability issue with filecache in large clusters
> ------------------------------------------------------
>
>                 Key: HADOOP-1182
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1182
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.1
>            Reporter: Christian Kunz
>
> When using filecache to distribute supporting files for map/reduce 
> applications in a 1000 node cluster, many map tasks fail  because of 
> timeouts. There was no such problem using a 200 node cluster for the same 
> applications with comparable input data. Either the whole job fails because 
> of too many map failures, or even worse, some map tasks hang indefinitely.
> java.net.SocketTimeoutException: timed out waiting for rpc response
>       at org.apache.hadoop.ipc.Client.call(Client.java:473)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>       at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
>       at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
>       at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
>       at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>       at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>       at 
> org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
>       at 
> org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
>       at 
> org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
>       at 
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to