[ 
https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491792
 ] 

Hairong Kuang commented on HADOOP-1263:
---------------------------------------

On a second thought, I feel that simply adding a retry mechanism to ipc may not 
work because not all the server operations are idempotent.  One option is to 
add an additional parameter to each call indicating if the call should be 
retried or not when time out. But it seems that our current rpc framework is 
hard to support this feature.

> retry logic when dfs exist or open fails temporarily, e.g because of timeout
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-1263
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1263
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Hairong Kuang
>
> Sometimes, when many (e.g. 1000+) map jobs start at about the same time and 
> require supporting files from filecache, it happens that some map tasks fail 
> because of rpc timeouts. With only the default number of 10 handlers on the 
> namenode, the probability is high that the whole job fails (see Hadoop-1182). 
> It is much better with a higher number of handlers, but some map tasks still 
> fail.
> This could be avoided if rpc clients did retry when encountering a timeout 
> before throwing an exception.
> Examples of exceptions:
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at 
> org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
> at 
> org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
> at 
> org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
> at 
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)
> java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:473)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>         at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
>         at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
>         at 
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
>         at 
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
>         at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
>         at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to