[jira] Commented: (HADOOP-1263) retry logic when dfs exist or open fails temporarily, e.g because of timeout

Hairong Kuang (JIRA) Wed, 25 Apr 2007 16:33:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491839
 ]


Hairong Kuang commented on HADOOP-1263:
---------------------------------------

The annotation method proposed in HADOOP-601 to provide a general retry 
framework in rpc seems to be a simple solution, but since it is not 
implemented, for this jira, I plan to implement the retry mechanism for only 
ClientProtocol using the retry framework implemented in HADOOP-997. Here are 
what I plan to do:

1. Add an exponential backoff policy to RetryPolicies.
2. Create a retry proxy for the dfs client using the following 
method-to-RetryPolicy hashmap:
     * TRY-ONCE-THEN-FAIL: create, addBlock, complete 
     * EXPONENTIAL-BACKOFF: open, setReplication, abandonBlock, 
abandonFileInProgress,  reportBadBlocks, exists, isDir, getListing, getHints,  
renewLease, getStats, getDatanodeReport, getBlockSize, getEditLogSize
     * I still have not decided which retry policy to use for
         (1)  rename, delete, mkdirs because a retry following a successful 
operation at the server side will return false instead of true;
         (2)  setSafeMode, refreshNodes, rollEditLog, rollFsImage, 
finalizeUpgrade, metaSave because I still need time to read the code for these 
methods.

Any suggestion is welcome!

> retry logic when dfs exist or open fails temporarily, e.g because of timeout
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-1263
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1263
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Hairong Kuang
>
> Sometimes, when many (e.g. 1000+) map jobs start at about the same time and 
> require supporting files from filecache, it happens that some map tasks fail 
> because of rpc timeouts. With only the default number of 10 handlers on the 
> namenode, the probability is high that the whole job fails (see Hadoop-1182). 
> It is much better with a higher number of handlers, but some map tasks still 
> fail.
> This could be avoided if rpc clients did retry when encountering a timeout 
> before throwing an exception.
> Examples of exceptions:
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at 
> org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
> at 
> org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
> at 
> org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
> at 
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)
> java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:473)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>         at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
>         at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
>         at 
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
>         at 
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
>         at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
>         at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1263) retry logic when dfs exist or open fails temporarily, e.g because of timeout

Reply via email to