[
https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491839
]
Hairong Kuang commented on HADOOP-1263:
---------------------------------------
The annotation method proposed in HADOOP-601 to provide a general retry
framework in rpc seems to be a simple solution, but since it is not
implemented, for this jira, I plan to implement the retry mechanism for only
ClientProtocol using the retry framework implemented in HADOOP-997. Here are
what I plan to do:
1. Add an exponential backoff policy to RetryPolicies.
2. Create a retry proxy for the dfs client using the following
method-to-RetryPolicy hashmap:
* TRY-ONCE-THEN-FAIL: create, addBlock, complete
* EXPONENTIAL-BACKOFF: open, setReplication, abandonBlock,
abandonFileInProgress, reportBadBlocks, exists, isDir, getListing, getHints,
renewLease, getStats, getDatanodeReport, getBlockSize, getEditLogSize
* I still have not decided which retry policy to use for
(1) rename, delete, mkdirs because a retry following a successful
operation at the server side will return false instead of true;
(2) setSafeMode, refreshNodes, rollEditLog, rollFsImage,
finalizeUpgrade, metaSave because I still need time to read the code for these
methods.
Any suggestion is welcome!
> retry logic when dfs exist or open fails temporarily, e.g because of timeout
> ----------------------------------------------------------------------------
>
> Key: HADOOP-1263
> URL: https://issues.apache.org/jira/browse/HADOOP-1263
> Project: Hadoop
> Issue Type: Improvement
> Components: dfs
> Affects Versions: 0.12.3
> Reporter: Christian Kunz
> Assigned To: Hairong Kuang
>
> Sometimes, when many (e.g. 1000+) map jobs start at about the same time and
> require supporting files from filecache, it happens that some map tasks fail
> because of rpc timeouts. With only the default number of 10 handlers on the
> namenode, the probability is high that the whole job fails (see Hadoop-1182).
> It is much better with a higher number of handlers, but some map tasks still
> fail.
> This could be avoided if rpc clients did retry when encountering a timeout
> before throwing an exception.
> Examples of exceptions:
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
> at
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
> at
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
> at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at
> org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
> at
> org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
> at
> org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
> at
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
> at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
> at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
> at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
> at
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
> at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
> at
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
> at
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
> at
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
> at
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
> at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.