retry logic when dfs exist or open fails temporarily, e.g because of timeout
----------------------------------------------------------------------------

                 Key: HADOOP-1263
                 URL: https://issues.apache.org/jira/browse/HADOOP-1263
             Project: Hadoop
          Issue Type: Improvement
          Components: dfs
    Affects Versions: 0.12.3
            Reporter: Christian Kunz


Sometimes, when many (e.g. 1000+) map jobs start at about the same time and 
require supporting files from filecache, it happens that some map tasks fail 
because of rpc timeouts. With only the default number of 10 handlers on the 
namenode, the probability is high that the whole job fails (see Hadoop-1182). 
It is much better with a higher number of handlers, but some map tasks still 
fail.

This could be avoided if rpc clients did retry when encountering a timeout 
before throwing an exception.

Examples of exceptions:

java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:473)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
at 
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
at 
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
at 
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
at 
org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
at 
org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
at 
org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
at 
org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)

java.net.SocketTimeoutException: timed out waiting for rpc response
        at org.apache.hadoop.ipc.Client.call(Client.java:473)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
        at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
        at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
        at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
        at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
        at 
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
        at 
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
        at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
        at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
        at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to