[jira] Commented: (HADOOP-1263) retry logic when dfs exist or open fails temporarily, e.g because of timeout

dhruba borthakur (JIRA) Fri, 27 Apr 2007 17:05:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492422
 ]


dhruba borthakur commented on HADOOP-1263:
------------------------------------------

The think the create()and cleanup() RPCs can be retried too. In fact, the 
DFSClient, in it current incantation, does retry the create() RPC three times 
and the cleanup() RPC indefinitely. I believe that it is safe to retry them 
because:

1. The first attempt did not even reach the namenode. The retried second 
attempt can reach the namenode and successfully complete.
2. The first attempt was processed by the namenode successfully but the 
response did not reach the dfsclient. The dfsclient can retry and the retries 
will fail. The entire operation fails anyway.

Either way, it is safe to retry the create() & cleanup() RPCs.

> retry logic when dfs exist or open fails temporarily, e.g because of timeout
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-1263
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1263
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Hairong Kuang
>
> Sometimes, when many (e.g. 1000+) map jobs start at about the same time and 
> require supporting files from filecache, it happens that some map tasks fail 
> because of rpc timeouts. With only the default number of 10 handlers on the 
> namenode, the probability is high that the whole job fails (see Hadoop-1182). 
> It is much better with a higher number of handlers, but some map tasks still 
> fail.
> This could be avoided if rpc clients did retry when encountering a timeout 
> before throwing an exception.
> Examples of exceptions:
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at 
> org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
> at 
> org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
> at 
> org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
> at 
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)
> java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:473)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>         at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
>         at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
>         at 
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
>         at 
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
>         at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
>         at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1263) retry logic when dfs exist or open fails temporarily, e.g because of timeout

Reply via email to