[ 
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542848
 ] 

Milind Bhandarkar commented on HADOOP-2141:
-------------------------------------------

As I mentioned earlier, we have been seeing this problem more and more over the 
last few weeks. One of the bugs discovered by Koji is that dfsclient to 
datanode writes do not have a timeout, so if a datanode goes bad during the 
write, the entire task gets stuck. There are always going to be such corner 
cases, and proper speculative execution is the right solution to such problems.

I am a bit ambivalent about making this a fix for 0.16, rather than 0.15.1, 
because

1. This will make us postpone debugging the *real* issues
2. Critical projects will have to wait for a long time for the 0.16 release to 
be deployed.

Weighing these two against each other, I would rather have this fix, rather 
than debugging *all* the issues that can possibly cause a task to get stuck. 
Thoughts ?

> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.15.0
>            Reporter: Koji Noguchi
>            Assignee: Arun C Murthy
>             Fix For: 0.16.0
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative 
> instance of a task is that it must be at least 20% behind the average 
> progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop 
> making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for 
> tasks in the speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to