[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892397#action_12892397
 ] 

Dick King commented on MAPREDUCE-1967:
--------------------------------------

I certainly agree with [Doug's comment of 7/26/10 
12:12|https://issues.apache.org/jira/browse/MAPREDUCE-1967?focusedCommentId=12892342&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12892342]
 .  I invite him and others to submit proposed exceptions.

Having said that, a DFS quota overflow is worse than most.  Mapreduce prefers 
to reschedule failed tasks, so if a task gets failed because of a bug in its 
code that is triggered by its split, its failed retries will happen sooner 
rather than later [although non-locally, at least on one of the four tries].  
In the case of DFS quotas, the underlying cause of the failure doesn't go away, 
but the retry is likely to succeed if some other task blows quota and releases 
its space, so this one might succeed.


Do people out there think we should have a one-strike policy or maybe a small 
number, like five?

> When a reducer fails on DFS quota, the job should fail immediately
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1967
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1967
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dick King
>
> Suppose an M/R job has so much output that the user is certain to exceed hir 
> quota.  Then some of the reducers will succeed but the job will get into a 
> state where the remaining reducers squabble over the remaining space.  The 
> remaining reducers will nibble at the remaining space, and finally one 
> reducer will fail on quota.  Its output file will be erased, and the other 
> reducers will collectively consume that space until one of _them_ fails on 
> quota.  Since the incomplete reducer that fails on quota is "chosen" 
> randomly, the tasks will accumulate their failures at similar rates, and the 
> system will have made a substantial futile investment.
> I would like to say that if a single reducer fails on DFS quota, the job 
> should be failed.  There may be a corner case that induces us to think that 
> we shouldn't be quite this stringent, but at least we shouldn't have to await 
> four failures by one task before shutting the job down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to