[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900283#comment-14900283
 ] 

Xianyin Xin commented on MAPREDUCE-6485:
----------------------------------------

Hi [~kasha], the two are similar but not the same. In MAPREDUCE-6302, a map is 
marked as failed so it can raise another resource request with higher priority 
than reduce, and thus can consume the preempt resource. In this case, the map 
is killed due to time out. In the current implementation, the killed map is not 
marked as failed, so it won't raise new resource request and retry to do the 
map. Then it fails completely and the whole job hangs there.

Now reopen it.

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6485
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>            Reporter: Bob
>            Priority: Critical
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed 
> map attempt would not be started. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to