[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

chong chen (JIRA) Tue, 20 Oct 2015 12:20:14 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965589#comment-14965589
 ]


chong chen commented on MAPREDUCE-6513:
---------------------------------------

How to account for task failure vs how to re-schedule tasks are two different 
things? I don't understand why we have to tie these two together. This seems to 
be a design limitation. Clearly, for this case, raising priority is an optimum 
solution. Since AM already finishes ramp up reducer once (651 reducers), to 
repeat that process, you have to ramp down the whole thing and gradually ramp 
up again, which generates another round of communication overhead between AM 
and RM/scheduler. 

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to