[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940453#comment-14940453
 ] 

Anubhav Dhoot commented on MAPREDUCE-6302:
------------------------------------------

In the old code we are not preempting if either the headroom or the assigned 
maps are enough to run a mapper. So the early out is consistent with the old 
preemption. But the new preemption does not have to have the same conditions. 
Since we are using it as a way to come out of deadlocks, I would think 
preempting irrespective of how many mappers are running is 
(a) safer and simpler to reason since it is only time based - We do not have to 
second guess if we are missing some other reasons for deadlock apart from 
incorrect headroom.
(b) better in terms of overall throughput for cases as Jason mentioned. 
Having a large timeout is the safety lever for controlling the aggressiveness 
of the preemption.  
Factoring in slow start in a subsequent jira seems like a good idea to me. I 
can think of reasons not to factor it in but leave it only as a heuristic to 
start reducers.

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: mai shurong
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: AM_log_head100000.txt.gz, AM_log_tail100000.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to