[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939999#comment-14939999
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
---------------------------------------------

Thanks for taking a look so quickly, Jason.

bq. Do we really want to avoid any kind of preemption if there's a map running?
Fair question. Anubhav had the same comment as well. The other thing to 
consider here is slowstart: consider slowstart set to a low value (say 0.5) 
reducers shouldn't be preempted unless there are more than half the mappers 
pending to be run. We could factor in slowstart into the calculations here. 
Need to decide if it is worth the additional complication given we are trying 
to just avoid a deadlock here. May be, file a follow-up and work there? When 
looking at this code, I noticed a few other things that could be 
simplified/fixed. e.g. {{preemptReducer}} in my patches.

Will address your comments on the patch once we decide on how to proceed on the 
above discussion. 


> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: mai shurong
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: AM_log_head100000.txt.gz, AM_log_tail100000.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to