[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-09-30 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938321#comment-14938321
 ] 

Anubhav Dhoot commented on MAPREDUCE-6302:
--

The patch looks mostly good

why does availableResourceForMap not consider assignedRequests.maps after the 
patch?

The earlier comments had some more description that would be useful to 
preserve. Maybe as a heading for both set of values to describe when does 
preemption kick in.  For eg the earlier description "The threshold in terms of 
seconds after which an unsatisfied mapper request triggers reducer preemption 
to free space."

Would UNCONDITIONAL be better than FORCE, because its not like the other one is 
an optional preemption when it kicks in?
consider 
reverting duration -> allocationDelayThresholdMs
forcePreemptThreshold -> forcePreemptThresholdSec
reducerPreemptionHoldMs -> reducerNoHeadroomPreemptionMs

resourceLimit in allocation is a weird name for the headroom in the Allocation. 
Consider another jira for fixing that.


> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-prelim.patch, 
> queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938799#comment-14938799
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

In the previous version, {{availableResourceForMap}} was calculated as 
{{resourceLimit (= headroom + map_resources + reduce_resources) - map_resources 
- reduce_resources + reduce_resources_being_preempted}}. That was unnecessary, 
this patch updates it to {{headroom + reduce_resources_being_preempted}}.

bq. Would UNCONDITIONAL be better than FORCE, because its not like the other 
one is an optional preemption when it kicks in?
Yep, unconditional is more descriptive. Will update the patch.


> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-prelim.patch, 
> queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938989#comment-14938989
 ] 

Hadoop QA commented on MAPREDUCE-6302:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  23m 22s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |  11m 52s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 53s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 15s | The applied patch generated 
1 release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m  0s | The applied patch generated  2 
new checkstyle issues (total was 517, now 518). |
| {color:red}-1{color} | checkstyle |   2m 24s | The applied patch generated  1 
new checkstyle issues (total was 12, now 12). |
| {color:red}-1{color} | whitespace |   0m  2s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 30s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m  5s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   9m 42s | Tests passed in 
hadoop-mapreduce-client-app. |
| {color:green}+1{color} | mapreduce tests |   1m 48s | Tests passed in 
hadoop-mapreduce-client-core. |
| {color:green}+1{color} | yarn tests |  58m 43s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 125m 14s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12764478/mr-6302-3.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 6c17d31 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6044/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6044/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-core.txt
 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6044/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6044/artifact/patchprocess/whitespace.txt
 |
| hadoop-mapreduce-client-app test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6044/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
 |
| hadoop-mapreduce-client-core test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6044/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6044/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6044/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6044/console |


This message was automatically generated.

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-09-30 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939127#comment-14939127
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

Discussed this with [~adhoot] offline. My latest patch (v3) would lead to 
preempting reducers even if there are running mappers. The surrounding code 
looks more complicated than it should be. 

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939976#comment-14939976
 ] 

Jason Lowe commented on MAPREDUCE-6302:
---

I think it's reasonable.  There were a number of separate bugs in this area 
because it was complicated, would be nice to see it simplified and easier to 
understand.

Do we really want to avoid any kind of preemption if there's a map running?  
Thinking of a case where a node failure causes 20 maps to line up for 
scheduling due to fetch failures and we only have one running.  Do we really 
want to feed those 20 maps through the one map hole?  Hope they don't run very 
long.  ;-)  I haven't studied what the original code did in this case, but I 
noticed it did not early-out if maps were running, hence the question.  I think 
the preemption logic could benefit from knowing whether reducers have reported 
whether they're past the SHUFFLE phase and exempt them from preemption.  Seems 
we would want to preempt as many reducers in the SHUFFLE phase as necessary to 
run most or all pending maps in parallel if possible to minimize job latency 
for most cases.

Other minor comments on the patch:
- docs for mapreduce.job.reducer.unconditional-preempt.delay.sec should be 
clear on how to disable the functionality if desired, since setting it to zero 
does some pretty bad things.
- preemtping s/b preempting


> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1493#comment-1493
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

Thanks for taking a look so quickly, Jason.

bq. Do we really want to avoid any kind of preemption if there's a map running?
Fair question. Anubhav had the same comment as well. The other thing to 
consider here is slowstart: consider slowstart set to a low value (say 0.5) 
reducers shouldn't be preempted unless there are more than half the mappers 
pending to be run. We could factor in slowstart into the calculations here. 
Need to decide if it is worth the additional complication given we are trying 
to just avoid a deadlock here. May be, file a follow-up and work there? When 
looking at this code, I noticed a few other things that could be 
simplified/fixed. e.g. {{preemptReducer}} in my patches.

Will address your comments on the patch once we decide on how to proceed on the 
above discussion. 


> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940002#comment-14940002
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

I might have forgotten the specifics. Aren't all reducers in SHUFFLE phase 
until all the mappers are done?

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940173#comment-14940173
 ] 

Jason Lowe commented on MAPREDUCE-6302:
---

bq. Aren't all reducers in SHUFFLE phase until all the mappers are done?
No, here's an example scenario:

# All maps complete, all reducers scheduled and some (or all) started
# Some of the reducers, but not all, finish shuffling and proceed to the MERGE 
or REDUCE phases
# Node with some map outputs goes down
# Remaining reducers in the SHUFFLE phase or not assigned cannot complete, maps 
get retroactively failed for fetch failures, need to launch new map attempts
# At this point we do _not_ want to kill any reducers past the SHUFFLE phase as 
they can progress and complete without further interactions for map outputs


> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-01 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940453#comment-14940453
 ] 

Anubhav Dhoot commented on MAPREDUCE-6302:
--

In the old code we are not preempting if either the headroom or the assigned 
maps are enough to run a mapper. So the early out is consistent with the old 
preemption. But the new preemption does not have to have the same conditions. 
Since we are using it as a way to come out of deadlocks, I would think 
preempting irrespective of how many mappers are running is 
(a) safer and simpler to reason since it is only time based - We do not have to 
second guess if we are missing some other reasons for deadlock apart from 
incorrect headroom.
(b) better in terms of overall throughput for cases as Jason mentioned. 
Having a large timeout is the safety lever for controlling the aggressiveness 
of the preemption.  
Factoring in slow start in a subsequent jira seems like a good idea to me. I 
can think of reasons not to factor it in but leave it only as a heuristic to 
start reducers.

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940517#comment-14940517
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

Preempting reducers to run mappers doesn't always lead to higher throughput. 
The reducer being preempted might have to spend more time to re-copy map 
outputs from every mapper than the mappers in question take to run. I 
understand that it will likely make sense for the vast majority of cases. 

I propose we do the following:
# In this JIRA, let us just fix starvation. Stick to the logic of preempting 
enough resources to run one mapper.
# In a follow up JIRA(s), let us improve this preemption to
## preempt reducers until we are able to meet the slowstart threshold
## prioritize preempting reducers that are still in SHUFFLE phase as Jason 
mentioned
## add an option to not preempt reducers that are past SHUFFLE phase 
irrespective of slowstart as long as one mapper can run

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941135#comment-14941135
 ] 

Jason Lowe commented on MAPREDUCE-6302:
---

Since the old code also doesn't preempt if there's room for one map then I'm OK 
with the current logic.  I just didn't want a regression.  And as for SHUFFLE 
phase awareness, I agree that's best left for a followup JIRA.


> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-prelim.patch, queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942989#comment-14942989
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

Filed MAPREDUCE-6501 to track the follow-up work. 

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-prelim.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943104#comment-14943104
 ] 

Hadoop QA commented on MAPREDUCE-6302:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  31m 18s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 46s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 51s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 15s | The applied patch generated 
1 release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 45s | The applied patch generated  2 
new checkstyle issues (total was 517, now 518). |
| {color:red}-1{color} | checkstyle |   3m 14s | The applied patch generated  1 
new checkstyle issues (total was 12, now 12). |
| {color:red}-1{color} | whitespace |   0m  3s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 38s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 24s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   9m 52s | Tests passed in 
hadoop-mapreduce-client-app. |
| {color:green}+1{color} | mapreduce tests |   1m 51s | Tests passed in 
hadoop-mapreduce-client-core. |
| {color:green}+1{color} | yarn tests |  59m 53s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 131m 45s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12764974/mr-6302-5.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 30e2f83 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6055/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6055/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-core.txt
 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6055/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6055/artifact/patchprocess/whitespace.txt
 |
| hadoop-mapreduce-client-app test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6055/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
 |
| hadoop-mapreduce-client-core test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6055/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6055/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6055/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6055/console |


This message was automatically generated.

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-prelim.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlas

[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-06 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945760#comment-14945760
 ] 

Anubhav Dhoot commented on MAPREDUCE-6302:
--

Should we make MR_JOB_REDUCER_UNCONDITIONAL_PREEMPT_DELAY_SEC and 
MR_JOB_REDUCER_PREEMPT_DELAY_SEC consistent in the way they treat negative 
values? 
Today MR_JOB_REDUCER_PREEMPT_DELAY_SEC treats negative same as zero which does 
not allow you to turn it off while the new proposed 
MR_JOB_REDUCER_UNCONDITIONAL_PREEMPT_DELAY_SEC uses negative as a way to turn 
off preemption. The latter seems preferable and since the default is zero and 
the doc does not talk about the negative value, i think it should be ok to 
change this behavior. Thoughts?

Its better to reword
 // Duration to wait before forcibly preempting a reducer when there is room
to 
 // Duration to wait before forcibly preempting a reducer irrespective of 
whether there is room


> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-prelim.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946023#comment-14946023
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

bq. Should we make MR_JOB_REDUCER_UNCONDITIONAL_PREEMPT_DELAY_SEC and 
MR_JOB_REDUCER_PREEMPT_DELAY_SEC consistent in the way they treat negative 
values? 
Good point. May be I should rename the unconditional preemption config 
"timeout" instead of "delay". {{MR_JOB_REDUCER_PREEMPT_DELAY_SEC}} delays the 
preemption; a positive value leads to waiting until it is done. The config we 
are adding here is more a timeout: if we don't get resources by this time, we 
preempt.

That way, for both configs, a negative value would mean disable. 

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-prelim.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946051#comment-14946051
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

bq. May be I should rename the unconditional preemption config "timeout" 
instead of "delay". 
I tried this out, but it doesn't read well. Sorry for vacillating on this, long 
day. Given the slight difference in nature of the two configs, I am fine with 
leaving the negative values inconsistent. I am willing to change the config 
name to something that better describes this. Suggestions are very welcome. 


> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-prelim.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946204#comment-14946204
 ] 

Hadoop QA commented on MAPREDUCE-6302:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  19m 15s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 56s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 19s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 15s | The applied patch generated 
1 release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 34s | The applied patch generated  2 
new checkstyle issues (total was 517, now 518). |
| {color:red}-1{color} | whitespace |   0m  3s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m  1s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   9m 42s | Tests passed in 
hadoop-mapreduce-client-app. |
| {color:green}+1{color} | mapreduce tests |   1m 49s | Tests passed in 
hadoop-mapreduce-client-core. |
| {color:green}+1{color} | yarn tests |  56m 13s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 113m 40s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12765298/mr-6302-6.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 1bca1bb |
| Release Audit | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6061/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6061/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-core.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6061/artifact/patchprocess/whitespace.txt
 |
| hadoop-mapreduce-client-app test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6061/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
 |
| hadoop-mapreduce-client-core test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6061/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6061/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6061/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6061/console |


This message was automatically generated.

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-prelim.patch, 
> queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-08 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948179#comment-14948179
 ] 

Anubhav Dhoot commented on MAPREDUCE-6302:
--

Looks like the whitespace error is genuine while the checkstyle and release 
audit can be ignored.
bq. MR_JOB_REDUCER_PREEMPT_DELAY_SEC delays the preemption; a positive value 
leads to waiting until it is done. The config we are adding here is more a 
timeout: if we don't get resources by this time, we preempt.
Aren't both values waiting to get resources by the configured time and will not 
do any preemption if it gets resources by then? Even 
MR_JOB_REDUCER_PREEMPT_DELAY_SEC will not preempt if the resources were 
obtained before the timeout. I am concerned we are introducing an inconsistency 
in this patch that will burden administrators. It would be good to at least 
update the doc comments in yarn-default to indicate the effect of negative 
values and zero for both configs.

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-prelim.patch, 
> queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-08 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949034#comment-14949034
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

I see your point. 

I agree that it would be nice for the two configs to be consistent; -1 could 
mean disable the feature for both. Unfortunately one of the configs has already 
been in a release and changes to that would be backwards incompatible.

Now, I see the following alternatives for the new config
# what the latest patch does. -1 for disable, >=0 is the wait time before 
preemption
# the value provided is the wait time before preemption, and any negative 
values are interpreted as zero. If folks want to disable this, they will have 
to pass Long.MAX_VALUE.

Personally, I find the first one simpler to use. I do see the inconsistency 
between two very similar but different configs. 

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-prelim.patch, 
> queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-08 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949600#comment-14949600
 ] 

Karthik Kambatla commented on MAPREDUCE-6302:
-

Filed MAPREDUCE-6506

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-08 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949607#comment-14949607
 ] 

Anubhav Dhoot commented on MAPREDUCE-6302:
--

+1

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations

2015-10-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949761#comment-14949761
 ] 

Hadoop QA commented on MAPREDUCE-6302:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  19m 23s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m  3s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 20s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 20s | The applied patch generated 
1 release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 33s | The applied patch generated  2 
new checkstyle issues (total was 517, now 518). |
| {color:red}-1{color} | whitespace |   0m  2s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m  5s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   9m 43s | Tests passed in 
hadoop-mapreduce-client-app. |
| {color:green}+1{color} | mapreduce tests |   1m 48s | Tests passed in 
hadoop-mapreduce-client-core. |
| {color:green}+1{color} | yarn tests |  56m 20s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 114m 15s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12765709/mr-6302-7.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 8d22622 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6063/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6063/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-core.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6063/artifact/patchprocess/whitespace.txt
 |
| hadoop-mapreduce-client-app test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6063/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
 |
| hadoop-mapreduce-client-core test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6063/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6063/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6063/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6063/console |


This message was automatically generated.

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)