[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-08-19 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428449#comment-15428449
 ] 

Wangda Tan commented on YARN-4685:
--

+1 to latest patch, will commit shortly.

> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-4685-workaround.patch, YARN-4685.patch
>
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if blacklisted nodes are reconnected 
> with clearing disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-08-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427795#comment-15427795
 ] 

Hadoop QA commented on YARN-4685:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
21s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
9s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s 
{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
15s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 13m 38s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12824510/YARN-4685.patch |
| JIRA Issue | YARN-4685 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 5b685bd44176 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 8179f9a |
| Default Java | 1.8.0_101 |
| findbugs | v3.0.0 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/12830/testReport/ |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/12830/console |
| Powered by | Apache Yetus 0.3.0   http://yetus.apache.org |


This message was automatically generated.



> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YAR

[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-08-19 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427761#comment-15427761
 ] 

Rohith Sharma K S commented on YARN-4685:
-

OK. I will upload a patch with changes.

> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-4685-workaround.patch
>
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if blacklisted nodes are reconnected 
> with clearing disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-08-18 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427298#comment-15427298
 ] 

Wangda Tan commented on YARN-4685:
--

[~rohithsharma],

Discussed with [~vinodkv] about this, one solution is to update 
DEFAULT_AM_BLACKLIST_ENABLED to false, and update default threshold from .8 to 
.2. We can open a separate JIRA to have a longer term fix for this issue. 
Sounds like a plan?

> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-4685-workaround.patch
>
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if blacklisted nodes are reconnected 
> with clearing disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-05-12 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282415#comment-15282415
 ] 

Rohith Sharma K S commented on YARN-4685:
-

Since this issue is introduced by YARN-2005 and committed to branch-2.8, should 
YARN-2005 reverted as long as right solution is decided??. One biggest 
challenge to revert is many patches are committed on top of YARN-2005. 
OR should we go ahead with changing threshold to 0.2 default for 2.8 release? 
Any thoughts

> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if blacklisted nodes are reconnected 
> with clearing disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-05-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282145#comment-15282145
 ] 

Wangda Tan commented on YARN-4685:
--

[~rohithsharma], it seems to me that there's no consensus about how to fix this 
problem yet, could we move this to 2.9? 

> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if blacklisted nodes are reconnected 
> with clearing disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-03-29 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217225#comment-15217225
 ] 

Rohith Sharma K S commented on YARN-4685:
-

Some of the points brought in offline discussion with [~sunilg] and [~vvasudev] 
are
# The default value for maximum threshold value is 0.8. This should be reduced 
to 0.1 i.e 10% OR 0.2 i.e 20%. As Vinod 
[commented|https://issues.apache.org/jira/browse/YARN-4685?focusedCommentId=15201117&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15201117]
 previously in this JIRA, In real production cluster, blacklisting 80% of nodes 
for one app is very prone to be problematic if 20% of nodes are always busy.
# Once attempt is scheduled, there is no way to update scheduler for updated 
blacklist add/remove. Since the existing API *allocate* is used for updating 
blacklisted nodes for AM, using same API for update AM blacklist add/removal 
nodes from RMAppAttempt is critical. Lot of RMAppAttempt state machines need to 
be handled since allocate API return Allocation object, lot of race conditions 
would appear. In order to update scheduler for blacklisting nodes is triggering 
an update event from RMAppAttempt for AM blacklisting nodes. This make sure 
YarnScheduler interface is compatible.

> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if we add more nodes to cluster.
> Solution is update BlacklistManager for every 
> {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This 
> ensures if there is any addition/removal in nodes, this will be updated to 
> BlacklistManager 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-03-21 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204712#comment-15204712
 ] 

Sunil G commented on YARN-4685:
---

Agreeing to your point [~rohithsharma].

We have {{blacklistManager}}  per {{RMAppAttempt}}. So to operate anything on 
{{blacklistManager}}, we have to pass reference to scheduler. Assuming I am 
interested in your second approach. In Each heartbeat call, we will check for 
pending AM container resource request. Then for such resource request, 
re-compute blacklist threshold if needed (which means if some nodes are 
added/removed recently) in {{blacklistManager}}. If there are some changes in 
threshold, remove blacklist for this ResourceRequest.

But we need to change lot of interface api syntax. If we had a common 
BlackListManager, which keeps tracks of all blacklist information for all apps, 
it would have been more clean.

> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if we add more nodes to cluster.
> Solution is update BlacklistManager for every 
> {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This 
> ensures if there is any addition/removal in nodes, this will be updated to 
> BlacklistManager 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-03-21 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204685#comment-15204685
 ] 

Rohith Sharma K S commented on YARN-4685:
-

Initially thought to fix by calling another allocate call when ever there is 
node update event to 
{{RMApp->RMAppImpl}}. But there could be case where newly allocate call get the 
master container before RMAppAttemptImpl gets container allocated event. In 
such case, RMAppAttemptImpl should have handling mechanism. Like this many 
cases can occur. This option does not work.

Other approaches fixing this issue are recompute blacklist threshold EITHER for 
on node-added && node-remove event OR on every heartbeat for the *ALL* apps 
which are waiting for AM container allocation and update appschedulinginfo for 
{{amBlacklist}} 

> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if we add more nodes to cluster.
> Solution is update BlacklistManager for every 
> {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This 
> ensures if there is any addition/removal in nodes, this will be updated to 
> BlacklistManager 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

2016-03-19 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201117#comment-15201117
 ] 

Vinod Kumar Vavilapalli commented on YARN-4685:
---

There are simpler cases which are busted too. For e.g, if an AM failed on a 
node, this node will *never* be looked again for launching this app's AM as it 
is within the blacklist threshold. In a busy cluster where this node continues 
to be the only one free for a while, we will keep on skipping the machine.

> AM blacklisting result in application to get hanged
> ---
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if we add more nodes to cluster.
> Solution is update BlacklistManager for every 
> {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This 
> ensures if there is any addition/removal in nodes, this will be updated to 
> BlacklistManager 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)