[jira] [Commented] (YARN-4947) Test timeout is happening for TestRMWebServicesNodes

2016-04-21 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253434#comment-15253434
 ] 

Sunil G commented on YARN-4947:
---

Thanks [~rohithsharma],  I am very much agree with the approach to fix these 
test case than adding complicated flags and overrides. I am not seeing much of 
a problem if we start the auxiliary services. 

> Test timeout is happening for TestRMWebServicesNodes
> 
>
> Key: YARN-4947
> URL: https://issues.apache.org/jira/browse/YARN-4947
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4947.patch, 0002-YARN-4947.patch
>
>
> Testcase timeout for TestRMWebServicesNodes is happening after YARN-4893 
> [timeout|https://builds.apache.org/job/PreCommit-YARN-Build/11044/testReport/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4846) Random failures for TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers

2016-04-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253414#comment-15253414
 ] 

Bibin A Chundatt commented on YARN-4846:


Test failures are Tracked as part for HADOOP-13049.

> Random failures for 
> TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers
> 
>
> Key: YARN-4846
> URL: https://issues.apache.org/jira/browse/YARN-4846
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4846.patch, 0002-YARN-4846.patch, 
> 0003-YARN-4846.patch, 0004-YARN-4846.patch, YARN-4846-update-PCPP.patch
>
>
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPreemption.testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers(TestCapacitySchedulerPreemption.java:473)
> {noformat}
> https://builds.apache.org/job/PreCommit-YARN-Build/10826/testReport/org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity/TestCapacitySchedulerPreemption/testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4846) Random failures for TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253373#comment-15253373
 ] 

Hadoop QA commented on YARN-4846:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
24s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 7s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 12s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
16s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 163m 9s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_77 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification |
|   | hadoop.yarn.server.resourcemanager.TestApplicationCleanup |
|   | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
|   | hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher |
|   | hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication |
| 

[jira] [Commented] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253368#comment-15253368
 ] 

Hadoop QA commented on YARN-4676:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 8 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
45s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 54s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 54s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
11s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 9s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
55s {color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped branch modules with no Java source: 
hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 
47s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 6m 11s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
22s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 57s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 5m 57s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 5m 57s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 50s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 6m 50s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 50s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
8s {color} | {color:green} root: patch generated 0 new + 520 unchanged - 6 
fixed = 520 total (was 526) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 9s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
47s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped patch modules with no Java source: 
hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 6m 34s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 8s 
{color} | {c

[jira] [Commented] (YARN-1297) Miscellaneous Fair Scheduler speedups

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253356#comment-15253356
 ] 

Hadoop QA commented on YARN-1297:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 0 new + 142 unchanged - 2 fixed = 142 total (was 144) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 30s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 32s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 164m 3s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_77 Failed junit tests | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestWorkPreservingRMRestartForNodeLab

[jira] [Commented] (YARN-4947) Test timeout is happening for TestRMWebServicesNodes

2016-04-21 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253348#comment-15253348
 ] 

Rohith Sharma K S commented on YARN-4947:
-

Practically NodeManagers are not allowed to register to RM without 
ResourceTracerService is started. So I feel the test case should change such 
that start the RM instead of only doing init. Agree that webservice unit test 
does not require to start service, but since NodeManagers are registering 
service start should happen ideally.

> Test timeout is happening for TestRMWebServicesNodes
> 
>
> Key: YARN-4947
> URL: https://issues.apache.org/jira/browse/YARN-4947
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4947.patch, 0002-YARN-4947.patch
>
>
> Testcase timeout for TestRMWebServicesNodes is happening after YARN-4893 
> [timeout|https://builds.apache.org/job/PreCommit-YARN-Build/11044/testReport/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4390) Consider container request size during CS preemption

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253307#comment-15253307
 ] 

Hadoop QA commented on YARN-4390:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 12 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
38s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
23s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 20s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 28 new + 506 unchanged - 15 fixed = 534 total (was 521) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 34s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 57s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
16s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 166m 2s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_77 Failed junit tests | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestWorkPreservingRMRestartForNodeLabel
 |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication
 |
|   | hadoop.yarn.server.re

[jira] [Commented] (YARN-3215) Respect labels in CapacityScheduler when computing headroom

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253277#comment-15253277
 ] 

Hadoop QA commented on YARN-3215:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 8m 44s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
19s {color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} branch-2.8 passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} branch-2.8 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
26s {color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s 
{color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
21s {color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
25s {color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} branch-2.8 passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} branch-2.8 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
34s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 20s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 1 new + 217 unchanged - 6 fixed = 218 total (was 223) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
21s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 81m 40s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 81m 26s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 192m 0s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_77 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps |
|   | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.8.0_77 Timed out junit tests | 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes |
| JDK v1.7.0

[jira] [Updated] (YARN-4983) JVM and UGI metrics disappear after RM is once transitioned to standby mode

2016-04-21 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-4983:

Attachment: YARN-4983-trunk.001.patch

The tons of UT failures are caused by a wrong jvm metrics API call. Calling 
initSingleton instead of create to avoid registering twice to the same metrics 
system. 

> JVM and UGI metrics disappear after RM is once transitioned to standby mode
> ---
>
> Key: YARN-4983
> URL: https://issues.apache.org/jira/browse/YARN-4983
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-4983-trunk.000.patch, YARN-4983-trunk.001.patch
>
>
> When get transitioned to standby, the RM will shutdown the existing metric 
> system and relaunch a new one. This will cause the jvm metrics and ugi 
> metrics to miss in the new metric system. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery

2016-04-21 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3971:

Hadoop Flags:   (was: Reviewed)

> Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
> recovery
> --
>
> Key: YARN-3971
> URL: https://issues.apache.org/jira/browse/YARN-3971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 
> 0003-YARN-3971.patch, 0004-YARN-3971.patch, 
> 0005-YARN-3971.001.addendum.patch, 0005-YARN-3971.addendum.patch, 
> 0005-YARN-3971.patch
>
>
> Steps to reproduce 
> # Create label x,y
> # Delete label x,y
> # Create label x,y add capacity scheduler xml for labels x and y too
> # Restart RM 
>  
> Both RM will become Standby.
> Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
> {code}
> 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
> queue=a1 is using this label. Please remove label on queue before remove the 
> label
> java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
> label. Please remove label on queue before remove the label
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery

2016-04-21 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3971:

Target Version/s: 2.9.0

> Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
> recovery
> --
>
> Key: YARN-3971
> URL: https://issues.apache.org/jira/browse/YARN-3971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 
> 0003-YARN-3971.patch, 0004-YARN-3971.patch, 
> 0005-YARN-3971.001.addendum.patch, 0005-YARN-3971.addendum.patch, 
> 0005-YARN-3971.patch
>
>
> Steps to reproduce 
> # Create label x,y
> # Delete label x,y
> # Create label x,y add capacity scheduler xml for labels x and y too
> # Restart RM 
>  
> Both RM will become Standby.
> Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
> {code}
> 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
> queue=a1 is using this label. Please remove label on queue before remove the 
> label
> java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
> label. Please remove label on queue before remove the label
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery

2016-04-21 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3971:

Fix Version/s: (was: 2.8.0)

> Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
> recovery
> --
>
> Key: YARN-3971
> URL: https://issues.apache.org/jira/browse/YARN-3971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 
> 0003-YARN-3971.patch, 0004-YARN-3971.patch, 
> 0005-YARN-3971.001.addendum.patch, 0005-YARN-3971.addendum.patch, 
> 0005-YARN-3971.patch
>
>
> Steps to reproduce 
> # Create label x,y
> # Delete label x,y
> # Create label x,y add capacity scheduler xml for labels x and y too
> # Restart RM 
>  
> Both RM will become Standby.
> Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
> {code}
> 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
> queue=a1 is using this label. Please remove label on queue before remove the 
> label
> java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
> label. Please remove label on queue before remove the label
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4981) limit the max containers to assign to one application per heartbeart

2016-04-21 Thread ChenFolin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253193#comment-15253193
 ] 

ChenFolin commented on YARN-4981:
-

I set mapreduce.map|reduce.cpu.vcores=1.
As we know vcores are not strictly limit the cpu.
So if i set maxAssign = 10 ,and If the job is CPU-intensive, it may lead to one 
node run 10 YarnChild processes that service for one application , and it may 
run cpu 100%.
If i limit to one application per heartbeat, i can make it more average.
Such as :
application has 100 tasks, and 10 tasks can run cpu 100%, if do not limit to 
one application per heartbeat, 10 node may run cpu 100%.
if limit to one application per heartbeat only 4 , it will use 25 nodes, and 
every just run cpu 40%.
I think it may be more efficient for job.



> limit the max containers to assign to one application per heartbeart
> 
>
> Key: YARN-4981
> URL: https://issues.apache.org/jira/browse/YARN-4981
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.4
>Reporter: ChenFolin
> Attachments: YARN-4981.patch
>
>
> When use assignMultiple,and such as set maxAssign=10:
> if a job is high CPU util or high CPU load, it may lead to  a part of nodes   
> very busy. and the job may cost a long time.
> so , i want to limit the max containers to assign to one application per 
> heartbeart, it may help to nodes use more uniform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4983) JVM and UGI metrics disappear after RM is once transitioned to standby mode

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253188#comment-15253188
 ] 

Hadoop QA commented on YARN-4983:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
37s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 1s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 49s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
6s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 32s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 15s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 7s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 7s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 43s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 43s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 4s 
{color} | {color:red} root: patch generated 2 new + 172 unchanged - 1 fixed = 
174 total (was 173) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 5s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 22s 
{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed 
with JDK v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 7m 43s {color} 
| {color:red} hadoop-common in the patch failed with JDK v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 27s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 34s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 22s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
21s {color} 

[jira] [Assigned] (YARN-1134) Add support for zipping/unzipping logs while in transit for the NM logs web-service

2016-04-21 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong reassigned YARN-1134:
---

Assignee: Xuan Gong

> Add support for zipping/unzipping logs while in transit for the NM logs 
> web-service
> ---
>
> Key: YARN-1134
> URL: https://issues.apache.org/jira/browse/YARN-1134
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Xuan Gong
>
> As [~zjshen] pointed out at 
> [YARN-649|https://issues.apache.org/jira/browse/YARN-649?focusedCommentId=13698415&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13698415],
> {quote}
> For the long running applications, they may have a big log file, such that it 
> will take a long time to download the log file via the RESTful API. 
> Consequently, HTTP connection may timeout before downloading before 
> downloading a complete log file. Maybe it is good to zip the log file before 
> sending it, and unzip it after receiving it.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4846) Random failures for TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers

2016-04-21 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4846:
---
Attachment: 0004-YARN-4846.patch

Uploading patch after fixing checkstyle issue

> Random failures for 
> TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers
> 
>
> Key: YARN-4846
> URL: https://issues.apache.org/jira/browse/YARN-4846
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4846.patch, 0002-YARN-4846.patch, 
> 0003-YARN-4846.patch, 0004-YARN-4846.patch, YARN-4846-update-PCPP.patch
>
>
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPreemption.testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers(TestCapacitySchedulerPreemption.java:473)
> {noformat}
> https://builds.apache.org/job/PreCommit-YARN-Build/10826/testReport/org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity/TestCapacitySchedulerPreemption/testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4982) Test timeout :TestAMRMProxy testcase timeout always

2016-04-21 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4982:
---
Issue Type: Sub-task  (was: Bug)
Parent: YARN-4478

> Test timeout :TestAMRMProxy testcase timeout always
> ---
>
> Key: YARN-4982
> URL: https://issues.apache.org/jira/browse/YARN-4982
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>
> https://builds.apache.org/job/PreCommit-YARN-Build/11088/testReport/junit/org.apache.hadoop.yarn.client.api.impl/TestAMRMProxy/testAMRMProxyE2E/
> In hadoop-yarn-client package test {{TestAMRMProxy}} testcase timeout always
> {noformat}
> java.lang.Exception: test timed out after 6 milliseconds
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:160)
>   at com.sun.proxy.$Proxy85.getNewApplication(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNewApplication(YarnClientImpl.java:227)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createApplication(YarnClientImpl.java:235)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy.createApp(TestAMRMProxy.java:367)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy.testAMRMProxyE2E(TestAMRMProxy.java:110)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4919) Yarn logs should support a option to output logs as compressed archive

2016-04-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-4919.
---
Resolution: Duplicate

This as a dup of YARN-1134, closing so.

> Yarn logs should support a option to output logs as compressed archive
> --
>
> Key: YARN-4919
> URL: https://issues.apache.org/jira/browse/YARN-4919
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4577) Enable aux services to have their own custom classpath/jar file

2016-04-21 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253112#comment-15253112
 ] 

Sangjin Lee commented on YARN-4577:
---

Yes I think the POC patch is pretty close to what I had in mind too.

A couple of more minor suggestions:
- I probably wouldn't make {{AuxServiceWithCustomClassLoader}} public. It 
should be really visible only to {{AuxServices}}. Package scope should be fine.
- I understand {{callWithCustomClassLoader()}} is bit complicated because it 
has to support methods with different signatures. I would simply inline the 
code (as you are doing with {{service*()}} methods). Then you don't have to do 
any reflection business to do this.

Don't forget to test it with a real-life use case! Thanks.

> Enable aux services to have their own custom classpath/jar file
> ---
>
> Key: YARN-4577
> URL: https://issues.apache.org/jira/browse/YARN-4577
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-4577.1.patch, YARN-4577.2.patch, 
> YARN-4577.20160119.1.patch, YARN-4577.20160204.patch, YARN-4577.3.patch, 
> YARN-4577.3.rebase.patch, YARN-4577.4.patch, YARN-4577.poc.patch
>
>
> Right now, users have to add their jars to the NM classpath directly, thus 
> put them on the system classloader. But if multiple versions of the plugin 
> are present on the classpath, there is no control over which version actually 
> gets loaded. Or if there are any conflicts between the dependencies 
> introduced by the auxiliary service and the NM itself, they can break the NM, 
> the auxiliary service, or both.
> The solution could be: to instantiate aux services using a classloader that 
> is different from the system classloader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4987) EntityGroupFS timeline store needs to handle null storage gracefully

2016-04-21 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-4987:

Priority: Minor  (was: Major)

> EntityGroupFS timeline store needs to handle null storage gracefully
> 
>
> Key: YARN-4987
> URL: https://issues.apache.org/jira/browse/YARN-4987
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
>Priority: Minor
>
> To handle concurrency issues, key value based timeline storage may return 
> null on reads that are concurrent to service stop. EntityGroupFS timeline 
> store needs to handle this case gracefully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4766) NM should not aggregate logs older than the retention policy

2016-04-21 Thread Robert Kanter (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253086#comment-15253086
 ] 

Robert Kanter commented on YARN-4766:
-

Looks good overall.  Here's a few things:
# {{AggregatedLogFormat#getPendingLogFilesToUploadForThisContainer}} should be 
marked {{@VisibleForTesting}}
# It's typically easier to maintain multiple constructors when you have the one 
with the most arguments do the "real" work and the others just call others with 
default values in arguments.  Can you make {{ApplicationImpl}} do that?  
# In {{ApplicationImpl#buildAppProto}}, it's catching an {{IOException}} (which 
shouldn't occur) and simply logging it.  If this does somehow occur, then it's 
going to continue executing, which is probably bad.  Given that the only caller 
of this method is already doing it from a try-catch block, I think we'd be 
better off throwing the {{IOException}}.  Also, the caller should log the 
{{Exception}} so that we get a stack trace.
# There's an extra newline above {{ApplicationImpl#ApplicationImpl}}
# {{AppLogAggregatorImpl#uploadLogsForContainers}} creates a new array named 
{{paths}} that's never used.
# In {{TestAppLogAggregatorImpl}}, 
{{testAggregatorWithRetentionPolicyDisabled_shouldUploadAllFiles}} and 
{{testAggregatorWhenNoFileOlderThanRetentionPolicy_ShouldUploadAll}} are 
identical other than a config property or two.  Can we make a helper than has 
most of the code and pass it the config properties so we can combine the code 
here?
# There's an extra newline in the {{DeletionServiceDeleteTaskProto}} proto 
message

> NM should not aggregate logs older than the retention policy
> 
>
> Key: YARN-4766
> URL: https://issues.apache.org/jira/browse/YARN-4766
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, nodemanager
>Reporter: Haibo Chen
>Assignee: Haibo Chen
> Attachments: yarn4766.001.patch, yarn4766.002.patch, 
> yarn4766.003.patch
>
>
> When a log aggregation fails on the NM the information is for the attempt is 
> kept in the recovery DB. Log aggregation can fail for multiple reasons which 
> are often related to HDFS space or permissions.
> On restart the recovery DB is read and if an application attempt needs its 
> logs aggregated, the files are scheduled for aggregation without any checks. 
> The log files could be older than the retention limit in which case we should 
> not aggregate them but immediately mark them for deletion from the local file 
> system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253084#comment-15253084
 ] 

Hitesh Shah commented on YARN-4844:
---

bq. It is not a very hard thing to drop it, we'd better to do it close to first 
branch-3 release.

I believe a recent comment on the mailing list was trying to target a 3.0 
release within the next few weeks so I guess that means we make this change 
now? 

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-4844.1.patch, YARN-4844.2.patch
>
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java

2016-04-21 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu reopened YARN-4090:


> Make Collections.sort() more efficient in FSParentQueue.java
> 
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, 
> sampling1.jpg, sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1297) Miscellaneous Fair Scheduler speedups

2016-04-21 Thread Yufei Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu updated YARN-1297:
---
Attachment: YARN-1297.006.patch

> Miscellaneous Fair Scheduler speedups
> -
>
> Key: YARN-1297
> URL: https://issues.apache.org/jira/browse/YARN-1297
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Sandy Ryza
>Assignee: Yufei Gu
> Attachments: YARN-1297-1.patch, YARN-1297-2.patch, 
> YARN-1297.005.patch, YARN-1297.006.patch, YARN-1297.3.patch, 
> YARN-1297.4.patch, YARN-1297.4.patch, YARN-1297.patch, YARN-1297.patch
>
>
> I ran the Fair Scheduler's core scheduling loop through a profiler tool and 
> identified a bunch of minimally invasive changes that can shave off a few 
> milliseconds.
> The main one is demoting a couple INFO log messages to DEBUG, which brought 
> my benchmark down from 16000 ms to 6000.
> A few others (which had way less of an impact) were
> * Most of the time in comparisons was being spent in Math.signum.  I switched 
> this to direct ifs and elses and it halved the percent of time spent in 
> comparisons.
> * I removed some unnecessary instantiations of Resource objects
> * I made it so that queues' usage wasn't calculated from the applications up 
> each time getResourceUsage was called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4846) Random failures for TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers

2016-04-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253058#comment-15253058
 ] 

Wangda Tan commented on YARN-4846:
--

Looks good to me, thanks! Will commit the patch tomorrow.

> Random failures for 
> TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers
> 
>
> Key: YARN-4846
> URL: https://issues.apache.org/jira/browse/YARN-4846
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4846.patch, 0002-YARN-4846.patch, 
> 0003-YARN-4846.patch, YARN-4846-update-PCPP.patch
>
>
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPreemption.testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers(TestCapacitySchedulerPreemption.java:473)
> {noformat}
> https://builds.apache.org/job/PreCommit-YARN-Build/10826/testReport/org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity/TestCapacitySchedulerPreemption/testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4844:
-
Attachment: YARN-4844.2.patch

Rebased to latest trunk. (ver. 2)

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-4844.1.patch, YARN-4844.2.patch
>
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253048#comment-15253048
 ] 

Wangda Tan commented on YARN-4844:
--

[~hitesh],

Agree it is very messy but it seems there's no other way to do it. :(
I also tried to add new Resource object (like YarnServerResource) which will be 
used by YARN internally only and keeps user-facing API clean. But there're too 
many interactions between application and services use api.Resource, we need to 
handle all these cases separately, it doesn't look like a doable plan to me.

For branch-3 release, I prefer to keep {{long getMemory/etc}} only. However, 
since getMemory is used 1000+ times inside resource manager project, if we drop 
the {{Resource#int getMemory}} now from trunk, we need to write two versions of 
patches for almost RM fixes, which will be a HUGE headache for YARN RM 
contributors. It is not a very hard thing to drop it, we'd better to do it 
close to first branch-3 release.

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-4844.1.patch
>
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1297) Miscellaneous Fair Scheduler speedups

2016-04-21 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253042#comment-15253042
 ] 

Yufei Gu commented on YARN-1297:


I agree. Let's me upload a new patch soon and reopen the YARN-4090. 

> Miscellaneous Fair Scheduler speedups
> -
>
> Key: YARN-1297
> URL: https://issues.apache.org/jira/browse/YARN-1297
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Sandy Ryza
>Assignee: Yufei Gu
> Attachments: YARN-1297-1.patch, YARN-1297-2.patch, 
> YARN-1297.005.patch, YARN-1297.3.patch, YARN-1297.4.patch, YARN-1297.4.patch, 
> YARN-1297.patch, YARN-1297.patch
>
>
> I ran the Fair Scheduler's core scheduling loop through a profiler tool and 
> identified a bunch of minimally invasive changes that can shave off a few 
> milliseconds.
> The main one is demoting a couple INFO log messages to DEBUG, which brought 
> my benchmark down from 16000 ms to 6000.
> A few others (which had way less of an impact) were
> * Most of the time in comparisons was being spent in Math.signum.  I switched 
> this to direct ifs and elses and it halved the percent of time spent in 
> comparisons.
> * I removed some unnecessary instantiations of Resource objects
> * I made it so that queues' usage wasn't calculated from the applications up 
> each time getResourceUsage was called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4987) EntityGroupFS timeline store needs to handle null storage gracefully

2016-04-21 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-4987:

Issue Type: Sub-task  (was: Bug)
Parent: YARN-4233

> EntityGroupFS timeline store needs to handle null storage gracefully
> 
>
> Key: YARN-4987
> URL: https://issues.apache.org/jira/browse/YARN-4987
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
>
> To handle concurrency issues, key value based timeline storage may return 
> null on reads that are concurrent to service stop. EntityGroupFS timeline 
> store needs to handle this case gracefully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4987) EntityGroupFS timeline store needs to handle null storage gracefully

2016-04-21 Thread Li Lu (JIRA)
Li Lu created YARN-4987:
---

 Summary: EntityGroupFS timeline store needs to handle null storage 
gracefully
 Key: YARN-4987
 URL: https://issues.apache.org/jira/browse/YARN-4987
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Li Lu


To handle concurrency issues, key value based timeline storage may return null 
on reads that are concurrent to service stop. EntityGroupFS timeline store 
needs to handle this case gracefully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4390) Consider container request size during CS preemption

2016-04-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4390:
-
Attachment: YARN-4390.4.patch

> Consider container request size during CS preemption
> 
>
> Key: YARN-4390
> URL: https://issues.apache.org/jira/browse/YARN-4390
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.0.0, 2.8.0, 2.7.3
>Reporter: Eric Payne
>Assignee: Wangda Tan
> Attachments: YARN-4390-design.1.pdf, YARN-4390-test-results.pdf, 
> YARN-4390.1.patch, YARN-4390.2.patch, YARN-4390.3.branch-2.patch, 
> YARN-4390.3.patch, YARN-4390.4.patch
>
>
> There are multiple reasons why preemption could unnecessarily preempt 
> containers. One is that an app could be requesting a large container (say 
> 8-GB), and the preemption monitor could conceivably preempt multiple 
> containers (say 8, 1-GB containers) in order to fill the large container 
> request. These smaller containers would then be rejected by the requesting AM 
> and potentially given right back to the preempted app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4390) Consider container request size during CS preemption

2016-04-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253032#comment-15253032
 ] 

Wangda Tan commented on YARN-4390:
--

[~eepayne], Thanks for review!

bq. I think this JIRA gets us closer to that goal, but there may be a 
possibility for the killed container to go someplace else. Is that right?
Yes it is true, and the reservation should happen before we can correctly 
preempt resources for large containers. For example, if YARN-4280 occurs, we 
cannot reserve container and preempt containers correctly.

Done most of your comments, except:

bq. Even though killableContainers is an unmodifiableMap, I think it can still 
change, can't it?
Yes, It can change. And actually, all existing preemption logic assume change 
could happen:
- In micro view: we clone queue metrics at the beginning of editSchedule, but 
queue metrics can be changed during preemption logic.
- In macro view: selected candidates could become invalid / valid 
back-and-forth before max-kill-wait reaches. (Since queue's resource usage 
could be updated in the period of max-kill-wait).
So back to your question, if killableContainers modified during editSchedule. 
We can fix it in next (and next-next ..) editSchedule.

bq. I am a little concerned about calling 
preemptionContext.getScheduler().getAllNodes()) to get the list of all of the 
nodes on every iteration of the preemption monitor...
This is a valid concern. However, as far as I know, Fair scheduler is using the 
method when doing async scheduling, and async scheduling is widely used by Fair 
Scheduler users:
See logic:
{code} 
  void continuousSchedulingAttempt() throws InterruptedException {
long start = getClock().getTime();
List nodeIdList =
nodeTracker.sortedNodeList(nodeAvailableResourceComparator);

// iterate all nodes
for (FSSchedulerNode node : nodeIdList) {
  try {
if (Resources.fitsIn(minimumAllocation,
node.getUnallocatedResource())) {
  attemptScheduling(node);
}
  }
  
   }
{code}
I didn't see any JIRA to complain about performane impact regarding to this 
approach.
And since it uses R/W lock, write lock will be acquired only if node add / move 
or node resource update. So in most cases, nobody acquires write lock. I agree 
to cache node list inside PCPP if we do see performance issues.

Attaching ver.4 patch, please kindly review.

> Consider container request size during CS preemption
> 
>
> Key: YARN-4390
> URL: https://issues.apache.org/jira/browse/YARN-4390
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.0.0, 2.8.0, 2.7.3
>Reporter: Eric Payne
>Assignee: Wangda Tan
> Attachments: YARN-4390-design.1.pdf, YARN-4390-test-results.pdf, 
> YARN-4390.1.patch, YARN-4390.2.patch, YARN-4390.3.branch-2.patch, 
> YARN-4390.3.patch
>
>
> There are multiple reasons why preemption could unnecessarily preempt 
> containers. One is that an app could be requesting a large container (say 
> 8-GB), and the preemption monitor could conceivably preempt multiple 
> containers (say 8, 1-GB containers) in order to fill the large container 
> request. These smaller containers would then be rejected by the requesting AM 
> and potentially given right back to the preempted app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4962) support filling up containers on node one by one

2016-04-21 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253030#comment-15253030
 ] 

sandflee commented on YARN-4962:


Thanks [~templedf],  node labels seems couldn't solve our problem, because:
1, our nodes have the same resource mostly. It's hard to label A/B
2, type b job couldn't run on  type A node, It's a waste of resource when type 
A node are free.

>  support filling up containers on node one by one
> -
>
> Key: YARN-4962
> URL: https://issues.apache.org/jira/browse/YARN-4962
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>
> we had a gpu cluster, jobs with bigger resource request couldn't be satisfied 
> for node is running the jobs with smaller resource request.  we didn't open 
> reserve system because gpu jobs may run days or weeks. we expect scheduler 
> allocate containers to fill the node , then there will be resource to run   
> jobs with big resource request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253028#comment-15253028
 ] 

Hitesh Shah commented on YARN-4844:
---

getMemoryLong(), etc just seems messy. I can understand why this is needed on 
branch-2 if we need to support long but for trunk, it seems better to change 
getMemory() to return a long. 

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-4844.1.patch
>
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4862) Handle duplicate completed containers in RMNodeImpl

2016-04-21 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253020#comment-15253020
 ] 

Jian He commented on YARN-4862:
---

I think the  RMNodeImpl#completedContainers  could be leaked. e.g. If the 
container is completed at RM side (preempted etc.),
1. container may be first added into RMNodeImpl#containersToBeRemovedFromNM.
2. At this point completedContainers does not have the container, so 
{{completedContainers.removeAll(this.containersToBeRemovedFromNM);}} does 
nothing,
3. later on, container finished at NM and gets added into the 
completedContainers, and this container will remain there forever.



> Handle duplicate completed containers in RMNodeImpl
> ---
>
> Key: YARN-4862
> URL: https://issues.apache.org/jira/browse/YARN-4862
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: 0001-YARN-4862.patch, 0002-YARN-4862.patch
>
>
> As per 
> [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
>  from [~sharadag], there should be safe guard for duplicated container status 
> in RMNodeImpl before creating UpdatedContainerInfo. 
> Or else in heavily loaded cluster where event processing is gradually slow, 
> if any duplicated container are sent to RM(may be bug in NM also), there is 
> significant impact that RMNodImpl always create UpdatedContainerInfo for 
> duplicated containers. This result in increase in the heap memory and causes 
> problem like YARN-4852.
> This is an optimization for issue kind YARN-4852



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4556) TestFifoScheduler.testResourceOverCommit fails

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253011#comment-15253011
 ] 

Hadoop QA commented on YARN-4556:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 8s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 43s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 165m 22s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_77 Failed junit tests | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA |
|   | hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication
 |
|   | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
|   | hadoop.yarn.server.resourcemanager.Tes

[jira] [Commented] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253005#comment-15253005
 ] 

Hadoop QA commented on YARN-4844:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s {color} 
| {color:red} YARN-4844 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12800125/YARN-4844.1.patch |
| JIRA Issue | YARN-4844 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/11170/console |
| Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |


This message was automatically generated.



> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-4844.1.patch
>
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253004#comment-15253004
 ] 

Wangda Tan commented on YARN-4844:
--

An additional note about why it is a 2.8 blocker:

Currently Capacity Scheduler relies on total pending resource:
When trying to assign container for each node heartbeat from root queue, it 
skips queue which has <= 0 pending resources.

So if memory pendingResource overflows, no more containers can be allocated.

branch-2.7 will not affected since the new logic is only in branch-2.8.

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-4844.1.patch
>
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters

2016-04-21 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-4971:
---
Affects Version/s: (was: 3.0.0)
   2.7.2

> RM fails to re-bind to wildcard IP after failover in multi homed clusters
> -
>
> Key: YARN-4971
> URL: https://issues.apache.org/jira/browse/YARN-4971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-4971.1.patch
>
>
> If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first 
> time the service becomes active binding to the wildcard works as expected. If 
> the service has transitioned from active to standby and then becomes active 
> again after failovers the service only binds to one of the ip addresses.
> There is a difference between the services inside the RM: it only seem to 
> happen for the services listening on ports: 8030 and 8032



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters

2016-04-21 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-4971:
---
Target Version/s: 2.9.0

> RM fails to re-bind to wildcard IP after failover in multi homed clusters
> -
>
> Key: YARN-4971
> URL: https://issues.apache.org/jira/browse/YARN-4971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-4971.1.patch
>
>
> If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first 
> time the service becomes active binding to the wildcard works as expected. If 
> the service has transitioned from active to standby and then becomes active 
> again after failovers the service only binds to one of the ip addresses.
> There is a difference between the services inside the RM: it only seem to 
> happen for the services listening on ports: 8030 and 8032



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1297) Miscellaneous Fair Scheduler speedups

2016-04-21 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252984#comment-15252984
 ] 

Karthik Kambatla commented on YARN-1297:


Looks like the test failures are because of preemption and runnability logic 
and test's reliance on the timing of the queue-usage update. I remember Sandy 
measured the logging changes themselves led to good improvement. 

What do you think of doing only the logging changes here and drive the 
resources change as part of YARN-4090? 

> Miscellaneous Fair Scheduler speedups
> -
>
> Key: YARN-1297
> URL: https://issues.apache.org/jira/browse/YARN-1297
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Sandy Ryza
>Assignee: Yufei Gu
> Attachments: YARN-1297-1.patch, YARN-1297-2.patch, 
> YARN-1297.005.patch, YARN-1297.3.patch, YARN-1297.4.patch, YARN-1297.4.patch, 
> YARN-1297.patch, YARN-1297.patch
>
>
> I ran the Fair Scheduler's core scheduling loop through a profiler tool and 
> identified a bunch of minimally invasive changes that can shave off a few 
> milliseconds.
> The main one is demoting a couple INFO log messages to DEBUG, which brought 
> my benchmark down from 16000 ms to 6000.
> A few others (which had way less of an impact) were
> * Most of the time in comparisons was being spent in Math.signum.  I switched 
> this to direct ifs and elses and it halved the percent of time spent in 
> comparisons.
> * I removed some unnecessary instantiations of Resource objects
> * I made it so that queues' usage wasn't calculated from the applications up 
> each time getResourceUsage was called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252976#comment-15252976
 ] 

Wangda Tan commented on YARN-4844:
--

Discussed with [~vinodkv], [~jianhe], [~hitesh] about this issue.

A good news is, Google PB has backward/forward compatibility for all int_ 
fields, see: https://developers.google.com/protocol-buffers/docs/proto#updating:
bq. int32, uint32, int64, uint64, and bool are all compatible – this means you 
can change a field from one of these types to another without breaking 
forwards- or backwards-compatibility. If a number is parsed from the wire which 
doesn't fit in the corresponding type, you will get the same effect as if you 
had cast the number to that type in C++ (e.g. if a 64-bit number is read as an 
int32, it will be truncated to 32 bits).
So we have no problem to change ResourceProto from int32 to int64.

In addition to .proto change, following changes are required for API record : 
Resource.
- Update {{set_(int ...)}} to {{set_(long ...)}}, there's no compatible issue 
for setters
- Add {{getMemoryLong}} and {{getVirtualCoresLong}} method

And also, we need update Metrics objects related to Resources, such as 
QueueMetrics, etc. AFAIK, there's no compatibility issue.

The last part is scheduler and test fixes.

Attached ver.1 patch for review.

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-4844.1.patch
>
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4844:
-
Attachment: YARN-4844.1.patch

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-4844.1.patch
>
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4844:
-
Target Version/s: 2.8.0

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4844:
-
Priority: Blocker  (was: Critical)

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64

2016-04-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-4844:


Assignee: Wangda Tan

> Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
> --
>
> Key: YARN-4844
> URL: https://issues.apache.org/jira/browse/YARN-4844
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
>
> We use int32 for memory now, if a cluster has 10k nodes, each node has 210G 
> memory, we will get a negative total cluster memory.
> And another case that easier overflows int32 is: we added all pending 
> resources of running apps to cluster's total pending resources. If a 
> problematic app requires too much resources (let's say 1M+ containers, each 
> of them has 3G containers), int32 will be not enough.
> Even if we can cap each app's pending request, we cannot handle the case that 
> there're many running apps, each of them has capped but still significant 
> numbers of pending resources.
> So we may possibly need to upgrade int32 memory field (could include v-cores 
> as well) to int64 to avoid integer overflow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252916#comment-15252916
 ] 

Hadoop QA commented on YARN-3931:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s {color} 
| {color:red} YARN-3931 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12745729/YARN-3931.001.patch |
| JIRA Issue | YARN-3931 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/11167/console |
| Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |


This message was automatically generated.



> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>  Labels: patch
> Attachments: YARN-3931.001.patch
>
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2016-04-21 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252880#comment-15252880
 ] 

Sangjin Lee commented on YARN-3816:
---

Understood. I commented because I thought that entities that should skip this 
aggregation should be pretty generic and couldn't think of why it wouldn't be 
generic.

I'm +1 on the latest patch. I'll wait for a little while so others can also 
look at it and chime in on the patch. Thanks!

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Li Lu
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-YARN-2928-v5.patch, YARN-3816-YARN-2928-v6.patch, 
> YARN-3816-YARN-2928-v7.patch, YARN-3816-YARN-2928-v8.patch, 
> YARN-3816-YARN-2928-v9.patch, YARN-3816-feature-YARN-2928.v4.1.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4556) TestFifoScheduler.testResourceOverCommit fails

2016-04-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252845#comment-15252845
 ] 

Hudson commented on YARN-4556:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9649 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9649/])
YARN-4556. TestFifoScheduler.testResourceOverCommit fails. Contributed (epayne: 
rev 3dce486d88895dcdf443f4d0064d1fb6e9116045)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java


>  TestFifoScheduler.testResourceOverCommit fails
> ---
>
> Key: YARN-4556
> URL: https://issues.apache.org/jira/browse/YARN-4556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Akihiro Suda
> Attachments: YARN-4556-1.patch
>
>
> From YARN-4548 Jenkins log: 
> https://builds.apache.org/job/PreCommit-YARN-Build/10181/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
> {code}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler
> Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 31.004 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler
> testResourceOverCommit(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler)
>   Time elapsed: 4.746 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<-2048> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler.testResourceOverCommit(TestFifoScheduler.java:1142)
> {code}
> https://github.com/apache/hadoop/blob/8676a118a12165ae5a8b80a2a4596c133471ebc1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java#L1142
> It seems that Jenkins has been hitting this intermittently since April 2015
> https://www.google.com/search?q=TestFifoScheduler.testResourceOverCommit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking

2016-04-21 Thread Daniel Zhi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Zhi updated YARN-4676:
-
Attachment: YARN-4676.013.patch

Review comments update:
1. Added GracefulDecommision link in hadoop-project/src/site/site.xml;
2. Updated GracefulDecommision.md to be more end-user oriented;
3. Added a newline to TestDecommissioningNodesWatcher.java as suggested. (This 
however lead to "warning: 1 line adds whitespace errors" during "git apply 
--verbose YARN-4676.013.patch").

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> 
>
> Key: YARN-4676
> URL: https://issues.apache.org/jira/browse/YARN-4676
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Zhi
>Assignee: Daniel Zhi
>  Labels: features
> Attachments: GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, 
> YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, 
> YARN-4676.008.patch, YARN-4676.009.patch, YARN-4676.010.patch, 
> YARN-4676.011.patch, YARN-4676.012.patch, YARN-4676.013.patch
>
>
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4390) Consider container request size during CS preemption

2016-04-21 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252803#comment-15252803
 ] 

Eric Payne commented on YARN-4390:
--

Thanks, [~leftnoteasy], for the detailed work on this issue. Overall, I think 
the approach looks good. One thing I still wonder about is even if the 
preemption policy kills the perfectly sized container, will the scheduler know 
that it needs to assign those freed resources to the same app that requested 
them? I think this JIRA gets us closer to that goal, but there may be a 
possibility for the killed container to go someplace else. Is that right? Even 
if that's the case, I still like this approach.

Here are a few comments about the patch.
* I would rename "select_candidates_for_reserved_containers" to 
"select_based_on_reserved_containers"
* 
{{CapacitySchedulerPreemptionUtils#deductPreemptableResourcesBasedSelectedCandidates}}
** {{res}} is never used. Was it intended to be passed to 
{{tq.deductActuallyToBePreempted}}, or is it needed only to test if 
{{c.getReservedResource()}} and {{c.getAllocatedResource}} are not null?
* Here's just a very minor thing: {{FifoCandidatesSelector#selectCandidates}}:
   * {{... could already select containers ...}} could be changed to {{... 
could have already selected containers ...}}, for clarity.
* {{ReservedContainerCandidateSelector#getPreemptionCandidatesOnNode}}:
{code}
Map killableContainers =
node.getKillableContainers();
{code}
**  Even though killableContainers is an {{unmodifiableMap}}, I think it can 
still change, can't it? It looks like {{killableContainers}} is only used once 
in this method. Would it make more sense to wait until {{killableContainers}} 
is ready to use before calling  {{node.getKillableContainers()}}?
* {{ReservedContainerCandidateSelector#getNodesForPreemption}}:
** I am a little concerned about calling 
{{preemptionContext.getScheduler().getAllNodes())}} to get the list of all of 
the nodes on every iteration of the preemption monitor. I can't think of a 
better way to handle this, but I think this could be expensive on the RM since 
{{getAllNodes()}} will lock the {{ClusterNodeTracker}} while it creates the 
list of nodes, and anything trying to use {{ClusterNodeTrackers}} resources 
will have to wait. It may not be a problem, but I know that we sometimes see 
the RM getting bogged down, and I am concerned about adding another long wait 
every 15 seconds (default) (or however long the preemption monitor interval is 
configured for). The nodes list doesn't change all that often. I wonder if it 
would make sense to cache it and only update it periodically (every {{n-th}} 
iteration?).


> Consider container request size during CS preemption
> 
>
> Key: YARN-4390
> URL: https://issues.apache.org/jira/browse/YARN-4390
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.0.0, 2.8.0, 2.7.3
>Reporter: Eric Payne
>Assignee: Wangda Tan
> Attachments: YARN-4390-design.1.pdf, YARN-4390-test-results.pdf, 
> YARN-4390.1.patch, YARN-4390.2.patch, YARN-4390.3.branch-2.patch, 
> YARN-4390.3.patch
>
>
> There are multiple reasons why preemption could unnecessarily preempt 
> containers. One is that an app could be requesting a large container (say 
> 8-GB), and the preemption monitor could conceivably preempt multiple 
> containers (say 8, 1-GB containers) in order to fill the large container 
> request. These smaller containers would then be rejected by the requesting AM 
> and potentially given right back to the preempted app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4981) limit the max containers to assign to one application per heartbeart

2016-04-21 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252787#comment-15252787
 ] 

Karthik Kambatla commented on YARN-4981:


Does limiting to one application per heartbeat solve this problem? What happens 
if the scheduler assigns multiple CPU-intensive tasks (from different 
applications) on the same node? If the job is CPU-intensive, may be it should 
ask for more vcores so the CPU is not oversubscribed? 

Also, [~ywskycn] and I were looking into improving this in the past. We were 
considering determining maxAssign dynamically based on the delta between the 
node's allocation and the overall average allocation per node in the cluster. 
What do you think of that? 

> limit the max containers to assign to one application per heartbeart
> 
>
> Key: YARN-4981
> URL: https://issues.apache.org/jira/browse/YARN-4981
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.4
>Reporter: ChenFolin
> Attachments: YARN-4981.patch
>
>
> When use assignMultiple,and such as set maxAssign=10:
> if a job is high CPU util or high CPU load, it may lead to  a part of nodes   
> very busy. and the job may cost a long time.
> so , i want to limit the max containers to assign to one application per 
> heartbeart, it may help to nodes use more uniform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4983) JVM and UGI metrics disappear after RM is once transitioned to standby mode

2016-04-21 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-4983:

Attachment: YARN-4983-trunk.000.patch

This is a patch to quickly fix the problem. I'm reattaching the JvmMetrics 
after the rm got transitioned to standby state. UgiMetrics is a different story 
since it fails when get reattached. As a quick fix I recreate it. 

Right now this patch is just demo for a quick fix. I think the root 
contradiction here is our metric system does not quite consider the case when a 
metric system got restarted (and metrics got reattached), while the RM HA code 
relaunches the metric system. If there are more elegant ways to fix this please 
do let me know. Thanks! 

> JVM and UGI metrics disappear after RM is once transitioned to standby mode
> ---
>
> Key: YARN-4983
> URL: https://issues.apache.org/jira/browse/YARN-4983
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-4983-trunk.000.patch
>
>
> When get transitioned to standby, the RM will shutdown the existing metric 
> system and relaunch a new one. This will cause the jvm metrics and ugi 
> metrics to miss in the new metric system. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2016-04-21 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252745#comment-15252745
 ] 

Raul Gutierrez Segales commented on YARN-3931:
--

[~wangda], [~kyungwan nam] - ping, can we apply this?

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>  Labels: patch
> Attachments: YARN-3931.001.patch
>
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4556) TestFifoScheduler.testResourceOverCommit fails

2016-04-21 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252732#comment-15252732
 ] 

Eric Payne commented on YARN-4556:
--

+1

>  TestFifoScheduler.testResourceOverCommit fails
> ---
>
> Key: YARN-4556
> URL: https://issues.apache.org/jira/browse/YARN-4556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Akihiro Suda
> Attachments: YARN-4556-1.patch
>
>
> From YARN-4548 Jenkins log: 
> https://builds.apache.org/job/PreCommit-YARN-Build/10181/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
> {code}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler
> Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 31.004 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler
> testResourceOverCommit(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler)
>   Time elapsed: 4.746 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<-2048> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler.testResourceOverCommit(TestFifoScheduler.java:1142)
> {code}
> https://github.com/apache/hadoop/blob/8676a118a12165ae5a8b80a2a4596c133471ebc1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java#L1142
> It seems that Jenkins has been hitting this intermittently since April 2015
> https://www.google.com/search?q=TestFifoScheduler.testResourceOverCommit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2016-04-21 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252716#comment-15252716
 ] 

Li Lu commented on YARN-3816:
-

Thanks [~sjlee0]. Some quick note on the design: the goal here is to let each 
concrete types of timeline collectors to define their own skip types. The 
challenge was the updateAggregateStatus method is static so that we can provide 
the aggregateEntities static method for offline aggregations. This limits the 
ability to customize the "skip set" for each TimelineCollector's subclasses (no 
method override for static methods). One solution is to use the strategy 
pattern to let each timeline collector decides its own set of skipped types, 
but I do not want each instance of timeline collectors to hold one skip set 
since they should be the same for the same class. Therefore, I'm making the 
getEntityTypesSkipAggregation method to be an instance method, but for both 
TimelineCollectors and AppLevelCollectors, they can simply return the class 
level skip set. The two static sets (entityTypesSkipAggregation) just happen to 
have the same name in the two classes, but they're not interfering with each 
other. 

Not sure if this is clear enough, but any suggestions would be helpful. Thanks! 

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Li Lu
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-YARN-2928-v5.patch, YARN-3816-YARN-2928-v6.patch, 
> YARN-3816-YARN-2928-v7.patch, YARN-3816-YARN-2928-v8.patch, 
> YARN-3816-YARN-2928-v9.patch, YARN-3816-feature-YARN-2928.v4.1.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4968) A couple of AM retry unit tests need to wait SchedulerApplicationAttempt stopped.

2016-04-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252713#comment-15252713
 ] 

Hudson commented on YARN-4968:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9648 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9648/])
YARN-4968. A couple of AM retry unit tests need to wait (gtcarrera9: rev 
7c6339f66ac301406504be28841bc3f3bfebc8ae)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java


> A couple of AM retry unit tests need to wait SchedulerApplicationAttempt 
> stopped.
> -
>
> Key: YARN-4968
> URL: https://issues.apache.org/jira/browse/YARN-4968
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.9.0
>
> Attachments: YARN-4968.1.patch
>
>
> Noticed some unit tests, for example:
> TestRMRestart#testRMRestartAfterPreemption
> TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry
> Sometimes failure because retrying app attempt registers before the previous 
> scheduler-application-attempt completely completed in scheduler.
> We need to wait scheduler-application-attempt stop before retrying following 
> attempts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4957) Add getNewReservation in ApplicationClientProtocol

2016-04-21 Thread Sean Po (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Po updated YARN-4957:
--
Attachment: YARN-4957.v5.patch

Thanks [~subru] for your comments. V5 of the patch addresses each of them.

> Add getNewReservation in ApplicationClientProtocol
> --
>
> Key: YARN-4957
> URL: https://issues.apache.org/jira/browse/YARN-4957
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, client, resourcemanager
>Affects Versions: 2.8.0
>Reporter: Subru Krishnan
>Assignee: Sean Po
>  Labels: api-breaking
> Attachments: YARN-4957.v0.patch, YARN-4957.v1.patch, 
> YARN-4957.v2.patch, YARN-4957.v3.patch, YARN-4957.v4.patch, YARN-4957.v5.patch
>
>
> Currently submitReservation returns a ReservationId if sucessful. This JIRA 
> propose adding a getNewReservation in ApplicationClientProtocol for the 
> following reasons:
>   * Prevent zombie reservations in the face of client and/or network failures 
> post submitReservation
>   * Align reservation submission with application submission



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4846) Random failures for TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252689#comment-15252689
 ] 

Hadoop QA commented on YARN-4846:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
42s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
20s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 17s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 1 new + 21 unchanged - 0 fixed = 22 total (was 21) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 75m 36s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 77m 10s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
20s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 169m 25s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_77 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.8.0_77 Timed out junit tests | 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_95 Timed out junit tests | 
org.apache

[jira] [Updated] (YARN-4968) A couple of AM retry unit tests need to wait SchedulerApplicationAttempt stopped.

2016-04-21 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-4968:

Fix Version/s: (was: 3.0.0)

> A couple of AM retry unit tests need to wait SchedulerApplicationAttempt 
> stopped.
> -
>
> Key: YARN-4968
> URL: https://issues.apache.org/jira/browse/YARN-4968
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.9.0
>
> Attachments: YARN-4968.1.patch
>
>
> Noticed some unit tests, for example:
> TestRMRestart#testRMRestartAfterPreemption
> TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry
> Sometimes failure because retrying app attempt registers before the previous 
> scheduler-application-attempt completely completed in scheduler.
> We need to wait scheduler-application-attempt stop before retrying following 
> attempts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4968) A couple of AM retry unit tests need to wait SchedulerApplicationAttempt stopped.

2016-04-21 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252640#comment-15252640
 ] 

Li Lu commented on YARN-4968:
-

Oh BTW, thanks [~wangda] for the work! 

> A couple of AM retry unit tests need to wait SchedulerApplicationAttempt 
> stopped.
> -
>
> Key: YARN-4968
> URL: https://issues.apache.org/jira/browse/YARN-4968
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.0.0, 2.9.0
>
> Attachments: YARN-4968.1.patch
>
>
> Noticed some unit tests, for example:
> TestRMRestart#testRMRestartAfterPreemption
> TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry
> Sometimes failure because retrying app attempt registers before the previous 
> scheduler-application-attempt completely completed in scheduler.
> We need to wait scheduler-application-attempt stop before retrying following 
> attempts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4968) A couple of AM retry unit tests need to wait SchedulerApplicationAttempt stopped.

2016-04-21 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252594#comment-15252594
 ] 

Li Lu commented on YARN-4968:
-

No further concerns raised. I'll commit this patch shortly. 

> A couple of AM retry unit tests need to wait SchedulerApplicationAttempt 
> stopped.
> -
>
> Key: YARN-4968
> URL: https://issues.apache.org/jira/browse/YARN-4968
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4968.1.patch
>
>
> Noticed some unit tests, for example:
> TestRMRestart#testRMRestartAfterPreemption
> TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry
> Sometimes failure because retrying app attempt registers before the previous 
> scheduler-application-attempt completely completed in scheduler.
> We need to wait scheduler-application-attempt stop before retrying following 
> attempts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4556) TestFifoScheduler.testResourceOverCommit fails

2016-04-21 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252593#comment-15252593
 ] 

Nathan Roberts commented on YARN-4556:
--

Patch seems like a reasonable test improvement. +1 non-binding


>  TestFifoScheduler.testResourceOverCommit fails
> ---
>
> Key: YARN-4556
> URL: https://issues.apache.org/jira/browse/YARN-4556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Akihiro Suda
> Attachments: YARN-4556-1.patch
>
>
> From YARN-4548 Jenkins log: 
> https://builds.apache.org/job/PreCommit-YARN-Build/10181/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
> {code}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler
> Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 31.004 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler
> testResourceOverCommit(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler)
>   Time elapsed: 4.746 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<-2048> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler.testResourceOverCommit(TestFifoScheduler.java:1142)
> {code}
> https://github.com/apache/hadoop/blob/8676a118a12165ae5a8b80a2a4596c133471ebc1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java#L1142
> It seems that Jenkins has been hitting this intermittently since April 2015
> https://www.google.com/search?q=TestFifoScheduler.testResourceOverCommit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4957) Add getNewReservation in ApplicationClientProtocol

2016-04-21 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252540#comment-15252540
 ] 

Subru Krishnan commented on YARN-4957:
--

Thanks [~seanpo03] for addressing my comments. The latest patch LGTM, +1 from 
my side.

Just a couple of minor comments on the documentation:
  * Move the part about getting reservationId to either a pre-step or step 0 as 
now text is aligned with the diagram.
  * The link to the get new reservationId is broken.

[~curino], can you take a look?

> Add getNewReservation in ApplicationClientProtocol
> --
>
> Key: YARN-4957
> URL: https://issues.apache.org/jira/browse/YARN-4957
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, client, resourcemanager
>Affects Versions: 2.8.0
>Reporter: Subru Krishnan
>Assignee: Sean Po
>  Labels: api-breaking
> Attachments: YARN-4957.v0.patch, YARN-4957.v1.patch, 
> YARN-4957.v2.patch, YARN-4957.v3.patch, YARN-4957.v4.patch
>
>
> Currently submitReservation returns a ReservationId if sucessful. This JIRA 
> propose adding a getNewReservation in ApplicationClientProtocol for the 
> following reasons:
>   * Prevent zombie reservations in the face of client and/or network failures 
> post submitReservation
>   * Align reservation submission with application submission



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4954) TestYarnClient.testReservationAPIs fails on machines with less than 4 GB available memory

2016-04-21 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252523#comment-15252523
 ] 

Junping Du commented on YARN-4954:
--

Hi [~vinodkv], I saw you move this one into subtask of YARN-4478. 
Here we noticed timeout issue for 4 test classes: 
{noformat}
org.apache.hadoop.yarn.client.cli.TestYarnCLI
org.apache.hadoop.yarn.client.api.impl.TestYarnClient
org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
org.apache.hadoop.yarn.client.api.impl.TestNMClien
{noformat}
Shall we file JIRA under YARN-4478 to track them? If so, as a whole or filed 
separately?

> TestYarnClient.testReservationAPIs fails on machines with less than 4 GB 
> available memory
> -
>
> Key: YARN-4954
> URL: https://issues.apache.org/jira/browse/YARN-4954
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 3.0.0
>Reporter: Gergely Novák
>Assignee: Gergely Novák
>Priority: Minor
> Attachments: YARN-4954.001.patch
>
>
> TestYarnClient.testReservationAPIs sometimes fails with this error:
> {noformat}
> java.lang.AssertionError: 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.exceptions.PlanningException:
>  The request cannot be satisfied
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitReservation(ClientRMService.java:1254)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitReservation(ApplicationClientProtocolPBServiceImpl.java:457)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:515)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2422)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2418)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2416)
> Caused by: 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.exceptions.PlanningException:
>  The request cannot be satisfied
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.IterativePlanner.computeJobAllocation(IterativePlanner.java:151)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.PlanningAlgorithm.allocateUser(PlanningAlgorithm.java:64)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.PlanningAlgorithm.createReservation(PlanningAlgorithm.java:140)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.TryManyReservationAgents.createReservation(TryManyReservationAgents.java:55)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.AlignedPlannerWithGreedy.createReservation(AlignedPlannerWithGreedy.java:84)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitReservation(ClientRMService.java:1237)
>   ... 10 more
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestYarnClient.testReservationAPIs(TestYarnClient.java:1227)
> {noformat}
> This is caused by really not having enough available memory to complete the 
> reservation (4 * 1024 MB). In my opinion lowering the required memory (either 
> by lowering the number of containers to 2, or the memory to 512 MB) would 
> make the test more stable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4697) NM aggregation thread pool is not bound by limits

2016-04-21 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252503#comment-15252503
 ] 

Junping Du commented on YARN-4697:
--

After investigating on our cluster which hit the same issue also recently, I 
think two root cause here:
1. Due to YARN-4325, too many stale applications doesn't get purged from NM 
state store clean. So NM recover too many stale application which will init app 
with log aggregation.
2. Because these applications are stale, some operation - like: createAppDir() 
will failed with token issues, but we swallow exception there and continue to 
create invalid aggregator - just file YARN-4984 to fix this issue.

> NM aggregation thread pool is not bound by limits
> -
>
> Key: YARN-4697
> URL: https://issues.apache.org/jira/browse/YARN-4697
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Critical
> Fix For: 2.9.0
>
> Attachments: yarn4697.001.patch, yarn4697.002.patch, 
> yarn4697.003.patch, yarn4697.004.patch
>
>
> In the LogAggregationService.java we create a threadpool to upload logs from 
> the nodemanager to HDFS if log aggregation is turned on. This is a cached 
> threadpool which based on the javadoc is an ulimited pool of threads.
> In the case that we have had a problem with log aggregation this could cause 
> a problem on restart. The number of threads created at that point could be 
> huge and will put a large load on the NameNode and in worse case could even 
> bring it down due to file descriptor issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4986) Add a check in the coprocessor for table to operated on

2016-04-21 Thread Vrushali C (JIRA)
Vrushali C created YARN-4986:


 Summary: Add a check in the coprocessor for table to operated on
 Key: YARN-4986
 URL: https://issues.apache.org/jira/browse/YARN-4986
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vrushali C
Assignee: Vrushali C


As a precautionary measure, it will be a good idea to have the coprocessor code 
check which table it needs to be working on and return/proceed accordingly. 
This is more of a safety check so that we are sure we are not inadvertently 
executing the coprocessor code on some other table.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4985) Refactor the coprocessor code & other definition classes into independent packages

2016-04-21 Thread Vrushali C (JIRA)
Vrushali C created YARN-4985:


 Summary: Refactor the coprocessor code & other definition classes 
into independent packages
 Key: YARN-4985
 URL: https://issues.apache.org/jira/browse/YARN-4985
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vrushali C
Assignee: Vrushali C



As part of the coprocessor deployment, we have realized that it will be much 
cleaner to have the coprocessor code sit in a package which does not depend on 
hadoop-yarn-server classes. It only needs hbase and other util classes.

These util classes and tag definition related classes can be refactored into 
their own independent "definition" class package so that making changes to 
coprocessor code, upgrading hbase, deploying hbase on a different hadoop 
version cluster etc all becomes operationally much easier and less error prone 
to having different library jars etc.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4786) Enhance hbase coprocessor aggregation operations:GLOBAL_MIN, LATEST_MIN etc and FINAL attributes

2016-04-21 Thread Vrushali C (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vrushali C reassigned YARN-4786:


Assignee: Vrushali C

> Enhance hbase coprocessor aggregation operations:GLOBAL_MIN, LATEST_MIN etc 
> and FINAL attributes
> 
>
> Key: YARN-4786
> URL: https://issues.apache.org/jira/browse/YARN-4786
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
>
> As part of YARN-4062, Joep and I had been discussing about min, max 
> operations and the final attributes. 
> YARN-4062 has GLOBAL_MIN, GLOBAL_MAX and SUM operations. It presently 
> indicates SUM_FINAL for a cell that contains a metric that is the final value 
> for the metric.
> We should enhance this such that the set of aggregation dimensions SUM, MIN, 
> MAX, etc. are really set of a per-column level and shouldn't be passed from 
> the client, but be instrumented by the ColumnHelper infrastructure instead. 
> We should probably use a different tag value for that.
> Both aggregation dimension and this "FINAL_VALUE" or whatever abbreviation we 
> use are needed to determine the right thing to do for compaction. Only one 
> value needs to have this final value bit / tag set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2016-04-21 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252492#comment-15252492
 ] 

Sangjin Lee commented on YARN-3816:
---

We're almost there...

It appears that {{entityTypesSkipAggregation}} is in two places: 
{{TimelineCollector}} and {{AppLevelTimelineCollector}}. And in 
{{TimelineCollector}} it is not being populated, whereas it is populated in 
{{AppLevelTimelineCollector}}. This is rather confusing. What I would suggest 
is to keep it only in {{TimelineCollector}} (I don't think this is dependent on 
the app-level timeline collector?). Then we could remove the 
{{getEntityTypesSkipAggregation()}} method and directly reference it at the 
places where we need it.

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Li Lu
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-YARN-2928-v5.patch, YARN-3816-YARN-2928-v6.patch, 
> YARN-3816-YARN-2928-v7.patch, YARN-3816-YARN-2928-v8.patch, 
> YARN-3816-YARN-2928-v9.patch, YARN-3816-feature-YARN-2928.v4.1.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4984) LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.

2016-04-21 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252490#comment-15252490
 ] 

Junping Du commented on YARN-4984:
--

The exception swallowing happens at LogAggregationService.initAppAggregator()

{noformat}
  // wait until check for existing aggregator to create dirs
YarnRuntimeException appDirException = null;
try {
  // Create the app dir
  createAppDir(user, appId, userUgi);
} catch (Exception e) {
  appLogAggregator.disableLogAggregation();
  if (!(e instanceof YarnRuntimeException)) {
appDirException = new YarnRuntimeException(e);
  } else {
appDirException = (YarnRuntimeException)e;
  }
}
... 
// creating aggregator thread
{noformat}
We should throw out exception in case createAppDir() is created with failure.

> LogAggregationService shouldn't swallow exception in handling createAppDir() 
> which cause thread leak.
> -
>
> Key: YARN-4984
> URL: https://issues.apache.org/jira/browse/YARN-4984
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.2
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> Due to YARN-4325, many stale applications still exists in NM state store and 
> get recovered after NM restart. The app initiation will get failed due to 
> token invalid, but exception is swallowed and aggregator thread is still 
> created for invalid app.
> Exception is:
> {noformat}
> 158 2016-04-19 23:38:33,039 ERROR logaggregation.LogAggregationService 
> (LogAggregationService.java:run(300)) - Failed to setup application log 
> directory for application_1448060878692_11842
> 159 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 1380589 for hdfswrite) can't be fo
> und in cache
> 160 at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> 161 at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> 162 at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> 163 at com.sun.proxy.$Proxy13.getFileInfo(Unknown Source)
> 164 at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
> 165 at sun.reflect.GeneratedMethodAccessor76.invoke(Unknown 
> Source)
> 166 at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 167 at java.lang.reflect.Method.invoke(Method.java:606)
> 168 at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
> 169 at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> 170 at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
> 171 at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116)
> 172 at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1315)
> 173 at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1311)
> 174 at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 175 at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1311)
> 176 at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.checkExists(LogAggregationService.java:248)
> 177 at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.access$100(LogAggregationService.java:67)
> 178 at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
> 179 at java.security.AccessController.doPrivileged(Native Method)
> 180 at javax.security.auth.Subject.doAs(Subject.java:415)
> 181 at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> 182 at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:261)
> 183 at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:367)
> 184 at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
> 185   

[jira] [Created] (YARN-4984) LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.

2016-04-21 Thread Junping Du (JIRA)
Junping Du created YARN-4984:


 Summary: LogAggregationService shouldn't swallow exception in 
handling createAppDir() which cause thread leak.
 Key: YARN-4984
 URL: https://issues.apache.org/jira/browse/YARN-4984
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.2
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


Due to YARN-4325, many stale applications still exists in NM state store and 
get recovered after NM restart. The app initiation will get failed due to token 
invalid, but exception is swallowed and aggregator thread is still created for 
invalid app.

Exception is:
{noformat}
158 2016-04-19 23:38:33,039 ERROR logaggregation.LogAggregationService 
(LogAggregationService.java:run(300)) - Failed to setup application log 
directory for application_1448060878692_11842
159 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (HDFS_DELEGATION_TOKEN token 1380589 for hdfswrite) can't be fo
und in cache
160 at org.apache.hadoop.ipc.Client.call(Client.java:1427)
161 at org.apache.hadoop.ipc.Client.call(Client.java:1358)
162 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
163 at com.sun.proxy.$Proxy13.getFileInfo(Unknown Source)
164 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
165 at sun.reflect.GeneratedMethodAccessor76.invoke(Unknown Source)
166 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
167 at java.lang.reflect.Method.invoke(Method.java:606)
168 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
169 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
170 at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
171 at 
org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116)
172 at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1315)
173 at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1311)
174 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
175 at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1311)
176 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.checkExists(LogAggregationService.java:248)
177 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.access$100(LogAggregationService.java:67)
178 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
179 at java.security.AccessController.doPrivileged(Native Method)
180 at javax.security.auth.Subject.doAs(Subject.java:415)
181 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
182 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:261)
183 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:367)
184 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
185 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:447)
186 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)

{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4954) TestYarnClient.testReservationAPIs fails on machines with less than 4 GB available memory

2016-04-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-4954:
--
Issue Type: Sub-task  (was: Test)
Parent: YARN-4478

> TestYarnClient.testReservationAPIs fails on machines with less than 4 GB 
> available memory
> -
>
> Key: YARN-4954
> URL: https://issues.apache.org/jira/browse/YARN-4954
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 3.0.0
>Reporter: Gergely Novák
>Assignee: Gergely Novák
>Priority: Minor
> Attachments: YARN-4954.001.patch
>
>
> TestYarnClient.testReservationAPIs sometimes fails with this error:
> {noformat}
> java.lang.AssertionError: 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.exceptions.PlanningException:
>  The request cannot be satisfied
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitReservation(ClientRMService.java:1254)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitReservation(ApplicationClientProtocolPBServiceImpl.java:457)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:515)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2422)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2418)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2416)
> Caused by: 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.exceptions.PlanningException:
>  The request cannot be satisfied
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.IterativePlanner.computeJobAllocation(IterativePlanner.java:151)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.PlanningAlgorithm.allocateUser(PlanningAlgorithm.java:64)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.PlanningAlgorithm.createReservation(PlanningAlgorithm.java:140)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.TryManyReservationAgents.createReservation(TryManyReservationAgents.java:55)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.planning.AlignedPlannerWithGreedy.createReservation(AlignedPlannerWithGreedy.java:84)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitReservation(ClientRMService.java:1237)
>   ... 10 more
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestYarnClient.testReservationAPIs(TestYarnClient.java:1227)
> {noformat}
> This is caused by really not having enough available memory to complete the 
> reservation (4 * 1024 MB). In my opinion lowering the required memory (either 
> by lowering the number of containers to 2, or the memory to 512 MB) would 
> make the test more stable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252464#comment-15252464
 ] 

Hadoop QA commented on YARN-4311:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 17s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 52s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 50s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
10s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 57s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
55s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
53s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 36s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 12s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 38s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 5m 38s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
8s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 52s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
54s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
41s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 40s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 11s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_77. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 8s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 76m 7s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 50s 
{color} | {color:green} hadoop-sls in the patch passed with JDK v1.8.0_77. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_

[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252458#comment-15252458
 ] 

Hadoop QA commented on YARN-3816:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 59s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
21s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
42s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 59s 
{color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
4s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
21s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 43s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 9s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 51s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
48s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
50s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 0s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_77. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 1s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 42s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 36s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 15s 
{col

[jira] [Created] (YARN-4983) JVM and UGI metrics disappear after RM is once transitioned to standby mode

2016-04-21 Thread Li Lu (JIRA)
Li Lu created YARN-4983:
---

 Summary: JVM and UGI metrics disappear after RM is once 
transitioned to standby mode
 Key: YARN-4983
 URL: https://issues.apache.org/jira/browse/YARN-4983
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Li Lu


When get transitioned to standby, the RM will shutdown the existing metric 
system and relaunch a new one. This will cause the jvm metrics and ugi metrics 
to miss in the new metric system. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4976) Missing NullPointer check in ContainerLaunchContextPBImpl causes RM to die

2016-04-21 Thread Giovanni Matteo Fumarola (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252394#comment-15252394
 ] 

Giovanni Matteo Fumarola commented on YARN-4976:


Thanks [~chris.douglas] and [~templedf] for reviewing and commit it. 
Thanks [~ellenfkh] for finding the issue.


> Missing NullPointer check in ContainerLaunchContextPBImpl causes RM to die
> --
>
> Key: YARN-4976
> URL: https://issues.apache.org/jira/browse/YARN-4976
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
> Fix For: 2.9.0
>
> Attachments: YARN-4976.v0.patch, YARN-4976.v1.patch, 
> YARN-4976.v2.patch, YARN-4976.v3.patch
>
>
> The client can set a null value for any env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4150) Failure in TestNMClient because nodereports were not available

2016-04-21 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252380#comment-15252380
 ] 

Ray Chiang commented on YARN-4150:
--

I can't replicate the issue.  Is this still a problem?

> Failure in TestNMClient because nodereports were not available
> --
>
> Key: YARN-4150
> URL: https://issues.apache.org/jira/browse/YARN-4150
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4150.001.patch
>
>
> Saw a failure in a test run
> https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:210)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4976) Missing NullPointer check in ContainerLaunchContextPBImpl causes RM to die

2016-04-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252354#comment-15252354
 ] 

Hudson commented on YARN-4976:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9645 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9645/])
YARN-4976. Missing NullPointer check in ContainerLaunchContextPBImpl. 
(cdouglas: rev 95a50466075c28110fa7c297e9c5246892076ca8)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerLaunchContextPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestApplicationClientProtocolRecords.java


> Missing NullPointer check in ContainerLaunchContextPBImpl causes RM to die
> --
>
> Key: YARN-4976
> URL: https://issues.apache.org/jira/browse/YARN-4976
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
> Fix For: 2.9.0
>
> Attachments: YARN-4976.v0.patch, YARN-4976.v1.patch, 
> YARN-4976.v2.patch, YARN-4976.v3.patch
>
>
> The client can set a null value for any env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4846) Random failures for TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers

2016-04-21 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4846:
---
Attachment: 0003-YARN-4846.patch

> Random failures for 
> TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers
> 
>
> Key: YARN-4846
> URL: https://issues.apache.org/jira/browse/YARN-4846
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4846.patch, 0002-YARN-4846.patch, 
> 0003-YARN-4846.patch, YARN-4846-update-PCPP.patch
>
>
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPreemption.testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers(TestCapacitySchedulerPreemption.java:473)
> {noformat}
> https://builds.apache.org/job/PreCommit-YARN-Build/10826/testReport/org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity/TestCapacitySchedulerPreemption/testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4846) Random failures for TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers

2016-04-21 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4846:
---
Attachment: (was: 0003-YARN-4846.patch)

> Random failures for 
> TestCapacitySchedulerPreemption#testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers
> 
>
> Key: YARN-4846
> URL: https://issues.apache.org/jira/browse/YARN-4846
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4846.patch, 0002-YARN-4846.patch, 
> 0003-YARN-4846.patch, YARN-4846-update-PCPP.patch
>
>
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPreemption.testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers(TestCapacitySchedulerPreemption.java:473)
> {noformat}
> https://builds.apache.org/job/PreCommit-YARN-Build/10826/testReport/org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity/TestCapacitySchedulerPreemption/testPreemptionPolicyShouldRespectAlreadyMarkedKillableContainers/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4957) Add getNewReservation in ApplicationClientProtocol

2016-04-21 Thread Sean Po (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252304#comment-15252304
 ] 

Sean Po commented on YARN-4957:
---

All other test failures are accounted for above.

> Add getNewReservation in ApplicationClientProtocol
> --
>
> Key: YARN-4957
> URL: https://issues.apache.org/jira/browse/YARN-4957
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, client, resourcemanager
>Affects Versions: 2.8.0
>Reporter: Subru Krishnan
>Assignee: Sean Po
>  Labels: api-breaking
> Attachments: YARN-4957.v0.patch, YARN-4957.v1.patch, 
> YARN-4957.v2.patch, YARN-4957.v3.patch, YARN-4957.v4.patch
>
>
> Currently submitReservation returns a ReservationId if sucessful. This JIRA 
> propose adding a getNewReservation in ApplicationClientProtocol for the 
> following reasons:
>   * Prevent zombie reservations in the face of client and/or network failures 
> post submitReservation
>   * Align reservation submission with application submission



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4957) Add getNewReservation in ApplicationClientProtocol

2016-04-21 Thread Sean Po (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252303#comment-15252303
 ] 

Sean Po commented on YARN-4957:
---

In YARN-4957.v4.patch I fixed the checkstyle and Javadoc issues. 

+*Test Failures:*+
Test: [hadoop.mapred.TestMRCJCFileOutputCommitter]
Existing JIRA: 
https://issues.apache.org/jira/browse/MAPREDUCE-6682?jql=text%20~%20%22TestMRCJCFileOutputCommitter%22
This test passes locally.

+*Test Failures:*+
Test: [hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore]
There are no existing JIRAs for TestZKRMStateStore. This test passes locally.


> Add getNewReservation in ApplicationClientProtocol
> --
>
> Key: YARN-4957
> URL: https://issues.apache.org/jira/browse/YARN-4957
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, client, resourcemanager
>Affects Versions: 2.8.0
>Reporter: Subru Krishnan
>Assignee: Sean Po
>  Labels: api-breaking
> Attachments: YARN-4957.v0.patch, YARN-4957.v1.patch, 
> YARN-4957.v2.patch, YARN-4957.v3.patch, YARN-4957.v4.patch
>
>
> Currently submitReservation returns a ReservationId if sucessful. This JIRA 
> propose adding a getNewReservation in ApplicationClientProtocol for the 
> following reasons:
>   * Prevent zombie reservations in the face of client and/or network failures 
> post submitReservation
>   * Align reservation submission with application submission



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2016-04-21 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252263#comment-15252263
 ] 

Li Lu commented on YARN-3816:
-

Thanks [~sjlee0]! Took a look at it. The test failure happened when we read 
something out from the entity table. The write related to this failure was not 
performed through timeline collectors IIUC. 

I'm kicking another Jenkins run. 

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Li Lu
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-YARN-2928-v5.patch, YARN-3816-YARN-2928-v6.patch, 
> YARN-3816-YARN-2928-v7.patch, YARN-3816-YARN-2928-v8.patch, 
> YARN-3816-YARN-2928-v9.patch, YARN-3816-feature-YARN-2928.v4.1.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2016-04-21 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252202#comment-15252202
 ] 

Sangjin Lee commented on YARN-3816:
---

This is somewhat unrelated to the TestHBaseTimelineStorage failure observed 
above, but did you have a chance to go over our unit tests to see if this patch 
may change the behavior? I'm thinking about unit tests that write the YARN 
container entities. Now it may or may not (depending on the timing) write the 
application row too. I just want to make sure it does not introduce any 
timing-dependent unit test failures. Have you made a pass on the unit tests to 
see if we have such a situation?

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Li Lu
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-YARN-2928-v5.patch, YARN-3816-YARN-2928-v6.patch, 
> YARN-3816-YARN-2928-v7.patch, YARN-3816-YARN-2928-v8.patch, 
> YARN-3816-YARN-2928-v9.patch, YARN-3816-feature-YARN-2928.v4.1.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4697) NM aggregation thread pool is not bound by limits

2016-04-21 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252164#comment-15252164
 ] 

Junping Du commented on YARN-4697:
--

bq. My concern is that if don't fix the root-cause, though we've protected 
ourselves from crashes, we'd just be queueing a lot of aggregation processes 
and causing long waiting times.
Agree. We do see NM log aggregation service launch many active threads which 
keep large number of TCP connections to DN which use out system's file limit. 
We can fix shared limited thread number here, but the TCP connections problem 
may not solved by this patch.

bq. Upon NM restart, NM will try to recover all applications and submit a log 
aggregation task to the thread pool for each application recovered. Therefore, 
a large number of recovered applications plus concurrent applications can cause 
the thread pool to increase without a bound.
Does all these applications are active one or finished already? I suspect we 
are leaking finished applications in NM state store in recover process. I 
noticed this issue in filing YARN-4325 but lost my progress as previous long 
running cluster is gone. [~haibochen], could you check if your case is the same 
here?

In general, I think the fix on this JIRA is OK. But I agree with Vinod that we 
should dig out more on the root cause or it could be other holes (like TCP 
connection leaking mentioned above).

> NM aggregation thread pool is not bound by limits
> -
>
> Key: YARN-4697
> URL: https://issues.apache.org/jira/browse/YARN-4697
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Critical
> Fix For: 2.9.0
>
> Attachments: yarn4697.001.patch, yarn4697.002.patch, 
> yarn4697.003.patch, yarn4697.004.patch
>
>
> In the LogAggregationService.java we create a threadpool to upload logs from 
> the nodemanager to HDFS if log aggregation is turned on. This is a cached 
> threadpool which based on the javadoc is an ulimited pool of threads.
> In the case that we have had a problem with log aggregation this could cause 
> a problem on restart. The number of threads created at that point could be 
> huge and will put a large load on the NameNode and in worse case could even 
> bring it down due to file descriptor issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4976) Missing NullPointer check in ContainerLaunchContextPBImpl causes RM to die

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252142#comment-15252142
 ] 

Hadoop QA commented on YARN-4976:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
20s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 55s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 11s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 20m 36s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:fbe3e86 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/1270/YARN-4976.v3.patch |
| JIRA Issue | YARN-4976 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux ad198258ccb0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 7da5847 |
| Default Java | 1.7.0_95 |
| Multi-JDK versions |  /usr/lib/jvm/java-

[jira] [Commented] (YARN-4935) TestYarnClient#testSubmitIncorrectQueue fails with FairScheduler

2016-04-21 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252141#comment-15252141
 ] 

Yufei Gu commented on YARN-4935:


Thanks for the review and commit, [~kasha]!

> TestYarnClient#testSubmitIncorrectQueue fails with FairScheduler
> 
>
> Key: YARN-4935
> URL: https://issues.apache.org/jira/browse/YARN-4935
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
> Fix For: 2.9.0
>
> Attachments: YARN-4935.001.patch, YARN-4935.002.patch
>
>
> This test case introduced by YARN-3131 works well on CapacityScheduler but 
> not on FairScheduler, since CS doesn't allow dynamically create a queue, 
> while FS supports it. So if you give a random queue name, CS will reject it, 
> but FS will create a new queue for it by default. 
> One simple solution is to specific CS in this test case. /cc [~lichangleo]. I 
> was thinking about creating another test case for FS. But for the code 
> introduced by YARN-3131, it may be not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4784) Fairscheduler: defaultQueueSchedulingPolicy should not accept FIFO

2016-04-21 Thread Yufei Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252136#comment-15252136
 ] 

Yufei Gu commented on YARN-4784:


[~kasha], Thanks a lot for review and commit!

> Fairscheduler: defaultQueueSchedulingPolicy should not accept FIFO
> --
>
> Key: YARN-4784
> URL: https://issues.apache.org/jira/browse/YARN-4784
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
> Fix For: 2.9.0
>
> Attachments: YARN-4784.001.patch, YARN-4784.002.patch
>
>
> The configure item defaultQueueSchedulingPolicy should not accept fifo as a 
> value since it is an invalid value for non-leaf queues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4954) TestYarnClient.testReservationAPIs fails on machines with less than 4 GB available memory

2016-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252115#comment-15252115
 ] 

Hadoop QA commented on YARN-4954:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
25s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
35s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
19s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 14s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 14s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 17s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 22s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
45s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 11s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_77. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 26s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 147m 7s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_77 Failed junit tests | hadoop.yarn.client.api.impl.TestAMRMProxy |
|   | hadoop.yarn.client.TestGetGroups |
| JDK v1.8.0_77 Timed out junit tests | 
org.apache.hadoop.yarn.client.cli.TestYarnCLI |
|   | org.apache.hadoop.yarn.client.api.impl.TestYarnClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestAMRMClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestNMClient |
| JDK v1.7.0_95 Failed junit tests | hadoop.yarn.client.api.impl.TestAMRMProxy |
|   | hadoop.yarn.client.TestGetGroups |
| JDK v1.7.0_95 Timed out junit tests | 
org.apache.hadoop.yarn.client.cli.TestYarnCLI |
|   | org.apache.hadoop.yarn.client.api.i

[jira] [Updated] (YARN-4976) Missing NullPointer check in ContainerLaunchContextPBImpl causes RM to die

2016-04-21 Thread Giovanni Matteo Fumarola (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-4976:
---
Attachment: YARN-4976.v3.patch

> Missing NullPointer check in ContainerLaunchContextPBImpl causes RM to die
> --
>
> Key: YARN-4976
> URL: https://issues.apache.org/jira/browse/YARN-4976
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
> Attachments: YARN-4976.v0.patch, YARN-4976.v1.patch, 
> YARN-4976.v2.patch, YARN-4976.v3.patch
>
>
> The client can set a null value for any env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4976) Missing NullPointer check in ContainerLaunchContextPBImpl causes RM to die

2016-04-21 Thread Giovanni Matteo Fumarola (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252091#comment-15252091
 ] 

Giovanni Matteo Fumarola commented on YARN-4976:


Uploading a new patch without a whitespace in the comment.

> Missing NullPointer check in ContainerLaunchContextPBImpl causes RM to die
> --
>
> Key: YARN-4976
> URL: https://issues.apache.org/jira/browse/YARN-4976
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
> Attachments: YARN-4976.v0.patch, YARN-4976.v1.patch, 
> YARN-4976.v2.patch
>
>
> The client can set a null value for any env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2016-04-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252087#comment-15252087
 ] 

zhihai xu commented on YARN-1458:
-

Ok, no problem, you can try it at your convenience. thanks for finding this 
issue!

> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.addendum.patch, YARN-1458.alternative0.patch, 
> YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
> yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-04-21 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-4311:
--
Attachment: YARN-4311-v16.patch

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-branch-2.7.001.patch, 
> YARN-4311-branch-2.7.002.patch, YARN-4311-branch-2.7.003.patch, 
> YARN-4311-branch-2.7.004.patch, YARN-4311-v1.patch, YARN-4311-v10.patch, 
> YARN-4311-v11.patch, YARN-4311-v11.patch, YARN-4311-v12.patch, 
> YARN-4311-v13.patch, YARN-4311-v13.patch, YARN-4311-v14.patch, 
> YARN-4311-v15.patch, YARN-4311-v16.patch, YARN-4311-v2.patch, 
> YARN-4311-v3.patch, YARN-4311-v4.patch, YARN-4311-v5.patch, 
> YARN-4311-v6.patch, YARN-4311-v7.patch, YARN-4311-v8.patch, YARN-4311-v9.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-04-21 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-4311:
--
Attachment: (was: YARN-4311-v16.patch)

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-branch-2.7.001.patch, 
> YARN-4311-branch-2.7.002.patch, YARN-4311-branch-2.7.003.patch, 
> YARN-4311-branch-2.7.004.patch, YARN-4311-v1.patch, YARN-4311-v10.patch, 
> YARN-4311-v11.patch, YARN-4311-v11.patch, YARN-4311-v12.patch, 
> YARN-4311-v13.patch, YARN-4311-v13.patch, YARN-4311-v14.patch, 
> YARN-4311-v15.patch, YARN-4311-v16.patch, YARN-4311-v2.patch, 
> YARN-4311-v3.patch, YARN-4311-v4.patch, YARN-4311-v5.patch, 
> YARN-4311-v6.patch, YARN-4311-v7.patch, YARN-4311-v8.patch, YARN-4311-v9.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-04-21 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-4311:
--
Attachment: YARN-4311-v16.patch

Fixing checkstyle and findbugs issues. Test failures are locally passing and 
unrelated.

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-branch-2.7.001.patch, 
> YARN-4311-branch-2.7.002.patch, YARN-4311-branch-2.7.003.patch, 
> YARN-4311-branch-2.7.004.patch, YARN-4311-v1.patch, YARN-4311-v10.patch, 
> YARN-4311-v11.patch, YARN-4311-v11.patch, YARN-4311-v12.patch, 
> YARN-4311-v13.patch, YARN-4311-v13.patch, YARN-4311-v14.patch, 
> YARN-4311-v15.patch, YARN-4311-v16.patch, YARN-4311-v2.patch, 
> YARN-4311-v3.patch, YARN-4311-v4.patch, YARN-4311-v5.patch, 
> YARN-4311-v6.patch, YARN-4311-v7.patch, YARN-4311-v8.patch, YARN-4311-v9.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4890) Unit test intermittent failure: TestNodeLabelContainerAllocation#testQueueUsedCapacitiesUpdate

2016-04-21 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251951#comment-15251951
 ] 

Sunil G commented on YARN-4890:
---

Thanks [~leftnoteasy] for the review and commit, and thanks [~bibinchundatt] 
for the review.

> Unit test intermittent failure: 
> TestNodeLabelContainerAllocation#testQueueUsedCapacitiesUpdate
> --
>
> Key: YARN-4890
> URL: https://issues.apache.org/jira/browse/YARN-4890
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Fix For: 2.9.0
>
> Attachments: 0001-YARN-4890.patch
>
>
> Message:
> {code}
> Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 314.062 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
> testQueueUsedCapacitiesUpdate(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation)
>   Time elapsed: 12.426 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0.3> but was:<0.6>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:519)
>   at org.junit.Assert.assertEquals(Assert.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation.checkQueueUsedCapacity(TestNodeLabelContainerAllocation.java:1163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation.testQueueUsedCapacitiesUpdate(TestNodeLabelContainerAllocation.java:1382)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4958) The file localization process should allow for wildcards to reduce the application footprint in the state store

2016-04-21 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251934#comment-15251934
 ] 

Daniel Templeton commented on YARN-4958:


On closer examination, HADOOP-12747 may be solving a slightly different 
problem.  I believe it's solving the problem of reducing the end user pain in 
specifying a large number of JARs in -libjar.  This JIRA is solving the problem 
of reducing the state store impact of specifying a large number of JARs in 
-libjar.  Based on looking at the implementations, the two JIRAs are related 
but orthogonal.

> The file localization process should allow for wildcards to reduce the 
> application footprint in the state store
> ---
>
> Key: YARN-4958
> URL: https://issues.apache.org/jira/browse/YARN-4958
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4958.001.patch
>
>
> When using the -libjars option to add classes to the classpath, every library 
> so added is explicitly listed in the {{ContainerLaunchContext}}'s local 
> resources even though they're all uploaded to the same directory in HDFS.  
> When using tools like Crunch without an uber JAR or when trying to take 
> advantage of the shared cache, the number of libraries can be quite large.  
> We've seen many cases where we had to turn down the max number of 
> applications to prevent ZK from running out of heap because of the size of 
> the state store entries.
> Rather than listing all files independently, this JIRA proposes to have the 
> NM allow wildcards in the resource localization paths.  Specifically, we 
> propose to allow a path to have a final component (name) set to "*", which is 
> interpreted by the NM as "download the full directory and link to every file 
> in it from the job's working directory."  This behavior is the same as the 
> current behavior when using -libjars, but avoids explicitly listing every 
> file.
> This JIRA does not attempt to provide more general purpose wildcards, such as 
> "\*.jar" or "file\*", as having multiple entries for a single directory 
> presents numerous logistical issues.
> This JIRA also does not attempt to integrate with the shared cache.  That 
> work will be left to a future JIRA.  Specifically, this JIRA only applies 
> when a full directory is uploaded.  Currently the shared cache does not 
> handle directory uploads.
> This JIRA proposes to allow for wildcards both in the internal processing of 
> the -libjars switch and in paths added through the {{Job}} and 
> {{DistributedCache}} classes.
> The proposed approach is to treat a path, "dir/\*", as "dir" for purposes of 
> all file verification and localization.  In the final step, the NM will query 
> the localized directory to get a list of the files in "dir" such that each 
> can be linked from the job's working directory.  Since $PWD/\* is always 
> included on the classpath, all JAR files in "dir" will be in the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4958) The file localization process should allow for wildcards to reduce the application footprint in the state store

2016-04-21 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-4958:
---
Description: 
When using the -libjars option to add classes to the classpath, every library 
so added is explicitly listed in the {{ContainerLaunchContext}}'s local 
resources even though they're all uploaded to the same directory in HDFS.  When 
using tools like Crunch without an uber JAR or when trying to take advantage of 
the shared cache, the number of libraries can be quite large.  We've seen many 
cases where we had to turn down the max number of applications to prevent ZK 
from running out of heap because of the size of the state store entries.

Rather than listing all files independently, this JIRA proposes to have the NM 
allow wildcards in the resource localization paths.  Specifically, we propose 
to allow a path to have a final component (name) set to "*", which is 
interpreted by the NM as "download the full directory and link to every file in 
it from the job's working directory."  This behavior is the same as the current 
behavior when using -libjars, but avoids explicitly listing every file.

This JIRA does not attempt to provide more general purpose wildcards, such as 
"\*.jar" or "file\*", as having multiple entries for a single directory 
presents numerous logistical issues.

This JIRA also does not attempt to integrate with the shared cache.  That work 
will be left to a future JIRA.  Specifically, this JIRA only applies when a 
full directory is uploaded.  Currently the shared cache does not handle 
directory uploads.

This JIRA proposes to allow for wildcards both in the internal processing of 
the -libjars switch and in paths added through the {{Job}} and 
{{DistributedCache}} classes.

The proposed approach is to treat a path, "dir/\*", as "dir" for purposes of 
all file verification and localization.  In the final step, the NM will query 
the localized directory to get a list of the files in "dir" such that each can 
be linked from the job's working directory.  Since $PWD/\* is always included 
on the classpath, all JAR files in "dir" will be in the classpath.

  was:
When using the -libjars option to add classes to the classpath, every library 
so added is explicitly listed in the {{ContainerLaunchContext}}'s local 
resources even though they're all uploaded to the same directory in HDFS.  When 
using tools like Crunch without an uber JAR or when trying to take advantage of 
the shared cache, the number of libraries can be quite large.  We've seen many 
cases where we had to turn down the max number of applications to prevent ZK 
from running out of heap because of the size of the state store entries.

Rather than listing all files independently, this JIRA proposes to have the NM 
allow wildcards in the resource localization paths.  Specifically, we propose 
to allow a path to have a final component (name) set to "*", which is 
interpreted by the NM as "download the full directory and link to every file in 
it from the job's working directory."  This behavior is the same as the current 
behavior when using -libjars, but avoids explicitly listing every file.

This JIRA does not attempt to provide more general purpose wildcards, such as 
"*.jar" or "file*", as having multiple entries for a single directory presents 
numerous logistical issues.

This JIRA also does not attempt to integrate with the shared cache.  That work 
will be left to a future JIRA.  Specifically, this JIRA only applies when a 
full directory is uploaded.  Currently the shared cache does not handle 
directory uploads.

This JIRA proposes to allow for wildcards both in the internal processing of 
the -libjars switch and in paths added through the {{Job}} and 
{{DistributedCache}} classes.

The proposed approach is to treat a path, "dir/*", as "dir" for purposes of all 
file verification and localization.  In the final step, the NM will query the 
localized directory to get a list of the files in "dir" such that each can be 
linked from the job's working directory.  Since $PWD/* is always included on 
the classpath, all JAR files in "dir" will be in the classpath.


> The file localization process should allow for wildcards to reduce the 
> application footprint in the state store
> ---
>
> Key: YARN-4958
> URL: https://issues.apache.org/jira/browse/YARN-4958
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4958.001.patch
>
>
> When using the -libjars option to add classes to the classpath, every library 
> so added is explicitly listed in the {{ContainerL

[jira] [Commented] (YARN-4958) The file localization process should allow for wildcards to reduce the application footprint in the state store

2016-04-21 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251925#comment-15251925
 ] 

Daniel Templeton commented on YARN-4958:


[~jira.shegalov], I just noticed your comments on HADOOP-12747.  I'd be curious 
to have your opinion on this one.

> The file localization process should allow for wildcards to reduce the 
> application footprint in the state store
> ---
>
> Key: YARN-4958
> URL: https://issues.apache.org/jira/browse/YARN-4958
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4958.001.patch
>
>
> When using the -libjars option to add classes to the classpath, every library 
> so added is explicitly listed in the {{ContainerLaunchContext}}'s local 
> resources even though they're all uploaded to the same directory in HDFS.  
> When using tools like Crunch without an uber JAR or when trying to take 
> advantage of the shared cache, the number of libraries can be quite large.  
> We've seen many cases where we had to turn down the max number of 
> applications to prevent ZK from running out of heap because of the size of 
> the state store entries.
> Rather than listing all files independently, this JIRA proposes to have the 
> NM allow wildcards in the resource localization paths.  Specifically, we 
> propose to allow a path to have a final component (name) set to "*", which is 
> interpreted by the NM as "download the full directory and link to every file 
> in it from the job's working directory."  This behavior is the same as the 
> current behavior when using -libjars, but avoids explicitly listing every 
> file.
> This JIRA does not attempt to provide more general purpose wildcards, such as 
> "\*.jar" or "file\*", as having multiple entries for a single directory 
> presents numerous logistical issues.
> This JIRA also does not attempt to integrate with the shared cache.  That 
> work will be left to a future JIRA.  Specifically, this JIRA only applies 
> when a full directory is uploaded.  Currently the shared cache does not 
> handle directory uploads.
> This JIRA proposes to allow for wildcards both in the internal processing of 
> the -libjars switch and in paths added through the {{Job}} and 
> {{DistributedCache}} classes.
> The proposed approach is to treat a path, "dir/\*", as "dir" for purposes of 
> all file verification and localization.  In the final step, the NM will query 
> the localized directory to get a list of the files in "dir" such that each 
> can be linked from the job's working directory.  Since $PWD/\* is always 
> included on the classpath, all JAR files in "dir" will be in the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >