[jira] [Commented] (YARN-8580) yarn.resourcemanager.am.max-attempts is not respected for yarn services

2018-07-25 Thread Gour Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556935#comment-16556935
 ] 

Gour Saha commented on YARN-8580:
-

Actually, this is Yarn Service specific property. So the value 20 is getting 
set because that's the default for Yarn Services. The reason 100 was not taking 
effect is - for Yarn Service the property name is 
yarn.service.am-restart.max-attempts and not 
yarn.resourcemanager.am.max-attempts.

Once the right property is set, the desired behavior will be seen.

It is still an Invalid jira though.

> yarn.resourcemanager.am.max-attempts is not respected for yarn services
> ---
>
> Key: YARN-8580
> URL: https://issues.apache.org/jira/browse/YARN-8580
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Priority: Major
>
> 1) Max am attempt is set to 100 on all nodes. ( including gateway)
> {code}
>  
>   yarn.resourcemanager.am.max-attempts
>   100
> {code}
> 2) Start a Yarn service ( Hbase tarball ) application
> 3) Kill AM 20 times
> Here, App fails with below diagnostics.
> {code}
> bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status 
> application_1532481557746_0001
> 18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> 18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml 
> at file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml
> Application Report : 
>   Application-Id : application_1532481557746_0001
>   Application-Name : hbase-tarball-lr
>   Application-Type : yarn-service
>   User : hbase
>   Queue : default
>   Application Priority : 0
>   Start-Time : 1532481864863
>   Finish-Time : 1532522943103
>   Progress : 100%
>   State : FAILED
>   Final-State : FAILED
>   Tracking-URL : 
> https://xxx:8090/cluster/app/application_1532481557746_0001
>   RPC Port : -1
>   AM Host : N/A
>   Aggregate Resource Allocation : 252150112 MB-seconds, 164141 
> vcore-seconds
>   Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
>   Log Aggregation Status : SUCCEEDED
>   Diagnostics : Application application_1532481557746_0001 failed 20 
> times (global limit =100; local limit is =20) due to AM Container for 
> appattempt_1532481557746_0001_20 exited with  exitCode: 137
> Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed 
> on request. Exit code is 137
> [2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. 
> [2018-07-25 12:49:03.045]Killed by external signal
> For more detailed output, check the application tracking page: 
> https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on 
> links to logs of each attempt.
> . Failing the application.
>   Unmanaged Application : false
>   Application Node Label Expression : 
>   AM container Node Label Expression : 
>   TimeoutType : LIFETIME  ExpiryTime : 2018-07-25T22:26:15.419+   
> RemainingTime : 0seconds
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc

2018-07-25 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556926#comment-16556926
 ] 

Weiwei Yang commented on YARN-8577:
---

Thanks [~bibinchundatt] for the review and commit, I have cherry-picked this to 
branch-2.9 too.

> Fix the broken anchor in SLS site-doc
> -
>
> Key: YARN-8577
> URL: https://issues.apache.org/jira/browse/YARN-8577
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.9.0, 3.0.0, 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
> Fix For: 3.2.0, 2.9.2, 3.0.4, 3.1.2
>
> Attachments: HADOOP-15630.001.patch
>
>
> The anchor for section "Synthetic Load Generator" is currently broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8577) Fix the broken anchor in SLS site-doc

2018-07-25 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8577:
--
Fix Version/s: 2.9.2

> Fix the broken anchor in SLS site-doc
> -
>
> Key: YARN-8577
> URL: https://issues.apache.org/jira/browse/YARN-8577
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.9.0, 3.0.0, 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
> Fix For: 3.2.0, 2.9.2, 3.0.4, 3.1.2
>
> Attachments: HADOOP-15630.001.patch
>
>
> The anchor for section "Synthetic Load Generator" is currently broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2018-07-25 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8546:
--
Fix Version/s: (was: 3.1.1)
   3.1.2

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2018-07-25 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556921#comment-16556921
 ] 

Weiwei Yang commented on YARN-8546:
---

Thanks [~bibinchundatt], I have corrected the fix version to 3.1.2.

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556916#comment-16556916
 ] 

genericqa commented on YARN-7833:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
31s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
26s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 
11s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 32m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
11s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 56s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
10s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
24s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 28m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 28m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  0m 
 0s{color} | {color:green} There were no new shellcheck issues. {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
36s{color} | {color:green} There were no new shelldocs issues. {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 4 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch 137 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 25s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
36s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m  
4s{color} | {color:red} hadoop-tools/hadoop-sls generated 2 new + 0 unchanged - 
0 fixed = 2 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
10s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
55s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
22s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 69m 
13s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |

[jira] [Commented] (YARN-8407) Container launch exception in AM log should be printed in ERROR level

2018-07-25 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556897#comment-16556897
 ] 

Bibin A Chundatt commented on YARN-8407:


[~yeshavora]

Few minor comments

# Please handle formatting
# Use Stringbuilder for creating message.

> Container launch exception in AM log should be printed in ERROR level
> -
>
> Key: YARN-8407
> URL: https://issues.apache.org/jira/browse/YARN-8407
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Yesha Vora
>Priority: Major
> Attachments: YARN-8407.001.patch
>
>
> when a container launch is failing due to docker image not available is 
> logged as INFO level in AM log. 
> Container launch failure should be logged as ERROR.
> Steps:
> launch httpd yarn-service application with invalid docker image
>  
> {code:java}
> 2018-06-07 01:51:32,966 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE httpd-0 : 
> container_e05_1528335963594_0001_01_02]: 
> container_e05_1528335963594_0001_01_02 completed. Reinsert back to 
> pending list and requested a new container.
> exitStatus=-1, diagnostics=[2018-06-07 01:51:02.363]Exception from 
> container-launch.
> Container id: container_e05_1528335963594_0001_01_02
> Exit code: 7
> Exception message: Launch container failed
> Shell error output: Unable to find image 'xxx/httpd:0.1' locally
> Trying to pull repository xxx/httpd ...
> /usr/bin/docker-current: Get https://xxx/v1/_ping: dial tcp: lookup xxx on 
> yyy: no such host.
> See '/usr/bin/docker-current run --help'.
> Shell output: main : command provided 4
> main : run as user is hbase
> main : requested yarn user is hbase
> Creating script paths...
> Creating local dirs...
> Getting exit code file...
> Changing effective user to root...
> Wrote the exit code 7 to 
> /grid/0/hadoop/yarn/local/nmPrivate/application_1528335963594_0001/container_e05_1528335963594_0001_01_02/container_e05_1528335963594_0001_01_02.pid.exitcode
> [2018-06-07 01:51:02.393]Diagnostic message from attempt :
> [2018-06-07 01:51:02.394]Container exited with a non-zero exit code 7. Last 
> 4096 bytes of stderr.txt :
> [2018-06-07 01:51:32.428]Could not find 
> nmPrivate/application_1528335963594_0001/container_e05_1528335963594_0001_01_02//container_e05_1528335963594_0001_01_02.pid
>  in any of the directories
> 2018-06-07 01:51:32,966 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE httpd-0 : 
> container_e05_1528335963594_0001_01_02] Transitioned from STARTED to INIT 
> on STOP event{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8252) Fix ServiceMaster main not found

2018-07-25 Thread Jaume M (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556890#comment-16556890
 ] 

Jaume M commented on YARN-8252:
---

I'm seeing this when trying to install LLAP with hadoop master. The container 
doesn't start and the only error line is:
{{Error: Could not find or load main class 
org.apache.hadoop.yarn.service.ServiceMaster}}

> Fix ServiceMaster main not found
> 
>
> Key: YARN-8252
> URL: https://issues.apache.org/jira/browse/YARN-8252
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Zoltan Haindrich
>Priority: Major
>
> I was looking into using yarn services; however it seems for some reason it 
> is not possible to run {{ServiceMaster}} class from the jar...I might be 
> missing some fundamental...so I've put together a shellscript to make it easy 
> for anyone to checkI would be happy with any exception beyond main not 
> found
> [ServiceMaster.main 
> method|https://github.com/apache/hadoop/blob/67f239c42f676237290d18ddbbc9aec369267692/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/ServiceMaster.java#L305]
> {code:java}
> #!/bin/bash
> set -e
> wget -O core.jar  -nv 
> http://central.maven.org/maven2/org/apache/hadoop/hadoop-yarn-services-core/3.1.0/hadoop-yarn-services-core-3.1.0.jar
> unzip -qn core.jar
> cat > org/apache/hadoop/yarn/service/ServiceMaster2.java << EOF
> package org.apache.hadoop.yarn.service;
> public class ServiceMaster2 {
>   public static void main(String[] args) throws Exception {
> System.out.println("asd!");
>   }
> }
> EOF
> javac org/apache/hadoop/yarn/service/ServiceMaster2.java
> jar -cf a1.jar org
> find org -name ServiceMaster*
> # this will print "asd!"
> java -cp a1.jar org.apache.hadoop.yarn.service.ServiceMaster2
> #the following invocations result in:
> # Error: Could not find or load main class 
> org.apache.hadoop.yarn.service.ServiceMaster
> #
> set +e
> java -cp a1.jar org.apache.hadoop.yarn.service.ServiceMaster
> java -cp core.jar org.apache.hadoop.yarn.service.ServiceMaster
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment

2018-07-25 Thread Tanuj Nayak (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanuj Nayak updated YARN-7833:
--
Attachment: (was: YARN-7833.v1.patch)

> [PERF/TEST] Extend SLS to support simulation of a Federated Environment
> ---
>
> Key: YARN-7833
> URL: https://issues.apache.org/jira/browse/YARN-7833
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Tanuj Nayak
>Priority: Major
> Attachments: YARN-7833.v1.patch
>
>
> To develop algorithms for federation, it would be of great help to have a 
> version of SLS that supports multi RMs and GPG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment

2018-07-25 Thread Tanuj Nayak (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556750#comment-16556750
 ] 

Tanuj Nayak edited comment on YARN-7833 at 7/26/18 1:21 AM:


Added initial patch of Federated SLS. It does not separate the metrics of 
individual RM's in the ClusterMetrics and QueueMetrics classes yet.


was (Author: tanujnay):
Added initial patch of Federated SLS

> [PERF/TEST] Extend SLS to support simulation of a Federated Environment
> ---
>
> Key: YARN-7833
> URL: https://issues.apache.org/jira/browse/YARN-7833
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Tanuj Nayak
>Priority: Major
> Attachments: YARN-7833.v1.patch
>
>
> To develop algorithms for federation, it would be of great help to have a 
> version of SLS that supports multi RMs and GPG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment

2018-07-25 Thread Tanuj Nayak (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanuj Nayak updated YARN-7833:
--
Attachment: YARN-7833.v1.patch

> [PERF/TEST] Extend SLS to support simulation of a Federated Environment
> ---
>
> Key: YARN-7833
> URL: https://issues.apache.org/jira/browse/YARN-7833
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Tanuj Nayak
>Priority: Major
> Attachments: YARN-7833.v1.patch
>
>
> To develop algorithms for federation, it would be of great help to have a 
> version of SLS that supports multi RMs and GPG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment

2018-07-25 Thread Tanuj Nayak (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanuj Nayak updated YARN-7833:
--
Attachment: (was: YARN-7833.v1.patch)

> [PERF/TEST] Extend SLS to support simulation of a Federated Environment
> ---
>
> Key: YARN-7833
> URL: https://issues.apache.org/jira/browse/YARN-7833
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Tanuj Nayak
>Priority: Major
>
> To develop algorithms for federation, it would be of great help to have a 
> version of SLS that supports multi RMs and GPG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment

2018-07-25 Thread Tanuj Nayak (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanuj Nayak updated YARN-7833:
--
Attachment: YARN-7833.v1.patch

> [PERF/TEST] Extend SLS to support simulation of a Federated Environment
> ---
>
> Key: YARN-7833
> URL: https://issues.apache.org/jira/browse/YARN-7833
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Tanuj Nayak
>Priority: Major
> Attachments: YARN-7833.v1.patch
>
>
> To develop algorithms for federation, it would be of great help to have a 
> version of SLS that supports multi RMs and GPG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment

2018-07-25 Thread Tanuj Nayak (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanuj Nayak updated YARN-7833:
--
Attachment: (was: YARN-7433.v1.patch)

> [PERF/TEST] Extend SLS to support simulation of a Federated Environment
> ---
>
> Key: YARN-7833
> URL: https://issues.apache.org/jira/browse/YARN-7833
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Tanuj Nayak
>Priority: Major
>
> To develop algorithms for federation, it would be of great help to have a 
> version of SLS that supports multi RMs and GPG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment

2018-07-25 Thread Tanuj Nayak (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanuj Nayak updated YARN-7833:
--
Attachment: YARN-7433.v1.patch

> [PERF/TEST] Extend SLS to support simulation of a Federated Environment
> ---
>
> Key: YARN-7833
> URL: https://issues.apache.org/jira/browse/YARN-7833
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Tanuj Nayak
>Priority: Major
> Attachments: YARN-7433.v1.patch
>
>
> To develop algorithms for federation, it would be of great help to have a 
> version of SLS that supports multi RMs and GPG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556464#comment-16556464
 ] 

genericqa commented on YARN-8581:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 17m 
17s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  3m 
22s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 
 8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  5s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
15s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
32s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  8m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  9s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
46s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
24s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
40s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}109m 53s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8581 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12933108/YARN-8581.v1.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 2752b4edc895 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / f93ecf5 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21369/testReport/ |
| Max. process+thread count | 301 (vs. ulimit of 1) |
| modules | C: 

[jira] [Commented] (YARN-8566) Add diagnostic message for unschedulable containers

2018-07-25 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556449#comment-16556449
 ] 

Robert Kanter commented on YARN-8566:
-

Thanks for the patch.  A few comments:
# In the switch statement, the {{break}}'s should be indented one more level.
# I think we should make the log message and the diagnostic message say the 
same thing for consistency (the only difference would be that the log message 
would also have the App ID and stack trace).
# It looks like {{throwInvalidResourceException}} already has a message with 
details about the problem in it - why not simply push that message to the 
diagnostic message instead of adding {{InvalidResourceType}}?
#- Furthermore, it looks like the exception message is the same, regardless of 
the reason for being invalid, which makes it somewhat unclear (i.e. it says 
"...requested resource type=[X] < 0 or greater than maximum allowed 
allocation." - which doesn't tell you which case).  I'd suggest we make the 
exception message more dynamic based on what the actual problem is, and re-use 
it for the diagnostic message.

> Add diagnostic message for unschedulable containers
> ---
>
> Key: YARN-8566
> URL: https://issues.apache.org/jira/browse/YARN-8566
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8566.001.patch, YARN-8566.002.patch, 
> YARN-8566.003.patch, YARN-8566.004.patch
>
>
> If a queue is configured with maxResources set to 0 for a resource, and an 
> application is submitted to that queue that requests that resource, that 
> application will remain pending until it is removed or moved to a different 
> queue. This behavior can be realized without extended resources, but it’s 
> unlikely a user will create a queue that allows 0 memory or CPU. As the 
> number of resources in the system increases, this scenario will become more 
> common, and it will become harder to recognize these cases. Therefore, the 
> scheduler should indicate in the diagnostic string for an application if it 
> was not scheduled because of a 0 maxResources setting.
> Example configuration (fair-scheduler.xml) : 
> {code:java}
> 
>   10
> 
> 1 mb,2vcores
> 9 mb,4vcores, 0gpu
> 50
> -1.0f
> 2.0
> fair
>   
> 
> {code}
> Command: 
> {code:java}
> yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar" pi 
> -Dmapreduce.job.queuename=sample_queue -Dmapreduce.map.resource.gpu=1 1 1000;
> {code}
> The job hangs and the application diagnostic info is empty.
> Given that an exception is thrown before any mapper/reducer container is 
> created, the diagnostic message of the AM should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8330) An extra container got launched by RM for yarn-service

2018-07-25 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556386#comment-16556386
 ] 

Hudson commented on YARN-8330:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14643 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14643/])
YARN-8330.  Improved publishing ALLOCATED events to ATS. (eyang: 
rev f93ecf5c1e0b3db27424963814fc01ec43eb76e0)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java


> An extra container got launched by RM for yarn-service
> --
>
> Key: YARN-8330
> URL: https://issues.apache.org/jira/browse/YARN-8330
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Yesha Vora
>Assignee: Suma Shivaprasad
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8330.1.patch, YARN-8330.2.patch, YARN-8330.3.patch, 
> YARN-8330.4.patch
>
>
> Steps:
> launch Hbase tarball app
> list containers for hbase tarball app
> {code}
> /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list 
> appattempt_1525463491331_0006_01
> WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of 
> YARN_LOG_DIR.
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> Total number of containers :5
> Container-IdStart Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018  
>  N/A RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa
> 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03
> Fri May 04 22:34:26 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01
> Fri May 04 22:34:15 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05
> Fri May 04 22:34:56 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa
> 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04
> Fri May 04 22:34:56 + 2018   N/A
> nullxxx:25454  http://xxx:8042
> http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code}
> Total expected containers = 4 ( 3 components container + 1 am). Instead, RM 
> is listing 5 containers. 
> container_e06_1525463491331_0006_01_04 is in null state.
> Yarn service utilized container 02, 03, 05 for component. There is no log 
> available in NM & AM related to container 04. Only one line in RM log is 
> printed
> {code}
> 2018-05-04 22:34:56,618 INFO  rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(489)) - 
> container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to 
> RESERVED{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8583) Inconsistency in YARN status command

2018-07-25 Thread Eric Yang (JIRA)
Eric Yang created YARN-8583:
---

 Summary: Inconsistency in YARN status command
 Key: YARN-8583
 URL: https://issues.apache.org/jira/browse/YARN-8583
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Yang


YARN app -status command can report base on application ID or application name 
with some usability limitation.  Application ID is globally unique, and it 
allows any user to query application status of any application.  Application 
name is not globally unique, and it will only work for querying user's own 
application.  This is somewhat restrictive for application administrator, but 
allowing other user to query any other user's application could consider a 
security hole as well.  There are two possible options to reduce the 
inconsistency:

Option 1.  Block other user from query application status.  This may improve 
security in some sense, but it is an incompatible change.  This is a simpler 
change by matching the owner of the application, and decide to report or not 
report.

Option 2.  Add --user parameter to allow administrator to query application 
name ran by other user.  This is a bigger change because application metadata 
is stored in user's own hdfs directory.  There are security restriction that 
need to be defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy

2018-07-25 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8581:
---
Attachment: YARN-8581.v1.patch

> [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy
> ---
>
> Key: YARN-8581
> URL: https://issues.apache.org/jira/browse/YARN-8581
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8581.v1.patch
>
>
> In Federation, every time an AM heartbeat comes in, 
> LocalityMulticastAMRMProxyPolicy in AMRMProxy splits the asks according to 
> the list of active and enabled sub-clusters. However, if we haven't been able 
> to heartbeat to a sub-cluster for some time (network issues, or we keep 
> hitting some exception from YarnRM, or YarnRM master-slave switch is taking a 
> long time etc.), we should consider the sub-cluster as unhealthy and stop 
> routing asks there, until the heartbeat channel becomes healthy again. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8448) AM HTTPS Support

2018-07-25 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556364#comment-16556364
 ] 

Robert Kanter edited comment on YARN-8448 at 7/25/18 10:42 PM:
---

I've finished up a patch that implements everything described in YARN-6586, 
other than the RM HA support (TODO in YARN-8449) and Documentation (just filed 
YARN-8582 for this).  I've put the bulk of the changes here 
(YARN-8448.001.patch), and the MapReduce changes in MAPREDUCE-4669.

Some notes on the patch:
- Updated BouncyCastle library to a newer version and had to also change the 
artifact from {{bcprov-jdk16}} to {{bcprov-jdk15on}}.  I know that sounds 
backwards, but jdk15on is actually newer and the one we should be using (see 
http://bouncy-castle.1462172.n4.nabble.com/Bouncycaslte-bcprov-jdk15-vs-bcprov-jdk16-td4656252.html).
- The {{yarn.resourcemanager.application-https.policy}} property controls how 
the RM should handle HTTPS when talking to AMs.  It can be {{OFF}}, 
{{OPTIONAL}} (default), or {{REQUIRED}}.  {{OFF}} makes it behave like today, 
where it does nothing special.  {{OPTIONAL}} makes it generate and provide the 
keystore and truststore to the AM when it sees an HTTPS tracking URL, but HTTP 
is also still allowed.  And {{REQUIRED}} is like {{OPTIONAL}}, but it won't 
follow HTTP tracking URLs.
- A lot of the code around the container executors is in providing/copying/etc 
the keystore and truststore files.  I've largely based this on the existing way 
we handle the credentials (delegation tokens) file.
- When provided a keystore file, the AM will get env vars 
{{KEYSTORE_FILE_LOCATION}} and {{KEYSTORE_PASSWORD}}; similarly, 
{{TRUSTSTORE_FILE_LOCATION}} and {{TRUSTSTORE_PASSWORD}} for the truststore 
file.
- Due to the (ugly) way we parse arguments in the LCE, I had to add an argument 
that's either {{\-\-http}} or {{\-\-https}} to indicate if we'll be providing 
it the keystore and truststore files.  Otherwise, there isn't a good way to 
have optional arguments.
- In order to keep things simple, I piggybacked passing the keystore and 
truststore files and passwords via secrets in the Credentials, which is already 
securely passed from the RM to the NM.
- {{ProxyCAManager}} is in charge of creating the certificates, keystores, and 
truststores.
- When writing the unit tests, I found a number of tests that were about 80% 
complete in what they were testing, which I completed in addition to adding 
tests for my changes.
-- I also tried to simplify some things (e.g. {{TestDockerContainerRuntime}} 
has ~30 tests that all duplicate the code for checking the arguments, and 
because I changed the number of arguments, they all failed - instead of 
updating them all, I created a helper method)
- I'm not sure what's up with {{test-container-executor}}, but unless my 
environment was messed up, it doesn't work when run as {{root}}; maybe people 
typically run it as a normal user?  The test talks about running as {{root}} as 
an option, and even has a few tests that only run when running as {{root}}.  I 
spent some time fixing this - it now runs in all 4 user configurations 
described in the existing comments.
- I've tested in a real cluster with the DefaultContainerExecutor and 
LinuxContainerExecutor using all combinations of 
{{yarn.resourcemanager.application-https.policy}}, 
{{yarn.app.mapreduce.am.webapp.https.enabled}}, and 
{{yarn.app.mapreduce.am.webapp.https.client.auth}} (see MAPREDUCE-4669), and 
everything behaved correctly.  I haven't tested out the 
DockerContainerExecutor.  
-- If you want to try this out yourself in a cluster, I'd recommend also 
applying the MAPREDUCE-4669 patch so you have an AM that supports the changes.  
You can then use {{openssl s_client -connect :}} to get 
SSL details.  You can also try {{curl}}.


was (Author: rkanter):
I've finished up a patch that implements everything described in YARN-6586, 
other than the RM HA support (TODO in YARN-8449) and Documentation (just filed 
YARN-8582 for this).  I've put the bulk of the changes here 
(YARN-8448.001.patch), and the MapReduce changes in MAPREDUCE-4669.

Some notes on the patch:
- Updated BouncyCastle library to a newer version and had to also change the 
artifact from {{bcprov-jdk16}} to {{bcprov-jdk15on}}.  I know that sounds 
backwards, but jdk15on is actually newer and the one we should be using (see 
http://bouncy-castle.1462172.n4.nabble.com/Bouncycaslte-bcprov-jdk15-vs-bcprov-jdk16-td4656252.html).
- The {{yarn.resourcemanager.application-https.policy}} property controls how 
the RM should handle HTTPS when talking to AMs.  It can be {{OFF}}, 
{{OPTIONAL}} (default), or {{REQUIRED}}.  {{OFF}} makes it behave like today, 
where it does nothing special.  {{OPTIONAL}} makes it generate and provide the 
keystore and truststore to the AM when it sees an HTTPS tracking URL, but HTTP 
is also still allowed.  And 

[jira] [Updated] (YARN-8448) AM HTTPS Support

2018-07-25 Thread Robert Kanter (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-8448:

Attachment: YARN-8448.001.patch

> AM HTTPS Support
> 
>
> Key: YARN-8448
> URL: https://issues.apache.org/jira/browse/YARN-8448
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8448.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8448) AM HTTPS Support

2018-07-25 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556364#comment-16556364
 ] 

Robert Kanter commented on YARN-8448:
-

I've finished up a patch that implements everything described in YARN-6586, 
other than the RM HA support (TODO in YARN-8449) and Documentation (just filed 
YARN-8582 for this).  I've put the bulk of the changes here 
(YARN-8448.001.patch), and the MapReduce changes in MAPREDUCE-4669.

Some notes on the patch:
- Updated BouncyCastle library to a newer version and had to also change the 
artifact from {{bcprov-jdk16}} to {{bcprov-jdk15on}}.  I know that sounds 
backwards, but jdk15on is actually newer and the one we should be using (see 
http://bouncy-castle.1462172.n4.nabble.com/Bouncycaslte-bcprov-jdk15-vs-bcprov-jdk16-td4656252.html).
- The {{yarn.resourcemanager.application-https.policy}} property controls how 
the RM should handle HTTPS when talking to AMs.  It can be {{OFF}}, 
{{OPTIONAL}} (default), or {{REQUIRED}}.  {{OFF}} makes it behave like today, 
where it does nothing special.  {{OPTIONAL}} makes it generate and provide the 
keystore and truststore to the AM when it sees an HTTPS tracking URL, but HTTP 
is also still allowed.  And {{REQUIRED}} is like {{OPTIONAL}}, but it won't 
follow HTTP tracking URLs.
- A lot of the code around the container executors is in providing/copying/etc 
the keystore and truststore files.  I've largely based this on the existing way 
we handle the credentials (delegation tokens) file.
- When provided a keystore file, the AM will get env vars 
{{KEYSTORE_FILE_LOCATION}} and {{KEYSTORE_PASSWORD}}; similarly, 
{{TRUSTSTORE_FILE_LOCATION}} and {{TRUSTSTORE_PASSWORD}} for the truststore 
file.
- Due to the (ugly) way we parse arguments in the LCE, I had to add an argument 
that's either {{--http}} or {{--https}} to indicate if we'll be providing it 
the keystore and truststore files.  Otherwise, there isn't a good way to have 
optional arguments.
- In order to keep things simple, I piggybacked passing the keystore and 
truststore files and passwords via secrets in the Credentials, which is already 
securely passed from the RM to the NM.
- {{ProxyCAManager}} is in charge of creating the certificates, keystores, and 
truststores.
- When writing the unit tests, I found a number of tests that were about 80% 
complete in what they were testing, which I completed in addition to adding 
tests for my changes.
-- I also tried to simplify some things (e.g. {{TestDockerContainerRuntime}} 
has ~30 tests that all duplicate the code for checking the arguments, and 
because I changed the number of arguments, they all failed - instead of 
updating them all, I created a helper method)
- I'm not sure what's up with {{test-container-executor}}, but unless my 
environment was messed up, it doesn't work when run as {{root}}; maybe people 
typically run it as a normal user?  The test talks about running as {{root}} as 
an option, and even has a few tests that only run when running as {{root}}.  I 
spent some time fixing this - it now runs in all 4 user configurations 
described in the existing comments.
- I've tested in a real cluster with the DefaultContainerExecutor and 
LinuxContainerExecutor using all combinations of 
{{yarn.resourcemanager.application-https.policy}}, 
{{yarn.app.mapreduce.am.webapp.https.enabled}}, and 
{{yarn.app.mapreduce.am.webapp.https.client.auth}} (see MAPREDUCE-4669), and 
everything behaved correctly.  I haven't tested out the 
DockerContainerExecutor.  
-- If you want to try this out yourself in a cluster, I'd recommend also 
applying the MAPREDUCE-4669 patch so you have an AM that supports the changes.  
You can then use {{openssl s_client -connect :}} to get 
SSL details.  You can also try {{curl}}.

> AM HTTPS Support
> 
>
> Key: YARN-8448
> URL: https://issues.apache.org/jira/browse/YARN-8448
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8448.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8582) Documentation for AM HTTPS Support

2018-07-25 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-8582:
---

 Summary: Documentation for AM HTTPS Support
 Key: YARN-8582
 URL: https://issues.apache.org/jira/browse/YARN-8582
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: docs
Reporter: Robert Kanter
Assignee: Robert Kanter


Documentation for YARN-6586.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy

2018-07-25 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8581:
---
Issue Type: Sub-task  (was: Task)
Parent: YARN-5597

> [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy
> ---
>
> Key: YARN-8581
> URL: https://issues.apache.org/jira/browse/YARN-8581
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
>
> In Federation, every time an AM heartbeat comes in, 
> LocalityMulticastAMRMProxyPolicy in AMRMProxy splits the asks according to 
> the list of active and enabled sub-clusters. However, if we haven't been able 
> to heartbeat to a sub-cluster for some time (network issues, or we keep 
> hitting some exception from YarnRM, or YarnRM master-slave switch is taking a 
> long time etc.), we should consider the sub-cluster as unhealthy and stop 
> routing asks there, until the heartbeat channel becomes healthy again. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy

2018-07-25 Thread Botong Huang (JIRA)
Botong Huang created YARN-8581:
--

 Summary: [AMRMProxy] Add sub-cluster timeout in 
LocalityMulticastAMRMProxyPolicy
 Key: YARN-8581
 URL: https://issues.apache.org/jira/browse/YARN-8581
 Project: Hadoop YARN
  Issue Type: Task
  Components: amrmproxy, federation
Reporter: Botong Huang
Assignee: Botong Huang


In Federation, every time an AM heartbeat comes in, 
LocalityMulticastAMRMProxyPolicy in AMRMProxy splits the asks according to the 
list of active and enabled sub-clusters. However, if we haven't been able to 
heartbeat to a sub-cluster for some time (network issues, or we keep hitting 
some exception from YarnRM, or YarnRM master-slave switch is taking a long time 
etc.), we should consider the sub-cluster as unhealthy and stop routing asks 
there, until the heartbeat channel becomes healthy again. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8579) New AM attempt could not retrieve previous attempt component data

2018-07-25 Thread Gour Saha (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gour Saha reassigned YARN-8579:
---

Assignee: Gour Saha

> New AM attempt could not retrieve previous attempt component data
> -
>
> Key: YARN-8579
> URL: https://issues.apache.org/jira/browse/YARN-8579
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Assignee: Gour Saha
>Priority: Critical
>
> Steps:
> 1) Launch httpd-docker
> 2) Wait for app to be in STABLE state
> 3) Run validation for app (It takes around 3 mins)
> 4) Stop all Zks 
> 5) Wait 60 sec
> 6) Kill AM
> 7) wait for 30 sec
> 8) Start all ZKs
> 9) Wait for application to finish
> 10) Validate expected containers of the app
> Expected behavior:
> New attempt of AM should start and docker containers launched by 1st attempt 
> should be recovered by new attempt.
> Actual behavior:
> New AM attempt starts. It can not recover 1st attempt docker containers. It 
> can not read component details from ZK. 
> Thus, it starts new attempt for all containers.
> {code}
> 2018-07-19 22:42:47,595 [main] INFO  service.ServiceScheduler - Registering 
> appattempt_1531977563978_0015_02, fault-test-zkrm-httpd-docker into 
> registry
> 2018-07-19 22:42:47,611 [main] INFO  service.ServiceScheduler - Received 1 
> containers from previous attempt.
> 2018-07-19 22:42:47,642 [main] INFO  service.ServiceScheduler - Could not 
> read component paths: 
> `/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components':
>  No such file or directory: KeeperErrorCode = NoNode for 
> /registry/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components
> 2018-07-19 22:42:47,643 [main] INFO  service.ServiceScheduler - Handling 
> container_e08_1531977563978_0015_01_03 from previous attempt
> 2018-07-19 22:42:47,643 [main] INFO  service.ServiceScheduler - Record not 
> found in registry for container container_e08_1531977563978_0015_01_03 
> from previous attempt, releasing
> 2018-07-19 22:42:47,649 [AMRM Callback Handler Thread] INFO  
> impl.TimelineV2ClientImpl - Updated timeline service address to xxx:33019
> 2018-07-19 22:42:47,651 [main] INFO  service.ServiceScheduler - Triggering 
> initial evaluation of component httpd
> 2018-07-19 22:42:47,652 [main] INFO  component.Component - [INIT COMPONENT 
> httpd]: 2 instances.
> 2018-07-19 22:42:47,652 [main] INFO  component.Component - [COMPONENT httpd] 
> Requesting for 2 container(s){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8569) Create an interface to provide cluster information to application

2018-07-25 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556348#comment-16556348
 ] 

Eric Yang commented on YARN-8569:
-

[~oliverhuh...@gmail.com] Your approach works fine for application that carry 
hadoop client or zookeeper client.  This proposed interface is to lower the bar 
of entry to obtain cluster information for non-Hadoop native applications.  
This is the main reason to offer a file based interface for nodes.

The high level view of the design looks like this:
# Application Master received YARN service JSON from yarn cli.
# Application Master write the hostname information to YARN service JSON 
resides in /user/${USER}/.yarn/services/[service]/[service].json
# The file is added to distributed cache and localized during container launch.
# The file is bind-mount into docker container for consumption at a predefined 
location.
# Flex operation will trigger update of [service].json and repopulate 
distributed cache when nodes involved in the cluster has changed.

User application can poll file changes from docker container to be notified of 
cluster information changes.

> Create an interface to provide cluster information to application
> -
>
> Key: YARN-8569
> URL: https://issues.apache.org/jira/browse/YARN-8569
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Priority: Major
>  Labels: Docker
>
> Some program requires container hostnames to be known for application to run. 
>  For example, distributed tensorflow requires launch_command that looks like:
> {code}
> # On ps0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=0
> # On ps1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=1
> # On worker0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=0
> # On worker1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=1
> {code}
> This is a bit cumbersome to orchestrate via Distributed Shell, or YARN 
> services launch_command.  In addition, the dynamic parameters do not work 
> with YARN flex command.  This is the classic pain point for application 
> developer attempt to automate system environment settings as parameter to end 
> user application.
> It would be great if YARN Docker integration can provide a simple option to 
> expose hostnames of the yarn service via a mounted file.  The file content 
> gets updated when flex command is performed.  This allows application 
> developer to consume system environment settings via a standard interface.  
> It is like /proc/devices for Linux, but for Hadoop.  This may involve 
> updating a file in distributed cache, and allow mounting of the file via 
> container-executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-25 Thread Suma Shivaprasad (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556347#comment-16556347
 ] 

Suma Shivaprasad commented on YARN-8418:


Thanks for updating the patch and clarifying [~bibinchundatt] LGTM. [~sunilg] 
Can you please take a look ?

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, 
> YARN-8418.006.patch, YARN-8418.007.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8407) Container launch exception in AM log should be printed in ERROR level

2018-07-25 Thread Yesha Vora (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora updated YARN-8407:
-
Attachment: YARN-8407.001.patch

> Container launch exception in AM log should be printed in ERROR level
> -
>
> Key: YARN-8407
> URL: https://issues.apache.org/jira/browse/YARN-8407
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Yesha Vora
>Priority: Major
> Attachments: YARN-8407.001.patch
>
>
> when a container launch is failing due to docker image not available is 
> logged as INFO level in AM log. 
> Container launch failure should be logged as ERROR.
> Steps:
> launch httpd yarn-service application with invalid docker image
>  
> {code:java}
> 2018-06-07 01:51:32,966 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE httpd-0 : 
> container_e05_1528335963594_0001_01_02]: 
> container_e05_1528335963594_0001_01_02 completed. Reinsert back to 
> pending list and requested a new container.
> exitStatus=-1, diagnostics=[2018-06-07 01:51:02.363]Exception from 
> container-launch.
> Container id: container_e05_1528335963594_0001_01_02
> Exit code: 7
> Exception message: Launch container failed
> Shell error output: Unable to find image 'xxx/httpd:0.1' locally
> Trying to pull repository xxx/httpd ...
> /usr/bin/docker-current: Get https://xxx/v1/_ping: dial tcp: lookup xxx on 
> yyy: no such host.
> See '/usr/bin/docker-current run --help'.
> Shell output: main : command provided 4
> main : run as user is hbase
> main : requested yarn user is hbase
> Creating script paths...
> Creating local dirs...
> Getting exit code file...
> Changing effective user to root...
> Wrote the exit code 7 to 
> /grid/0/hadoop/yarn/local/nmPrivate/application_1528335963594_0001/container_e05_1528335963594_0001_01_02/container_e05_1528335963594_0001_01_02.pid.exitcode
> [2018-06-07 01:51:02.393]Diagnostic message from attempt :
> [2018-06-07 01:51:02.394]Container exited with a non-zero exit code 7. Last 
> 4096 bytes of stderr.txt :
> [2018-06-07 01:51:32.428]Could not find 
> nmPrivate/application_1528335963594_0001/container_e05_1528335963594_0001_01_02//container_e05_1528335963594_0001_01_02.pid
>  in any of the directories
> 2018-06-07 01:51:32,966 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE httpd-0 : 
> container_e05_1528335963594_0001_01_02] Transitioned from STARTED to INIT 
> on STOP event{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8580) yarn.resourcemanager.am.max-attempts is not respected for yarn services

2018-07-25 Thread Giovanni Matteo Fumarola (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola resolved YARN-8580.

Resolution: Invalid

> yarn.resourcemanager.am.max-attempts is not respected for yarn services
> ---
>
> Key: YARN-8580
> URL: https://issues.apache.org/jira/browse/YARN-8580
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Priority: Major
>
> 1) Max am attempt is set to 100 on all nodes. ( including gateway)
> {code}
>  
>   yarn.resourcemanager.am.max-attempts
>   100
> {code}
> 2) Start a Yarn service ( Hbase tarball ) application
> 3) Kill AM 20 times
> Here, App fails with below diagnostics.
> {code}
> bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status 
> application_1532481557746_0001
> 18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> 18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml 
> at file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml
> Application Report : 
>   Application-Id : application_1532481557746_0001
>   Application-Name : hbase-tarball-lr
>   Application-Type : yarn-service
>   User : hbase
>   Queue : default
>   Application Priority : 0
>   Start-Time : 1532481864863
>   Finish-Time : 1532522943103
>   Progress : 100%
>   State : FAILED
>   Final-State : FAILED
>   Tracking-URL : 
> https://xxx:8090/cluster/app/application_1532481557746_0001
>   RPC Port : -1
>   AM Host : N/A
>   Aggregate Resource Allocation : 252150112 MB-seconds, 164141 
> vcore-seconds
>   Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
>   Log Aggregation Status : SUCCEEDED
>   Diagnostics : Application application_1532481557746_0001 failed 20 
> times (global limit =100; local limit is =20) due to AM Container for 
> appattempt_1532481557746_0001_20 exited with  exitCode: 137
> Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed 
> on request. Exit code is 137
> [2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. 
> [2018-07-25 12:49:03.045]Killed by external signal
> For more detailed output, check the application tracking page: 
> https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on 
> links to logs of each attempt.
> . Failing the application.
>   Unmanaged Application : false
>   Application Node Label Expression : 
>   AM container Node Label Expression : 
>   TimeoutType : LIFETIME  ExpiryTime : 2018-07-25T22:26:15.419+   
> RemainingTime : 0seconds
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8580) yarn.resourcemanager.am.max-attempts is not respected for yarn services

2018-07-25 Thread Giovanni Matteo Fumarola (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556321#comment-16556321
 ] 

Giovanni Matteo Fumarola commented on YARN-8580:


Hi [~yeshavora] ,

YARN gets the minimum between the global and the local limit. Your global limit 
is set to 100 (yarn.resourcemanager.am.max-attempts) while the AM limit is set 
to 20.

Closing this Jira as invalid.
Diagnostics : Application application_1532481557746_0001 failed 20 times 
(global limit =100; local limit is =20)

> yarn.resourcemanager.am.max-attempts is not respected for yarn services
> ---
>
> Key: YARN-8580
> URL: https://issues.apache.org/jira/browse/YARN-8580
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Priority: Major
>
> 1) Max am attempt is set to 100 on all nodes. ( including gateway)
> {code}
>  
>   yarn.resourcemanager.am.max-attempts
>   100
> {code}
> 2) Start a Yarn service ( Hbase tarball ) application
> 3) Kill AM 20 times
> Here, App fails with below diagnostics.
> {code}
> bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status 
> application_1532481557746_0001
> 18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> 18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml 
> at file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml
> Application Report : 
>   Application-Id : application_1532481557746_0001
>   Application-Name : hbase-tarball-lr
>   Application-Type : yarn-service
>   User : hbase
>   Queue : default
>   Application Priority : 0
>   Start-Time : 1532481864863
>   Finish-Time : 1532522943103
>   Progress : 100%
>   State : FAILED
>   Final-State : FAILED
>   Tracking-URL : 
> https://xxx:8090/cluster/app/application_1532481557746_0001
>   RPC Port : -1
>   AM Host : N/A
>   Aggregate Resource Allocation : 252150112 MB-seconds, 164141 
> vcore-seconds
>   Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
>   Log Aggregation Status : SUCCEEDED
>   Diagnostics : Application application_1532481557746_0001 failed 20 
> times (global limit =100; local limit is =20) due to AM Container for 
> appattempt_1532481557746_0001_20 exited with  exitCode: 137
> Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed 
> on request. Exit code is 137
> [2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. 
> [2018-07-25 12:49:03.045]Killed by external signal
> For more detailed output, check the application tracking page: 
> https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on 
> links to logs of each attempt.
> . Failing the application.
>   Unmanaged Application : false
>   Application Node Label Expression : 
>   AM container Node Label Expression : 
>   TimeoutType : LIFETIME  ExpiryTime : 2018-07-25T22:26:15.419+   
> RemainingTime : 0seconds
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8580) yarn.resourcemanager.am.max-attempts is not respected for yarn services

2018-07-25 Thread Yesha Vora (JIRA)
Yesha Vora created YARN-8580:


 Summary: yarn.resourcemanager.am.max-attempts is not respected for 
yarn services
 Key: YARN-8580
 URL: https://issues.apache.org/jira/browse/YARN-8580
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Affects Versions: 3.1.1
Reporter: Yesha Vora


1) Max am attempt is set to 100 on all nodes. ( including gateway)
{code}
 
  yarn.resourcemanager.am.max-attempts
  100
{code}
2) Start a Yarn service ( Hbase tarball ) application
3) Kill AM 20 times

Here, App fails with below diagnostics.

{code}
bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status 
application_1532481557746_0001
18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History 
server at xxx/xxx:10200
18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
to rm2
18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml at 
file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml
Application Report : 
Application-Id : application_1532481557746_0001
Application-Name : hbase-tarball-lr
Application-Type : yarn-service
User : hbase
Queue : default
Application Priority : 0
Start-Time : 1532481864863
Finish-Time : 1532522943103
Progress : 100%
State : FAILED
Final-State : FAILED
Tracking-URL : 
https://xxx:8090/cluster/app/application_1532481557746_0001
RPC Port : -1
AM Host : N/A
Aggregate Resource Allocation : 252150112 MB-seconds, 164141 
vcore-seconds
Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
Log Aggregation Status : SUCCEEDED
Diagnostics : Application application_1532481557746_0001 failed 20 
times (global limit =100; local limit is =20) due to AM Container for 
appattempt_1532481557746_0001_20 exited with  exitCode: 137
Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed on 
request. Exit code is 137
[2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. 
[2018-07-25 12:49:03.045]Killed by external signal
For more detailed output, check the application tracking page: 
https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on links 
to logs of each attempt.
. Failing the application.
Unmanaged Application : false
Application Node Label Expression : 
AM container Node Label Expression : 
TimeoutType : LIFETIME  ExpiryTime : 2018-07-25T22:26:15.419+   
RemainingTime : 0seconds
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8579) New AM attempt could not retrieve previous attempt component data

2018-07-25 Thread Yesha Vora (JIRA)
Yesha Vora created YARN-8579:


 Summary: New AM attempt could not retrieve previous attempt 
component data
 Key: YARN-8579
 URL: https://issues.apache.org/jira/browse/YARN-8579
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.1
Reporter: Yesha Vora


Steps:
1) Launch httpd-docker
2) Wait for app to be in STABLE state
3) Run validation for app (It takes around 3 mins)
4) Stop all Zks 
5) Wait 60 sec
6) Kill AM
7) wait for 30 sec
8) Start all ZKs
9) Wait for application to finish
10) Validate expected containers of the app

Expected behavior:
New attempt of AM should start and docker containers launched by 1st attempt 
should be recovered by new attempt.

Actual behavior:
New AM attempt starts. It can not recover 1st attempt docker containers. It can 
not read component details from ZK. 
Thus, it starts new attempt for all containers.

{code}
2018-07-19 22:42:47,595 [main] INFO  service.ServiceScheduler - Registering 
appattempt_1531977563978_0015_02, fault-test-zkrm-httpd-docker into registry
2018-07-19 22:42:47,611 [main] INFO  service.ServiceScheduler - Received 1 
containers from previous attempt.
2018-07-19 22:42:47,642 [main] INFO  service.ServiceScheduler - Could not read 
component paths: 
`/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components': 
No such file or directory: KeeperErrorCode = NoNode for 
/registry/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components
2018-07-19 22:42:47,643 [main] INFO  service.ServiceScheduler - Handling 
container_e08_1531977563978_0015_01_03 from previous attempt
2018-07-19 22:42:47,643 [main] INFO  service.ServiceScheduler - Record not 
found in registry for container container_e08_1531977563978_0015_01_03 from 
previous attempt, releasing
2018-07-19 22:42:47,649 [AMRM Callback Handler Thread] INFO  
impl.TimelineV2ClientImpl - Updated timeline service address to xxx:33019
2018-07-19 22:42:47,651 [main] INFO  service.ServiceScheduler - Triggering 
initial evaluation of component httpd
2018-07-19 22:42:47,652 [main] INFO  component.Component - [INIT COMPONENT 
httpd]: 2 instances.
2018-07-19 22:42:47,652 [main] INFO  component.Component - [COMPONENT httpd] 
Requesting for 2 container(s){code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed

2018-07-25 Thread Gour Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556267#comment-16556267
 ] 

Gour Saha commented on YARN-8545:
-

[~csingh] patch 001 looks good to me. +1.

> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-8545.001.patch
>
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
>   at 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
>   at 
> org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
>   at 
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
>   at 
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){code}
> And even after container launch context prepare failed, AM still trying to 
> monitor container's readiness:
> {code:java}
> 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
> 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: primary-worker-0: IP is not 
> available yet"
> ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8569) Create an interface to provide cluster information to application

2018-07-25 Thread Keqiu Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556244#comment-16556244
 ] 

Keqiu Hu commented on YARN-8569:


I can see the value of sharing some information between cluster nodes with 
this, just want to throw how we tackle the problem. We did this by storing the 
information in application master, each worker node would have a TaskExecutor 
to heartbeat with AM to get latest cluster information.

How do you ensure the file is atomic, for example, multiple nodes can modify 
the mounted file the same time?

> Create an interface to provide cluster information to application
> -
>
> Key: YARN-8569
> URL: https://issues.apache.org/jira/browse/YARN-8569
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Priority: Major
>  Labels: Docker
>
> Some program requires container hostnames to be known for application to run. 
>  For example, distributed tensorflow requires launch_command that looks like:
> {code}
> # On ps0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=0
> # On ps1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=1
> # On worker0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=0
> # On worker1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=1
> {code}
> This is a bit cumbersome to orchestrate via Distributed Shell, or YARN 
> services launch_command.  In addition, the dynamic parameters do not work 
> with YARN flex command.  This is the classic pain point for application 
> developer attempt to automate system environment settings as parameter to end 
> user application.
> It would be great if YARN Docker integration can provide a simple option to 
> expose hostnames of the yarn service via a mounted file.  The file content 
> gets updated when flex command is performed.  This allows application 
> developer to consume system environment settings via a standard interface.  
> It is like /proc/devices for Linux, but for Hadoop.  This may involve 
> updating a file in distributed cache, and allow mounting of the file via 
> container-executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556179#comment-16556179
 ] 

genericqa commented on YARN-6966:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red} 12m  
7s{color} | {color:red} Docker failed to build yetus/hadoop:f667ef1. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-6966 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12933031/YARN-6966-branch-2.001.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21368/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, 
> YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, 
> YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-25 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556155#comment-16556155
 ] 

Haibo Chen commented on YARN-6966:
--

Yes, please add a patch for branch-3.0.

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, 
> YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, 
> YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8517) getContainer and getContainers ResourceManager REST API methods are not documented

2018-07-25 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556069#comment-16556069
 ] 

Robert Kanter commented on YARN-8517:
-

Thanks [~bsteinbach] for the patch.  One minor suggestion:
- We should have a short description of the API before the URI for both new 
sections.  You can see that other sections have a description, usually 
something like "With the  API, you can..."

> getContainer and getContainers ResourceManager REST API methods are not 
> documented
> --
>
> Key: YARN-8517
> URL: https://issues.apache.org/jira/browse/YARN-8517
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Szilard Nemeth
>Assignee: Antal Bálint Steinbach
>Priority: Major
>  Labels: newbie, newbie++
> Attachments: YARN-8517.001.patch, YARN-8517.002.patch, 
> YARN-8517.003.patch, YARN-8517.004.patch
>
>
> Looking at the documentation here: 
> https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html
> I cannot find documentation for 2 RM REST endpoints: 
> - /apps/\{appid\}/appattempts/\{appattemptid\}/containers
> - /apps/\{appid\}/appattempts/\{appattemptid\}/containers/\{containerid\}
> I suppose they are not intentionally undocumented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8578) Failed while trying to construct the redirect url to the log server for Samza applications

2018-07-25 Thread Yuriy Malygin (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy Malygin updated YARN-8578:

Description: 
In Timeline web interface I see miscellaneous behavior when clicking on a link 
in columns _ID_ and _Tracking UI_ of one row:
 * from _ID_ go to 
[http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] 
where I can see logs
 * from _Tracking UI_ go to 
[http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I 
redirecting to 
[http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username]
 and see error: 
 *Failed redirect for container_e51_1532439541520_0104_01_01*
 _Failed while trying to construct the redirect url to the log server. Log 
Server url may not be configured_
 _java.lang.Exception: Unknown container. Container either has not started or 
has already completed or doesn't belong to this node at all._

 

Application type of application_1532439541520_0104 is a Samza.

If type is MapReduce both URI works fine and redirects to logs.

 

  was:
In Timeline web interface I see miscellaneous behavior when clicking on a link 
in columns _ID_ and _Tracking UI_ of one row:
 * from _ID_ go to 
[http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] 
where I can see logs
 * from _Tracking UI_ go to 
[http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I 
redirecting to 
[http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username]
 and see error: 
*Failed redirect for container_e51_1532439541520_0104_01_01*
_Failed while trying to construct the redirect url to the log server. Log 
Server url may not be configured_
_java.lang.Exception: Unknown container. Container either has not started or 
has already completed or doesn't belong to this node at all._

 

Application type of application_1532439541520_0104 is Samza.

If type is MapReduce both URI works fine and redirects to logs.

 


> Failed while trying to construct the redirect url to the log server for Samza 
> applications
> --
>
> Key: YARN-8578
> URL: https://issues.apache.org/jira/browse/YARN-8578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Yuriy Malygin
>Priority: Major
>
> In Timeline web interface I see miscellaneous behavior when clicking on a 
> link in columns _ID_ and _Tracking UI_ of one row:
>  * from _ID_ go to 
> [http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104]
>  where I can see logs
>  * from _Tracking UI_ go to 
> [http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I 
> redirecting to 
> [http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username]
>  and see error: 
>  *Failed redirect for container_e51_1532439541520_0104_01_01*
>  _Failed while trying to construct the redirect url to the log server. Log 
> Server url may not be configured_
>  _java.lang.Exception: Unknown container. Container either has not started or 
> has already completed or doesn't belong to this node at all._
>  
> Application type of application_1532439541520_0104 is a Samza.
> If type is MapReduce both URI works fine and redirects to logs.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8571) Validate service principal format prior to launching yarn service

2018-07-25 Thread Billie Rinaldi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555853#comment-16555853
 ] 

Billie Rinaldi commented on YARN-8571:
--

Thanks for the patch, [~eyang]. I think this patch would NPE when the principal 
is null, so we should check for that. Otherwise it looks good.

> Validate service principal format prior to launching yarn service
> -
>
> Key: YARN-8571
> URL: https://issues.apache.org/jira/browse/YARN-8571
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security, yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-8571.001.patch
>
>
> Hadoop client and server interaction is designed to validate the service 
> principal before RPC request is permitted.  In YARN service, the same 
> security model is enforced to prevent replay attack.   However, end user 
> might submit JSON that looks like this to YARN service REST API:
> {code}
> {
>   "name": "sleeper-service",
>   "version": "1.0.0",
>   "components" :
>   [
> {
>   "name": "sleeper",
>   "number_of_containers": 2,
>   "launch_command": "sleep 90",
>   "resource": {
> "cpus": 1,
> "memory": "256"
>   }
> }
>   ],
>   "kerberos_principal" : {
> "principal_name" : "ambari...@example.com",
> "keytab" : "file:///etc/security/keytabs/smokeuser.headless.keytab"
>   }
> }
> {code}
> The kerberos principal is end user kerberos principal instead of service 
> principal.  This does not work properly because YARN service application 
> master requires to run with a service principal to communicate with YARN CLI 
> client via Hadoop RPC.  Without breaking Hadoop security design in this JIRA, 
> it might be in our best interest to validate principal_name during 
> submission, and report error message when someone tries to run YARN service 
> with user principal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

2018-07-25 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555962#comment-16555962
 ] 

Hudson commented on YARN-4606:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14641 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14641/])
YARN-4606. CapacityScheduler: applications could get starved because (ericp: 
rev 9485c9aee6e9bb935c3e6ae4da81d70b621781de)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/UsersManager.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java


> CapacityScheduler: applications could get starved because computation of 
> #activeUsers considers pending apps 
> -
>
> Key: YARN-4606
> URL: https://issues.apache.org/jira/browse/YARN-4606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Karam Singh
>Assignee: Manikandan R
>Priority: Critical
> Attachments: YARN-4606.001.patch, YARN-4606.002.patch, 
> YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, 
> YARN-4606.006.patch, YARN-4606.007.patch, YARN-4606.1.poc.patch, 
> YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending 
> (caused by max-am-percent, etc.), ActiveUsersManager still considers the user 
> is an active user. This could lead to starvation of active applications, for 
> example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to 
> user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new 
> resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8578) Failed while trying to construct the redirect url to the log server for Samza applications

2018-07-25 Thread Yuriy Malygin (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy Malygin updated YARN-8578:

Description: 
In Timeline web interface I see miscellaneous behavior when clicking on a link 
in columns _ID_ and _Tracking UI_ of one row:
 * from _ID_ go to 
[http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] 
where I can see logs
 * from _Tracking UI_ go to 
[http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I 
redirecting to 
[http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username]
 and see error: 
 *Failed redirect for container_e51_1532439541520_0104_01_01*
 _Failed while trying to construct the redirect url to the log server. Log 
Server url may not be configured_
 _java.lang.Exception: Unknown container. Container either has not started or 
has already completed or doesn't belong to this node at all._

 

Application type of application_1532439541520_0104 is a Samza.

If type is MapReduce both URI works fine and redirects to logs - to TS and JHS.

 

  was:
In Timeline web interface I see miscellaneous behavior when clicking on a link 
in columns _ID_ and _Tracking UI_ of one row:
 * from _ID_ go to 
[http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] 
where I can see logs
 * from _Tracking UI_ go to 
[http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I 
redirecting to 
[http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username]
 and see error: 
 *Failed redirect for container_e51_1532439541520_0104_01_01*
 _Failed while trying to construct the redirect url to the log server. Log 
Server url may not be configured_
 _java.lang.Exception: Unknown container. Container either has not started or 
has already completed or doesn't belong to this node at all._

 

Application type of application_1532439541520_0104 is a Samza.

If type is MapReduce both URI works fine and redirects to logs.

 


> Failed while trying to construct the redirect url to the log server for Samza 
> applications
> --
>
> Key: YARN-8578
> URL: https://issues.apache.org/jira/browse/YARN-8578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Yuriy Malygin
>Priority: Major
>
> In Timeline web interface I see miscellaneous behavior when clicking on a 
> link in columns _ID_ and _Tracking UI_ of one row:
>  * from _ID_ go to 
> [http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104]
>  where I can see logs
>  * from _Tracking UI_ go to 
> [http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I 
> redirecting to 
> [http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username]
>  and see error: 
>  *Failed redirect for container_e51_1532439541520_0104_01_01*
>  _Failed while trying to construct the redirect url to the log server. Log 
> Server url may not be configured_
>  _java.lang.Exception: Unknown container. Container either has not started or 
> has already completed or doesn't belong to this node at all._
>  
> Application type of application_1532439541520_0104 is a Samza.
> If type is MapReduce both URI works fine and redirects to logs - to TS and 
> JHS.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8578) Failed while trying to construct the redirect url to the log server for Samza applications

2018-07-25 Thread Yuriy Malygin (JIRA)
Yuriy Malygin created YARN-8578:
---

 Summary: Failed while trying to construct the redirect url to the 
log server for Samza applications
 Key: YARN-8578
 URL: https://issues.apache.org/jira/browse/YARN-8578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.7.3
Reporter: Yuriy Malygin


In Timeline web interface I see miscellaneous behavior when clicking on a link 
in columns _ID_ and _Tracking UI_ of one row:
 * from _ID_ go to 
[http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] 
where I can see logs
 * from _Tracking UI_ go to 
[http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I 
redirecting to 
[http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username]
 and see error: 
*Failed redirect for container_e51_1532439541520_0104_01_01*
_Failed while trying to construct the redirect url to the log server. Log 
Server url may not be configured_
_java.lang.Exception: Unknown container. Container either has not started or 
has already completed or doesn't belong to this node at all._

 

Application type of application_1532439541520_0104 is Samza.

If type is MapReduce both URI works fine and redirects to logs.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs

2018-07-25 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555893#comment-16555893
 ] 

Robert Kanter commented on YARN-4946:
-

AFAIK, nothing has changed in this area.  However, I think the flag file is 
going to be a no-go.  I've gotten a _lot_ of pushback in the past when trying 
to have the RM write information to HDFS.  So I think we need to come up with a 
different approach.

The RM remembers X number of applications in order to save on memory and 
RMStateStore space.  This is controlled by 
{{yarn.resourcemanager.max-completed-applications}} and 
{{yarn.resourcemanager.state-store.max-completed-applications}}, respectively; 
and you usually would set them to the same value (in fact, I believe the 
state-store one is set to the other one by default).  For example, if set to 
1000, then when you run 1001 applications, the RM will forget the oldest 
application that is no longer running (i.e. completed, failed), so that it 
never remembers more than 1000 applications - that's what I mean about 
"forgetting."  Those applications can be looked up in the JHS, Spark HS, or etc.

No need to do a failover or HA (though we should test that once at the end to 
be thorough).  You can test this with 
{{yarn.resourcemanager.max-completed-applications}} by setting it to a low 
value like 3 or something.  The RM should not remember more than 3 completed 
applications, so simply run 4 jobs, wait for them to complete, and you'll see 
it.

The issue this JIRA is trying to solve is when you run the tool from 
MAPREDUCE-6415, if it can't find the App in the RM (because the RM forgot it) 
when getting the log aggregation status, it assumes that the aggregation 
completed successfully 
(https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-archive-logs/src/main/java/org/apache/hadoop/tools/HadoopArchiveLogs.java#L350).
  Assuming your cluster and job is working correctly, that's a good assumption, 
but if not, it'll be wrong.  IIRC, that's actually okay if log aggregation has 
reached a terminal state like succeeded or even failed; but is more of a 
problem if it's still in the middle of aggregating because we're going to 
process partial logs.  So I think we can leave that if we can ensure that the 
RM only forgets apps once they've reached a terminal log aggregation status.  
In other words, if the RM doesn't consider the App isn't truly finished until 
(and thus removed from it's history) until the aggregation status has reached a 
terminal state (i.e. DISABLED, SUCCEEDED, FAILED, TIME_OUT).  This should be a 
simpler fix and doesn't require writing anything to HDFS.

> RM should write out Aggregated Log Completion file flag next to logs
> 
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> # When the RM sees that an Application has successfully finished aggregation 
> its logs, it will write a flag file next to that Application's log files
> # The tool no longer talks to the RM at all.  When looking at the FileSystem, 
> it now uses that flag file to determine if it should process those log files. 
>  If the file is there, it archives, otherwise it does not.
> # As part of the archiving process, it will delete the flag file
> # (If you don't run the tool, the flag file will eventually be cleaned up by 
> the JHS when it cleans up the aggregated logs because it's in the same 
> directory)
> This improvement has several advantages:
> # The edge case about "forgotten" Applications is fixed
> # The tool no longer has to talk to the RM; it only has to consult HDFS.  
> This is simpler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8574) Allow dot in attribute values

2018-07-25 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1614#comment-1614
 ] 

Bibin A Chundatt edited comment on YARN-8574 at 7/25/18 11:47 AM:
--

Thank you [~Naganarasimha] for review

Both prefix and value should allow ".". rt . Currently also prefix allows  
yarn.rm.io, yarn.nm.io.
Did you mean namespace ??



was (Author: bibinchundatt):
Both prefix and value should allow ".". rt . Currently also prefix allows  
yarn.rm.io, yarn.nm.io.


> Allow dot in attribute values 
> --
>
> Key: YARN-8574
> URL: https://issues.apache.org/jira/browse/YARN-8574
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: YARN-8574-YARN-3409.001.patch
>
>
> Currently "." is considered as invalid value. Enable  the same;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc

2018-07-25 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555735#comment-16555735
 ] 

Hudson commented on YARN-8577:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14638 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14638/])
YARN-8577. Fix the broken anchor in SLS site-doc. Contributed by Weiwei 
(bibinchundatt: rev 3d3158cea4580eb2e3b2af635c3a7d30f4dbb873)
* (edit) hadoop-tools/hadoop-sls/src/site/markdown/SchedulerLoadSimulator.md


> Fix the broken anchor in SLS site-doc
> -
>
> Key: YARN-8577
> URL: https://issues.apache.org/jira/browse/YARN-8577
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.9.0, 3.0.0, 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
> Fix For: 3.2.0, 3.0.4, 3.1.2
>
> Attachments: HADOOP-15630.001.patch
>
>
> The anchor for section "Synthetic Load Generator" is currently broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8575) CapacityScheduler should check node state before committing reserve/allocate proposals

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1622#comment-1622
 ] 

genericqa commented on YARN-8575:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
23s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 14s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 18s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 17s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
23s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}131m 10s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.policies.TestDominantResourceFairnessPolicy
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8575 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12933019/YARN-8575.001.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux fadae5e6991e 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 955f795 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/21364/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21364/testReport/ |
| Max. process+thread count | 877 (vs. ulimit of 1) |
| modules | C: 

[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1618#comment-1618
 ] 

genericqa commented on YARN-8558:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 32s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
 8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 52s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 
42s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 72m 45s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8558 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12933027/YARN-8558.002.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 98dac162c806 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 
19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 955f795 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21366/testReport/ |
| Max. process+thread count | 410 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21366/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> NM recovery level db not cleaned up properly on container 

[jira] [Commented] (YARN-8574) Allow dot in attribute values

2018-07-25 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1614#comment-1614
 ] 

Bibin A Chundatt commented on YARN-8574:


Both prefix and value should allow ".". rt . Currently also prefix allows  
yarn.rm.io, yarn.nm.io.


> Allow dot in attribute values 
> --
>
> Key: YARN-8574
> URL: https://issues.apache.org/jira/browse/YARN-8574
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: YARN-8574-YARN-3409.001.patch
>
>
> Currently "." is considered as invalid value. Enable  the same;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8574) Allow dot in attribute values

2018-07-25 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555499#comment-16555499
 ] 

Naganarasimha G R commented on YARN-8574:
-

Thanks for the patch [~bibinchundatt], my bad almost had forgotten to work upon 
it. 

One concern i had is agree that value should be allowed with a dot but not the 
prefix, so for prefix if we are using the same pattern it qill be a problem

 

 

 

> Allow dot in attribute values 
> --
>
> Key: YARN-8574
> URL: https://issues.apache.org/jira/browse/YARN-8574
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: YARN-8574-YARN-3409.001.patch
>
>
> Currently "." is considered as invalid value. Enable  the same;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555489#comment-16555489
 ] 

genericqa commented on YARN-8577:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
37m 27s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 29s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 51m 24s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | HADOOP-15630 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12933026/HADOOP-15630.001.patch
 |
| Optional Tests |  asflicense  mvnsite  |
| uname | Linux bb9265a7bf90 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 955f795 |
| maven | version: Apache Maven 3.3.9 |
| Max. process+thread count | 334 (vs. ulimit of 1) |
| modules | C: hadoop-tools/hadoop-sls U: hadoop-tools/hadoop-sls |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14941/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Fix the broken anchor in SLS site-doc
> -
>
> Key: YARN-8577
> URL: https://issues.apache.org/jira/browse/YARN-8577
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.9.0, 3.0.0, 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
> Attachments: HADOOP-15630.001.patch
>
>
> The anchor for section "Synthetic Load Generator" is currently broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs

2018-07-25 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555607#comment-16555607
 ] 

Szilard Nemeth edited comment on YARN-4946 at 7/25/18 12:58 PM:


Hi [~rkanter]!
I'm trying to pick this one up.
Since this is created a long time ago, I suppose RM could work differently 
regarding "forgotten" applications, e.g. maybe it was improved over time.
Could you please provide me some hints how to test whether RM still forgets 
applications?
Is it a way to go to have an RM setup with HA, start one application, wait for 
its completion and do an RM failover or this involves some more complex steps 
to take?

Could you give me some insights how the application can be "forgotten" if 
enough time passes, or any other cases that can lead to the same situation?
Thanks!


was (Author: snemeth):
Hi [~rkanter]!
I'm trying to pick this one up.
Since this is created a long time ago, I suppose RM could work differently 
regarding "forgotten" applications, e.g. maybe it was improved over time.
Could you please provide me some hints how to test whether RM still forgets 
applications?
Is it a way to go to have an RM setup with HA, start one application, wait for 
its completion and do an RM failover or this involves some more complex steps 
to take?
Thanks!

> RM should write out Aggregated Log Completion file flag next to logs
> 
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> # When the RM sees that an Application has successfully finished aggregation 
> its logs, it will write a flag file next to that Application's log files
> # The tool no longer talks to the RM at all.  When looking at the FileSystem, 
> it now uses that flag file to determine if it should process those log files. 
>  If the file is there, it archives, otherwise it does not.
> # As part of the archiving process, it will delete the flag file
> # (If you don't run the tool, the flag file will eventually be cleaned up by 
> the JHS when it cleans up the aggregated logs because it's in the same 
> directory)
> This improvement has several advantages:
> # The edge case about "forgotten" Applications is fixed
> # The tool no longer has to talk to the RM; it only has to consult HDFS.  
> This is simpler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs

2018-07-25 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555607#comment-16555607
 ] 

Szilard Nemeth commented on YARN-4946:
--

Hi [~rkanter]!
I'm trying to pick this one up.
Since this is created a long time ago, I suppose RM could work differently 
regarding "forgotten" applications, e.g. maybe it was improved over time.
Could you please provide me some hints how to test whether RM still forgets 
applications?
Is it a way to go to have an RM setup with HA, start one application, wait for 
its completion and do an RM failover or this involves some more complex steps 
to take?
Thanks!

> RM should write out Aggregated Log Completion file flag next to logs
> 
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> # When the RM sees that an Application has successfully finished aggregation 
> its logs, it will write a flag file next to that Application's log files
> # The tool no longer talks to the RM at all.  When looking at the FileSystem, 
> it now uses that flag file to determine if it should process those log files. 
>  If the file is there, it archives, otherwise it does not.
> # As part of the archiving process, it will delete the flag file
> # (If you don't run the tool, the flag file will eventually be cleaned up by 
> the JHS when it cleans up the aggregated logs because it's in the same 
> directory)
> This improvement has several advantages:
> # The edge case about "forgotten" Applications is fixed
> # The tool no longer has to talk to the RM; it only has to consult HDFS.  
> This is simpler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs

2018-07-25 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-4946:


Assignee: Szilard Nemeth

> RM should write out Aggregated Log Completion file flag next to logs
> 
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> # When the RM sees that an Application has successfully finished aggregation 
> its logs, it will write a flag file next to that Application's log files
> # The tool no longer talks to the RM at all.  When looking at the FileSystem, 
> it now uses that flag file to determine if it should process those log files. 
>  If the file is there, it archives, otherwise it does not.
> # As part of the archiving process, it will delete the flag file
> # (If you don't run the tool, the flag file will eventually be cleaned up by 
> the JHS when it cleans up the aggregated logs because it's in the same 
> directory)
> This improvement has several advantages:
> # The edge case about "forgotten" Applications is fixed
> # The tool no longer has to talk to the RM; it only has to consult HDFS.  
> This is simpler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-25 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555480#comment-16555480
 ] 

Szilard Nemeth commented on YARN-6966:
--

Hi [~haibochen]!
Reopened and moved this issue to Patch Available as I think Yetus won't pick 
this up otherwise.
Added the patch for branch-2.
Should I add another patch to branch-3.0?
Thanks!

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, 
> YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, 
> YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-25 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-6966:
-
Target Version/s: 2.10.0, 3.2.0, 3.1.2

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, 
> YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, 
> YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6307) Refactor FairShareComparator#compare

2018-07-25 Thread stefanlee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1626#comment-1626
 ] 

stefanlee commented on YARN-6307:
-

thanks for this jira,[~yufeigu] [~templedf], I have a  doubt that what  is the 
difference  between *fair share* in _FairSharePolicy#compare_  and  *fair 
share* in  _FairSharePolicy#computeShares_, I think the latter is related to 
*preempt*. there are incomprehensible.

> Refactor FairShareComparator#compare
> 
>
> Key: YARN-6307
> URL: https://issues.apache.org/jira/browse/YARN-6307
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6307.001.patch, YARN-6307.002.patch, 
> YARN-6307.003.patch
>
>
> The method does three things: compare the min share usage, compare fair share 
> usage by checking weight ratio, break tied by submit time and name. They are 
> mixed with each other which is not easy to read and maintenance, poor style. 
> Additionally, there are potential performance issues, like no need to check 
> weight ratio if minShare usage comparison already indicate the order. It is 
> worth to improve considering huge amount invokings in scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish

2018-07-25 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555396#comment-16555396
 ] 

Sunil Govindan commented on YARN-8558:
--

Yes. That makes sense. However naming convention for 
CONTAINER_TOKENS_CURRENT_MASTER_KEY etc are confusing as it looks like 
indication value per container. In theory, its common for all containers 
manager by manager for a day. So could we rename this to avoid the confusion.

> NM recovery level db not cleaned up properly on container finish
> 
>
> Key: YARN-8558
> URL: https://issues.apache.org/jira/browse/YARN-8558
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8558.001.patch
>
>
> {code}
> 2018-07-20 16:49:23,117 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1531994217928_0054 transitioned from NEW to INITING
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_18 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_19 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_20 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_21 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_22 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_23 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_24 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_25 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_38 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_39 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_41 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_44 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_46 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_49 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_52 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_54 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_73 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_74 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_75 with incomplete 
> 

[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc

2018-07-25 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1605#comment-1605
 ] 

Bibin A Chundatt commented on YARN-8577:


+ 1 LGTM 

Will commit it soon

> Fix the broken anchor in SLS site-doc
> -
>
> Key: YARN-8577
> URL: https://issues.apache.org/jira/browse/YARN-8577
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.9.0, 3.0.0, 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
> Attachments: HADOOP-15630.001.patch
>
>
> The anchor for section "Synthetic Load Generator" is currently broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2018-07-25 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1601#comment-1601
 ] 

Bibin A Chundatt commented on YARN-8546:


[~cheersyang]

branch-3.1.1 is created so fix version should be 3.1.2

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555495#comment-16555495
 ] 

genericqa commented on YARN-8577:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
35m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  4s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 49m 20s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8577 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12933026/HADOOP-15630.001.patch
 |
| Optional Tests |  asflicense  mvnsite  |
| uname | Linux 0aefa8e5efa7 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 
08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 955f795 |
| maven | version: Apache Maven 3.3.9 |
| Max. process+thread count | 399 (vs. ulimit of 1) |
| modules | C: hadoop-tools/hadoop-sls U: hadoop-tools/hadoop-sls |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21365/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Fix the broken anchor in SLS site-doc
> -
>
> Key: YARN-8577
> URL: https://issues.apache.org/jira/browse/YARN-8577
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.9.0, 3.0.0, 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
> Attachments: HADOOP-15630.001.patch
>
>
> The anchor for section "Synthetic Load Generator" is currently broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8541) RM startup failure on recovery after user deletion

2018-07-25 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555494#comment-16555494
 ] 

Bibin A Chundatt commented on YARN-8541:


Thank you [~sunilg] and [~suma.shivaprasad] for review and [~jj336013] for  
issue

> RM startup failure on recovery after user deletion
> --
>
> Key: YARN-8541
> URL: https://issues.apache.org/jira/browse/YARN-8541
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: yimeng
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8541-branch-3.1.003.patch, YARN-8541.001.patch, 
> YARN-8541.002.patch, YARN-8541.003.patch
>
>
> My hadoop version 3.1.0. I found that  a problem RM startup failure on 
> recovery as the follow test step:
> 1.create a user "user1" have the permisson to submit app.
> 2.use user1 to submit a job ,wait job finished.
> 3.delete user "user1"
> 4.restart yarn 
> 5.the RM restart failed
> RM logs:
> 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized root queue 
> root: numChildQueue= 3, capacity=1.0, absoluteCapacity=1.0, 
> usedResources=usedCapacity=0.0, numApps=0, 
> numContainers=0 | CapacitySchedulerQueueManager.java:163
> 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized queue 
> mappings, override: false | UserGroupMappingPlacementRule.java:232
> 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized 
> CapacityScheduler with calculator=class 
> org.apache.hadoop.yarn.util.resource.DominantResourceCalculator, 
> minimumAllocation=<>, maximumAllocation=< vCores:32>>, asynchronousScheduling=false, asyncScheduleInterval=5ms | 
> CapacityScheduler.java:392
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | dynamic-resources.xml not 
> found | Configuration.java:2767
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | Initializing AMS 
> Processing chain. Root 
> Processor=[org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor].
>  | AMSProcessingChain.java:62
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | disabled placement 
> handler will be used, all scheduling requests will be rejected. | 
> ApplicationMasterService.java:130
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | Adding 
> [org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor]
>  tp top of AMS Processing chain. | AMSProcessingChain.java:75
> 2018-07-16 16:24:59,713 | WARN | main-EventThread | Exception handling the 
> winning of election | ActiveStandbyElector.java:897
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:893)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473)
>  at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:728)
>  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:600)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
>  ... 4 more
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application 
> application_1531624956005_0001 submitted by user super reason: No groups 
> found for user super
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1204)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1245)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1241)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1241)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320)
>  ... 5 more
> Caused 

[jira] [Updated] (YARN-8575) CapacityScheduler should check node state before committing reserve/allocate proposals

2018-07-25 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8575:
---
Attachment: YARN-8575.001.patch

> CapacityScheduler should check node state before committing reserve/allocate 
> proposals
> --
>
> Key: YARN-8575
> URL: https://issues.apache.org/jira/browse/YARN-8575
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8575.001.patch
>
>
> Recently we found a new error as follows: 
> {noformat}
> ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  node to unreserve doesn't exist, nodeid: host1:45454
> {noformat}
> Reproduce this problem:
> (1) Create a reserve proposal for app1 on node1
> (2) node1 is successfully decommissioned and removed from node tracker
> (3) Try to commit this outdated reserve proposal, it will be accepted and 
> applied.
> This error may be occurred after decommissioning some NMs. The application 
> who print the error log will always have a reserved container on non-exist 
> (decommissioned) NM and the pending request will never be satisfied.
> To solve this problem, scheduler should check node state in 
> FiCaSchedulerApp#accept to avoid committing outdated proposals on unusable 
> nodes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8575) CapacityScheduler should check node state before committing reserve/allocate proposals

2018-07-25 Thread Tao Yang (JIRA)
Tao Yang created YARN-8575:
--

 Summary: CapacityScheduler should check node state before 
committing reserve/allocate proposals
 Key: YARN-8575
 URL: https://issues.apache.org/jira/browse/YARN-8575
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.2.0, 3.1.2
Reporter: Tao Yang
Assignee: Tao Yang


Recently we found a new error as follows: 
{noformat}
ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 node to unreserve doesn't exist, nodeid: host1:45454
{noformat}
Reproduce this problem:
(1) Create a reserve proposal for app1 on node1
(2) node1 is successfully decommissioned and removed from node tracker
(3) Try to commit this outdated reserve proposal, it will be accepted and 
applied.
This error may be occurred after decommissioning some NMs. The application who 
print the error log will always have a reserved container on non-exist 
(decommissioned) NM and the pending request will never be satisfied.
To solve this problem, scheduler should check node state in 
FiCaSchedulerApp#accept to avoid committing outdated proposals on unusable 
nodes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8574) Allow dot in attribute values

2018-07-25 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555360#comment-16555360
 ] 

Bibin A Chundatt commented on YARN-8574:


[~sunil.gov...@gmail.com] 

Please review.

> Allow dot in attribute values 
> --
>
> Key: YARN-8574
> URL: https://issues.apache.org/jira/browse/YARN-8574
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: YARN-8574-YARN-3409.001.patch
>
>
> Currently "." is considered as invalid value. Enable  the same;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8574) Allow dot in attribute values

2018-07-25 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-8574:
---
Attachment: YARN-8574-YARN-3409.001.patch

> Allow dot in attribute values 
> --
>
> Key: YARN-8574
> URL: https://issues.apache.org/jira/browse/YARN-8574
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: YARN-8574-YARN-3409.001.patch
>
>
> Currently "." is considered as invalid value. Enable  the same;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-25 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reopened YARN-6966:
--

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, 
> YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, 
> YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-25 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-6966:
-
Attachment: YARN-6966-branch-2.001.patch

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, 
> YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, 
> YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2018-07-25 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555469#comment-16555469
 ] 

Weiwei Yang commented on YARN-8546:
---

Thanks [~Tao Yang] for the contribution, I've committed this to trunk and 
branch-3.1.

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2018-07-25 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555461#comment-16555461
 ] 

Hudson commented on YARN-8546:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #14635 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14635/])
YARN-8546. Resource leak caused by a reserved container being released (wwei: 
rev 5be9f4a5d05c9cb99348719fe35626b1de3055db)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerAsyncScheduling.java


> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Moved] (YARN-8577) Fix the broken anchor in SLS site-doc

2018-07-25 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang moved HADOOP-15630 to YARN-8577:


Affects Version/s: (was: 3.1.0)
   (was: 3.0.0)
   (was: 2.9.0)
   2.9.0
   3.0.0
   3.1.0
  Component/s: (was: documentation)
   documentation
  Key: YARN-8577  (was: HADOOP-15630)
  Project: Hadoop YARN  (was: Hadoop Common)

> Fix the broken anchor in SLS site-doc
> -
>
> Key: YARN-8577
> URL: https://issues.apache.org/jira/browse/YARN-8577
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 3.1.0, 3.0.0, 2.9.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
> Attachments: HADOOP-15630.001.patch
>
>
> The anchor for section "Synthetic Load Generator" is currently broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2018-07-25 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8546:
--
Summary: Resource leak caused by a reserved container being released more 
than once under async scheduling  (was: A reserved container might be released 
multiple times under async scheduling)

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish

2018-07-25 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555436#comment-16555436
 ] 

Bibin A Chundatt commented on YARN-8558:


[~sunilg]

Make sense. Renamed variable and uploaded patch handling the same.

> NM recovery level db not cleaned up properly on container finish
> 
>
> Key: YARN-8558
> URL: https://issues.apache.org/jira/browse/YARN-8558
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8558.001.patch, YARN-8558.002.patch
>
>
> {code}
> 2018-07-20 16:49:23,117 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1531994217928_0054 transitioned from NEW to INITING
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_18 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_19 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_20 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_21 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_22 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_23 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_24 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_25 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_38 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_39 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_41 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_44 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_46 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_49 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_52 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_54 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_73 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_74 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_75 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container 

[jira] [Updated] (YARN-8558) NM recovery level db not cleaned up properly on container finish

2018-07-25 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-8558:
---
Attachment: YARN-8558.002.patch

> NM recovery level db not cleaned up properly on container finish
> 
>
> Key: YARN-8558
> URL: https://issues.apache.org/jira/browse/YARN-8558
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8558.001.patch, YARN-8558.002.patch
>
>
> {code}
> 2018-07-20 16:49:23,117 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1531994217928_0054 transitioned from NEW to INITING
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_18 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_19 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_20 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_21 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_22 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_23 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_24 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_25 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_38 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_39 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_41 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_44 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_46 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_49 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_52 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_54 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_73 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_74 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_75 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_78 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> 

[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555429#comment-16555429
 ] 

genericqa commented on YARN-8418:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
28s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
0s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  7s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
22s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 14s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
25s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 24s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
39s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}106m 34s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.containermanager.TestContainerManager |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8418 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12933012/YARN-8418.007.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 86144f4f8ba2 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 81d5950 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| unit | 

[jira] [Created] (YARN-8576) Fix the broken anchor in SLS site-doc

2018-07-25 Thread Weiwei Yang (JIRA)
Weiwei Yang created YARN-8576:
-

 Summary: Fix the broken anchor in SLS site-doc
 Key: YARN-8576
 URL: https://issues.apache.org/jira/browse/YARN-8576
 Project: Hadoop YARN
  Issue Type: Bug
  Components: docs
Reporter: Weiwei Yang
Assignee: Weiwei Yang


The anchor for section "Synthetic Load Generator" is currently broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8546) A reserved container might be released multiple times under async scheduling

2018-07-25 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555413#comment-16555413
 ] 

Weiwei Yang commented on YARN-8546:
---

Thanks [~Tao Yang] for that patch, the fix looks good. +1.

There is a minor typo in the log message, I will fix it during the commit.

Thanks

> A reserved container might be released multiple times under async scheduling
> 
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8541) RM startup failure on recovery after user deletion

2018-07-25 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555388#comment-16555388
 ] 

Sunil Govindan commented on YARN-8541:
--

Thanks [~bibinchundatt]. TestPlacementManager is not in 3.1 and hence makes 
sense to remove for 3.1. +1

> RM startup failure on recovery after user deletion
> --
>
> Key: YARN-8541
> URL: https://issues.apache.org/jira/browse/YARN-8541
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: yimeng
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8541-branch-3.1.003.patch, YARN-8541.001.patch, 
> YARN-8541.002.patch, YARN-8541.003.patch
>
>
> My hadoop version 3.1.0. I found that  a problem RM startup failure on 
> recovery as the follow test step:
> 1.create a user "user1" have the permisson to submit app.
> 2.use user1 to submit a job ,wait job finished.
> 3.delete user "user1"
> 4.restart yarn 
> 5.the RM restart failed
> RM logs:
> 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized root queue 
> root: numChildQueue= 3, capacity=1.0, absoluteCapacity=1.0, 
> usedResources=usedCapacity=0.0, numApps=0, 
> numContainers=0 | CapacitySchedulerQueueManager.java:163
> 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized queue 
> mappings, override: false | UserGroupMappingPlacementRule.java:232
> 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized 
> CapacityScheduler with calculator=class 
> org.apache.hadoop.yarn.util.resource.DominantResourceCalculator, 
> minimumAllocation=<>, maximumAllocation=< vCores:32>>, asynchronousScheduling=false, asyncScheduleInterval=5ms | 
> CapacityScheduler.java:392
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | dynamic-resources.xml not 
> found | Configuration.java:2767
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | Initializing AMS 
> Processing chain. Root 
> Processor=[org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor].
>  | AMSProcessingChain.java:62
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | disabled placement 
> handler will be used, all scheduling requests will be rejected. | 
> ApplicationMasterService.java:130
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | Adding 
> [org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor]
>  tp top of AMS Processing chain. | AMSProcessingChain.java:75
> 2018-07-16 16:24:59,713 | WARN | main-EventThread | Exception handling the 
> winning of election | ActiveStandbyElector.java:897
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:893)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473)
>  at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:728)
>  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:600)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
>  ... 4 more
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application 
> application_1531624956005_0001 submitted by user super reason: No groups 
> found for user super
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1204)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1245)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1241)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1241)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320)
>  ... 5 more
> Caused by: 

[jira] [Commented] (YARN-8574) Allow dot in attribute values

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555373#comment-16555373
 ] 

genericqa commented on YARN-8574:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m 
11s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-8574 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12933017/YARN-8574-YARN-3409.001.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21363/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Allow dot in attribute values 
> --
>
> Key: YARN-8574
> URL: https://issues.apache.org/jira/browse/YARN-8574
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: YARN-8574-YARN-3409.001.patch
>
>
> Currently "." is considered as invalid value. Enable  the same;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-25 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-8418:
---
Attachment: YARN-8418.007.patch

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, 
> YARN-8418.006.patch, YARN-8418.007.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8574) Allow dot in attribute values

2018-07-25 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-8574:
--

 Summary: Allow dot in attribute values 
 Key: YARN-8574
 URL: https://issues.apache.org/jira/browse/YARN-8574
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Currently "." is considered as invalid value. Enable  the same;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8521) NPE in AllocationTagsManager when a container is removed more than once

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555354#comment-16555354
 ] 

genericqa commented on YARN-8521:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
36s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 32s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 46s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 38s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}132m 52s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8521 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12932999/YARN-8521.003.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux bd3ed19bca42 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 81d5950 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/21360/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21360/testReport/ |
| Max. process+thread count | 941 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 

[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555304#comment-16555304
 ] 

genericqa commented on YARN-8418:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
0s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 25s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
23s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
22s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  1m 
10s{color} | {color:red} hadoop-yarn in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  1m 10s{color} 
| {color:red} hadoop-yarn in the patch failed. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
11s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
24s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. 
{color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  4m  
8s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
23s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. 
{color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
22s{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager
 generated 2 new + 9 unchanged - 0 fixed = 11 total (was 9) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
19s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 24s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 66m 49s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8418 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12933000/YARN-8418.006.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 30002f1ec09b 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 
08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 81d5950 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 

[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-25 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555313#comment-16555313
 ] 

Bibin A Chundatt commented on YARN-8418:


Missed to add event class. Attaching patch again

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, 
> YARN-8418.006.patch, YARN-8418.007.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8566) Add diagnostic message for unschedulable containers

2018-07-25 Thread JIRA


[ 
https://issues.apache.org/jira/browse/YARN-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553032#comment-16553032
 ] 

Antal Bálint Steinbach edited comment on YARN-8566 at 7/25/18 6:51 AM:
---

Hi [~snemeth] !

 

Thanks for the patch. I only have some minor comments:
 * Maybe it would be good, to add diagnostic text for the 3rd case (UNKNOWN)
 * Using a switch for enums can be less verbose
 * you can extract 
app.getRMAppAttempt(appAttemptId).updateAMLaunchDiagnostics(...

 
{code:java}
// String errorMsg = "";
switch (e.getInvalidResourceType()){
  case GREATER_THEN_MAX_ALLOCATION:
errorMsg = "Cannot allocate containers as resource request is " +
"greater than the maximum allowed allocation!";
break;
  case LESS_THAN_ZERO:
errorMsg = "Cannot allocate containers as resource request is " +
"less than zero!";
break;
  case UNKNOWN: 
  default:
errorMsg = "Cannot allocate containers for some unknown reasons!";
}
app.getRMAppAttempt(appAttemptId).updateAMLaunchDiagnostics(errorMsg);

{code}
 


was (Author: bsteinbach):
Hi [~snemeth] !

 

Thanks for the patch. I only have some minor comments:
 * Maybe it would be good, to add diagnostic text for the 3rd case (UNKNOWN)
 * Using a switch for enums can be less verbose
 * you can extract 
app.getRMAppAttempt(appAttemptId).updateAMLaunchDiagnostics(...

 
{code:java}
// String errorMsg = "";
switch (e.getInvalidResourceType()){
  case GREATER_THEN_MAX_ALLOCATION:
errorMsg = "Cannot allocate containers as resource request is " +
"greater than the maximum allowed allocation!";
break;
  case LESS_THAN_ZERO:
errorMsg = "Cannot allocate containers as resource request is " +
"less than zero!";
  case UNKNOWN: 
  default:
errorMsg = "Cannot allocate containers for some unknown reasons!";
}
app.getRMAppAttempt(appAttemptId).updateAMLaunchDiagnostics(errorMsg);

{code}
 

> Add diagnostic message for unschedulable containers
> ---
>
> Key: YARN-8566
> URL: https://issues.apache.org/jira/browse/YARN-8566
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8566.001.patch, YARN-8566.002.patch, 
> YARN-8566.003.patch, YARN-8566.004.patch
>
>
> If a queue is configured with maxResources set to 0 for a resource, and an 
> application is submitted to that queue that requests that resource, that 
> application will remain pending until it is removed or moved to a different 
> queue. This behavior can be realized without extended resources, but it’s 
> unlikely a user will create a queue that allows 0 memory or CPU. As the 
> number of resources in the system increases, this scenario will become more 
> common, and it will become harder to recognize these cases. Therefore, the 
> scheduler should indicate in the diagnostic string for an application if it 
> was not scheduled because of a 0 maxResources setting.
> Example configuration (fair-scheduler.xml) : 
> {code:java}
> 
>   10
> 
> 1 mb,2vcores
> 9 mb,4vcores, 0gpu
> 50
> -1.0f
> 2.0
> fair
>   
> 
> {code}
> Command: 
> {code:java}
> yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar" pi 
> -Dmapreduce.job.queuename=sample_queue -Dmapreduce.map.resource.gpu=1 1 1000;
> {code}
> The job hangs and the application diagnostic info is empty.
> Given that an exception is thrown before any mapper/reducer container is 
> created, the diagnostic message of the AM should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



***UNCHECKED*** [jira] [Comment Edited] (YARN-8566) Add diagnostic message for unschedulable containers

2018-07-25 Thread JIRA


[ 
https://issues.apache.org/jira/browse/YARN-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553917#comment-16553917
 ] 

Antal Bálint Steinbach edited comment on YARN-8566 at 7/25/18 6:50 AM:
---

Hi [~snemeth]
+1 LGTM (Non-binding) Thanks for the fix. 


was (Author: bsteinbach):
Hi [~snemeth]
+1 Thanks for the fix. 

> Add diagnostic message for unschedulable containers
> ---
>
> Key: YARN-8566
> URL: https://issues.apache.org/jira/browse/YARN-8566
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8566.001.patch, YARN-8566.002.patch, 
> YARN-8566.003.patch, YARN-8566.004.patch
>
>
> If a queue is configured with maxResources set to 0 for a resource, and an 
> application is submitted to that queue that requests that resource, that 
> application will remain pending until it is removed or moved to a different 
> queue. This behavior can be realized without extended resources, but it’s 
> unlikely a user will create a queue that allows 0 memory or CPU. As the 
> number of resources in the system increases, this scenario will become more 
> common, and it will become harder to recognize these cases. Therefore, the 
> scheduler should indicate in the diagnostic string for an application if it 
> was not scheduled because of a 0 maxResources setting.
> Example configuration (fair-scheduler.xml) : 
> {code:java}
> 
>   10
> 
> 1 mb,2vcores
> 9 mb,4vcores, 0gpu
> 50
> -1.0f
> 2.0
> fair
>   
> 
> {code}
> Command: 
> {code:java}
> yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar" pi 
> -Dmapreduce.job.queuename=sample_queue -Dmapreduce.map.resource.gpu=1 1 1000;
> {code}
> The job hangs and the application diagnostic info is empty.
> Given that an exception is thrown before any mapper/reducer container is 
> created, the diagnostic message of the AM should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8553) Reduce complexity of AHSWebService getApps method

2018-07-25 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555216#comment-16555216
 ] 

Szilard Nemeth commented on YARN-8553:
--

Thanks [~sunilg] for jumping in for the review.

> Reduce complexity of AHSWebService getApps method
> -
>
> Key: YARN-8553
> URL: https://issues.apache.org/jira/browse/YARN-8553
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8553.001.patch
>
>
> YARN-8501 refactor the RMWebService#getApp. Similar refactoring required in 
> AHSWebservice. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8572) YarnClient getContainers API should support filtering by container status

2018-07-25 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-8572:
-
Description: YarnClient.getContainers should support filtering containers 
by their status - RUNNING, COMPLETED etc . This may require corresponding 
changes in ATS to filter by container status for a given application attempt  
(was: YarnClient.getContainers should support filtering containers by their 
status - RUNNING, COMPLETED etc . This may require corresponding changes in ATS 
to filter by container status for a given application attemopt)

> YarnClient getContainers API should support filtering by container status
> -
>
> Key: YARN-8572
> URL: https://issues.apache.org/jira/browse/YARN-8572
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Suma Shivaprasad
>Priority: Major
>
> YarnClient.getContainers should support filtering containers by their status 
> - RUNNING, COMPLETED etc . This may require corresponding changes in ATS to 
> filter by container status for a given application attempt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-25 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-8418:
---
Attachment: YARN-8418.006.patch

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, 
> YARN-8418.006.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8521) NPE in AllocationTagsManager when a container is removed more than once

2018-07-25 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8521:
--
Attachment: YARN-8521.003.patch

> NPE in AllocationTagsManager when a container is removed more than once
> ---
>
> Key: YARN-8521
> URL: https://issues.apache.org/jira/browse/YARN-8521
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8521.001.patch, YARN-8521.002.patch, 
> YARN-8521.003.patch
>
>
> We've seen sometimes there is NPE in AllocationTagsManager
> {code:java}
> private void removeTagFromInnerMap(Map innerMap, String tag) {
>   Long count = innerMap.get(tag);
>   if (count > 1) { // NPE!!
>   ...
> {code}
> it seems {{AllocationTagsManager#removeContainer}} somehow gets called more 
> than once for a same container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org