[jira] [Commented] (YARN-8580) yarn.resourcemanager.am.max-attempts is not respected for yarn services
[ https://issues.apache.org/jira/browse/YARN-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556935#comment-16556935 ] Gour Saha commented on YARN-8580: - Actually, this is Yarn Service specific property. So the value 20 is getting set because that's the default for Yarn Services. The reason 100 was not taking effect is - for Yarn Service the property name is yarn.service.am-restart.max-attempts and not yarn.resourcemanager.am.max-attempts. Once the right property is set, the desired behavior will be seen. It is still an Invalid jira though. > yarn.resourcemanager.am.max-attempts is not respected for yarn services > --- > > Key: YARN-8580 > URL: https://issues.apache.org/jira/browse/YARN-8580 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Priority: Major > > 1) Max am attempt is set to 100 on all nodes. ( including gateway) > {code} > > yarn.resourcemanager.am.max-attempts > 100 > {code} > 2) Start a Yarn service ( Hbase tarball ) application > 3) Kill AM 20 times > Here, App fails with below diagnostics. > {code} > bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status > application_1532481557746_0001 > 18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > 18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml > at file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml > Application Report : > Application-Id : application_1532481557746_0001 > Application-Name : hbase-tarball-lr > Application-Type : yarn-service > User : hbase > Queue : default > Application Priority : 0 > Start-Time : 1532481864863 > Finish-Time : 1532522943103 > Progress : 100% > State : FAILED > Final-State : FAILED > Tracking-URL : > https://xxx:8090/cluster/app/application_1532481557746_0001 > RPC Port : -1 > AM Host : N/A > Aggregate Resource Allocation : 252150112 MB-seconds, 164141 > vcore-seconds > Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds > Log Aggregation Status : SUCCEEDED > Diagnostics : Application application_1532481557746_0001 failed 20 > times (global limit =100; local limit is =20) due to AM Container for > appattempt_1532481557746_0001_20 exited with exitCode: 137 > Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed > on request. Exit code is 137 > [2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. > [2018-07-25 12:49:03.045]Killed by external signal > For more detailed output, check the application tracking page: > https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on > links to logs of each attempt. > . Failing the application. > Unmanaged Application : false > Application Node Label Expression : > AM container Node Label Expression : > TimeoutType : LIFETIME ExpiryTime : 2018-07-25T22:26:15.419+ > RemainingTime : 0seconds > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc
[ https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556926#comment-16556926 ] Weiwei Yang commented on YARN-8577: --- Thanks [~bibinchundatt] for the review and commit, I have cherry-picked this to branch-2.9 too. > Fix the broken anchor in SLS site-doc > - > > Key: YARN-8577 > URL: https://issues.apache.org/jira/browse/YARN-8577 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 2.9.0, 3.0.0, 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Fix For: 3.2.0, 2.9.2, 3.0.4, 3.1.2 > > Attachments: HADOOP-15630.001.patch > > > The anchor for section "Synthetic Load Generator" is currently broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8577) Fix the broken anchor in SLS site-doc
[ https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8577: -- Fix Version/s: 2.9.2 > Fix the broken anchor in SLS site-doc > - > > Key: YARN-8577 > URL: https://issues.apache.org/jira/browse/YARN-8577 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 2.9.0, 3.0.0, 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Fix For: 3.2.0, 2.9.2, 3.0.4, 3.1.2 > > Attachments: HADOOP-15630.001.patch > > > The anchor for section "Synthetic Load Generator" is currently broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8546: -- Fix Version/s: (was: 3.1.1) 3.1.2 > Resource leak caused by a reserved container being released more than once > under async scheduling > - > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556921#comment-16556921 ] Weiwei Yang commented on YARN-8546: --- Thanks [~bibinchundatt], I have corrected the fix version to 3.1.2. > Resource leak caused by a reserved container being released more than once > under async scheduling > - > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment
[ https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556916#comment-16556916 ] genericqa commented on YARN-7833: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 31s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 32m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 56s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 10s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 28m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 28m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} shellcheck {color} | {color:green} 0m 0s{color} | {color:green} There were no new shellcheck issues. {color} | | {color:green}+1{color} | {color:green} shelldocs {color} | {color:green} 0m 36s{color} | {color:green} There were no new shelldocs issues. {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 4 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch 137 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 36s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 4s{color} | {color:red} hadoop-tools/hadoop-sls generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 10s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 55s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 22s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 69m 13s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
[jira] [Commented] (YARN-8407) Container launch exception in AM log should be printed in ERROR level
[ https://issues.apache.org/jira/browse/YARN-8407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556897#comment-16556897 ] Bibin A Chundatt commented on YARN-8407: [~yeshavora] Few minor comments # Please handle formatting # Use Stringbuilder for creating message. > Container launch exception in AM log should be printed in ERROR level > - > > Key: YARN-8407 > URL: https://issues.apache.org/jira/browse/YARN-8407 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Yesha Vora >Priority: Major > Attachments: YARN-8407.001.patch > > > when a container launch is failing due to docker image not available is > logged as INFO level in AM log. > Container launch failure should be logged as ERROR. > Steps: > launch httpd yarn-service application with invalid docker image > > {code:java} > 2018-06-07 01:51:32,966 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE httpd-0 : > container_e05_1528335963594_0001_01_02]: > container_e05_1528335963594_0001_01_02 completed. Reinsert back to > pending list and requested a new container. > exitStatus=-1, diagnostics=[2018-06-07 01:51:02.363]Exception from > container-launch. > Container id: container_e05_1528335963594_0001_01_02 > Exit code: 7 > Exception message: Launch container failed > Shell error output: Unable to find image 'xxx/httpd:0.1' locally > Trying to pull repository xxx/httpd ... > /usr/bin/docker-current: Get https://xxx/v1/_ping: dial tcp: lookup xxx on > yyy: no such host. > See '/usr/bin/docker-current run --help'. > Shell output: main : command provided 4 > main : run as user is hbase > main : requested yarn user is hbase > Creating script paths... > Creating local dirs... > Getting exit code file... > Changing effective user to root... > Wrote the exit code 7 to > /grid/0/hadoop/yarn/local/nmPrivate/application_1528335963594_0001/container_e05_1528335963594_0001_01_02/container_e05_1528335963594_0001_01_02.pid.exitcode > [2018-06-07 01:51:02.393]Diagnostic message from attempt : > [2018-06-07 01:51:02.394]Container exited with a non-zero exit code 7. Last > 4096 bytes of stderr.txt : > [2018-06-07 01:51:32.428]Could not find > nmPrivate/application_1528335963594_0001/container_e05_1528335963594_0001_01_02//container_e05_1528335963594_0001_01_02.pid > in any of the directories > 2018-06-07 01:51:32,966 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE httpd-0 : > container_e05_1528335963594_0001_01_02] Transitioned from STARTED to INIT > on STOP event{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8252) Fix ServiceMaster main not found
[ https://issues.apache.org/jira/browse/YARN-8252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556890#comment-16556890 ] Jaume M commented on YARN-8252: --- I'm seeing this when trying to install LLAP with hadoop master. The container doesn't start and the only error line is: {{Error: Could not find or load main class org.apache.hadoop.yarn.service.ServiceMaster}} > Fix ServiceMaster main not found > > > Key: YARN-8252 > URL: https://issues.apache.org/jira/browse/YARN-8252 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Zoltan Haindrich >Priority: Major > > I was looking into using yarn services; however it seems for some reason it > is not possible to run {{ServiceMaster}} class from the jar...I might be > missing some fundamental...so I've put together a shellscript to make it easy > for anyone to checkI would be happy with any exception beyond main not > found > [ServiceMaster.main > method|https://github.com/apache/hadoop/blob/67f239c42f676237290d18ddbbc9aec369267692/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/ServiceMaster.java#L305] > {code:java} > #!/bin/bash > set -e > wget -O core.jar -nv > http://central.maven.org/maven2/org/apache/hadoop/hadoop-yarn-services-core/3.1.0/hadoop-yarn-services-core-3.1.0.jar > unzip -qn core.jar > cat > org/apache/hadoop/yarn/service/ServiceMaster2.java << EOF > package org.apache.hadoop.yarn.service; > public class ServiceMaster2 { > public static void main(String[] args) throws Exception { > System.out.println("asd!"); > } > } > EOF > javac org/apache/hadoop/yarn/service/ServiceMaster2.java > jar -cf a1.jar org > find org -name ServiceMaster* > # this will print "asd!" > java -cp a1.jar org.apache.hadoop.yarn.service.ServiceMaster2 > #the following invocations result in: > # Error: Could not find or load main class > org.apache.hadoop.yarn.service.ServiceMaster > # > set +e > java -cp a1.jar org.apache.hadoop.yarn.service.ServiceMaster > java -cp core.jar org.apache.hadoop.yarn.service.ServiceMaster > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment
[ https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanuj Nayak updated YARN-7833: -- Attachment: (was: YARN-7833.v1.patch) > [PERF/TEST] Extend SLS to support simulation of a Federated Environment > --- > > Key: YARN-7833 > URL: https://issues.apache.org/jira/browse/YARN-7833 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Tanuj Nayak >Priority: Major > Attachments: YARN-7833.v1.patch > > > To develop algorithms for federation, it would be of great help to have a > version of SLS that supports multi RMs and GPG. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment
[ https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556750#comment-16556750 ] Tanuj Nayak edited comment on YARN-7833 at 7/26/18 1:21 AM: Added initial patch of Federated SLS. It does not separate the metrics of individual RM's in the ClusterMetrics and QueueMetrics classes yet. was (Author: tanujnay): Added initial patch of Federated SLS > [PERF/TEST] Extend SLS to support simulation of a Federated Environment > --- > > Key: YARN-7833 > URL: https://issues.apache.org/jira/browse/YARN-7833 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Tanuj Nayak >Priority: Major > Attachments: YARN-7833.v1.patch > > > To develop algorithms for federation, it would be of great help to have a > version of SLS that supports multi RMs and GPG. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment
[ https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanuj Nayak updated YARN-7833: -- Attachment: YARN-7833.v1.patch > [PERF/TEST] Extend SLS to support simulation of a Federated Environment > --- > > Key: YARN-7833 > URL: https://issues.apache.org/jira/browse/YARN-7833 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Tanuj Nayak >Priority: Major > Attachments: YARN-7833.v1.patch > > > To develop algorithms for federation, it would be of great help to have a > version of SLS that supports multi RMs and GPG. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment
[ https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanuj Nayak updated YARN-7833: -- Attachment: (was: YARN-7833.v1.patch) > [PERF/TEST] Extend SLS to support simulation of a Federated Environment > --- > > Key: YARN-7833 > URL: https://issues.apache.org/jira/browse/YARN-7833 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Tanuj Nayak >Priority: Major > > To develop algorithms for federation, it would be of great help to have a > version of SLS that supports multi RMs and GPG. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment
[ https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanuj Nayak updated YARN-7833: -- Attachment: YARN-7833.v1.patch > [PERF/TEST] Extend SLS to support simulation of a Federated Environment > --- > > Key: YARN-7833 > URL: https://issues.apache.org/jira/browse/YARN-7833 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Tanuj Nayak >Priority: Major > Attachments: YARN-7833.v1.patch > > > To develop algorithms for federation, it would be of great help to have a > version of SLS that supports multi RMs and GPG. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment
[ https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanuj Nayak updated YARN-7833: -- Attachment: (was: YARN-7433.v1.patch) > [PERF/TEST] Extend SLS to support simulation of a Federated Environment > --- > > Key: YARN-7833 > URL: https://issues.apache.org/jira/browse/YARN-7833 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Tanuj Nayak >Priority: Major > > To develop algorithms for federation, it would be of great help to have a > version of SLS that supports multi RMs and GPG. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment
[ https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanuj Nayak updated YARN-7833: -- Attachment: YARN-7433.v1.patch > [PERF/TEST] Extend SLS to support simulation of a Federated Environment > --- > > Key: YARN-7833 > URL: https://issues.apache.org/jira/browse/YARN-7833 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Tanuj Nayak >Priority: Major > Attachments: YARN-7433.v1.patch > > > To develop algorithms for federation, it would be of great help to have a > version of SLS that supports multi RMs and GPG. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy
[ https://issues.apache.org/jira/browse/YARN-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556464#comment-16556464 ] genericqa commented on YARN-8581: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 17m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 3m 22s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 5s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 32s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 9s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 46s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 24s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 40s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}109m 53s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8581 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933108/YARN-8581.v1.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 2752b4edc895 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / f93ecf5 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21369/testReport/ | | Max. process+thread count | 301 (vs. ulimit of 1) | | modules | C:
[jira] [Commented] (YARN-8566) Add diagnostic message for unschedulable containers
[ https://issues.apache.org/jira/browse/YARN-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556449#comment-16556449 ] Robert Kanter commented on YARN-8566: - Thanks for the patch. A few comments: # In the switch statement, the {{break}}'s should be indented one more level. # I think we should make the log message and the diagnostic message say the same thing for consistency (the only difference would be that the log message would also have the App ID and stack trace). # It looks like {{throwInvalidResourceException}} already has a message with details about the problem in it - why not simply push that message to the diagnostic message instead of adding {{InvalidResourceType}}? #- Furthermore, it looks like the exception message is the same, regardless of the reason for being invalid, which makes it somewhat unclear (i.e. it says "...requested resource type=[X] < 0 or greater than maximum allowed allocation." - which doesn't tell you which case). I'd suggest we make the exception message more dynamic based on what the actual problem is, and re-use it for the diagnostic message. > Add diagnostic message for unschedulable containers > --- > > Key: YARN-8566 > URL: https://issues.apache.org/jira/browse/YARN-8566 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8566.001.patch, YARN-8566.002.patch, > YARN-8566.003.patch, YARN-8566.004.patch > > > If a queue is configured with maxResources set to 0 for a resource, and an > application is submitted to that queue that requests that resource, that > application will remain pending until it is removed or moved to a different > queue. This behavior can be realized without extended resources, but it’s > unlikely a user will create a queue that allows 0 memory or CPU. As the > number of resources in the system increases, this scenario will become more > common, and it will become harder to recognize these cases. Therefore, the > scheduler should indicate in the diagnostic string for an application if it > was not scheduled because of a 0 maxResources setting. > Example configuration (fair-scheduler.xml) : > {code:java} > > 10 > > 1 mb,2vcores > 9 mb,4vcores, 0gpu > 50 > -1.0f > 2.0 > fair > > > {code} > Command: > {code:java} > yarn jar > "./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar" pi > -Dmapreduce.job.queuename=sample_queue -Dmapreduce.map.resource.gpu=1 1 1000; > {code} > The job hangs and the application diagnostic info is empty. > Given that an exception is thrown before any mapper/reducer container is > created, the diagnostic message of the AM should be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8330) An extra container got launched by RM for yarn-service
[ https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556386#comment-16556386 ] Hudson commented on YARN-8330: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14643 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14643/]) YARN-8330. Improved publishing ALLOCATED events to ATS. (eyang: rev f93ecf5c1e0b3db27424963814fc01ec43eb76e0) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java > An extra container got launched by RM for yarn-service > -- > > Key: YARN-8330 > URL: https://issues.apache.org/jira/browse/YARN-8330 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Yesha Vora >Assignee: Suma Shivaprasad >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8330.1.patch, YARN-8330.2.patch, YARN-8330.3.patch, > YARN-8330.4.patch > > > Steps: > launch Hbase tarball app > list containers for hbase tarball app > {code} > /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list > appattempt_1525463491331_0006_01 > WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of > YARN_LOG_DIR. > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Total number of containers :5 > Container-IdStart Time Finish Time > StateHost Node Http Address >LOG-URL > container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa > 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03 > Fri May 04 22:34:26 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01 > Fri May 04 22:34:15 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05 > Fri May 04 22:34:56 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa > 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04 > Fri May 04 22:34:56 + 2018 N/A > nullxxx:25454 http://xxx:8042 > http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code} > Total expected containers = 4 ( 3 components container + 1 am). Instead, RM > is listing 5 containers. > container_e06_1525463491331_0006_01_04 is in null state. > Yarn service utilized container 02, 03, 05 for component. There is no log > available in NM & AM related to container 04. Only one line in RM log is > printed > {code} > 2018-05-04 22:34:56,618 INFO rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(489)) - > container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to > RESERVED{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8583) Inconsistency in YARN status command
Eric Yang created YARN-8583: --- Summary: Inconsistency in YARN status command Key: YARN-8583 URL: https://issues.apache.org/jira/browse/YARN-8583 Project: Hadoop YARN Issue Type: Improvement Reporter: Eric Yang YARN app -status command can report base on application ID or application name with some usability limitation. Application ID is globally unique, and it allows any user to query application status of any application. Application name is not globally unique, and it will only work for querying user's own application. This is somewhat restrictive for application administrator, but allowing other user to query any other user's application could consider a security hole as well. There are two possible options to reduce the inconsistency: Option 1. Block other user from query application status. This may improve security in some sense, but it is an incompatible change. This is a simpler change by matching the owner of the application, and decide to report or not report. Option 2. Add --user parameter to allow administrator to query application name ran by other user. This is a bigger change because application metadata is stored in user's own hdfs directory. There are security restriction that need to be defined. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy
[ https://issues.apache.org/jira/browse/YARN-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8581: --- Attachment: YARN-8581.v1.patch > [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy > --- > > Key: YARN-8581 > URL: https://issues.apache.org/jira/browse/YARN-8581 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8581.v1.patch > > > In Federation, every time an AM heartbeat comes in, > LocalityMulticastAMRMProxyPolicy in AMRMProxy splits the asks according to > the list of active and enabled sub-clusters. However, if we haven't been able > to heartbeat to a sub-cluster for some time (network issues, or we keep > hitting some exception from YarnRM, or YarnRM master-slave switch is taking a > long time etc.), we should consider the sub-cluster as unhealthy and stop > routing asks there, until the heartbeat channel becomes healthy again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8448) AM HTTPS Support
[ https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556364#comment-16556364 ] Robert Kanter edited comment on YARN-8448 at 7/25/18 10:42 PM: --- I've finished up a patch that implements everything described in YARN-6586, other than the RM HA support (TODO in YARN-8449) and Documentation (just filed YARN-8582 for this). I've put the bulk of the changes here (YARN-8448.001.patch), and the MapReduce changes in MAPREDUCE-4669. Some notes on the patch: - Updated BouncyCastle library to a newer version and had to also change the artifact from {{bcprov-jdk16}} to {{bcprov-jdk15on}}. I know that sounds backwards, but jdk15on is actually newer and the one we should be using (see http://bouncy-castle.1462172.n4.nabble.com/Bouncycaslte-bcprov-jdk15-vs-bcprov-jdk16-td4656252.html). - The {{yarn.resourcemanager.application-https.policy}} property controls how the RM should handle HTTPS when talking to AMs. It can be {{OFF}}, {{OPTIONAL}} (default), or {{REQUIRED}}. {{OFF}} makes it behave like today, where it does nothing special. {{OPTIONAL}} makes it generate and provide the keystore and truststore to the AM when it sees an HTTPS tracking URL, but HTTP is also still allowed. And {{REQUIRED}} is like {{OPTIONAL}}, but it won't follow HTTP tracking URLs. - A lot of the code around the container executors is in providing/copying/etc the keystore and truststore files. I've largely based this on the existing way we handle the credentials (delegation tokens) file. - When provided a keystore file, the AM will get env vars {{KEYSTORE_FILE_LOCATION}} and {{KEYSTORE_PASSWORD}}; similarly, {{TRUSTSTORE_FILE_LOCATION}} and {{TRUSTSTORE_PASSWORD}} for the truststore file. - Due to the (ugly) way we parse arguments in the LCE, I had to add an argument that's either {{\-\-http}} or {{\-\-https}} to indicate if we'll be providing it the keystore and truststore files. Otherwise, there isn't a good way to have optional arguments. - In order to keep things simple, I piggybacked passing the keystore and truststore files and passwords via secrets in the Credentials, which is already securely passed from the RM to the NM. - {{ProxyCAManager}} is in charge of creating the certificates, keystores, and truststores. - When writing the unit tests, I found a number of tests that were about 80% complete in what they were testing, which I completed in addition to adding tests for my changes. -- I also tried to simplify some things (e.g. {{TestDockerContainerRuntime}} has ~30 tests that all duplicate the code for checking the arguments, and because I changed the number of arguments, they all failed - instead of updating them all, I created a helper method) - I'm not sure what's up with {{test-container-executor}}, but unless my environment was messed up, it doesn't work when run as {{root}}; maybe people typically run it as a normal user? The test talks about running as {{root}} as an option, and even has a few tests that only run when running as {{root}}. I spent some time fixing this - it now runs in all 4 user configurations described in the existing comments. - I've tested in a real cluster with the DefaultContainerExecutor and LinuxContainerExecutor using all combinations of {{yarn.resourcemanager.application-https.policy}}, {{yarn.app.mapreduce.am.webapp.https.enabled}}, and {{yarn.app.mapreduce.am.webapp.https.client.auth}} (see MAPREDUCE-4669), and everything behaved correctly. I haven't tested out the DockerContainerExecutor. -- If you want to try this out yourself in a cluster, I'd recommend also applying the MAPREDUCE-4669 patch so you have an AM that supports the changes. You can then use {{openssl s_client -connect :}} to get SSL details. You can also try {{curl}}. was (Author: rkanter): I've finished up a patch that implements everything described in YARN-6586, other than the RM HA support (TODO in YARN-8449) and Documentation (just filed YARN-8582 for this). I've put the bulk of the changes here (YARN-8448.001.patch), and the MapReduce changes in MAPREDUCE-4669. Some notes on the patch: - Updated BouncyCastle library to a newer version and had to also change the artifact from {{bcprov-jdk16}} to {{bcprov-jdk15on}}. I know that sounds backwards, but jdk15on is actually newer and the one we should be using (see http://bouncy-castle.1462172.n4.nabble.com/Bouncycaslte-bcprov-jdk15-vs-bcprov-jdk16-td4656252.html). - The {{yarn.resourcemanager.application-https.policy}} property controls how the RM should handle HTTPS when talking to AMs. It can be {{OFF}}, {{OPTIONAL}} (default), or {{REQUIRED}}. {{OFF}} makes it behave like today, where it does nothing special. {{OPTIONAL}} makes it generate and provide the keystore and truststore to the AM when it sees an HTTPS tracking URL, but HTTP is also still allowed. And
[jira] [Updated] (YARN-8448) AM HTTPS Support
[ https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-8448: Attachment: YARN-8448.001.patch > AM HTTPS Support > > > Key: YARN-8448 > URL: https://issues.apache.org/jira/browse/YARN-8448 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Attachments: YARN-8448.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8448) AM HTTPS Support
[ https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556364#comment-16556364 ] Robert Kanter commented on YARN-8448: - I've finished up a patch that implements everything described in YARN-6586, other than the RM HA support (TODO in YARN-8449) and Documentation (just filed YARN-8582 for this). I've put the bulk of the changes here (YARN-8448.001.patch), and the MapReduce changes in MAPREDUCE-4669. Some notes on the patch: - Updated BouncyCastle library to a newer version and had to also change the artifact from {{bcprov-jdk16}} to {{bcprov-jdk15on}}. I know that sounds backwards, but jdk15on is actually newer and the one we should be using (see http://bouncy-castle.1462172.n4.nabble.com/Bouncycaslte-bcprov-jdk15-vs-bcprov-jdk16-td4656252.html). - The {{yarn.resourcemanager.application-https.policy}} property controls how the RM should handle HTTPS when talking to AMs. It can be {{OFF}}, {{OPTIONAL}} (default), or {{REQUIRED}}. {{OFF}} makes it behave like today, where it does nothing special. {{OPTIONAL}} makes it generate and provide the keystore and truststore to the AM when it sees an HTTPS tracking URL, but HTTP is also still allowed. And {{REQUIRED}} is like {{OPTIONAL}}, but it won't follow HTTP tracking URLs. - A lot of the code around the container executors is in providing/copying/etc the keystore and truststore files. I've largely based this on the existing way we handle the credentials (delegation tokens) file. - When provided a keystore file, the AM will get env vars {{KEYSTORE_FILE_LOCATION}} and {{KEYSTORE_PASSWORD}}; similarly, {{TRUSTSTORE_FILE_LOCATION}} and {{TRUSTSTORE_PASSWORD}} for the truststore file. - Due to the (ugly) way we parse arguments in the LCE, I had to add an argument that's either {{--http}} or {{--https}} to indicate if we'll be providing it the keystore and truststore files. Otherwise, there isn't a good way to have optional arguments. - In order to keep things simple, I piggybacked passing the keystore and truststore files and passwords via secrets in the Credentials, which is already securely passed from the RM to the NM. - {{ProxyCAManager}} is in charge of creating the certificates, keystores, and truststores. - When writing the unit tests, I found a number of tests that were about 80% complete in what they were testing, which I completed in addition to adding tests for my changes. -- I also tried to simplify some things (e.g. {{TestDockerContainerRuntime}} has ~30 tests that all duplicate the code for checking the arguments, and because I changed the number of arguments, they all failed - instead of updating them all, I created a helper method) - I'm not sure what's up with {{test-container-executor}}, but unless my environment was messed up, it doesn't work when run as {{root}}; maybe people typically run it as a normal user? The test talks about running as {{root}} as an option, and even has a few tests that only run when running as {{root}}. I spent some time fixing this - it now runs in all 4 user configurations described in the existing comments. - I've tested in a real cluster with the DefaultContainerExecutor and LinuxContainerExecutor using all combinations of {{yarn.resourcemanager.application-https.policy}}, {{yarn.app.mapreduce.am.webapp.https.enabled}}, and {{yarn.app.mapreduce.am.webapp.https.client.auth}} (see MAPREDUCE-4669), and everything behaved correctly. I haven't tested out the DockerContainerExecutor. -- If you want to try this out yourself in a cluster, I'd recommend also applying the MAPREDUCE-4669 patch so you have an AM that supports the changes. You can then use {{openssl s_client -connect :}} to get SSL details. You can also try {{curl}}. > AM HTTPS Support > > > Key: YARN-8448 > URL: https://issues.apache.org/jira/browse/YARN-8448 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Attachments: YARN-8448.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8582) Documentation for AM HTTPS Support
Robert Kanter created YARN-8582: --- Summary: Documentation for AM HTTPS Support Key: YARN-8582 URL: https://issues.apache.org/jira/browse/YARN-8582 Project: Hadoop YARN Issue Type: Sub-task Components: docs Reporter: Robert Kanter Assignee: Robert Kanter Documentation for YARN-6586. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy
[ https://issues.apache.org/jira/browse/YARN-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8581: --- Issue Type: Sub-task (was: Task) Parent: YARN-5597 > [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy > --- > > Key: YARN-8581 > URL: https://issues.apache.org/jira/browse/YARN-8581 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > > In Federation, every time an AM heartbeat comes in, > LocalityMulticastAMRMProxyPolicy in AMRMProxy splits the asks according to > the list of active and enabled sub-clusters. However, if we haven't been able > to heartbeat to a sub-cluster for some time (network issues, or we keep > hitting some exception from YarnRM, or YarnRM master-slave switch is taking a > long time etc.), we should consider the sub-cluster as unhealthy and stop > routing asks there, until the heartbeat channel becomes healthy again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy
Botong Huang created YARN-8581: -- Summary: [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy Key: YARN-8581 URL: https://issues.apache.org/jira/browse/YARN-8581 Project: Hadoop YARN Issue Type: Task Components: amrmproxy, federation Reporter: Botong Huang Assignee: Botong Huang In Federation, every time an AM heartbeat comes in, LocalityMulticastAMRMProxyPolicy in AMRMProxy splits the asks according to the list of active and enabled sub-clusters. However, if we haven't been able to heartbeat to a sub-cluster for some time (network issues, or we keep hitting some exception from YarnRM, or YarnRM master-slave switch is taking a long time etc.), we should consider the sub-cluster as unhealthy and stop routing asks there, until the heartbeat channel becomes healthy again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8579) New AM attempt could not retrieve previous attempt component data
[ https://issues.apache.org/jira/browse/YARN-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gour Saha reassigned YARN-8579: --- Assignee: Gour Saha > New AM attempt could not retrieve previous attempt component data > - > > Key: YARN-8579 > URL: https://issues.apache.org/jira/browse/YARN-8579 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Gour Saha >Priority: Critical > > Steps: > 1) Launch httpd-docker > 2) Wait for app to be in STABLE state > 3) Run validation for app (It takes around 3 mins) > 4) Stop all Zks > 5) Wait 60 sec > 6) Kill AM > 7) wait for 30 sec > 8) Start all ZKs > 9) Wait for application to finish > 10) Validate expected containers of the app > Expected behavior: > New attempt of AM should start and docker containers launched by 1st attempt > should be recovered by new attempt. > Actual behavior: > New AM attempt starts. It can not recover 1st attempt docker containers. It > can not read component details from ZK. > Thus, it starts new attempt for all containers. > {code} > 2018-07-19 22:42:47,595 [main] INFO service.ServiceScheduler - Registering > appattempt_1531977563978_0015_02, fault-test-zkrm-httpd-docker into > registry > 2018-07-19 22:42:47,611 [main] INFO service.ServiceScheduler - Received 1 > containers from previous attempt. > 2018-07-19 22:42:47,642 [main] INFO service.ServiceScheduler - Could not > read component paths: > `/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components': > No such file or directory: KeeperErrorCode = NoNode for > /registry/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components > 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Handling > container_e08_1531977563978_0015_01_03 from previous attempt > 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Record not > found in registry for container container_e08_1531977563978_0015_01_03 > from previous attempt, releasing > 2018-07-19 22:42:47,649 [AMRM Callback Handler Thread] INFO > impl.TimelineV2ClientImpl - Updated timeline service address to xxx:33019 > 2018-07-19 22:42:47,651 [main] INFO service.ServiceScheduler - Triggering > initial evaluation of component httpd > 2018-07-19 22:42:47,652 [main] INFO component.Component - [INIT COMPONENT > httpd]: 2 instances. > 2018-07-19 22:42:47,652 [main] INFO component.Component - [COMPONENT httpd] > Requesting for 2 container(s){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8569) Create an interface to provide cluster information to application
[ https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556348#comment-16556348 ] Eric Yang commented on YARN-8569: - [~oliverhuh...@gmail.com] Your approach works fine for application that carry hadoop client or zookeeper client. This proposed interface is to lower the bar of entry to obtain cluster information for non-Hadoop native applications. This is the main reason to offer a file based interface for nodes. The high level view of the design looks like this: # Application Master received YARN service JSON from yarn cli. # Application Master write the hostname information to YARN service JSON resides in /user/${USER}/.yarn/services/[service]/[service].json # The file is added to distributed cache and localized during container launch. # The file is bind-mount into docker container for consumption at a predefined location. # Flex operation will trigger update of [service].json and repopulate distributed cache when nodes involved in the cluster has changed. User application can poll file changes from docker container to be notified of cluster information changes. > Create an interface to provide cluster information to application > - > > Key: YARN-8569 > URL: https://issues.apache.org/jira/browse/YARN-8569 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Priority: Major > Labels: Docker > > Some program requires container hostnames to be known for application to run. > For example, distributed tensorflow requires launch_command that looks like: > {code} > # On ps0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=0 > # On ps1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=1 > # On worker0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=0 > # On worker1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=1 > {code} > This is a bit cumbersome to orchestrate via Distributed Shell, or YARN > services launch_command. In addition, the dynamic parameters do not work > with YARN flex command. This is the classic pain point for application > developer attempt to automate system environment settings as parameter to end > user application. > It would be great if YARN Docker integration can provide a simple option to > expose hostnames of the yarn service via a mounted file. The file content > gets updated when flex command is performed. This allows application > developer to consume system environment settings via a standard interface. > It is like /proc/devices for Linux, but for Hadoop. This may involve > updating a file in distributed cache, and allow mounting of the file via > container-executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556347#comment-16556347 ] Suma Shivaprasad commented on YARN-8418: Thanks for updating the patch and clarifying [~bibinchundatt] LGTM. [~sunilg] Can you please take a look ? > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, > YARN-8418.006.patch, YARN-8418.007.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8407) Container launch exception in AM log should be printed in ERROR level
[ https://issues.apache.org/jira/browse/YARN-8407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesha Vora updated YARN-8407: - Attachment: YARN-8407.001.patch > Container launch exception in AM log should be printed in ERROR level > - > > Key: YARN-8407 > URL: https://issues.apache.org/jira/browse/YARN-8407 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Yesha Vora >Priority: Major > Attachments: YARN-8407.001.patch > > > when a container launch is failing due to docker image not available is > logged as INFO level in AM log. > Container launch failure should be logged as ERROR. > Steps: > launch httpd yarn-service application with invalid docker image > > {code:java} > 2018-06-07 01:51:32,966 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE httpd-0 : > container_e05_1528335963594_0001_01_02]: > container_e05_1528335963594_0001_01_02 completed. Reinsert back to > pending list and requested a new container. > exitStatus=-1, diagnostics=[2018-06-07 01:51:02.363]Exception from > container-launch. > Container id: container_e05_1528335963594_0001_01_02 > Exit code: 7 > Exception message: Launch container failed > Shell error output: Unable to find image 'xxx/httpd:0.1' locally > Trying to pull repository xxx/httpd ... > /usr/bin/docker-current: Get https://xxx/v1/_ping: dial tcp: lookup xxx on > yyy: no such host. > See '/usr/bin/docker-current run --help'. > Shell output: main : command provided 4 > main : run as user is hbase > main : requested yarn user is hbase > Creating script paths... > Creating local dirs... > Getting exit code file... > Changing effective user to root... > Wrote the exit code 7 to > /grid/0/hadoop/yarn/local/nmPrivate/application_1528335963594_0001/container_e05_1528335963594_0001_01_02/container_e05_1528335963594_0001_01_02.pid.exitcode > [2018-06-07 01:51:02.393]Diagnostic message from attempt : > [2018-06-07 01:51:02.394]Container exited with a non-zero exit code 7. Last > 4096 bytes of stderr.txt : > [2018-06-07 01:51:32.428]Could not find > nmPrivate/application_1528335963594_0001/container_e05_1528335963594_0001_01_02//container_e05_1528335963594_0001_01_02.pid > in any of the directories > 2018-06-07 01:51:32,966 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE httpd-0 : > container_e05_1528335963594_0001_01_02] Transitioned from STARTED to INIT > on STOP event{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8580) yarn.resourcemanager.am.max-attempts is not respected for yarn services
[ https://issues.apache.org/jira/browse/YARN-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola resolved YARN-8580. Resolution: Invalid > yarn.resourcemanager.am.max-attempts is not respected for yarn services > --- > > Key: YARN-8580 > URL: https://issues.apache.org/jira/browse/YARN-8580 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Priority: Major > > 1) Max am attempt is set to 100 on all nodes. ( including gateway) > {code} > > yarn.resourcemanager.am.max-attempts > 100 > {code} > 2) Start a Yarn service ( Hbase tarball ) application > 3) Kill AM 20 times > Here, App fails with below diagnostics. > {code} > bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status > application_1532481557746_0001 > 18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > 18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml > at file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml > Application Report : > Application-Id : application_1532481557746_0001 > Application-Name : hbase-tarball-lr > Application-Type : yarn-service > User : hbase > Queue : default > Application Priority : 0 > Start-Time : 1532481864863 > Finish-Time : 1532522943103 > Progress : 100% > State : FAILED > Final-State : FAILED > Tracking-URL : > https://xxx:8090/cluster/app/application_1532481557746_0001 > RPC Port : -1 > AM Host : N/A > Aggregate Resource Allocation : 252150112 MB-seconds, 164141 > vcore-seconds > Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds > Log Aggregation Status : SUCCEEDED > Diagnostics : Application application_1532481557746_0001 failed 20 > times (global limit =100; local limit is =20) due to AM Container for > appattempt_1532481557746_0001_20 exited with exitCode: 137 > Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed > on request. Exit code is 137 > [2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. > [2018-07-25 12:49:03.045]Killed by external signal > For more detailed output, check the application tracking page: > https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on > links to logs of each attempt. > . Failing the application. > Unmanaged Application : false > Application Node Label Expression : > AM container Node Label Expression : > TimeoutType : LIFETIME ExpiryTime : 2018-07-25T22:26:15.419+ > RemainingTime : 0seconds > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8580) yarn.resourcemanager.am.max-attempts is not respected for yarn services
[ https://issues.apache.org/jira/browse/YARN-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556321#comment-16556321 ] Giovanni Matteo Fumarola commented on YARN-8580: Hi [~yeshavora] , YARN gets the minimum between the global and the local limit. Your global limit is set to 100 (yarn.resourcemanager.am.max-attempts) while the AM limit is set to 20. Closing this Jira as invalid. Diagnostics : Application application_1532481557746_0001 failed 20 times (global limit =100; local limit is =20) > yarn.resourcemanager.am.max-attempts is not respected for yarn services > --- > > Key: YARN-8580 > URL: https://issues.apache.org/jira/browse/YARN-8580 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Priority: Major > > 1) Max am attempt is set to 100 on all nodes. ( including gateway) > {code} > > yarn.resourcemanager.am.max-attempts > 100 > {code} > 2) Start a Yarn service ( Hbase tarball ) application > 3) Kill AM 20 times > Here, App fails with below diagnostics. > {code} > bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status > application_1532481557746_0001 > 18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > 18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml > at file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml > Application Report : > Application-Id : application_1532481557746_0001 > Application-Name : hbase-tarball-lr > Application-Type : yarn-service > User : hbase > Queue : default > Application Priority : 0 > Start-Time : 1532481864863 > Finish-Time : 1532522943103 > Progress : 100% > State : FAILED > Final-State : FAILED > Tracking-URL : > https://xxx:8090/cluster/app/application_1532481557746_0001 > RPC Port : -1 > AM Host : N/A > Aggregate Resource Allocation : 252150112 MB-seconds, 164141 > vcore-seconds > Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds > Log Aggregation Status : SUCCEEDED > Diagnostics : Application application_1532481557746_0001 failed 20 > times (global limit =100; local limit is =20) due to AM Container for > appattempt_1532481557746_0001_20 exited with exitCode: 137 > Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed > on request. Exit code is 137 > [2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. > [2018-07-25 12:49:03.045]Killed by external signal > For more detailed output, check the application tracking page: > https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on > links to logs of each attempt. > . Failing the application. > Unmanaged Application : false > Application Node Label Expression : > AM container Node Label Expression : > TimeoutType : LIFETIME ExpiryTime : 2018-07-25T22:26:15.419+ > RemainingTime : 0seconds > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8580) yarn.resourcemanager.am.max-attempts is not respected for yarn services
Yesha Vora created YARN-8580: Summary: yarn.resourcemanager.am.max-attempts is not respected for yarn services Key: YARN-8580 URL: https://issues.apache.org/jira/browse/YARN-8580 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Affects Versions: 3.1.1 Reporter: Yesha Vora 1) Max am attempt is set to 100 on all nodes. ( including gateway) {code} yarn.resourcemanager.am.max-attempts 100 {code} 2) Start a Yarn service ( Hbase tarball ) application 3) Kill AM 20 times Here, App fails with below diagnostics. {code} bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status application_1532481557746_0001 18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History server at xxx/xxx:10200 18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml Application Report : Application-Id : application_1532481557746_0001 Application-Name : hbase-tarball-lr Application-Type : yarn-service User : hbase Queue : default Application Priority : 0 Start-Time : 1532481864863 Finish-Time : 1532522943103 Progress : 100% State : FAILED Final-State : FAILED Tracking-URL : https://xxx:8090/cluster/app/application_1532481557746_0001 RPC Port : -1 AM Host : N/A Aggregate Resource Allocation : 252150112 MB-seconds, 164141 vcore-seconds Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds Log Aggregation Status : SUCCEEDED Diagnostics : Application application_1532481557746_0001 failed 20 times (global limit =100; local limit is =20) due to AM Container for appattempt_1532481557746_0001_20 exited with exitCode: 137 Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed on request. Exit code is 137 [2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. [2018-07-25 12:49:03.045]Killed by external signal For more detailed output, check the application tracking page: https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on links to logs of each attempt. . Failing the application. Unmanaged Application : false Application Node Label Expression : AM container Node Label Expression : TimeoutType : LIFETIME ExpiryTime : 2018-07-25T22:26:15.419+ RemainingTime : 0seconds {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8579) New AM attempt could not retrieve previous attempt component data
Yesha Vora created YARN-8579: Summary: New AM attempt could not retrieve previous attempt component data Key: YARN-8579 URL: https://issues.apache.org/jira/browse/YARN-8579 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.1 Reporter: Yesha Vora Steps: 1) Launch httpd-docker 2) Wait for app to be in STABLE state 3) Run validation for app (It takes around 3 mins) 4) Stop all Zks 5) Wait 60 sec 6) Kill AM 7) wait for 30 sec 8) Start all ZKs 9) Wait for application to finish 10) Validate expected containers of the app Expected behavior: New attempt of AM should start and docker containers launched by 1st attempt should be recovered by new attempt. Actual behavior: New AM attempt starts. It can not recover 1st attempt docker containers. It can not read component details from ZK. Thus, it starts new attempt for all containers. {code} 2018-07-19 22:42:47,595 [main] INFO service.ServiceScheduler - Registering appattempt_1531977563978_0015_02, fault-test-zkrm-httpd-docker into registry 2018-07-19 22:42:47,611 [main] INFO service.ServiceScheduler - Received 1 containers from previous attempt. 2018-07-19 22:42:47,642 [main] INFO service.ServiceScheduler - Could not read component paths: `/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components': No such file or directory: KeeperErrorCode = NoNode for /registry/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Handling container_e08_1531977563978_0015_01_03 from previous attempt 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Record not found in registry for container container_e08_1531977563978_0015_01_03 from previous attempt, releasing 2018-07-19 22:42:47,649 [AMRM Callback Handler Thread] INFO impl.TimelineV2ClientImpl - Updated timeline service address to xxx:33019 2018-07-19 22:42:47,651 [main] INFO service.ServiceScheduler - Triggering initial evaluation of component httpd 2018-07-19 22:42:47,652 [main] INFO component.Component - [INIT COMPONENT httpd]: 2 instances. 2018-07-19 22:42:47,652 [main] INFO component.Component - [COMPONENT httpd] Requesting for 2 container(s){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed
[ https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556267#comment-16556267 ] Gour Saha commented on YARN-8545: - [~csingh] patch 001 looks good to me. +1. > YARN native service should return container if launch failed > > > Key: YARN-8545 > URL: https://issues.apache.org/jira/browse/YARN-8545 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Attachments: YARN-8545.001.patch > > > In some cases, container launch may fail but container will not be properly > returned to RM. > This could happen when AM trying to prepare container launch context but > failed w/o sending container launch context to NM (Once container launch > context is sent to NM, NM will report failed container to RM). > Exception like: > {code:java} > java.io.FileNotFoundException: File does not exist: > hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) > at > org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) > at > org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) > at > org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) > at > org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} > And even after container launch context prepare failed, AM still trying to > monitor container's readiness: > {code:java} > 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 > 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: primary-worker-0: IP is not > available yet" > ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8569) Create an interface to provide cluster information to application
[ https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556244#comment-16556244 ] Keqiu Hu commented on YARN-8569: I can see the value of sharing some information between cluster nodes with this, just want to throw how we tackle the problem. We did this by storing the information in application master, each worker node would have a TaskExecutor to heartbeat with AM to get latest cluster information. How do you ensure the file is atomic, for example, multiple nodes can modify the mounted file the same time? > Create an interface to provide cluster information to application > - > > Key: YARN-8569 > URL: https://issues.apache.org/jira/browse/YARN-8569 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Priority: Major > Labels: Docker > > Some program requires container hostnames to be known for application to run. > For example, distributed tensorflow requires launch_command that looks like: > {code} > # On ps0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=0 > # On ps1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=1 > # On worker0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=0 > # On worker1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=1 > {code} > This is a bit cumbersome to orchestrate via Distributed Shell, or YARN > services launch_command. In addition, the dynamic parameters do not work > with YARN flex command. This is the classic pain point for application > developer attempt to automate system environment settings as parameter to end > user application. > It would be great if YARN Docker integration can provide a simple option to > expose hostnames of the yarn service via a mounted file. The file content > gets updated when flex command is performed. This allows application > developer to consume system environment settings via a standard interface. > It is like /proc/devices for Linux, but for Hadoop. This may involve > updating a file in distributed cache, and allow mounting of the file via > container-executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556179#comment-16556179 ] genericqa commented on YARN-6966: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 12m 7s{color} | {color:red} Docker failed to build yetus/hadoop:f667ef1. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-6966 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933031/YARN-6966-branch-2.001.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21368/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, > YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, > YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556155#comment-16556155 ] Haibo Chen commented on YARN-6966: -- Yes, please add a patch for branch-3.0. > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, > YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, > YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8517) getContainer and getContainers ResourceManager REST API methods are not documented
[ https://issues.apache.org/jira/browse/YARN-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556069#comment-16556069 ] Robert Kanter commented on YARN-8517: - Thanks [~bsteinbach] for the patch. One minor suggestion: - We should have a short description of the API before the URI for both new sections. You can see that other sections have a description, usually something like "With the API, you can..." > getContainer and getContainers ResourceManager REST API methods are not > documented > -- > > Key: YARN-8517 > URL: https://issues.apache.org/jira/browse/YARN-8517 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Szilard Nemeth >Assignee: Antal Bálint Steinbach >Priority: Major > Labels: newbie, newbie++ > Attachments: YARN-8517.001.patch, YARN-8517.002.patch, > YARN-8517.003.patch, YARN-8517.004.patch > > > Looking at the documentation here: > https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html > I cannot find documentation for 2 RM REST endpoints: > - /apps/\{appid\}/appattempts/\{appattemptid\}/containers > - /apps/\{appid\}/appattempts/\{appattemptid\}/containers/\{containerid\} > I suppose they are not intentionally undocumented. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8578) Failed while trying to construct the redirect url to the log server for Samza applications
[ https://issues.apache.org/jira/browse/YARN-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy Malygin updated YARN-8578: Description: In Timeline web interface I see miscellaneous behavior when clicking on a link in columns _ID_ and _Tracking UI_ of one row: * from _ID_ go to [http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] where I can see logs * from _Tracking UI_ go to [http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I redirecting to [http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username] and see error: *Failed redirect for container_e51_1532439541520_0104_01_01* _Failed while trying to construct the redirect url to the log server. Log Server url may not be configured_ _java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all._ Application type of application_1532439541520_0104 is a Samza. If type is MapReduce both URI works fine and redirects to logs. was: In Timeline web interface I see miscellaneous behavior when clicking on a link in columns _ID_ and _Tracking UI_ of one row: * from _ID_ go to [http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] where I can see logs * from _Tracking UI_ go to [http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I redirecting to [http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username] and see error: *Failed redirect for container_e51_1532439541520_0104_01_01* _Failed while trying to construct the redirect url to the log server. Log Server url may not be configured_ _java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all._ Application type of application_1532439541520_0104 is Samza. If type is MapReduce both URI works fine and redirects to logs. > Failed while trying to construct the redirect url to the log server for Samza > applications > -- > > Key: YARN-8578 > URL: https://issues.apache.org/jira/browse/YARN-8578 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.3 >Reporter: Yuriy Malygin >Priority: Major > > In Timeline web interface I see miscellaneous behavior when clicking on a > link in columns _ID_ and _Tracking UI_ of one row: > * from _ID_ go to > [http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] > where I can see logs > * from _Tracking UI_ go to > [http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I > redirecting to > [http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username] > and see error: > *Failed redirect for container_e51_1532439541520_0104_01_01* > _Failed while trying to construct the redirect url to the log server. Log > Server url may not be configured_ > _java.lang.Exception: Unknown container. Container either has not started or > has already completed or doesn't belong to this node at all._ > > Application type of application_1532439541520_0104 is a Samza. > If type is MapReduce both URI works fine and redirects to logs. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8571) Validate service principal format prior to launching yarn service
[ https://issues.apache.org/jira/browse/YARN-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555853#comment-16555853 ] Billie Rinaldi commented on YARN-8571: -- Thanks for the patch, [~eyang]. I think this patch would NPE when the principal is null, so we should check for that. Otherwise it looks good. > Validate service principal format prior to launching yarn service > - > > Key: YARN-8571 > URL: https://issues.apache.org/jira/browse/YARN-8571 > Project: Hadoop YARN > Issue Type: Bug > Components: security, yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-8571.001.patch > > > Hadoop client and server interaction is designed to validate the service > principal before RPC request is permitted. In YARN service, the same > security model is enforced to prevent replay attack. However, end user > might submit JSON that looks like this to YARN service REST API: > {code} > { > "name": "sleeper-service", > "version": "1.0.0", > "components" : > [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 90", > "resource": { > "cpus": 1, > "memory": "256" > } > } > ], > "kerberos_principal" : { > "principal_name" : "ambari...@example.com", > "keytab" : "file:///etc/security/keytabs/smokeuser.headless.keytab" > } > } > {code} > The kerberos principal is end user kerberos principal instead of service > principal. This does not work properly because YARN service application > master requires to run with a service principal to communicate with YARN CLI > client via Hadoop RPC. Without breaking Hadoop security design in this JIRA, > it might be in our best interest to validate principal_name during > submission, and report error message when someone tries to run YARN service > with user principal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555962#comment-16555962 ] Hudson commented on YARN-4606: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14641 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14641/]) YARN-4606. CapacityScheduler: applications could get starved because (ericp: rev 9485c9aee6e9bb935c3e6ae4da81d70b621781de) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/UsersManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.007.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8578) Failed while trying to construct the redirect url to the log server for Samza applications
[ https://issues.apache.org/jira/browse/YARN-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy Malygin updated YARN-8578: Description: In Timeline web interface I see miscellaneous behavior when clicking on a link in columns _ID_ and _Tracking UI_ of one row: * from _ID_ go to [http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] where I can see logs * from _Tracking UI_ go to [http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I redirecting to [http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username] and see error: *Failed redirect for container_e51_1532439541520_0104_01_01* _Failed while trying to construct the redirect url to the log server. Log Server url may not be configured_ _java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all._ Application type of application_1532439541520_0104 is a Samza. If type is MapReduce both URI works fine and redirects to logs - to TS and JHS. was: In Timeline web interface I see miscellaneous behavior when clicking on a link in columns _ID_ and _Tracking UI_ of one row: * from _ID_ go to [http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] where I can see logs * from _Tracking UI_ go to [http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I redirecting to [http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username] and see error: *Failed redirect for container_e51_1532439541520_0104_01_01* _Failed while trying to construct the redirect url to the log server. Log Server url may not be configured_ _java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all._ Application type of application_1532439541520_0104 is a Samza. If type is MapReduce both URI works fine and redirects to logs. > Failed while trying to construct the redirect url to the log server for Samza > applications > -- > > Key: YARN-8578 > URL: https://issues.apache.org/jira/browse/YARN-8578 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.3 >Reporter: Yuriy Malygin >Priority: Major > > In Timeline web interface I see miscellaneous behavior when clicking on a > link in columns _ID_ and _Tracking UI_ of one row: > * from _ID_ go to > [http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] > where I can see logs > * from _Tracking UI_ go to > [http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I > redirecting to > [http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username] > and see error: > *Failed redirect for container_e51_1532439541520_0104_01_01* > _Failed while trying to construct the redirect url to the log server. Log > Server url may not be configured_ > _java.lang.Exception: Unknown container. Container either has not started or > has already completed or doesn't belong to this node at all._ > > Application type of application_1532439541520_0104 is a Samza. > If type is MapReduce both URI works fine and redirects to logs - to TS and > JHS. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8578) Failed while trying to construct the redirect url to the log server for Samza applications
Yuriy Malygin created YARN-8578: --- Summary: Failed while trying to construct the redirect url to the log server for Samza applications Key: YARN-8578 URL: https://issues.apache.org/jira/browse/YARN-8578 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.3 Reporter: Yuriy Malygin In Timeline web interface I see miscellaneous behavior when clicking on a link in columns _ID_ and _Tracking UI_ of one row: * from _ID_ go to [http://ts-hostname:8188/applicationhistory/app/application_1532439541520_0104] where I can see logs * from _Tracking UI_ go to [http://nn-hostname:8088/cluster/app/application_1532439541520_0104] where I redirecting to [http://nm-hostname:8042/node/containerlogs/container_e51_1532439541520_0104_01_01/username] and see error: *Failed redirect for container_e51_1532439541520_0104_01_01* _Failed while trying to construct the redirect url to the log server. Log Server url may not be configured_ _java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all._ Application type of application_1532439541520_0104 is Samza. If type is MapReduce both URI works fine and redirects to logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555893#comment-16555893 ] Robert Kanter commented on YARN-4946: - AFAIK, nothing has changed in this area. However, I think the flag file is going to be a no-go. I've gotten a _lot_ of pushback in the past when trying to have the RM write information to HDFS. So I think we need to come up with a different approach. The RM remembers X number of applications in order to save on memory and RMStateStore space. This is controlled by {{yarn.resourcemanager.max-completed-applications}} and {{yarn.resourcemanager.state-store.max-completed-applications}}, respectively; and you usually would set them to the same value (in fact, I believe the state-store one is set to the other one by default). For example, if set to 1000, then when you run 1001 applications, the RM will forget the oldest application that is no longer running (i.e. completed, failed), so that it never remembers more than 1000 applications - that's what I mean about "forgetting." Those applications can be looked up in the JHS, Spark HS, or etc. No need to do a failover or HA (though we should test that once at the end to be thorough). You can test this with {{yarn.resourcemanager.max-completed-applications}} by setting it to a low value like 3 or something. The RM should not remember more than 3 completed applications, so simply run 4 jobs, wait for them to complete, and you'll see it. The issue this JIRA is trying to solve is when you run the tool from MAPREDUCE-6415, if it can't find the App in the RM (because the RM forgot it) when getting the log aggregation status, it assumes that the aggregation completed successfully (https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-archive-logs/src/main/java/org/apache/hadoop/tools/HadoopArchiveLogs.java#L350). Assuming your cluster and job is working correctly, that's a good assumption, but if not, it'll be wrong. IIRC, that's actually okay if log aggregation has reached a terminal state like succeeded or even failed; but is more of a problem if it's still in the middle of aggregating because we're going to process partial logs. So I think we can leave that if we can ensure that the RM only forgets apps once they've reached a terminal log aggregation status. In other words, if the RM doesn't consider the App isn't truly finished until (and thus removed from it's history) until the aggregation status has reached a terminal state (i.e. DISABLED, SUCCEEDED, FAILED, TIME_OUT). This should be a simpler fix and doesn't require writing anything to HDFS. > RM should write out Aggregated Log Completion file flag next to logs > > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > # When the RM sees that an Application has successfully finished aggregation > its logs, it will write a flag file next to that Application's log files > # The tool no longer talks to the RM at all. When looking at the FileSystem, > it now uses that flag file to determine if it should process those log files. > If the file is there, it archives, otherwise it does not. > # As part of the archiving process, it will delete the flag file > # (If you don't run the tool, the flag file will eventually be cleaned up by > the JHS when it cleans up the aggregated logs because it's in the same > directory) > This improvement has several advantages: > # The edge case about "forgotten" Applications is fixed > # The tool no longer has to talk to the RM; it only has to consult HDFS. > This is simpler -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8574) Allow dot in attribute values
[ https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1614#comment-1614 ] Bibin A Chundatt edited comment on YARN-8574 at 7/25/18 11:47 AM: -- Thank you [~Naganarasimha] for review Both prefix and value should allow ".". rt . Currently also prefix allows yarn.rm.io, yarn.nm.io. Did you mean namespace ?? was (Author: bibinchundatt): Both prefix and value should allow ".". rt . Currently also prefix allows yarn.rm.io, yarn.nm.io. > Allow dot in attribute values > -- > > Key: YARN-8574 > URL: https://issues.apache.org/jira/browse/YARN-8574 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: YARN-8574-YARN-3409.001.patch > > > Currently "." is considered as invalid value. Enable the same; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc
[ https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555735#comment-16555735 ] Hudson commented on YARN-8577: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14638 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14638/]) YARN-8577. Fix the broken anchor in SLS site-doc. Contributed by Weiwei (bibinchundatt: rev 3d3158cea4580eb2e3b2af635c3a7d30f4dbb873) * (edit) hadoop-tools/hadoop-sls/src/site/markdown/SchedulerLoadSimulator.md > Fix the broken anchor in SLS site-doc > - > > Key: YARN-8577 > URL: https://issues.apache.org/jira/browse/YARN-8577 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 2.9.0, 3.0.0, 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Fix For: 3.2.0, 3.0.4, 3.1.2 > > Attachments: HADOOP-15630.001.patch > > > The anchor for section "Synthetic Load Generator" is currently broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8575) CapacityScheduler should check node state before committing reserve/allocate proposals
[ https://issues.apache.org/jira/browse/YARN-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1622#comment-1622 ] genericqa commented on YARN-8575: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 23s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 14s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 18s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 17s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}131m 10s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | | | hadoop.yarn.server.resourcemanager.scheduler.fair.policies.TestDominantResourceFairnessPolicy | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8575 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933019/YARN-8575.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux fadae5e6991e 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 955f795 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/21364/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21364/testReport/ | | Max. process+thread count | 877 (vs. ulimit of 1) | | modules | C:
[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish
[ https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1618#comment-1618 ] genericqa commented on YARN-8558: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 32s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 52s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 42s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 72m 45s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8558 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933027/YARN-8558.002.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 98dac162c806 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 955f795 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21366/testReport/ | | Max. process+thread count | 410 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21366/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > NM recovery level db not cleaned up properly on container
[jira] [Commented] (YARN-8574) Allow dot in attribute values
[ https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1614#comment-1614 ] Bibin A Chundatt commented on YARN-8574: Both prefix and value should allow ".". rt . Currently also prefix allows yarn.rm.io, yarn.nm.io. > Allow dot in attribute values > -- > > Key: YARN-8574 > URL: https://issues.apache.org/jira/browse/YARN-8574 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: YARN-8574-YARN-3409.001.patch > > > Currently "." is considered as invalid value. Enable the same; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8574) Allow dot in attribute values
[ https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555499#comment-16555499 ] Naganarasimha G R commented on YARN-8574: - Thanks for the patch [~bibinchundatt], my bad almost had forgotten to work upon it. One concern i had is agree that value should be allowed with a dot but not the prefix, so for prefix if we are using the same pattern it qill be a problem > Allow dot in attribute values > -- > > Key: YARN-8574 > URL: https://issues.apache.org/jira/browse/YARN-8574 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: YARN-8574-YARN-3409.001.patch > > > Currently "." is considered as invalid value. Enable the same; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc
[ https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555489#comment-16555489 ] genericqa commented on YARN-8577: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 37m 27s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 29s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 51m 24s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | HADOOP-15630 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933026/HADOOP-15630.001.patch | | Optional Tests | asflicense mvnsite | | uname | Linux bb9265a7bf90 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 955f795 | | maven | version: Apache Maven 3.3.9 | | Max. process+thread count | 334 (vs. ulimit of 1) | | modules | C: hadoop-tools/hadoop-sls U: hadoop-tools/hadoop-sls | | Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/14941/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Fix the broken anchor in SLS site-doc > - > > Key: YARN-8577 > URL: https://issues.apache.org/jira/browse/YARN-8577 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 2.9.0, 3.0.0, 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Attachments: HADOOP-15630.001.patch > > > The anchor for section "Synthetic Load Generator" is currently broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555607#comment-16555607 ] Szilard Nemeth edited comment on YARN-4946 at 7/25/18 12:58 PM: Hi [~rkanter]! I'm trying to pick this one up. Since this is created a long time ago, I suppose RM could work differently regarding "forgotten" applications, e.g. maybe it was improved over time. Could you please provide me some hints how to test whether RM still forgets applications? Is it a way to go to have an RM setup with HA, start one application, wait for its completion and do an RM failover or this involves some more complex steps to take? Could you give me some insights how the application can be "forgotten" if enough time passes, or any other cases that can lead to the same situation? Thanks! was (Author: snemeth): Hi [~rkanter]! I'm trying to pick this one up. Since this is created a long time ago, I suppose RM could work differently regarding "forgotten" applications, e.g. maybe it was improved over time. Could you please provide me some hints how to test whether RM still forgets applications? Is it a way to go to have an RM setup with HA, start one application, wait for its completion and do an RM failover or this involves some more complex steps to take? Thanks! > RM should write out Aggregated Log Completion file flag next to logs > > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > # When the RM sees that an Application has successfully finished aggregation > its logs, it will write a flag file next to that Application's log files > # The tool no longer talks to the RM at all. When looking at the FileSystem, > it now uses that flag file to determine if it should process those log files. > If the file is there, it archives, otherwise it does not. > # As part of the archiving process, it will delete the flag file > # (If you don't run the tool, the flag file will eventually be cleaned up by > the JHS when it cleans up the aggregated logs because it's in the same > directory) > This improvement has several advantages: > # The edge case about "forgotten" Applications is fixed > # The tool no longer has to talk to the RM; it only has to consult HDFS. > This is simpler -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555607#comment-16555607 ] Szilard Nemeth commented on YARN-4946: -- Hi [~rkanter]! I'm trying to pick this one up. Since this is created a long time ago, I suppose RM could work differently regarding "forgotten" applications, e.g. maybe it was improved over time. Could you please provide me some hints how to test whether RM still forgets applications? Is it a way to go to have an RM setup with HA, start one application, wait for its completion and do an RM failover or this involves some more complex steps to take? Thanks! > RM should write out Aggregated Log Completion file flag next to logs > > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > # When the RM sees that an Application has successfully finished aggregation > its logs, it will write a flag file next to that Application's log files > # The tool no longer talks to the RM at all. When looking at the FileSystem, > it now uses that flag file to determine if it should process those log files. > If the file is there, it archives, otherwise it does not. > # As part of the archiving process, it will delete the flag file > # (If you don't run the tool, the flag file will eventually be cleaned up by > the JHS when it cleans up the aggregated logs because it's in the same > directory) > This improvement has several advantages: > # The edge case about "forgotten" Applications is fixed > # The tool no longer has to talk to the RM; it only has to consult HDFS. > This is simpler -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth reassigned YARN-4946: Assignee: Szilard Nemeth > RM should write out Aggregated Log Completion file flag next to logs > > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > # When the RM sees that an Application has successfully finished aggregation > its logs, it will write a flag file next to that Application's log files > # The tool no longer talks to the RM at all. When looking at the FileSystem, > it now uses that flag file to determine if it should process those log files. > If the file is there, it archives, otherwise it does not. > # As part of the archiving process, it will delete the flag file > # (If you don't run the tool, the flag file will eventually be cleaned up by > the JHS when it cleans up the aggregated logs because it's in the same > directory) > This improvement has several advantages: > # The edge case about "forgotten" Applications is fixed > # The tool no longer has to talk to the RM; it only has to consult HDFS. > This is simpler -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555480#comment-16555480 ] Szilard Nemeth commented on YARN-6966: -- Hi [~haibochen]! Reopened and moved this issue to Patch Available as I think Yetus won't pick this up otherwise. Added the patch for branch-2. Should I add another patch to branch-3.0? Thanks! > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, > YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, > YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-6966: - Target Version/s: 2.10.0, 3.2.0, 3.1.2 > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, > YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, > YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6307) Refactor FairShareComparator#compare
[ https://issues.apache.org/jira/browse/YARN-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1626#comment-1626 ] stefanlee commented on YARN-6307: - thanks for this jira,[~yufeigu] [~templedf], I have a doubt that what is the difference between *fair share* in _FairSharePolicy#compare_ and *fair share* in _FairSharePolicy#computeShares_, I think the latter is related to *preempt*. there are incomprehensible. > Refactor FairShareComparator#compare > > > Key: YARN-6307 > URL: https://issues.apache.org/jira/browse/YARN-6307 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Yufei Gu >Assignee: Yufei Gu >Priority: Major > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-6307.001.patch, YARN-6307.002.patch, > YARN-6307.003.patch > > > The method does three things: compare the min share usage, compare fair share > usage by checking weight ratio, break tied by submit time and name. They are > mixed with each other which is not easy to read and maintenance, poor style. > Additionally, there are potential performance issues, like no need to check > weight ratio if minShare usage comparison already indicate the order. It is > worth to improve considering huge amount invokings in scheduler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish
[ https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555396#comment-16555396 ] Sunil Govindan commented on YARN-8558: -- Yes. That makes sense. However naming convention for CONTAINER_TOKENS_CURRENT_MASTER_KEY etc are confusing as it looks like indication value per container. In theory, its common for all containers manager by manager for a day. So could we rename this to avoid the confusion. > NM recovery level db not cleaned up properly on container finish > > > Key: YARN-8558 > URL: https://issues.apache.org/jira/browse/YARN-8558 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8558.001.patch > > > {code} > 2018-07-20 16:49:23,117 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Application application_1531994217928_0054 transitioned from NEW to INITING > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_18 with incomplete > records > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_19 with incomplete > records > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_20 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_21 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_22 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_23 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_24 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_25 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_38 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_39 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_41 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_44 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_46 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_49 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_52 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_54 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_73 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_74 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_75 with incomplete >
[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc
[ https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1605#comment-1605 ] Bibin A Chundatt commented on YARN-8577: + 1 LGTM Will commit it soon > Fix the broken anchor in SLS site-doc > - > > Key: YARN-8577 > URL: https://issues.apache.org/jira/browse/YARN-8577 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 2.9.0, 3.0.0, 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Attachments: HADOOP-15630.001.patch > > > The anchor for section "Synthetic Load Generator" is currently broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1601#comment-1601 ] Bibin A Chundatt commented on YARN-8546: [~cheersyang] branch-3.1.1 is created so fix version should be 3.1.2 > Resource leak caused by a reserved container being released more than once > under async scheduling > - > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8577) Fix the broken anchor in SLS site-doc
[ https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555495#comment-16555495 ] genericqa commented on YARN-8577: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 35m 37s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 4s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 49m 20s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8577 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933026/HADOOP-15630.001.patch | | Optional Tests | asflicense mvnsite | | uname | Linux 0aefa8e5efa7 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 955f795 | | maven | version: Apache Maven 3.3.9 | | Max. process+thread count | 399 (vs. ulimit of 1) | | modules | C: hadoop-tools/hadoop-sls U: hadoop-tools/hadoop-sls | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21365/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Fix the broken anchor in SLS site-doc > - > > Key: YARN-8577 > URL: https://issues.apache.org/jira/browse/YARN-8577 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 2.9.0, 3.0.0, 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Attachments: HADOOP-15630.001.patch > > > The anchor for section "Synthetic Load Generator" is currently broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8541) RM startup failure on recovery after user deletion
[ https://issues.apache.org/jira/browse/YARN-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555494#comment-16555494 ] Bibin A Chundatt commented on YARN-8541: Thank you [~sunilg] and [~suma.shivaprasad] for review and [~jj336013] for issue > RM startup failure on recovery after user deletion > -- > > Key: YARN-8541 > URL: https://issues.apache.org/jira/browse/YARN-8541 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: yimeng >Assignee: Bibin A Chundatt >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8541-branch-3.1.003.patch, YARN-8541.001.patch, > YARN-8541.002.patch, YARN-8541.003.patch > > > My hadoop version 3.1.0. I found that a problem RM startup failure on > recovery as the follow test step: > 1.create a user "user1" have the permisson to submit app. > 2.use user1 to submit a job ,wait job finished. > 3.delete user "user1" > 4.restart yarn > 5.the RM restart failed > RM logs: > 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized root queue > root: numChildQueue= 3, capacity=1.0, absoluteCapacity=1.0, > usedResources=usedCapacity=0.0, numApps=0, > numContainers=0 | CapacitySchedulerQueueManager.java:163 > 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized queue > mappings, override: false | UserGroupMappingPlacementRule.java:232 > 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized > CapacityScheduler with calculator=class > org.apache.hadoop.yarn.util.resource.DominantResourceCalculator, > minimumAllocation=<>, maximumAllocation=< vCores:32>>, asynchronousScheduling=false, asyncScheduleInterval=5ms | > CapacityScheduler.java:392 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | dynamic-resources.xml not > found | Configuration.java:2767 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | Initializing AMS > Processing chain. Root > Processor=[org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor]. > | AMSProcessingChain.java:62 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | disabled placement > handler will be used, all scheduling requests will be rejected. | > ApplicationMasterService.java:130 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | Adding > [org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor] > tp top of AMS Processing chain. | AMSProcessingChain.java:75 > 2018-07-16 16:24:59,713 | WARN | main-EventThread | Exception handling the > winning of election | ActiveStandbyElector.java:897 > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:893) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:728) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:600) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application > application_1531624956005_0001 submitted by user super reason: No groups > found for user super > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1245) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1241) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1241) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320) > ... 5 more > Caused
[jira] [Updated] (YARN-8575) CapacityScheduler should check node state before committing reserve/allocate proposals
[ https://issues.apache.org/jira/browse/YARN-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8575: --- Attachment: YARN-8575.001.patch > CapacityScheduler should check node state before committing reserve/allocate > proposals > -- > > Key: YARN-8575 > URL: https://issues.apache.org/jira/browse/YARN-8575 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 3.1.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8575.001.patch > > > Recently we found a new error as follows: > {noformat} > ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > node to unreserve doesn't exist, nodeid: host1:45454 > {noformat} > Reproduce this problem: > (1) Create a reserve proposal for app1 on node1 > (2) node1 is successfully decommissioned and removed from node tracker > (3) Try to commit this outdated reserve proposal, it will be accepted and > applied. > This error may be occurred after decommissioning some NMs. The application > who print the error log will always have a reserved container on non-exist > (decommissioned) NM and the pending request will never be satisfied. > To solve this problem, scheduler should check node state in > FiCaSchedulerApp#accept to avoid committing outdated proposals on unusable > nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8575) CapacityScheduler should check node state before committing reserve/allocate proposals
Tao Yang created YARN-8575: -- Summary: CapacityScheduler should check node state before committing reserve/allocate proposals Key: YARN-8575 URL: https://issues.apache.org/jira/browse/YARN-8575 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.2.0, 3.1.2 Reporter: Tao Yang Assignee: Tao Yang Recently we found a new error as follows: {noformat} ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: host1:45454 {noformat} Reproduce this problem: (1) Create a reserve proposal for app1 on node1 (2) node1 is successfully decommissioned and removed from node tracker (3) Try to commit this outdated reserve proposal, it will be accepted and applied. This error may be occurred after decommissioning some NMs. The application who print the error log will always have a reserved container on non-exist (decommissioned) NM and the pending request will never be satisfied. To solve this problem, scheduler should check node state in FiCaSchedulerApp#accept to avoid committing outdated proposals on unusable nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8574) Allow dot in attribute values
[ https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555360#comment-16555360 ] Bibin A Chundatt commented on YARN-8574: [~sunil.gov...@gmail.com] Please review. > Allow dot in attribute values > -- > > Key: YARN-8574 > URL: https://issues.apache.org/jira/browse/YARN-8574 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: YARN-8574-YARN-3409.001.patch > > > Currently "." is considered as invalid value. Enable the same; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8574) Allow dot in attribute values
[ https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-8574: --- Attachment: YARN-8574-YARN-3409.001.patch > Allow dot in attribute values > -- > > Key: YARN-8574 > URL: https://issues.apache.org/jira/browse/YARN-8574 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: YARN-8574-YARN-3409.001.patch > > > Currently "." is considered as invalid value. Enable the same; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth reopened YARN-6966: -- > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, > YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, > YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-6966: - Attachment: YARN-6966-branch-2.001.patch > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-6966-branch-2.001.patch, YARN-6966.001.patch, > YARN-6966.002.patch, YARN-6966.003.patch, YARN-6966.004.patch, > YARN-6966.005.patch, YARN-6966.005.patch, YARN-6966.006.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555469#comment-16555469 ] Weiwei Yang commented on YARN-8546: --- Thanks [~Tao Yang] for the contribution, I've committed this to trunk and branch-3.1. > Resource leak caused by a reserved container being released more than once > under async scheduling > - > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555461#comment-16555461 ] Hudson commented on YARN-8546: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #14635 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14635/]) YARN-8546. Resource leak caused by a reserved container being released (wwei: rev 5be9f4a5d05c9cb99348719fe35626b1de3055db) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerAsyncScheduling.java > Resource leak caused by a reserved container being released more than once > under async scheduling > - > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Moved] (YARN-8577) Fix the broken anchor in SLS site-doc
[ https://issues.apache.org/jira/browse/YARN-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang moved HADOOP-15630 to YARN-8577: Affects Version/s: (was: 3.1.0) (was: 3.0.0) (was: 2.9.0) 2.9.0 3.0.0 3.1.0 Component/s: (was: documentation) documentation Key: YARN-8577 (was: HADOOP-15630) Project: Hadoop YARN (was: Hadoop Common) > Fix the broken anchor in SLS site-doc > - > > Key: YARN-8577 > URL: https://issues.apache.org/jira/browse/YARN-8577 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 3.1.0, 3.0.0, 2.9.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Attachments: HADOOP-15630.001.patch > > > The anchor for section "Synthetic Load Generator" is currently broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8546: -- Summary: Resource leak caused by a reserved container being released more than once under async scheduling (was: A reserved container might be released multiple times under async scheduling) > Resource leak caused by a reserved container being released more than once > under async scheduling > - > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish
[ https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555436#comment-16555436 ] Bibin A Chundatt commented on YARN-8558: [~sunilg] Make sense. Renamed variable and uploaded patch handling the same. > NM recovery level db not cleaned up properly on container finish > > > Key: YARN-8558 > URL: https://issues.apache.org/jira/browse/YARN-8558 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8558.001.patch, YARN-8558.002.patch > > > {code} > 2018-07-20 16:49:23,117 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Application application_1531994217928_0054 transitioned from NEW to INITING > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_18 with incomplete > records > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_19 with incomplete > records > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_20 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_21 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_22 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_23 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_24 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_25 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_38 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_39 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_41 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_44 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_46 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_49 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_52 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_54 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_73 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_74 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_75 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container
[jira] [Updated] (YARN-8558) NM recovery level db not cleaned up properly on container finish
[ https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-8558: --- Attachment: YARN-8558.002.patch > NM recovery level db not cleaned up properly on container finish > > > Key: YARN-8558 > URL: https://issues.apache.org/jira/browse/YARN-8558 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8558.001.patch, YARN-8558.002.patch > > > {code} > 2018-07-20 16:49:23,117 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Application application_1531994217928_0054 transitioned from NEW to INITING > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_18 with incomplete > records > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_19 with incomplete > records > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_20 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_21 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_22 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_23 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_24 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_25 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_38 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_39 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_41 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_44 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_46 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_49 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_52 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_54 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_73 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_74 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_75 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_78 with incomplete > records > 2018-07-20 16:49:23,207 WARN >
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555429#comment-16555429 ] genericqa commented on YARN-8418: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 28s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 0s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 7s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 22s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 14s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 25s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 24s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 39s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}106m 34s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.nodemanager.containermanager.TestContainerManager | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8418 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933012/YARN-8418.007.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 86144f4f8ba2 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 81d5950 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | unit |
[jira] [Created] (YARN-8576) Fix the broken anchor in SLS site-doc
Weiwei Yang created YARN-8576: - Summary: Fix the broken anchor in SLS site-doc Key: YARN-8576 URL: https://issues.apache.org/jira/browse/YARN-8576 Project: Hadoop YARN Issue Type: Bug Components: docs Reporter: Weiwei Yang Assignee: Weiwei Yang The anchor for section "Synthetic Load Generator" is currently broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8546) A reserved container might be released multiple times under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555413#comment-16555413 ] Weiwei Yang commented on YARN-8546: --- Thanks [~Tao Yang] for that patch, the fix looks good. +1. There is a minor typo in the log message, I will fix it during the commit. Thanks > A reserved container might be released multiple times under async scheduling > > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8541) RM startup failure on recovery after user deletion
[ https://issues.apache.org/jira/browse/YARN-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555388#comment-16555388 ] Sunil Govindan commented on YARN-8541: -- Thanks [~bibinchundatt]. TestPlacementManager is not in 3.1 and hence makes sense to remove for 3.1. +1 > RM startup failure on recovery after user deletion > -- > > Key: YARN-8541 > URL: https://issues.apache.org/jira/browse/YARN-8541 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: yimeng >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8541-branch-3.1.003.patch, YARN-8541.001.patch, > YARN-8541.002.patch, YARN-8541.003.patch > > > My hadoop version 3.1.0. I found that a problem RM startup failure on > recovery as the follow test step: > 1.create a user "user1" have the permisson to submit app. > 2.use user1 to submit a job ,wait job finished. > 3.delete user "user1" > 4.restart yarn > 5.the RM restart failed > RM logs: > 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized root queue > root: numChildQueue= 3, capacity=1.0, absoluteCapacity=1.0, > usedResources=usedCapacity=0.0, numApps=0, > numContainers=0 | CapacitySchedulerQueueManager.java:163 > 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized queue > mappings, override: false | UserGroupMappingPlacementRule.java:232 > 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized > CapacityScheduler with calculator=class > org.apache.hadoop.yarn.util.resource.DominantResourceCalculator, > minimumAllocation=<>, maximumAllocation=< vCores:32>>, asynchronousScheduling=false, asyncScheduleInterval=5ms | > CapacityScheduler.java:392 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | dynamic-resources.xml not > found | Configuration.java:2767 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | Initializing AMS > Processing chain. Root > Processor=[org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor]. > | AMSProcessingChain.java:62 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | disabled placement > handler will be used, all scheduling requests will be rejected. | > ApplicationMasterService.java:130 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | Adding > [org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor] > tp top of AMS Processing chain. | AMSProcessingChain.java:75 > 2018-07-16 16:24:59,713 | WARN | main-EventThread | Exception handling the > winning of election | ActiveStandbyElector.java:897 > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:893) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:728) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:600) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application > application_1531624956005_0001 submitted by user super reason: No groups > found for user super > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1245) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1241) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1241) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320) > ... 5 more > Caused by:
[jira] [Commented] (YARN-8574) Allow dot in attribute values
[ https://issues.apache.org/jira/browse/YARN-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555373#comment-16555373 ] genericqa commented on YARN-8574: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 0m 11s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-8574 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933017/YARN-8574-YARN-3409.001.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21363/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Allow dot in attribute values > -- > > Key: YARN-8574 > URL: https://issues.apache.org/jira/browse/YARN-8574 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: YARN-8574-YARN-3409.001.patch > > > Currently "." is considered as invalid value. Enable the same; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-8418: --- Attachment: YARN-8418.007.patch > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, > YARN-8418.006.patch, YARN-8418.007.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8574) Allow dot in attribute values
Bibin A Chundatt created YARN-8574: -- Summary: Allow dot in attribute values Key: YARN-8574 URL: https://issues.apache.org/jira/browse/YARN-8574 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Currently "." is considered as invalid value. Enable the same; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8521) NPE in AllocationTagsManager when a container is removed more than once
[ https://issues.apache.org/jira/browse/YARN-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555354#comment-16555354 ] genericqa commented on YARN-8521: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 36s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 32s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 46s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 38s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}132m 52s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8521 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12932999/YARN-8521.003.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux bd3ed19bca42 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 81d5950 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/21360/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21360/testReport/ | | Max. process+thread count | 941 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U:
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555304#comment-16555304 ] genericqa commented on YARN-8418: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 0s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 25s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 23s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 22s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 1m 10s{color} | {color:red} hadoop-yarn in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 10s{color} | {color:red} hadoop-yarn in the patch failed. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 24s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 4m 8s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 23s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 22s{color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager generated 2 new + 9 unchanged - 0 fixed = 11 total (was 9) {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 19s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 24s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 66m 49s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8418 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933000/YARN-8418.006.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 30002f1ec09b 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 81d5950 | | maven | version: Apache Maven 3.3.9 | | Default Java |
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555313#comment-16555313 ] Bibin A Chundatt commented on YARN-8418: Missed to add event class. Attaching patch again > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, > YARN-8418.006.patch, YARN-8418.007.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8566) Add diagnostic message for unschedulable containers
[ https://issues.apache.org/jira/browse/YARN-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553032#comment-16553032 ] Antal Bálint Steinbach edited comment on YARN-8566 at 7/25/18 6:51 AM: --- Hi [~snemeth] ! Thanks for the patch. I only have some minor comments: * Maybe it would be good, to add diagnostic text for the 3rd case (UNKNOWN) * Using a switch for enums can be less verbose * you can extract app.getRMAppAttempt(appAttemptId).updateAMLaunchDiagnostics(... {code:java} // String errorMsg = ""; switch (e.getInvalidResourceType()){ case GREATER_THEN_MAX_ALLOCATION: errorMsg = "Cannot allocate containers as resource request is " + "greater than the maximum allowed allocation!"; break; case LESS_THAN_ZERO: errorMsg = "Cannot allocate containers as resource request is " + "less than zero!"; break; case UNKNOWN: default: errorMsg = "Cannot allocate containers for some unknown reasons!"; } app.getRMAppAttempt(appAttemptId).updateAMLaunchDiagnostics(errorMsg); {code} was (Author: bsteinbach): Hi [~snemeth] ! Thanks for the patch. I only have some minor comments: * Maybe it would be good, to add diagnostic text for the 3rd case (UNKNOWN) * Using a switch for enums can be less verbose * you can extract app.getRMAppAttempt(appAttemptId).updateAMLaunchDiagnostics(... {code:java} // String errorMsg = ""; switch (e.getInvalidResourceType()){ case GREATER_THEN_MAX_ALLOCATION: errorMsg = "Cannot allocate containers as resource request is " + "greater than the maximum allowed allocation!"; break; case LESS_THAN_ZERO: errorMsg = "Cannot allocate containers as resource request is " + "less than zero!"; case UNKNOWN: default: errorMsg = "Cannot allocate containers for some unknown reasons!"; } app.getRMAppAttempt(appAttemptId).updateAMLaunchDiagnostics(errorMsg); {code} > Add diagnostic message for unschedulable containers > --- > > Key: YARN-8566 > URL: https://issues.apache.org/jira/browse/YARN-8566 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8566.001.patch, YARN-8566.002.patch, > YARN-8566.003.patch, YARN-8566.004.patch > > > If a queue is configured with maxResources set to 0 for a resource, and an > application is submitted to that queue that requests that resource, that > application will remain pending until it is removed or moved to a different > queue. This behavior can be realized without extended resources, but it’s > unlikely a user will create a queue that allows 0 memory or CPU. As the > number of resources in the system increases, this scenario will become more > common, and it will become harder to recognize these cases. Therefore, the > scheduler should indicate in the diagnostic string for an application if it > was not scheduled because of a 0 maxResources setting. > Example configuration (fair-scheduler.xml) : > {code:java} > > 10 > > 1 mb,2vcores > 9 mb,4vcores, 0gpu > 50 > -1.0f > 2.0 > fair > > > {code} > Command: > {code:java} > yarn jar > "./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar" pi > -Dmapreduce.job.queuename=sample_queue -Dmapreduce.map.resource.gpu=1 1 1000; > {code} > The job hangs and the application diagnostic info is empty. > Given that an exception is thrown before any mapper/reducer container is > created, the diagnostic message of the AM should be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
***UNCHECKED*** [jira] [Comment Edited] (YARN-8566) Add diagnostic message for unschedulable containers
[ https://issues.apache.org/jira/browse/YARN-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553917#comment-16553917 ] Antal Bálint Steinbach edited comment on YARN-8566 at 7/25/18 6:50 AM: --- Hi [~snemeth] +1 LGTM (Non-binding) Thanks for the fix. was (Author: bsteinbach): Hi [~snemeth] +1 Thanks for the fix. > Add diagnostic message for unschedulable containers > --- > > Key: YARN-8566 > URL: https://issues.apache.org/jira/browse/YARN-8566 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8566.001.patch, YARN-8566.002.patch, > YARN-8566.003.patch, YARN-8566.004.patch > > > If a queue is configured with maxResources set to 0 for a resource, and an > application is submitted to that queue that requests that resource, that > application will remain pending until it is removed or moved to a different > queue. This behavior can be realized without extended resources, but it’s > unlikely a user will create a queue that allows 0 memory or CPU. As the > number of resources in the system increases, this scenario will become more > common, and it will become harder to recognize these cases. Therefore, the > scheduler should indicate in the diagnostic string for an application if it > was not scheduled because of a 0 maxResources setting. > Example configuration (fair-scheduler.xml) : > {code:java} > > 10 > > 1 mb,2vcores > 9 mb,4vcores, 0gpu > 50 > -1.0f > 2.0 > fair > > > {code} > Command: > {code:java} > yarn jar > "./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar" pi > -Dmapreduce.job.queuename=sample_queue -Dmapreduce.map.resource.gpu=1 1 1000; > {code} > The job hangs and the application diagnostic info is empty. > Given that an exception is thrown before any mapper/reducer container is > created, the diagnostic message of the AM should be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8553) Reduce complexity of AHSWebService getApps method
[ https://issues.apache.org/jira/browse/YARN-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555216#comment-16555216 ] Szilard Nemeth commented on YARN-8553: -- Thanks [~sunilg] for jumping in for the review. > Reduce complexity of AHSWebService getApps method > - > > Key: YARN-8553 > URL: https://issues.apache.org/jira/browse/YARN-8553 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8553.001.patch > > > YARN-8501 refactor the RMWebService#getApp. Similar refactoring required in > AHSWebservice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8572) YarnClient getContainers API should support filtering by container status
[ https://issues.apache.org/jira/browse/YARN-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-8572: - Description: YarnClient.getContainers should support filtering containers by their status - RUNNING, COMPLETED etc . This may require corresponding changes in ATS to filter by container status for a given application attempt (was: YarnClient.getContainers should support filtering containers by their status - RUNNING, COMPLETED etc . This may require corresponding changes in ATS to filter by container status for a given application attemopt) > YarnClient getContainers API should support filtering by container status > - > > Key: YARN-8572 > URL: https://issues.apache.org/jira/browse/YARN-8572 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Suma Shivaprasad >Priority: Major > > YarnClient.getContainers should support filtering containers by their status > - RUNNING, COMPLETED etc . This may require corresponding changes in ATS to > filter by container status for a given application attempt -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-8418: --- Attachment: YARN-8418.006.patch > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, > YARN-8418.006.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8521) NPE in AllocationTagsManager when a container is removed more than once
[ https://issues.apache.org/jira/browse/YARN-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8521: -- Attachment: YARN-8521.003.patch > NPE in AllocationTagsManager when a container is removed more than once > --- > > Key: YARN-8521 > URL: https://issues.apache.org/jira/browse/YARN-8521 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8521.001.patch, YARN-8521.002.patch, > YARN-8521.003.patch > > > We've seen sometimes there is NPE in AllocationTagsManager > {code:java} > private void removeTagFromInnerMap(Map innerMap, String tag) { > Long count = innerMap.get(tag); > if (count > 1) { // NPE!! > ... > {code} > it seems {{AllocationTagsManager#removeContainer}} somehow gets called more > than once for a same container. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org