[jira] [Updated] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on
[ https://issues.apache.org/jira/browse/YARN-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-4762: Priority: Blocker (was: Critical) > NMs failing on DelegatingLinuxContainerRuntime init with LCE on > --- > > Key: YARN-4762 > URL: https://issues.apache.org/jira/browse/YARN-4762 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Sidharta Seethana >Priority: Blocker > > Seeing this exception and the NMs crash. > {code} > 2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > is started > 2016-03-03 16:47:58,027 DEBUG > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > checkLinuxExecutorSetup: > [/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup] > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: > Mount point Based on mtab file: /proc/mounts. Controller mount point not > writable for: cpu > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Unable to get cgroups handle. > 2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: > noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to > initialize container executor > 2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: > Service NodeManager failed in state INITED; cause: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: NodeManager entered state STOPPED > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: > NodeManager: stopping services, size=0 > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > entered state STOPPED > 2016-03-03 16:47:58,047 FATAL > org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting > NodeManager > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4553) Add cgroups support for docker containers
[ https://issues.apache.org/jira/browse/YARN-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-4553: Fix Version/s: 2.9.0 > Add cgroups support for docker containers > - > > Key: YARN-4553 > URL: https://issues.apache.org/jira/browse/YARN-4553 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Fix For: 2.9.0 > > Attachments: YARN-4553.001.patch, YARN-4553.002.patch, > YARN-4553.003.patch > > > Currently, cgroups-based resource isolation does not work with docker > containers under YARN. The processes in these containers are launched by the > docker daemon and they are not children of a container-executor process. > Docker supports a --cgroup-parent flag which can be used to point to the > container-specific cgroups that are created by the nodemanager. This will > allow the Nodemanager to manage cgroups (as it does today) while allowing > resource isolation to work with docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-4761: -- Attachment: YARN-4761.02.patch Thanks for the pointer [~rohithsharma]. It's good to know. Posted patch v.2. I moved the unit test from {{TestCapacityScheduler}} to {{TestAbstractYarnScheduler}}. I can confirm that the test fails before the fair scheduler changes and passes after. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4761.01.patch, YARN-4761.02.patch > > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179316#comment-15179316 ] Hadoop QA commented on YARN-4761: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 45s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 7s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 52s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 72m 6s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 159m 31s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_74 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791373/YARN-4761.01.patch | | JIRA Issue
[jira] [Commented] (YARN-2883) Queuing of container requests in the NM
[ https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179208#comment-15179208 ] Hadoop QA commented on YARN-2883: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 8 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 33s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 29s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s {color} | {color:green} yarn-2877 passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s {color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 1s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 0s {color} | {color:green} yarn-2877 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 45s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common in yarn-2877 has 3 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 7s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in yarn-2877 has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 44s {color} | {color:green} yarn-2877 passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 2s {color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 3s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 3s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 3s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 15s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 37s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 30 new + 233 unchanged - 2 fixed = 263 total (was 235) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 54s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 44s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 33 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 5s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s {color} |
[jira] [Updated] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on
[ https://issues.apache.org/jira/browse/YARN-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-4762: Priority: Critical (was: Major) > NMs failing on DelegatingLinuxContainerRuntime init with LCE on > --- > > Key: YARN-4762 > URL: https://issues.apache.org/jira/browse/YARN-4762 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Sidharta Seethana >Priority: Critical > > Seeing this exception and the NMs crash. > {code} > 2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > is started > 2016-03-03 16:47:58,027 DEBUG > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > checkLinuxExecutorSetup: > [/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup] > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: > Mount point Based on mtab file: /proc/mounts. Controller mount point not > writable for: cpu > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Unable to get cgroups handle. > 2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: > noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to > initialize container executor > 2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: > Service NodeManager failed in state INITED; cause: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: NodeManager entered state STOPPED > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: > NodeManager: stopping services, size=0 > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > entered state STOPPED > 2016-03-03 16:47:58,047 FATAL > org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting > NodeManager > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on
[ https://issues.apache.org/jira/browse/YARN-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179196#comment-15179196 ] Sidharta Seethana commented on YARN-4762: - Changing priority to critical - NMs don't see to come up when cgroups are not in use. > NMs failing on DelegatingLinuxContainerRuntime init with LCE on > --- > > Key: YARN-4762 > URL: https://issues.apache.org/jira/browse/YARN-4762 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Sidharta Seethana > > Seeing this exception and the NMs crash. > {code} > 2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > is started > 2016-03-03 16:47:58,027 DEBUG > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > checkLinuxExecutorSetup: > [/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup] > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: > Mount point Based on mtab file: /proc/mounts. Controller mount point not > writable for: cpu > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Unable to get cgroups handle. > 2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: > noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to > initialize container executor > 2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: > Service NodeManager failed in state INITED; cause: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: NodeManager entered state STOPPED > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: > NodeManager: stopping services, size=0 > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > entered state STOPPED > 2016-03-03 16:47:58,047 FATAL > org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting > NodeManager > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on
[ https://issues.apache.org/jira/browse/YARN-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179194#comment-15179194 ] Sidharta Seethana commented on YARN-4762: - /cc [~vvasudev] When the new resource handler mechanism was introduced a CGroupHandlerImpl instance was only created/initialized if one of the resource handlers was enabled. Initialization does one of the following : # if mounting of cgroups is enabled, does not mount anything because mounting is done on demand for individual resource handlers # If mounting of cgroups is disabled, ‘initializeControllerPathsFromMtab’ gets called - which checks for writability for each of the cgroup mounts. (2) was correct behavior because the cgroups handler wasn’t created unless at least one of the (cgroups based) resource handlers was in use. However, with YARN-4553 , a CGroupsHandler is always created, even if there are no cgroups-based handlers in use. This (incorrectly) leads to an attempt to check if cgroups' mount paths are writable. I'll take a look at fixing this. > NMs failing on DelegatingLinuxContainerRuntime init with LCE on > --- > > Key: YARN-4762 > URL: https://issues.apache.org/jira/browse/YARN-4762 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli > > Seeing this exception and the NMs crash. > {code} > 2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > is started > 2016-03-03 16:47:58,027 DEBUG > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > checkLinuxExecutorSetup: > [/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup] > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: > Mount point Based on mtab file: /proc/mounts. Controller mount point not > writable for: cpu > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Unable to get cgroups handle. > 2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: > noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to > initialize container executor > 2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: > Service NodeManager failed in state INITED; cause: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: NodeManager entered state STOPPED > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: > NodeManager: stopping services, size=0 > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > entered state STOPPED > 2016-03-03 16:47:58,047 FATAL > org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting > NodeManager > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > {code} -- This message was sent
[jira] [Assigned] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on
[ https://issues.apache.org/jira/browse/YARN-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana reassigned YARN-4762: --- Assignee: Sidharta Seethana > NMs failing on DelegatingLinuxContainerRuntime init with LCE on > --- > > Key: YARN-4762 > URL: https://issues.apache.org/jira/browse/YARN-4762 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Sidharta Seethana > > Seeing this exception and the NMs crash. > {code} > 2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > is started > 2016-03-03 16:47:58,027 DEBUG > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > checkLinuxExecutorSetup: > [/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup] > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: > Mount point Based on mtab file: /proc/mounts. Controller mount point not > writable for: cpu > 2016-03-03 16:47:58,043 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Unable to get cgroups handle. > 2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: > noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to > initialize container executor > 2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: > Service NodeManager failed in state INITED; cause: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: NodeManager entered state STOPPED > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: > NodeManager: stopping services, size=0 > 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: > Service: > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService > entered state STOPPED > 2016-03-03 16:47:58,047 FATAL > org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting > NodeManager > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize > container executor > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) > Caused by: java.io.IOException: Failed to initialize linux container > runtime(s)! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) > ... 3 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on
Vinod Kumar Vavilapalli created YARN-4762: - Summary: NMs failing on DelegatingLinuxContainerRuntime init with LCE on Key: YARN-4762 URL: https://issues.apache.org/jira/browse/YARN-4762 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Seeing this exception and the NMs crash. {code} 2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService is started 2016-03-03 16:47:58,027 DEBUG org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: checkLinuxExecutorSetup: [/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup] 2016-03-03 16:47:58,043 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: Mount point Based on mtab file: /proc/mounts. Controller mount point not writable for: cpu 2016-03-03 16:47:58,043 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: Unable to get cgroups handle. 2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor 2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) Caused by: java.io.IOException: Failed to initialize linux container runtime(s)! at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) ... 3 more 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: Service: NodeManager entered state STOPPED 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: NodeManager: stopping services, size=0 2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: Service: org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService entered state STOPPED 2016-03-03 16:47:58,047 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587) Caused by: java.io.IOException: Failed to initialize linux container runtime(s)! at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238) ... 3 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179171#comment-15179171 ] Rohith Sharma K S commented on YARN-4761: - I think we can move the test to TestAbstractYarnScheduler to test this JIRA behavior. TestAbstractYarnScheduler test class extends ParameterizedSchedulerTestBase that runs for both CS and FS. Some test cases are specific to CS behavior are EITHER skipped for FairScheduler OR being added in specific FairScheduler package. The test cases which assumes CS as default scheduler need to re-visit for fairscheduer functionality impacts like this JIRA. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4761.01.patch > > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179127#comment-15179127 ] Sangjin Lee commented on YARN-4761: --- I'd like to discuss the unit test for this. I could essentially duplicate the same test that was added to the {{TestCapacityScheduler}}. However, it might be largely a copy-and-paste, and I'm not too happy about that but I could still do that. Do let me know your thoughts on this. A larger question is, we have a large amount of generic RM unit tests out there, but they are exercised only against the capacity scheduler. Should we try to find ways to exercise them against the fair scheduler as well? That would be the most effective way of ensuring the soundness of any changes. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4761.01.patch > > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-4761: -- Attachment: YARN-4761.01.patch Posted patch v.1. Applied the same fix to the fair scheduler. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4761.01.patch > > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service
[ https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179070#comment-15179070 ] Jian He commented on YARN-4117: --- bq. MiniYarnCluster allocates the ports for the RM during the Start phase, and there was no way to pass the information to the AMRMProxy. I think this approach may not work for HA case where you will have multiple RM scheduler address. could you check how NM talks to RM ? I think NM have the same problem. we may let AMRMProxy follow the same method ? > End to end unit test with mini YARN cluster for AMRMProxy Service > - > > Key: YARN-4117 > URL: https://issues.apache.org/jira/browse/YARN-4117 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Reporter: Kishore Chaliparambil >Assignee: Giovanni Matteo Fumarola > Attachments: YARN-4117.v0.patch, YARN-4117.v1.patch > > > YARN-2884 introduces a proxy between AM and RM. This JIRA proposes an end to > end unit test using mini YARN cluster to the AMRMProxy service. This test > will validate register, allocate and finish application and token renewal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179066#comment-15179066 ] zhihai xu commented on YARN-4761: - Good Finding [~sjlee0]! the same issue could also happen for fair scheduler. we should decouple RMNode status from fair scheduler also. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179045#comment-15179045 ] Sangjin Lee commented on YARN-4761: --- To see this, you add this code to {{TestResourceTrackerService#testReconnectNode}}: {code} public void testReconnectNode() throws Exception { Configuration conf = new Configuration(); conf.set(YarnConfiguration.RM_SCHEDULER, "org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"); rm = new MockRM(conf) { ... {code} and the test breaks: {noformat} testReconnectNode(org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService) Time elapsed: 1.188 sec <<< FAILURE! java.lang.AssertionError: expected:<15360> but was:<10240> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testReconnectNode(TestResourceTrackerService.java:1044) {noformat} > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
Sangjin Lee created YARN-4761: - Summary: NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler Key: YARN-4761 URL: https://issues.apache.org/jira/browse/YARN-4761 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.4 Reporter: Sangjin Lee Assignee: Sangjin Lee YARN-3802 uncovered an issue with the scheduler where the resource calculation can be incorrect due to async event handling. It was subsequently fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2888) Corrective mechanisms for rebalancing NM container queues
[ https://issues.apache.org/jira/browse/YARN-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178995#comment-15178995 ] Hadoop QA commented on YARN-2888: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 27s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 32s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 51s {color} | {color:green} yarn-2877 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s {color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 26s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 43s {color} | {color:green} yarn-2877 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 42s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common in yarn-2877 has 3 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 3s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in yarn-2877 has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 15s {color} | {color:green} yarn-2877 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 45s {color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 36s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 10 new + 415 unchanged - 1 fixed = 425 total (was 416) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 19s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 35s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 21s {color} | {color:red} hadoop-yarn-api in the patch failed with JDK v1.8.0_72. {color} | |
[jira] [Commented] (YARN-4749) Generalize config file handling in container-executor
[ https://issues.apache.org/jira/browse/YARN-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178835#comment-15178835 ] Hadoop QA commented on YARN-4749: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 17m 48s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 59s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 35s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 49s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 48m 10s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791263/YARN-4749.002.patch | | JIRA Issue | YARN-4749 | | Optional Tests | asflicense compile cc mvnsite javac unit | | uname | Linux 782dc5b63277 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0a9f00a | | Default Java | 1.7.0_95 | | Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 | | JDK v1.7.0_95 Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/10705/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/10705/console | | Powered by | Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Generalize config file handling in container-executor > - > > Key: YARN-4749 > URL: https://issues.apache.org/jira/browse/YARN-4749 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >
[jira] [Updated] (YARN-2883) Queuing of container requests in the NM
[ https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantinos Karanasos updated YARN-2883: - Attachment: YARN-2883-yarn-2877.003.patch Adding some first test cases for the queuing of containers to the patch. > Queuing of container requests in the NM > --- > > Key: YARN-2883 > URL: https://issues.apache.org/jira/browse/YARN-2883 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Attachments: YARN-2883-yarn-2877.001.patch, > YARN-2883-yarn-2877.002.patch, YARN-2883-yarn-2877.003.patch > > > We propose to add a queue in each NM, where queueable container requests can > be held. > Based on the available resources in the node and the containers in the queue, > the NM will decide when to allow the execution of a queued container. > In order to ensure the instantaneous start of a guaranteed-start container, > the NM may decide to pre-empt/kill running queueable containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178833#comment-15178833 ] Sidharta Seethana commented on YARN-4744: - Thanks, [~jlowe] ! > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > Attachments: YARN-4744.001.patch, YARN-4744.002.patch > > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 9 more > 2014-03-02 09:20:43,113 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=yarn > OPERATION=Container Finished - Succeeded
[jira] [Assigned] (YARN-4760) proxy redirect to history server uses wrong URL
[ https://issues.apache.org/jira/browse/YARN-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger reassigned YARN-4760: - Assignee: Eric Badger > proxy redirect to history server uses wrong URL > --- > > Key: YARN-4760 > URL: https://issues.apache.org/jira/browse/YARN-4760 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.7.2 >Reporter: Jason Lowe >Assignee: Eric Badger > > YARN-3975 added the ability to redirect to the history server when an app > fails to specify a tracking URL and the RM has since forgotten about the > application. However it redirects to /apps/ instead of /app/ > which is the wrong destination page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4760) proxy redirect to history server uses wrong URL
Jason Lowe created YARN-4760: Summary: proxy redirect to history server uses wrong URL Key: YARN-4760 URL: https://issues.apache.org/jira/browse/YARN-4760 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.7.2 Reporter: Jason Lowe YARN-3975 added the ability to redirect to the history server when an app fails to specify a tracking URL and the RM has since forgotten about the application. However it redirects to /apps/ instead of /app/ which is the wrong destination page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178775#comment-15178775 ] Hadoop QA commented on YARN-4744: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 54s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 7s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 23s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 48s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 47m 47s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791259/YARN-4744.002.patch | | JIRA Issue | YARN-4744 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 526b1198d34e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0a9f00a | | Default Java | 1.7.0_95 | | Multi-JDK versions |
[jira] [Updated] (YARN-4758) Enable discovery of AMs by containers
[ https://issues.apache.org/jira/browse/YARN-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-4758: - Description: {color:red} This is already discussed on the umbrella JIRA YARN-1489. Copying some of my condensed summary from the design doc (section 3.2.10.3) of YARN-4692. {color} Even after the existing work in Workpreserving AM restart (Section 3.1.2 / YARN-1489), we still haven’t solved the problem of old running containers not knowing where the new AM starts running after the previous AM crashes. This is a specifically important problem to be solved for long running services where we’d like to avoid killing service containers when AMs failover. So far, we left this as a task for the apps, but solving it in YARN is much desirable. [(Task) This looks very much like service-registry (YARN-913), but for appcontainers to discover their own AMs. Combining this requirement (of any container being able to find their AM across failovers) with those of services (to be able to find through DNS where a service container is running - YARN-4757) will put our registry scalability needs to be much higher than that of just service endpoints. This calls for a more distributed solution for registry readers something that is discussed in the comments section of YARN-1489 and MAPREDUCE-6608. See comment https://issues.apache.org/jira/browse/YARN-1489?focusedCommentId=13862359=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13862359 was: {color:red} This is already discussed on the umbrella JIRA YARN-1489. Copying some of my condensed summary from the design doc (section 3.2.10.3) of YARN-4692. {color} Even after the existing work in Workpreserving AM restart (Section 3.1.2 / YARN-1489), we still haven’t solved the problem of old running containers not knowing where the new AM starts running after the previous AM crashes. This is a specifically important problem to be solved for long running services where we’d like to avoid killing service containers when AMs failover. So far, we left this as a task for the apps, but solving it in YARN is much desirable. [(Task) This looks very much like service-registry (YARN-913), but for appcontainers to discover their own AMs. Combining this requirement (of any container being able to find their AM across failovers) with those of services (to be able to find through DNS where a service container is running - YARN-4757) will put our registry scalability needs to be much higher than that of just service endpoints. This calls for a more distributed solution for registry readers something that is discussed in the comments section of YARN-1489 and MAPREDUCE-6608. > Enable discovery of AMs by containers > - > > Key: YARN-4758 > URL: https://issues.apache.org/jira/browse/YARN-4758 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli > > {color:red} > This is already discussed on the umbrella JIRA YARN-1489. > Copying some of my condensed summary from the design doc (section 3.2.10.3) > of YARN-4692. > {color} > Even after the existing work in Workpreserving AM restart (Section 3.1.2 / > YARN-1489), we still haven’t solved the problem of old running containers not > knowing where the new AM starts running after the previous AM crashes. This > is a specifically important problem to be solved for long running services > where we’d like to avoid killing service containers when AMs failover. So > far, we left this as a task for the apps, but solving it in YARN is much > desirable. [(Task) This looks very much like service-registry (YARN-913), > but for appcontainers to discover their own AMs. > Combining this requirement (of any container being able to find their AM > across failovers) with those of services (to be able to find through DNS > where a service container is running - YARN-4757) will put our registry > scalability needs to be much higher than that of just service endpoints. > This calls for a more distributed solution for registry readers something > that is discussed in the comments section of YARN-1489 and MAPREDUCE-6608. > See comment > https://issues.apache.org/jira/browse/YARN-1489?focusedCommentId=13862359=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13862359 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4749) Generalize config file handling in container-executor
[ https://issues.apache.org/jira/browse/YARN-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-4749: Attachment: YARN-4749.002.patch Uploaded a new patch based on review feedback. > Generalize config file handling in container-executor > - > > Key: YARN-4749 > URL: https://issues.apache.org/jira/browse/YARN-4749 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-4749.001.patch, YARN-4749.002.patch > > > The current implementation of container-executor already supports parsing of > key value pairs from a config file. However, it is currently restricted to > {{container-executor.cfg}} and cannot be reused for parsing additional > config/command files. Generalizing this is a required step for YARN-4245. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178722#comment-15178722 ] Jason Lowe commented on YARN-4744: -- Thanks for updating the patch! +1, pending Jenkins. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > Attachments: YARN-4744.001.patch, YARN-4744.002.patch > > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 9 more > 2014-03-02 09:20:43,113 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=yarn > OPERATION=Container Finished -
[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178708#comment-15178708 ] sandflee commented on YARN-4740: yes, this patch ensure AM receive at least one container complete msg, but to one AM process, just receive one. > container complete msg may lost while AM restart in race condition > -- > > Key: YARN-4740 > URL: https://issues.apache.org/jira/browse/YARN-4740 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-4740.01.patch, YARN-4740.02.patch > > > 1, container completed, and the msg is store in > RMAppAttempt.justFinishedContainers > 2, AM allocate and before allocateResponse came to AM, AM crashed > 3, AM restart and couldn't get the container complete msg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-4744: Attachment: YARN-4744.002.patch Uploaded a new patch - added a new PrivilegedOperation constructor and fixed all instances that had a null second argument to use this new constructor. Also filed YARN-4759 to revisit signal handling for docker containers. [~jlowe], please take a look? Thanks! > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > Attachments: YARN-4744.001.patch, YARN-4744.002.patch > > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at >
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178661#comment-15178661 ] Jason Lowe commented on YARN-4744: -- Ah, ignore my previous comment -- I see now that we don't have the docker tools in place to know whether or not the kill failed in that way. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > Attachments: YARN-4744.001.patch > > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 9 more > 2014-03-02 09:20:43,113 INFO >
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178658#comment-15178658 ] Jason Lowe commented on YARN-4744: -- Even if the Docker stuff doesn't work totally, it has the same logic and will have the same issue at a high level (i.e.: will always be a race between kill and container exiting on its own) -- so why wouldn't we want to make the change at least for doc purposes for those coming along later to fix it? > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > Attachments: YARN-4744.001.patch > > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at >
[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178657#comment-15178657 ] Jian He commented on YARN-4740: --- [~sandflee], actually, with this fix, the 2nd AM may possibly receive duplicated container statuses, while the 1st AM has already received it ? > container complete msg may lost while AM restart in race condition > -- > > Key: YARN-4740 > URL: https://issues.apache.org/jira/browse/YARN-4740 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-4740.01.patch, YARN-4740.02.patch > > > 1, container completed, and the msg is store in > RMAppAttempt.justFinishedContainers > 2, AM allocate and before allocateResponse came to AM, AM crashed > 3, AM restart and couldn't get the container complete msg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178653#comment-15178653 ] Vinod Kumar Vavilapalli commented on YARN-1489: --- bq. That and the "Old running containers don't know where the new AM is running." issue is big enough that we shouldn't close this umbrella as done. Just filed YARN-4758. > [Umbrella] Work-preserving ApplicationMaster restart > > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: Work preserving AM restart.pdf > > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are > running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two > potentially can be done at the app level, but it is good to have a common > solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4759) Revisit signalContainer() for docker containers
Sidharta Seethana created YARN-4759: --- Summary: Revisit signalContainer() for docker containers Key: YARN-4759 URL: https://issues.apache.org/jira/browse/YARN-4759 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sidharta Seethana The current signal handling (in the DockerContainerRuntime) needs to be revisited for docker containers. For example, container reacquisition on NM restart might not work, depending on which user the process in the container runs as. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2888) Corrective mechanisms for rebalancing NM container queues
[ https://issues.apache.org/jira/browse/YARN-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178647#comment-15178647 ] Hadoop QA commented on YARN-2888: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 46s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 40s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s {color} | {color:green} yarn-2877 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s {color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 26s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} yarn-2877 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 43s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common in yarn-2877 has 3 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 4s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in yarn-2877 has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 15s {color} | {color:green} yarn-2877 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 39s {color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 6s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 35s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 9 new + 415 unchanged - 1 fixed = 424 total (was 416) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 18s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 35s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 20s {color} | {color:red} hadoop-yarn-api in the patch failed with JDK v1.8.0_72. {color} | |
[jira] [Created] (YARN-4758) Enable discovery of AMs by containers
Vinod Kumar Vavilapalli created YARN-4758: - Summary: Enable discovery of AMs by containers Key: YARN-4758 URL: https://issues.apache.org/jira/browse/YARN-4758 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli {color:red} This is already discussed on the umbrella JIRA YARN-1489. Copying some of my condensed summary from the design doc (section 3.2.10.3) of YARN-4692. {color} Even after the existing work in Workpreserving AM restart (Section 3.1.2 / YARN-1489), we still haven’t solved the problem of old running containers not knowing where the new AM starts running after the previous AM crashes. This is a specifically important problem to be solved for long running services where we’d like to avoid killing service containers when AMs failover. So far, we left this as a task for the apps, but solving it in YARN is much desirable. [(Task) This looks very much like service-registry (YARN-913), but for appcontainers to discover their own AMs. Combining this requirement (of any container being able to find their AM across failovers) with those of services (to be able to find through DNS where a service container is running - YARN-4757) will put our registry scalability needs to be much higher than that of just service endpoints. This calls for a more distributed solution for registry readers something that is discussed in the comments section of YARN-1489 and MAPREDUCE-6608. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Maron updated YARN-4737: - Attachment: YARN-4737.004.patch > Use CSRF Filter in YARN > --- > > Key: YARN-4737 > URL: https://issues.apache.org/jira/browse/YARN-4737 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, webapp >Reporter: Jonathan Maron >Assignee: Jonathan Maron > Attachments: YARN-4737.001.patch, YARN-4737.002.patch, > YARN-4737.003.patch, YARN-4737.004.patch > > > A CSRF filter was added to hadoop common > (https://issues.apache.org/jira/browse/HADOOP-12691). The aim of this JIRA > is to come up with a mechanism to integrate this filter into the webapps for > which it is applicable (web apps that may establish an authenticated > identity). That includes the RM, NM, and mapreduce jobhistory web app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2888) Corrective mechanisms for rebalancing NM container queues
[ https://issues.apache.org/jira/browse/YARN-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-2888: -- Attachment: YARN-2888-yarn-2877.002.patch Updating above patch with some extra documentation and some minor refactoring > Corrective mechanisms for rebalancing NM container queues > - > > Key: YARN-2888 > URL: https://issues.apache.org/jira/browse/YARN-2888 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Attachments: YARN-2888-yarn-2877.001.patch, > YARN-2888-yarn-2877.002.patch > > > Bad queuing decisions by the LocalRMs (e.g., due to the distributed nature of > the scheduling decisions or due to having a stale image of the system) may > lead to an imbalance in the waiting times of the NM container queues. This > can in turn have an impact in job execution times and cluster utilization. > To this end, we introduce corrective mechanisms that may remove (whenever > needed) container requests from overloaded queues, adding them to less-loaded > ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178629#comment-15178629 ] Jonathan Maron commented on YARN-4757: -- I've actually been working on a DNS approach for some time. I'll upload a document describing the approach soon. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Maron reassigned YARN-4757: Assignee: Jonathan Maron > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of service-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endpoints of a service is not easy to implement > using the present registry-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-known DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
Vinod Kumar Vavilapalli created YARN-4757: - Summary: [Umbrella] Simplified discovery of services via DNS mechanisms Key: YARN-4757 URL: https://issues.apache.org/jira/browse/YARN-4757 Project: Hadoop YARN Issue Type: New Feature Reporter: Vinod Kumar Vavilapalli [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track all related efforts.] In addition to completing the present story of service-registry (YARN-913), we also need to simplify the access to the registry entries. The existing read mechanisms of the YARN Service Registry are currently limited to a registry specific (java) API and a REST interface. In practice, this makes it very difficult for wiring up existing clients and services. For e.g, dynamic configuration of dependent endpoints of a service is not easy to implement using the present registry-read mechanisms, *without* code-changes to existing services. A good solution to this is to expose the registry information through a more generic and widely used discovery mechanism: DNS. Service Discovery via DNS uses the well-known DNS interfaces to browse the network for services. YARN-913 in fact talked about such a DNS based mechanism but left it as a future task. (Task) Having the registry information exposed via DNS simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4756) Unnecessary wait in Node Status Updater during reboot
[ https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178614#comment-15178614 ] Hadoop QA commented on YARN-4756: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 49s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 52s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 3s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 35s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 33m 40s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791244/YARN-4756.002.patch | | JIRA Issue | YARN-4756 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux bb8f2e3e7d26 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0a9f00a | | Default Java | 1.7.0_95 | | Multi-JDK versions |
[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178599#comment-15178599 ] Eric Badger commented on YARN-4686: --- As per above comment, these test failures are not related to the patch. All relevant test failures have been addressed. [~jlowe] [~kasha] Please review the patch when you get a chance. > MiniYARNCluster.start() returns before cluster is completely started > > > Key: YARN-4686 > URL: https://issues.apache.org/jira/browse/YARN-4686 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith Sharma K S >Assignee: Eric Badger > Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch, > YARN-4686.002.patch, YARN-4686.003.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178584#comment-15178584 ] Hadoop QA commented on YARN-4686: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 9s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 3s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 38s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 49s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 58s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 42s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 1 new + 30 unchanged - 0 fixed = 31 total (was 30) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 5s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 6m 20s {color} | {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 43s {color} | {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 29s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 6m 31s {color} | {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} unit {color} |
[jira] [Updated] (YARN-4756) Unnecessary wait in Node Status Updater during reboot
[ https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-4756: -- Attachment: YARN-4756.002.patch Fixing patch so that it applies to trunk instead of being dependent on the [YARN-4686] patch. > Unnecessary wait in Node Status Updater during reboot > - > > Key: YARN-4756 > URL: https://issues.apache.org/jira/browse/YARN-4756 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger > Attachments: YARN-4756.001.patch, YARN-4756.002.patch > > > The startStatusUpdater thread waits for the isStopped variable to be set to > true, but it is waiting for the next heartbeat. During a reboot, the next > heartbeat will not come and so the thread waits for a timeout. Instead, we > should notify the thread to continue so that it can check the isStopped > variable and exit without having to wait for a timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178467#comment-15178467 ] Hadoop QA commented on YARN-4737: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 5s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 6s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 52s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 47s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 59s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 52s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 22s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 3s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 53s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 4s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 4s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 0s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 0s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 16s {color} | {color:red} root: patch generated 5 new + 431 unchanged - 9 fixed = 436 total (was 440) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 1s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 57s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 9m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 6s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 53s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 1s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 16s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 41s {color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 13s {color} |
[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178448#comment-15178448 ] Kuhu Shukla commented on YARN-4311: --- Requesting [~jlowe], [~templedf] for review/comments. Thanks a lot! > Removing nodes from include and exclude lists will not remove them from > decommissioned nodes list > - > > Key: YARN-4311 > URL: https://issues.apache.org/jira/browse/YARN-4311 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-4311-v1.patch, YARN-4311-v10.patch, > YARN-4311-v11.patch, YARN-4311-v11.patch, YARN-4311-v2.patch, > YARN-4311-v3.patch, YARN-4311-v4.patch, YARN-4311-v5.patch, > YARN-4311-v6.patch, YARN-4311-v7.patch, YARN-4311-v8.patch, YARN-4311-v9.patch > > > In order to fully forget about a node, removing the node from include and > exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The > tricky part that [~jlowe] pointed out was the case when include lists are not > used, in that case we don't want the nodes to fall off if they are not active. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178426#comment-15178426 ] Hadoop QA commented on YARN-4311: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 46s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 40s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 54s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 57s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 37s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 9s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 53s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 5m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 46s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 54s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 42s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 9s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 42s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 51s {color} | {color:green} hadoop-sls in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_95. {color} |
[jira] [Commented] (YARN-4083) Add a discovery mechanism for the scheduler address
[ https://issues.apache.org/jira/browse/YARN-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178423#comment-15178423 ] Vinod Kumar Vavilapalli commented on YARN-4083: --- bq. Today many apps like Distributed Shell, REEF, etc rely on the fact that the HADOOP_CONF_DIR of the NM is on the classpath to discover the scheduler address. Please see my comment [here|https://issues.apache.org/jira/browse/YARN-4650?focusedCommentId=15176322=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15176322] on YARN-4650. After rolling-upgrades (YARN-666), for correctness sake, we require all apps to *not* depend on server side config files (which may change during upgrades). bq. I feel an environment variable should be accessible by linux, windows and other containers. I think [~aw] is making the same comment I made in the design doc (in section 3.2.5) for YARN-4692. Pasting that comment below: {quote} All of our platform-to-application communication currently is only through process environment variables: for e.g. ApplicationContants.NM_HOST. With things like Linux CGroups, containerization through docker etc, it is now possible to launch multi-process containers where the solution of environmental variables breaks down. [(Task) We need better ways of propagating important information down to the containers information like container’s resource size, local-dirs and log-dirs available for writing etc. {quote} > Add a discovery mechanism for the scheduler address > --- > > Key: YARN-4083 > URL: https://issues.apache.org/jira/browse/YARN-4083 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Reporter: Subru Krishnan >Assignee: Subru Krishnan > > Today many apps like Distributed Shell, REEF, etc rely on the fact that the > HADOOP_CONF_DIR of the NM is on the classpath to discover the scheduler > address. This JIRA proposes the addition of an explicit discovery mechanism > for the scheduler address -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4756) Unnecessary wait in Node Status Updater during reboot
[ https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178420#comment-15178420 ] Hadoop QA commented on YARN-4756: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s {color} | {color:red} YARN-4756 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791229/YARN-4756.001.patch | | JIRA Issue | YARN-4756 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/10700/console | | Powered by | Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Unnecessary wait in Node Status Updater during reboot > - > > Key: YARN-4756 > URL: https://issues.apache.org/jira/browse/YARN-4756 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger > Attachments: YARN-4756.001.patch > > > The startStatusUpdater thread waits for the isStopped variable to be set to > true, but it is waiting for the next heartbeat. During a reboot, the next > heartbeat will not come and so the thread waits for a timeout. Instead, we > should notify the thread to continue so that it can check the isStopped > variable and exit without having to wait for a timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178409#comment-15178409 ] Eric Badger commented on YARN-4686: --- JIRA [YARN-4756] has been opened regarding the heartbeatMonitor optimization in the startStatusUpdater thread. > MiniYARNCluster.start() returns before cluster is completely started > > > Key: YARN-4686 > URL: https://issues.apache.org/jira/browse/YARN-4686 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith Sharma K S >Assignee: Eric Badger > Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch, > YARN-4686.002.patch, YARN-4686.003.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4756) Unnecessary wait in Node Status Updater during reboot
[ https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-4756: -- Description: The startStatusUpdater thread waits for the isStopped variable to be set to true, but it is waiting for the next heartbeat. During a reboot, the next heartbeat will not come and so the thread waits for a timeout. Instead, we should notify the thread to continue so that it can check the isStopped variable and exit without having to wait for a timeout. (was: The Node Status Updater thread waits for the isStopped variable to be set to true, but it is waiting for the next heartbeat. During a reboot, the next heartbeat will not come and so the thread waits for a timeout. Instead, we should notify the thread to continue so that it can check the isStopped variable and exit without having to wait for a timeout. ) > Unnecessary wait in Node Status Updater during reboot > - > > Key: YARN-4756 > URL: https://issues.apache.org/jira/browse/YARN-4756 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger > Attachments: YARN-4756.001.patch > > > The startStatusUpdater thread waits for the isStopped variable to be set to > true, but it is waiting for the next heartbeat. During a reboot, the next > heartbeat will not come and so the thread waits for a timeout. Instead, we > should notify the thread to continue so that it can check the isStopped > variable and exit without having to wait for a timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4756) Unnecessary wait in Node Status Updater during reboot
[ https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-4756: -- Attachment: YARN-4756.001.patch The optimization to notify the Node Status Updater thread to stop waiting for a heartbeat exposes a race condition in the test TestNodeManagerResync#testContainerResourceIncreaseIsSynchronizedWithRMResync. The test checks the current resources of the NM, then checks for it again since a different thread changes the current resources. However, there is no synchronization between these threads and it was only working because of the excessive wait time from the reboot. The patch adds in a barrier to synchronize these two threads. > Unnecessary wait in Node Status Updater during reboot > - > > Key: YARN-4756 > URL: https://issues.apache.org/jira/browse/YARN-4756 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger > Attachments: YARN-4756.001.patch > > > The Node Status Updater thread waits for the isStopped variable to be set to > true, but it is waiting for the next heartbeat. During a reboot, the next > heartbeat will not come and so the thread waits for a timeout. Instead, we > should notify the thread to continue so that it can check the isStopped > variable and exit without having to wait for a timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4756) Unnecessary wait in Node Status Updater during reboot
Eric Badger created YARN-4756: - Summary: Unnecessary wait in Node Status Updater during reboot Key: YARN-4756 URL: https://issues.apache.org/jira/browse/YARN-4756 Project: Hadoop YARN Issue Type: Improvement Reporter: Eric Badger Assignee: Eric Badger The Node Status Updater thread waits for the isStopped variable to be set to true, but it is waiting for the next heartbeat. During a reboot, the next heartbeat will not come and so the thread waits for a timeout. Instead, we should notify the thread to continue so that it can check the isStopped variable and exit without having to wait for a timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2888) Corrective mechanisms for rebalancing NM container queues
[ https://issues.apache.org/jira/browse/YARN-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-2888: -- Attachment: YARN-2888-yarn-2877.001.patch Uploading initial patch to solicit feedback > Corrective mechanisms for rebalancing NM container queues > - > > Key: YARN-2888 > URL: https://issues.apache.org/jira/browse/YARN-2888 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Attachments: YARN-2888-yarn-2877.001.patch > > > Bad queuing decisions by the LocalRMs (e.g., due to the distributed nature of > the scheduling decisions or due to having a stale image of the system) may > lead to an imbalance in the waiting times of the NM container queues. This > can in turn have an impact in job execution times and cluster utilization. > To this end, we introduce corrective mechanisms that may remove (whenever > needed) container requests from overloaded queues, adding them to less-loaded > ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-4686: -- Attachment: YARN-4686.003.patch Fixing a deadlock issue related to the Node Status Updater thread and the Reboot thread. Also taking out heartbeatMonitor notify optimization. I will file a separate JIRA for this issue as well as the test whose race condition it exposes. > MiniYARNCluster.start() returns before cluster is completely started > > > Key: YARN-4686 > URL: https://issues.apache.org/jira/browse/YARN-4686 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith Sharma K S >Assignee: Eric Badger > Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch, > YARN-4686.002.patch, YARN-4686.003.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178179#comment-15178179 ] Sidharta Seethana commented on YARN-4744: - Thanks, [~jlowe]. I would prefer to leave the log-then-throw in place right now (or at least keep it outside the scope of this JIRA ). About the patch : I didn't modify DockerLinuxContainerRuntime because the signaling there needs additional work - needs to be reimplemented in terms of docker operations. Not sure if I filed a JIRA for that yet, I'll check. I'll fix the PrivilegedOperation constructor. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > Attachments: YARN-4744.001.patch > > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at >
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178165#comment-15178165 ] Varun Saxena commented on YARN-4712: Yeah even I think we should make minimal changes here i.e. only those required specifically for YARN-2928. Because this is something which has to be done primarily in trunk. I will let [~djp] comment on it as he was reviewing YARN-4308 as well. > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178143#comment-15178143 ] Sunil G commented on YARN-4712: --- Hi [~Naganarasimha Garla] and [~varun_saxena] I think changes in {{ContainersMonitorImpl}} need to be trunk also. A patch was given in YARN-4308, but there were few discussion on YARN-3304 and most people agreed for -1 there. Hence YARN-4308 is in limbo state, and I think for first time we can send 0 rather than -1. So patch there can go to trunk i think. In that case, next trunk sync will fetch that change. Thoughts? > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4749) Generalize config file handling in container-executor
[ https://issues.apache.org/jira/browse/YARN-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178107#comment-15178107 ] Sidharta Seethana commented on YARN-4749: - Thanks, [~vvasudev]. These issues existed before my changes and I missed fixing them. New patch coming up. > Generalize config file handling in container-executor > - > > Key: YARN-4749 > URL: https://issues.apache.org/jira/browse/YARN-4749 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-4749.001.patch > > > The current implementation of container-executor already supports parsing of > key value pairs from a config file. However, it is currently restricted to > {{container-executor.cfg}} and cannot be reused for parsing additional > config/command files. Generalizing this is a required step for YARN-4245. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178065#comment-15178065 ] Sunil G commented on YARN-4712: --- [~varun_saxena] I was also planning to handle this pblm from it root cause end, which is in NM reporting side, as YARN-4308. Ideally to me, it's not good to send UNAVAILABLE, bcz we send it for every first reading. In some error care, when there s no reading s there, I think we may need this error code also. But I would like to fix sending - 1 in first reponse at least. > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4749) Generalize config file handling in container-executor
[ https://issues.apache.org/jira/browse/YARN-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178053#comment-15178053 ] Varun Vasudev commented on YARN-4749: - Thanks for the patch [~sidharta-s]. It looks mostly good. One formatting/indentation fix - {code} +if(cfg->confdetails[cfg->size] ) +cfg->size++; {code} Please fix the formatting of the if condition, and fix the indentation of the the increment statement. I would prefer it if you added braces but that's my personal choice. > Generalize config file handling in container-executor > - > > Key: YARN-4749 > URL: https://issues.apache.org/jira/browse/YARN-4749 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-4749.001.patch > > > The current implementation of container-executor already supports parsing of > key value pairs from a config file. However, it is currently restricted to > {{container-executor.cfg}} and cannot be reused for parsing additional > config/command files. Generalizing this is a required step for YARN-4245. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178055#comment-15178055 ] Varun Saxena commented on YARN-4712: [~Naganarasimha], UNAVAILABLE related issue exists in trunk. Shouldnt we fix all issues related to UNAVAILABLE as part of JIRA in trunk ? YARN-4308 is raised specifically for this purpose if I am not wrong. I was seeing this JIRA more for handling the issue with sending float as CPU metric and just bringing in YARN-4308 fix in the interim. However I am open to fixing it here. But this can duplicate effort and cause clash if JIRA in trunk goes before this JIRA. Thoughts [~djp] ? > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-4311: -- Attachment: YARN-4311-v11.patch Re-attaching the v11 patch with no changes to trigger another pre-commit since TestResourceTrackerService failure are not reproducible locally and from investigation seem related to the sleep based wait. Need to see if this failure is consistent. Also checked that it applies clean to trunk. > Removing nodes from include and exclude lists will not remove them from > decommissioned nodes list > - > > Key: YARN-4311 > URL: https://issues.apache.org/jira/browse/YARN-4311 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-4311-v1.patch, YARN-4311-v10.patch, > YARN-4311-v11.patch, YARN-4311-v11.patch, YARN-4311-v2.patch, > YARN-4311-v3.patch, YARN-4311-v4.patch, YARN-4311-v5.patch, > YARN-4311-v6.patch, YARN-4311-v7.patch, YARN-4311-v8.patch, YARN-4311-v9.patch > > > In order to fully forget about a node, removing the node from include and > exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The > tricky part that [~jlowe] pointed out was the case when include lists are not > used, in that case we don't want the nodes to fall off if they are not active. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177970#comment-15177970 ] Hadoop QA commented on YARN-4712: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 11m 4s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: patch generated 1 new + 20 unchanged - 4 fixed = 21 total (was 24) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 16s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 20s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 40m 34s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791179/YARN-4712-YARN-2928.v1.003.patch | | JIRA Issue | YARN-4712 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 60f598883cac 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality |
[jira] [Updated] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4712: Attachment: YARN-4712-YARN-2928.v1.003.patch Hi [~djp] & [~varun_saxena], Please find the latest patch addressing the comments but additionally i have tried to take care all other places where -1 *(ResourceCalculatorProcessTree.UNAVAILABLE)* can be used in calculations. Please review. > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Maron updated YARN-4737: - Attachment: YARN-4737.003.patch > Use CSRF Filter in YARN > --- > > Key: YARN-4737 > URL: https://issues.apache.org/jira/browse/YARN-4737 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, webapp >Reporter: Jonathan Maron >Assignee: Jonathan Maron > Attachments: YARN-4737.001.patch, YARN-4737.002.patch, > YARN-4737.003.patch > > > A CSRF filter was added to hadoop common > (https://issues.apache.org/jira/browse/HADOOP-12691). The aim of this JIRA > is to come up with a mechanism to integrate this filter into the webapps for > which it is applicable (web apps that may establish an authenticated > identity). That includes the RM, NM, and mapreduce jobhistory web app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177872#comment-15177872 ] Jason Lowe commented on YARN-4744: -- Thanks for the patch! bq. In addition, logging in PrivilegedOperationExecutor includes information that isn't necessarily available when the exception is propagated. That problem is solved by having the throwing code either encode that information in the exception message or adding necessary fields to the exception class, allowing the error handler to retrieve them as needed. If the throwing code can create an appropriate log message then it can put that same information in the exception. There's already a custom exception for these errors, so it would be easy to add things like full command line, etc. I still think the code handling the error is the real problem if we're missing appropriate logs, but I don't feel so strongly to block it if others prefer leaving the log-then-throw logic in place. Comments on the patch: Don't we need to update DockerLinuxContainerRuntime in a similar manner? I think we'll have the same issue there. PrivilegedOperation should have a constructor that just takes an opType parameter and the other constructors should be implemented in terms of it. That eliminates the duplicate code maintenance pitfalls and avoids doing odd things like passing nulls as standard practice. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > Attachments: YARN-4744.001.patch > > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at >
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177826#comment-15177826 ] Varun Saxena commented on YARN-4700: Thanks [~Naganarasimha] for the latest patch. +1, looks good to me. I will wait for a while so that if others have any comment, they can give. Will commit it later today. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4700-YARN-2928.v1.001.patch, > YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.v1.003.patch, > YARN-4700-YARN-2928.v1.004.patch, YARN-4700-YARN-2928.wip.patch > > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4602) Scalable and Simple Message Service for YARN application
[ https://issues.apache.org/jira/browse/YARN-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4602: - Summary: Scalable and Simple Message Service for YARN application (was: Message/notification service between containers) > Scalable and Simple Message Service for YARN application > > > Key: YARN-4602 > URL: https://issues.apache.org/jira/browse/YARN-4602 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, resourcemanager >Reporter: Junping Du >Assignee: Junping Du > > Currently, mostly communications among YARN daemons, services and > applications are go through RPC. In almost all cases, logic running inside of > containers are RPC client but not server because it get launched inflight. > The only special case is AM container, because it get launched earlier than > any other containers so it can be RPC server and tell new coming containers > server address in application logic (like MR AM). > The side effects are: > 1. When AM container get failed, the new AM attempts will get launched with > new address/port, so previous RPC are broken. > 2. Application's requirement are variable, there could be other dependency > between containers (not AM), so some container failed over will affect other > containers' running logic. > It is better to have some message/notification mechanism between containers > for handle above cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177543#comment-15177543 ] Varun Vasudev commented on YARN-4737: - Thanks for the updated patch Jon. Some more fixes required - 1) In WebApps.java - {code} +Mapparams = getCsrfConfigParameters(); +if (hasCSRFEnabled(params)) { + LOG.info("CSRF Protection has been enabled for the {} application. " + + "Please ensure that there is an authentication mechanism " + + "enabled (kerberos, custom, etc).", + name); + String restCsrfClassName = RestCsrfPreventionFilter.class.getName(); + HttpServer2.defineFilter(server.getWebAppContext(), restCsrfClassName, + restCsrfClassName, params, + new String[] {"/*"}); +} {code} should be before {code} HttpServer2.defineFilter(server.getWebAppContext(), "guice", GuiceFilter.class.getName(), null, new String[] { "/*" }); {code} The guice filter redirects the request to the appropriate handler and the requests get executed before going through the CSRF filter. 2) The JHS configs in mapred-default.xml start with the prefix - mapreduce.jobhistory.webapp but the prefix used in code is mapreduce.jobhistory (no webapp) - I think you need to create a mapreduce.jobhistory.webapp prefix in the code. 3) In yarn-default.xml, all the timeline service configs have an extra "." in them after "yarn.timeline-service". e.g. yarn.timeline-service..webapp.rest-csrf.methods-to-ignore The failing tests and ASF warnings are unrelated to the patch. > Use CSRF Filter in YARN > --- > > Key: YARN-4737 > URL: https://issues.apache.org/jira/browse/YARN-4737 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, webapp >Reporter: Jonathan Maron >Assignee: Jonathan Maron > Attachments: YARN-4737.001.patch, YARN-4737.002.patch > > > A CSRF filter was added to hadoop common > (https://issues.apache.org/jira/browse/HADOOP-12691). The aim of this JIRA > is to come up with a mechanism to integrate this filter into the webapps for > which it is applicable (web apps that may establish an authenticated > identity). That includes the RM, NM, and mapreduce jobhistory web app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177526#comment-15177526 ] Varun Saxena commented on YARN-3863: Furthermore, in ATSv1 we had something called secondary filters which is along the lines our filters(atleast similar to info filters). It used to check other info field in an entity for match. Even there, it was not mandatory to have fields to retrieve as OTHER_INFO for secondary filters to match. Not saying that we have to do what ATSv1 did, but just letting you know what was done in ATSv1. We can discuss further and take a final decision on this in today's meeting. > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-YARN-2928.v2.03.patch, > YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177521#comment-15177521 ] Varun Saxena commented on YARN-3863: Moreover, info also has associated info as well. - Sorry, meant "Moreover, events may have associated info as well." > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-YARN-2928.v2.03.patch, > YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177509#comment-15177509 ] Varun Saxena commented on YARN-3863: Thanks [~sjlee0] for the review. bq. One high level question: am I correct in understanding that if a relations filter is specified for example but relation was not specified as part of fields to retrieve, we would try to fetch the relation? Yes, we would try to fetch only those relations which are required to match the relation filters. Same goes for event filters. We will try to fetch only those events which are required to match event filters if fields to retrieve does not specify EVENTS. bq. What if we simply reject or ignore the filters if they do not match the fields to retrieve? Would it make the implementation simpler or harder? It will preclude the need of some of the code in GenericEntityReader and ApplicationEntity i.e. primarily code in method {{fetchPartialColsFromInfoFamily}} and {{createFilterListForColsOfInfoFamily}}. bq. To me, supporting more contents even if the filters and the fields to retrieve are not consistent seems very much optional, and I'm not sure if it is worth it especially if it adds a lot more complexity. What do you think? Personally I think fields to retrieve and filters should be treated separately. Filters decide which entities to carry back in response and fields/configs/metrics to retrieve decide what should be carried in each entity. Treating filters and fields to retrieve is consistent with code written previously in the branch but as this is new code we can change the behavior too. But I am not very sure if we should do so. For instance, if I want to get IDs' of all the FINISHED apps, I can make a query with eventfilters as APPLICATION_FINISHED and not specify anything in fields to retrieve as I am only interested in application ID. If I link it to fields to retrieve, I will have to unnecessarily fetch other events as well, which I have no interest in. This increases the amount of bytes transferred across the wire as well. Moreover, info also has associated info as well. Maybe along the lines of confs/metrics to retrieve we can have something like events to retrieve as well but in all these cases one query param is depending on other which doesn't sound right to me. Thoughts ? We can discuss further on this in today's meeting. bq. I know Vrushali C had some thoughts on how to split this monolithic TestHBaseTimelineStorage. It might be good to come to a consensus on how to split it... Ok. I had split it across apps and entities. We can seek her opinion too on this in today's meeting. I will check other comments when I start coding for next version of patch. Most sound like they would be valid and fixable. > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-YARN-2928.v2.03.patch, > YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177506#comment-15177506 ] Rohith Sharma K S commented on YARN-4754: - bq. I still see 2 places where we are not closing ClientResponse, when we call putDomain and in doPosting if response is not 200 OK. It looks to be this is the case. After RM recovery completes, timeline entities are published in background. During this span of time, if there timeline sever is restarted or down for sometime, it is able to see many connections are kept CLOSE_WAIT state. > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177478#comment-15177478 ] Rohith Sharma K S commented on YARN-4754: - As I see ATS logs, there were no exception. In RM logs, there was exception as I mentioned in first comment {{SocketException: Too many open files}}. I am recovering the applications once again, and will check it out for close_wait connections. > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)