[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081550#comment-15081550 ] MENG DING commented on YARN-4528: - Hi, [~sandflee] With current logic, I think RM won't know if a container decrease msg has really been persisted in NM state store or not, even if you decrease resource synchronously in NM. For example, suppose we now synchronously decrease resource in NM, and something goes wrong when writing the NM state store, then an exception will be thrown, and will be caught by the following statement during status update in NM: {code} catch (Throwable e) { // TODO Better error handling. Thread can die with the rest of the // NM still running. LOG.error("Caught exception in status-updater", e); } {code} So to me, there is really no benefit of decreasing container resource synchronously in NM, is it? > decreaseContainer Message maybe lost if NM restart > -- > > Key: YARN-4528 > URL: https://issues.apache.org/jira/browse/YARN-4528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > Attachments: YARN-4528.01.patch > > > we may pending the container decrease msg util next heartbeat. or checks the > resource with rmContainer when node register. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081043#comment-15081043 ] Hadoop QA commented on YARN-4528: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 45s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 23s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 27s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 20s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 24s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 24s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 24s {color} | {color:red} Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server (total was 134, now 135). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 9s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 4s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 62m 4s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 17s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 21s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 171m 3s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests |
[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082083#comment-15082083 ] MENG DING commented on YARN-4528: - Honestly I don't think the design needs to be changed, unless other people think differently. As you said, this RARELY, if ever happens. Also, we acknowledged that AM only issues decrease request when it knows that a container doesn't need the original amount of resource, and a failed decrease message in NM is not at all fatal (unlike a failed increase message, which may cause the container to be killed by the resource enforcement). > decreaseContainer Message maybe lost if NM restart > -- > > Key: YARN-4528 > URL: https://issues.apache.org/jira/browse/YARN-4528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > Attachments: YARN-4528.01.patch > > > we may pending the container decrease msg util next heartbeat. or checks the > resource with rmContainer when node register. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081267#comment-15081267 ] MENG DING commented on YARN-4528: - Hi, [~sandflee] I am not quite sure about the benefit of directly decreasing resource in NM (point #2 in your comment). The targetResource is already being persisted in NM state store for NM recovery, and RM does not need to check the status of the NM decrease anyway. {code} // Persist container resource change for recovery this.context.getNMStateStore().storeContainerResourceChanged( containerId, targetResource); {code} > decreaseContainer Message maybe lost if NM restart > -- > > Key: YARN-4528 > URL: https://issues.apache.org/jira/browse/YARN-4528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > Attachments: YARN-4528.01.patch > > > we may pending the container decrease msg util next heartbeat. or checks the > resource with rmContainer when node register. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081322#comment-15081322 ] sandflee commented on YARN-4528: HI, [~mding], container decrease msg is passed like container complete msg passed from RM to AM. so a successfully nodeHeartBeat must ensure that container decrease msg is persisted in NM state store. {code:title=RMAppAttemptImpl.java # pullJustFinishContainers} // A new allocate means the AM received the previously sent // finishedContainers. We can ack this to NM now sendFinishedContainersToNM(); {code} > decreaseContainer Message maybe lost if NM restart > -- > > Key: YARN-4528 > URL: https://issues.apache.org/jira/browse/YARN-4528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > Attachments: YARN-4528.01.patch > > > we may pending the container decrease msg util next heartbeat. or checks the > resource with rmContainer when node register. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081969#comment-15081969 ] sandflee commented on YARN-4528: thanks [~mding], yes this could happen, but rarely. should this affect the design? > decreaseContainer Message maybe lost if NM restart > -- > > Key: YARN-4528 > URL: https://issues.apache.org/jira/browse/YARN-4528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > Attachments: YARN-4528.01.patch > > > we may pending the container decrease msg util next heartbeat. or checks the > resource with rmContainer when node register. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081342#comment-15081342 ] sandflee commented on YARN-4528: [~jianhe] reviewing the code of how containers complete msg passed from RM to AM, seems there is a race condition that the message will be lost when msg pulled by AM (not really passed to AM) and AM crashed. we could fix this by put finishedContainersSentToAM to justFinishContainers when transfer state from previous RMAppAttempt. > decreaseContainer Message maybe lost if NM restart > -- > > Key: YARN-4528 > URL: https://issues.apache.org/jira/browse/YARN-4528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > Attachments: YARN-4528.01.patch > > > we may pending the container decrease msg util next heartbeat. or checks the > resource with rmContainer when node register. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075772#comment-15075772 ] sandflee commented on YARN-4528: since in most cases container size is not changed, so I propose to pending container decrease msg. > decreaseContainer Message maybe lost if NM restart > -- > > Key: YARN-4528 > URL: https://issues.apache.org/jira/browse/YARN-4528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > > we may pending the container decrease msg util next heartbeat. or checks the > resource with rmContainer when node register. -- This message was sent by Atlassian JIRA (v6.3.4#6332)