[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081550#comment-15081550
 ] 

MENG DING commented on YARN-4528:
-

Hi, [~sandflee]

With current logic, I think RM won't know if a container decrease msg has 
really been persisted in NM state store or not, even if you decrease resource 
synchronously in NM. For example, suppose we now synchronously decrease 
resource in NM, and something goes wrong when writing the NM state store, then 
an exception will be thrown, and will be caught by the following statement 
during status update in NM:

{code}
catch (Throwable e) {

// TODO Better error handling. Thread can die with the rest of the
// NM still running.
LOG.error("Caught exception in status-updater", e);
  } 
{code}

So to me, there is really no benefit of decreasing container resource 
synchronously in NM, is it?

> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
> Attachments: YARN-4528.01.patch
>
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081043#comment-15081043
 ] 

Hadoop QA commented on YARN-4528:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
45s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 27s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
24s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
20s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
2s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 21s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 24s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 24s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server (total was 134, now 135). 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 9s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
38s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 4s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 62m 4s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 17s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_91. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 21s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
20s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 171m 3s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 

[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082083#comment-15082083
 ] 

MENG DING commented on YARN-4528:
-

Honestly I don't think the design needs to be changed, unless other people 
think differently. As you said, this RARELY, if ever happens. Also, we 
acknowledged that AM only issues decrease request when it knows that a 
container doesn't need the original amount of resource, and a failed decrease 
message in NM is not at all fatal (unlike a failed increase message, which may 
cause the container to be killed by the resource enforcement). 

> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
> Attachments: YARN-4528.01.patch
>
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081267#comment-15081267
 ] 

MENG DING commented on YARN-4528:
-

Hi, [~sandflee]

I am not quite sure about the benefit of directly decreasing resource in NM 
(point #2 in your comment). The targetResource is already being persisted in NM 
state store for NM recovery, and RM does not need to check the status of the NM 
decrease anyway. 
{code}
// Persist container resource change for recovery
this.context.getNMStateStore().storeContainerResourceChanged(
containerId, targetResource);
{code}


> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
> Attachments: YARN-4528.01.patch
>
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081322#comment-15081322
 ] 

sandflee commented on YARN-4528:


HI, [~mding],  container decrease msg is passed like container complete msg 
passed from RM to AM. so a successfully nodeHeartBeat must ensure that 
container decrease msg is persisted in NM state store.

{code:title=RMAppAttemptImpl.java # pullJustFinishContainers}
  // A new allocate means the AM received the previously sent
  // finishedContainers. We can ack this to NM now
  sendFinishedContainersToNM();
{code}

> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
> Attachments: YARN-4528.01.patch
>
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081969#comment-15081969
 ] 

sandflee commented on YARN-4528:


thanks [~mding], yes this could happen, but rarely. should this affect the 
design?

> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
> Attachments: YARN-4528.01.patch
>
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081342#comment-15081342
 ] 

sandflee commented on YARN-4528:


[~jianhe] reviewing the code of how containers complete msg passed from RM to 
AM, seems there is a race condition that the message will be lost when msg 
pulled by AM (not really passed to AM) and AM crashed. we could fix this by put 
finishedContainersSentToAM to justFinishContainers when transfer state from 
previous RMAppAttempt. 

> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
> Attachments: YARN-4528.01.patch
>
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2015-12-30 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075772#comment-15075772
 ] 

sandflee commented on YARN-4528:


since in most cases container size is not changed, so I propose to pending 
container decrease msg.

> decreaseContainer Message maybe lost if NM restart
> --
>
> Key: YARN-4528
> URL: https://issues.apache.org/jira/browse/YARN-4528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>
> we may pending the container decrease msg util next heartbeat. or checks the 
> resource with rmContainer when node register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)