[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-04-28 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262586#comment-15262586
 ] 

Varun Vasudev commented on YARN-3998:
-

I'm going to commit the latest patch tomorrow if no one objects.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, 
> YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-04-27 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260696#comment-15260696
 ] 

Varun Vasudev commented on YARN-3998:
-

[~vinodkv] - do you want to review this further or can I go ahead and commit it?

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, 
> YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-04-13 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239448#comment-15239448
 ] 

Varun Vasudev commented on YARN-3998:
-

The latest patch looks good to me. The points that still need to be addressed -
# Unify restart policies with AM restart work
# Support minimum retry interval
# Fix  containerLaunchStartTime and track the individual start time for the 
retry attempt.

In my opinion, all of these can be done as follow up JIRAs. [~vinodkv] - can 
you take a look at the latest patch?

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, 
> YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-04-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235395#comment-15235395
 ] 

Hadoop QA commented on YARN-3998:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
23s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
41s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 43s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
50s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
31s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 3s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 39s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 39s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 39s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 40s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 17 new + 
774 unchanged - 8 fixed = 791 total (was 782) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 36s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
45s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
25s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 40s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 3s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_77. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 10s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 12s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_77. {color} |

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-04-11 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235213#comment-15235213
 ] 

Jun Gong commented on YARN-3998:


Thanks [~vvasudev] for the review and comments!

Attach a rebased patch 09.patch and fix problem 3. I will handle problem 2 in 
following jira.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, 
> YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-04-11 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234983#comment-15234983
 ] 

Varun Vasudev commented on YARN-3998:
-

Thanks for the patch [~hex108]. It looks good. Few things that need to be 
corrected -
# It needs to be rebased to the latest trunk
# In ContainerImpl - {code} +containerLaunchStartTime = clock.getTime(); 
{code} is not required. We shouldn't updated the start time for every launch. 
We should add a new variable to track the running time of the current launch 
but that can be done in a follow up jira
# In NMStateStoreService - {code} +  .append(", WorDir: 
").append(workDir) {code} there's typo in WorDir

Apart from that it looks good to me.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, 
> YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-04-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227816#comment-15227816
 ] 

Hadoop QA commented on YARN-3998:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 25s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 30s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 50s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
50s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 28s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
7s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
50s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 43s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 48s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
5s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 40s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 3m 40s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 40s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 1s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 3m 1s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 1s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 55s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 17 new + 
773 unchanged - 8 fixed = 790 total (was 781) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 4s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
48s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
38s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 52s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 2s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 29s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_77. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 23s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 14s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_77. {color} |

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-04-05 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227697#comment-15227697
 ] 

Jun Gong commented on YARN-3998:


[~vvasudev] Thanks for the review and comments. Attach a new patch 08.patch to 
address your comments and fix test cases errors. 

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, 
> YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-04-04 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225716#comment-15225716
 ] 

Varun Vasudev commented on YARN-3998:
-

Thanks for the patch [~hex108]

# {code}+  List containerLocalDirs = new 
ArrayList<>(localDirs.size());{code}
containerLocalDirs is created but never populated
# In ContainerRelaunch#call, there's a lot of duplicate code copied from 
ContainerLaunch#call. Can you please refactor the code in ContainerLaunch#call 
so that it can be re-used?

>From my analysis, these pieces can be moved into their own functions are 
>called from ContainerLaunch#call and ContainerRelaunch#call

{code}
+// CONTAINER_KILLED_ON_REQUEST should not be missed if the container
+// is already at KILLING
+if (container.getContainerState() == ContainerState.KILLING) {
+  dispatcher.getEventHandler().handle(
+  new ContainerExitEvent(containerID,
+  ContainerEventType.CONTAINER_KILLED_ON_REQUEST,
+  Shell.WINDOWS ? 
ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode() :
+  ContainerExecutor.ExitCode.TERMINATED.getExitCode(),
+  "Container terminated before launch."));
+  return 0;
+}
{code}
This can be moved into it's own function that returns a boolean that we can use 
to get the same behaviour

{code}
+  localResources = container.getLocalizedResources();
+  if (localResources == null) {
+throw RPCUtil.getRemoteException(
+"Unable to get local resources when Container " + containerID +
+" is at " + container.getContainerState());
+  }
{code}
This can be moved into it's own function that returns 
{code}localResources{code} if it's not null and throws an exception otherwise.

{code}
+  List logDirs = dirsHandler.getLogDirs();
+
+  List containerLogDirs = new ArrayList<>();
+  String relativeContainerLogDir = ContainerLaunch
+  .getRelativeContainerLogDir(appIdStr, containerIdStr);
+  for(String logDir : logDirs) {
+containerLogDirs.add(logDir + Path.SEPARATOR + 
relativeContainerLogDir);
+  }
{code}
can be moved into it's own function that returns the populated List

{code}
+StringBuilder diagnosticInfo =
+new StringBuilder("Container exited with a non-zero exit code ");
+diagnosticInfo.append(ret);
+diagnosticInfo.append(". ");
+if (ret == ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode()
+|| ret == ContainerExecutor.ExitCode.TERMINATED.getExitCode()) {
+  // If the process was killed, Send container_cleanedup_after_kill and
+  // just break out of this method.
+  dispatcher.getEventHandler().handle(
+  new ContainerExitEvent(containerID,
+  ContainerEventType.CONTAINER_KILLED_ON_REQUEST, ret,
+  diagnosticInfo.toString()));
+  return ret;
+}
{code}
can be moved it's own function.
I suspect we can move
{code}
+  // LaunchContainer is a blocking call. We are here almost means the
+  // container is launched, so send out the event.
+  dispatcher.getEventHandler().handle(new ContainerEvent(
+  containerID,
+  ContainerEventType.CONTAINER_LAUNCHED));
+  context.getNMStateStore().storeContainerLaunched(containerID);
+
+  // Check if the container is signalled to be killed.
+  if (!shouldLaunchContainer.compareAndSet(false, true)) {
+LOG.info("Container " + containerIdStr + " not launched as "
++ "cleanup already called");
+ret = ContainerExecutor.ExitCode.TERMINATED.getExitCode();
+  } else {
+exec.activateContainer(containerID, pidFilePath);
+ret = exec.launchContainer(new ContainerStartContext.Builder()
+.setContainer(container)
+.setLocalizedResources(localResources)
+.setNmPrivateContainerScriptPath(nmPrivateContainerScriptPath)
+.setNmPrivateTokensPath(nmPrivateTokensPath)
+.setUser(container.getUser())
+.setAppId(appIdStr)
+.setContainerWorkDir(containerWorkDir)
+.setLocalDirs(localDirs)
+.setLogDirs(logDirs)
+.setContainerLocalDirs(containerLocalDirs)
+.setContainerLogDirs(containerLogDirs)
+.build());
+  }
{code}
into its own function as well.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, 
> YARN-3998.06.patch, YARN-3998.07.patch
>
>
> I'd like to add a 

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214969#comment-15214969
 ] 

Hadoop QA commented on YARN-3998:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 52s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 9s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
42s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 46s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
55s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 39s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 2s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 48s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 48s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 8s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 8s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 43s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 23 new + 
777 unchanged - 5 fixed = 800 total (was 782) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
47s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
21s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 54s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 20s {color} 
| {color:red} hadoop-yarn-api in the patch failed with JDK v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 6s {color} | 
{color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 12s 
{color} | {color:green} 

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-28 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214532#comment-15214532
 ] 

Jun Gong commented on YARN-3998:


Very sorry for late. I just a attached a new patch 07.patch.

In the new patch:
1. It treats RESTART in a first class manner: add new state RELAUNCHING for 
container, and ContainerRelaunch.
2. ContainerLaunch records and stores container's working directory and log 
directory, ContainerRelaunch will reuse previous working directory, log 
directory and container's launch script. If previous directories/files are not 
available, it will fail to relaunch. We could change this behavior afterwards.
3. Address [~vvasudev]'s all comments to 06.patch.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, 
> YARN-3998.06.patch, YARN-3998.07.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-09 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188681#comment-15188681
 ] 

Jun Gong commented on YARN-3998:


Yes, it seems we need to deal with platform errors, however it still seems 
useful for user to specify retry policy that retry on what error codes because 
only user knows meaning of every error codes. 

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-09 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187920#comment-15187920
 ] 

Vinod Kumar Vavilapalli commented on YARN-3998:
---

Instead of making it an arbitrary case-by-case per-exit-code behavior, we 
should have a general policy that is easily understood by users.

Namely we should not automatically restart containers for platform errors like 
mis-configuration of YARN, wrong permissions on yarn directories, absent 
user-accounts etc.

This is similar, again, to how we handled AM failures at a higher level via 
YARN-614. 

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-08 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184968#comment-15184968
 ] 

Junping Du commented on YARN-3998:
--

bq. We could specify retry policy to RETRY_ON_SPECIFIC_ERROR_CODE to handle 
this case, the error code might be INITIALIZE_USER_FAILED for the case if "user 
no found". Or do you mean it is not sufficient to just specify error codes?
Theoretically, it sounds ok to make retry policy based on error code. However, 
I think in some particular case (like "user not found" exception in 
LinuxContainerExecutor), it could share the same error code like -1000(INVALID) 
with other cases and we may need more specific info (like parsing error 
messages).

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-08 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184951#comment-15184951
 ] 

Jun Gong commented on YARN-3998:


We could specify retry policy to RETRY_ON_SPECIFIC_ERROR_CODE to handle this 
case, the error code might be INITIALIZE_USER_FAILED for the case if "user no 
found". Or do you mean it is not sufficient to just specify error codes?

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-08 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184952#comment-15184952
 ] 

Jun Gong commented on YARN-3998:


We could specify retry policy to RETRY_ON_SPECIFIC_ERROR_CODE to handle this 
case, the error code might be INITIALIZE_USER_FAILED for the case if "user no 
found". Or do you mean it is not sufficient to just specify error codes?

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-08 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184809#comment-15184809
 ] 

Junping Du commented on YARN-3998:
--

Thanks [~hex108] for the patch. In addition, I think we should be specific on 
failure types for re-luanch policy. e.g. if container get failed with "user no 
found" (enabling LinuxContainerExecutor), then relaunch container on this node 
could just be waste of time.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-07 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184167#comment-15184167
 ] 

Jun Gong commented on YARN-3998:


Thanks [~vvasudev] and [~vinodkv] for the comments and suggestions. I will 
update the patch.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-07 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183634#comment-15183634
 ] 

Vinod Kumar Vavilapalli commented on YARN-3998:
---

Okay, seems like we are in general agreement.
 - Let's try to look at the unification part in this JIRA itself as it concerns 
public facing APIs - even if we don't unify them completely here.
 - I filed YARN-4770 and assigned to myself. [~hex108] / [~vvasudev], feel free 
to take them over as you start working on them.
 - [~vvasudev], please also file a sub-task under YARN-4725 for the LCE related 
work. Tx.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-03-06 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182570#comment-15182570
 ] 

Varun Vasudev commented on YARN-3998:
-

My apologies for not responding earlier. I think creating a branch is not 
required. We can continue to work off trunk as long as the default behaviour 
preserves the existing behaviour - i.e. no retries. 

[~hex108] - what will the complexity of using the AM policy of retry windows 
for this patch? If it's straight forward, we should do it as part of this patch 
otherwise we should do it as part of a follow up.

Just to recap - what I would prefer is we address the following points(in 
order) -
#  Treat restarts in a first class manner - add the state machine changes 
required
#  When handling restarts - restart containers with the same local and log 
dirs(as long as the disks haven't gone bad)
#  Attempt to unify the AM/Container retry policies if feasible - this can be 
done as part of a follow up JIRA because it probably requires some discussion.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-24 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166567#comment-15166567
 ] 

Jun Gong commented on YARN-3998:


Thanks [~vinodkv] for explaining it.

{quote}
My point was mainly about creating and reusing a common policy-framework even 
if the actual policies may not be entirely reused. We should seriously consider 
this instead of creating adhoc APIs for custom hard-coded policies.
{quote}
Yes, it will be better if we could reuse a common policy-framework, we might 
need discuss it more.

{quote}
I'm okay creating separate JIRAs under YARN-3998 if you both think of doing so, 
but treat (some of the above) as blockers for releasing this feature. Given 
that, does it make sense to work on this in a branch?
{quote}
I could address these block problems in this issue if needed.  [~vvasudev] 
Could you share your thought please? Thanks.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159891#comment-15159891
 ] 

Vinod Kumar Vavilapalli commented on YARN-3998:
---

[~vvasudev], [~hex108]
bq. Unification with AM restart policies
My point was mainly about creating and reusing a common _policy-framework_ even 
if the actual policies may not be entirely reused. We should seriously consider 
this instead of creating adhoc APIs for custom hard-coded policies.

bq. We should also handle restarting of any containers that may have 
crashed during the NM reboot.
bq. The current version of the patch doesn't work with LCE because the NM can't 
cleanup the launch container script and the tokens(they're owned by the user 
who submitted the job/nobody)
bq. Yes, it will be better to store the log dir and work dir if we aims to make 
it more accurate. I was thinking to make minimal changes for this feature.
I'm okay creating separate JIRAs under YARN-3998 if you both think of doing so, 
but treat (some of the above) as blockers for releasing this feature. Given 
that, does it make sense to work on this in a branch?

[~asuresh]
bq. Would it make sense to add a time dimension. Instead of just specifying a 
retry count, it might be better to specify something like : "Kill container if 
it restarts X times in Y seconds".
As I was saying, this is similar to AM restarts where we support a sliding 
window of restarts so as to forget about very-old failures instead of 
accumulating them for-ever. This was added at YARN-611.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-23 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159810#comment-15159810
 ] 

Arun Suresh commented on YARN-3998:
---

By the way, Thanks a ton for raising this [~hex108].. Extremely useful feature 
to have !!

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-23 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159791#comment-15159791
 ] 

Arun Suresh commented on YARN-3998:
---

Was spending some time thinking about this..

Would it make sense to add a time dimension. Instead of just specifying a retry 
count, it might be better to specify something like :
"Kill container if it restarts X times in Y seconds".

This way, we can distinguish between transient and steady state errors. Most 
un-recoverable errors are transient (something that pops up during startup 
etc.) which will cause the container to restart multiple times really fast and 
thus should be permanently killed. A steady state restart (for eg. a container 
hosting a web server going down due to some 500 error.. caused due to some 
weird user request) should not contribute to the retry increment.

It might also make sense to extend this as a full fledged Container restart 
*Strategy*... Similar to what is found in systems written in Erlang 
(http://erlang.org/doc/design_principles/sup_princ.html) and the Scala Akka 
supervisor (http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance.html)

Thoughts ?

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159768#comment-15159768
 ] 

Vinod Kumar Vavilapalli commented on YARN-3998:
---

Making this a sub-task of YARN-4725 where we can track the remaining items.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-16 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149654#comment-15149654
 ] 

Jun Gong commented on YARN-3998:


Sorry for late reply, I was on holiday.

Thanks [~vinodkv] and [~vvasudev] for suggestion and review!

Some additional thought besides [~vvasudev]'s opinion: 
{quote}
Unification with AM restart policies
{quote}
I agree with [~vvasudev]. Now AM restart polices is retrying across different 
nodes, this feature is retrying on local node. When RM launches AM, it could 
specify local retry policy for it.  

{quote}
Treat relaunch in a first-class manner
{quote}
Glad to see it to be a first-class manner, I will update the patch.

{quote}
The following isn’t fool-proof and won’t work for all apps, can we just persist 
and read the selected log-dir from the state-store?
ContainerLaunch.handleContainerExitWithFailure() needs to handled differently 
during container-relaunches.
The same can be done for the work-dir.
All of these are related. If we store the log dir and work dir in the state 
store, we can address all 3 of these. 
{quote}
Yes, it will be better to store the log dir and work dir if we aims to make it 
more accurate. I was thinking to make minimal changes for this feature.

{quote}
In fact, if we end up changing the work-dir during relaunch due to a bad-dir, 
that may result in a breakage for the app. Apps may be reading from / writing 
into the work-dir and changing it during relaunch may invalidate application's 
assumptions. Should we just fail the container completely and let the AM deal 
with it?
{quote}
My thought is that if user specifies retry policy on container, the user should 
make sure that container could deal with this situation.

{quote}
Instead of removing a line and setting the limit to 10*1000, take the last 'n' 
characters in the string where 'n' is a config setting.
{quote}
It might make the diagnostics not consistent to remove the last n characters, 
suppose the  diagnostics is “The exception is ” and there is n characters 
in XXX, the diagnositics becomes “The exception is”. There is similar problem 
by removing first or last n lines. How about removing previous attempts' error 
information and just keeping the latest attempt's information? 

Glad to see more discussion about the feature.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-05 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134366#comment-15134366
 ] 

Varun Vasudev commented on YARN-3998:
-

Thanks for your comments Vinod. Overall, I mostly agree with your thoughts - it 
might make sense to file an umbrella JIRA to add first class container retry 
support and have this become a sub-task.

With regards to your specific comments -

bq. Unification with AM restart policies 
This should be taken up as a different JIRA - there are significant differences 
between AM restarts and container restarts - the most important one being that 
AM restarts can occur on different nodes(which is probably the desired 
behavior) but container relaunches by definition have to be on the same node. 
We might be able to re-use the notion of retrying on specific error codes, but 
it probably requires some more discussion.
 
bq. Treat relaunch in a first-class manner
Makes sense - [~hex108] - can you incorporate these into your next patch?

bq. The relaunch feature needs to work across NM restarts, so we should save 
the retry-context and policy per container into the state-store and reload it 
for continue relaunching after NM restart.

The container retry policy details are already stored in the state-store as 
part of the ContainerLaunchContext

bq. We should also handle restarting of any containers that may have crashed 
during the NM reboot.
This should be handled as a separate JIRA and not as part of this one - 
especially because we won't have the reason the container crashed if it crashed 
during the NM reboot.


bq. The following isn’t fool-proof and won’t work for all apps, can we just 
persist and read the selected log-dir from the state-store?
bq. ContainerLaunch.handleContainerExitWithFailure() needs to handled 
differently during container-relaunches.
bq. The same can be done for the work-dir.

All of these are related. If we store the log dir and work dir in the state 
store, we can address all 3 of there. [~hex108] - what do you think?

bq.In fact, if we end up changing the work-dir during relaunch due to a 
bad-dir, that may result in a breakage for the app. Apps may be reading from / 
writing into the work-dir and changing it during relaunch may invalidate 
application's assumptions. Should we just fail the container completely and let 
the AM deal with it?

I suspect we'll have to add an additional flag which the AMs will have to set 
to decide the behaviour. There are many valid cases where containers can store 
state on mounted volumes and don't rely on the YARN local dirs. On the other 
hand, like you pointed out in some cases, we might want to abandon the 
relaunch. I'm not sure if this should be done as part of this JIRA or another 
one.

[~hex108] with regards to your latest patch -

1) Can you change RETRY_ON_SPECIFIC_ERROR_CODE to RETRY_ON_SPECIFIC_ERROR_CODES

2)
{code}
+if (diagnostics.length() > DIAGNOSTICS_MAX_SIZE) {
+  // delete first line
+  int i = diagnostics.indexOf("\n");
+  if (i != -1) {
+diagnostics.delete(0, i + 1);
+  }
+}
{code}
Instead of this - can you please change the function to
a) Only trim the diagnostics string size in case of a relaunch
b) Instead of removing a line and setting the limit to 10*1000, take the last 
'n' characters in the string where 'n' is a config setting.

3)
{code}
+} catch (IOException e) {
+  LOG.error("Failed to delete " + relativePath, e);
+}
{code}
and 
{code}
+} catch (IOException e) {
+  LOG.error("Failed to delete container script file and token file", e);
+}
+
Change LOG.error to LOG.warn

4) The current version of the patch doesn't work with LCE because the NM can't 
cleanup the launch container script and the tokens(they're owned by the user 
who submitted the job/nobody). To fix this requires a change in the 
container-executor binary, which will involve passing a flag to the 
container-executor binary to relaunch a container. [~vinodkv] - what do you 
think we should do here - hold off committing this patch until we have support 
for relaunch in container-executor or commit it documenting the fact that it'll 
only work with DefaultContainerExecutor and file a follow up JIRA?

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will 

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133367#comment-15133367
 ] 

Vinod Kumar Vavilapalli commented on YARN-3998:
---

Thanks for working on this [~hex108]!

I've been meaning to file this JIRA with a slighly larger scope, but glad to 
see this moving.

The following has a list of additional things that I wanted in this feature. 
But, if you / [~vvasudev] want progress on the current patch as it stands, I 
can go file additional tickets as needed.

h4. Unification with AM restart policies
 - We should try to unify the retry-policies you are adding with the ones for 
AM container / app-attempt itself: See {{ApplicationSubmissionContext. 
getMaxAppAttempts()}} /  {{getAttemptFailuresValidityInterval()}}. I like your 
policy framework, so may be (in a followup?) use the same framework for AMs. 
/cc [~xgong]
 - To avoid containers crashing and restarting in no time, we should have a 
global min-retry-interval. See YARN-3669 which does the same for AMs.
 - Also, similar to AM restarts, we should support a sliding window of restarts 
so as to forget about very-old failures instead of accumulating them for-ever.

h4. Treat relaunch in a first-class manner
 - It's surprising we don't recognize the relaunches in a first-class manner.
 - I really don't like lots of if-else conditions everywhere. So, instead of 
using Container.isRelaunch(), I think we should have
— an explicit RELAUNCHING state instead of simply going back to LOCALIZED 
state.
— An additional ContainersLauncherEventType.RELAUNCH_CONTAINER
— And a new {{ContainerRelaunch}} callable which overrides the behavior of 
ContainerLaunch.

h4. Additional things needed during relaunch
 - The relaunch feature needs to work across NM restarts, so we should save the 
retry-context and policy per container into the state-store and reload it for 
continue relaunching after NM restart.
 - We should also handle restarting of any containers that may have crashed 
during the NM reboot.

h4. Comments on the current approach and patch.
 - The following isn’t fool-proof and won’t work for all apps, can we just 
persist and read the selected log-dir from the state-store?
{code}
+   * We apply a simple heuristic to find its previous working directory:
+   * if a good work dir with file pattern '*out' (e.g. stdout) already exists,
{code}
 - The same can be done for the work-dir.
 - In fact, if we end up changing the work-dir during relaunch due to a 
bad-dir, that may result in a breakage for the app. Apps may be reading from / 
writing into the work-dir and changing it during relaunch may invalidate 
application's assumptions. Should we just fail the container completely and let 
the AM deal with it?
 - ContainerLaunch.handleContainerExitWithFailure() needs to handled 
differently during container-relaunches.

There are other minor things I saw but they can wait for these things.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-03 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130109#comment-15130109
 ] 

Jun Gong commented on YARN-3998:


[~vvasudev], I just attached a new patch to address above problems. Thanks for 
review.

1) When finding container's previous working directory and log directory, just 
locate corresponding files in good directories which could be read/write and 
not full.

2)  Limiting diagnostic message's message to 1 bytes. If the length is 
greater than it, delete the first line whose separator is "\n".

3) After some container retries,  env variable  
*MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX*(DEFAULT_NM_ADMIN_USER_ENV) will be 
expanded to *MALLOC_ARENA_MAX=::*(a lot of ":"). I fixed it 
in *Apps#addToEnvironment*.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130229#comment-15130229
 ] 

Hadoop QA commented on YARN-3998:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 37s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
38s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 27s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 23s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
38s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 58s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
52s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 50s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 24s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 33s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
36s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 20s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 20s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 30s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 30s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 40s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 4 new + 
522 unchanged - 4 fixed = 526 total (was 526) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
48s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 4 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 
22s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 45s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 19s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 47s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 40s 
{color} | {color:green} 

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-02 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127929#comment-15127929
 ] 

Jun Gong commented on YARN-3998:


Thanks [~vvasudev] for the detailed review and comments. I will update the 
patch.

{quote}
9) In ContainerLaunch.java, in getContainerWorkDir, we should use 
dirsHandler.getLocalPathForWrite instead of dirsHandler.getLocalPathForRead.
{quote}
*getLocalPathForWrite* is used for allocating a new directory, I just want to 
search container's token file in all log paths in *getContainerLogDir*, so I 
use *getLocalPathForRead*. If the token file is not found, a new directory will 
be allocated by *getLocalPathForWrite*.

{quote}
A couple of additional thoughts - I would like to restart containers for kill 
and term signals which don't originate from YARN and handle scenarios where 
disks have gone bad and resources need to be relocalized but both those cases 
can be handled as follow up JIRAs.
{quote}
OK. I will create another JIRA to handle these cases.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-02 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127952#comment-15127952
 ] 

Jun Gong commented on YARN-3998:


Thanks for clarifying. Yes, it will be a problem, I will handle it.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-02 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127937#comment-15127937
 ] 

Varun Vasudev commented on YARN-3998:
-

{quote}
getLocalPathForWrite is used for allocating a new directory, I just want to 
search container's token file in all log paths in getContainerLogDir, so I use 
getLocalPathForRead. If the token file is not found, a new directory will be 
allocated by getLocalPathForWrite.
{quote}

The issue with getLocalPathForRead is that it will return true even if the disk 
is full. We need to avoid re-using a directory that's full.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-01 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127734#comment-15127734
 ] 

Varun Vasudev commented on YARN-3998:
-

A couple of additional thoughts - I would like to restart containers for kill 
and term signals which don't originate from YARN and handle scenarios where 
disks have gone bad and resources need to be relocalized but both those cases 
can be handled as follow up JIRAs.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-02-01 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127732#comment-15127732
 ] 

Varun Vasudev commented on YARN-3998:
-

My apologies for the slow response [~hex108]. Thank you for your patience. 
Comments on the latest patch

1) The patch doesn't apply cleanly on trunk anymore

2) In ContainerLaunchContext.java, change "Retry strategy when container fails 
to run.” to "Retry strategy when container exits with failure."

3) In yarn_protos.proto and ContainerRetryPolicy.java, rename ALWAYS_RETRY to 
RETRY_ON_ALL_ERRORS

4) In ApplicationMaster.java and Client.java, rename "retry_policy" to 
"container_retry_policy", "error_codes" to "container_retry_error_codes", 
"max_retries" to "container_max_retries", and "retry_interval" to 
"container_retry_interval"

5) In ContainerImpl.java, change 
{code}
+  LOG.info("Relaunch Container " + container.getContainerId()
++ ". Remain retry attempts : " + container.remainingRetryAttempts
++ ". Retry interval is "
++ container.containerRetryContext.getRetryInterval() + "ms");
+container.isRelaunch = true;
{code}
to
{code}
+  LOG.info("Relaunching container " + container.getContainerId()
++ ". Remaining retry attempts(after relaunch) : " + 
container.remainingRetryAttempts
++ ". Interval between retries is "
++ container.containerRetryContext.getRetryInterval() + "ms");
+container.isRelaunch = true;
{code}

6) In ContainerImpl.java, can we avoid creating a new thread if the interval is 
0?
{code}
+new Thread() {
+  @Override
+  public void run() {
+try {
+  Thread.sleep(container.containerRetryContext.getRetryInterval());
+  container.sendLaunchEvent();
+} catch (InterruptedException e) {
+  return;
+}
+  }
+}.start();
 {code}

7) In ContainerLaunch.java, change "If container is in relaunch" to "If 
container is being relaunched"

8) In ContainerLaunch.java, rename container1 to targetContainer

9) In ContainerLaunch.java, in getContainerWorkDir, we should use 
dirsHandler.getLocalPathForWrite instead of dirsHandler.getLocalPathForRead. As 
a result, 
{code}
+  public Path getLocalPathForRead(String pathStr) throws IOException {
+return getPathToRead(pathStr, getLocalDirsForRead());
+  }
{code}
is not required

10) In ContainerLaunch.java, in getContainerLogDir, instead of matching stdout 
exactly, can we use FileSystem.globStatus and match *out instead?

11) In the current version of the patch, the diagnosticInfo String is unbounded 
and there is no indication as to which attempt the diagnosticInfo came from. 
Can you update the patch to 
  - put some seperator(like "Diagnostic message from attempt:") in the 
diagnostic message 
  - limit the size of the diagnostic message to some upper bound

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-01-31 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125597#comment-15125597
 ] 

Jun Gong commented on YARN-3998:


hi [~vvasudev], could you please help review the latest patch if you have time? 
Thanks!

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-01-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101922#comment-15101922
 ] 

Hadoop QA commented on YARN-3998:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 36s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
42s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
34s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 50s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
53s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
56s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 59s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 35s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
27s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 33s 
{color} | {color:red} Patch generated 4 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 530, now 530). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 52s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 55s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 38s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 47s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-01-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101483#comment-15101483
 ] 

Hadoop QA commented on YARN-3998:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 54s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
39s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 41s 
{color} | {color:red} root in trunk failed with JDK v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 1m 56s 
{color} | {color:red} root in trunk failed with JDK v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
6s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 9m 21s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
41s {color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped branch modules with no Java source: . {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s 
{color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 17s 
{color} | {color:red} root in trunk failed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 9m 7s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 48s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
34s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 47s 
{color} | {color:red} root in the patch failed with JDK v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} cc {color} | {color:red} 0m 47s {color} | 
{color:red} root in the patch failed with JDK v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 47s {color} 
| {color:red} root in the patch failed with JDK v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 5m 40s 
{color} | {color:red} root in the patch failed with JDK v1.7.0_91. {color} |
| {color:red}-1{color} | {color:red} cc {color} | {color:red} 5m 40s {color} | 
{color:red} root in the patch failed with JDK v1.7.0_91. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 5m 40s {color} 
| {color:red} root in the patch failed with JDK v1.7.0_91. {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 6s 
{color} | {color:red} Patch generated 8 new checkstyle issues in root (total 
was 605, now 609). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 9m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green} 0m 
9s {color} | {color:green} There were no new shellcheck issues. {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 49 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped patch modules with no Java source: . {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 17s 
{color} | {color:red} root in the patch failed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 9m 5s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 4m 20s {color} 
| {color:red} root in the patch 

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-01-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101552#comment-15101552
 ] 

Hadoop QA commented on YARN-3998:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 44s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
33s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 50s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
50s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
54s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 53s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 35s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 51s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 51s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 51s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s 
{color} | {color:red} Patch generated 8 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 530, now 534). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
44s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 53s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 56s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 52s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 59s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-01-14 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101383#comment-15101383
 ] 

Jun Gong commented on YARN-3998:


[~vvasudev] Thanks for the detailed review and suggestions! I just attached a 
new patch to address above problems.

{quote}
In your implementation, the relaunched container will go through the 
launchContainer call which will try to setup the container launch environment 
again(like creating the local and log dirs, creating tokens, etc). Won't this 
lead to FileAlreadyExistsException being thrown as part of the launchContainer 
call? In addition, this also means that on a node with more than one local dir, 
different attempts could get allocated to different local dirs. I wonder if 
it's better to move the retry logic into the launchContainer function instead 
of adding a new state transition?
{quote}
The reason for adding a new state transition are as following:
1. Between retry interval, container is not running actually, it seems more 
reasonable to make it in LOCALIZED state.
2. For NM restart, it will not be enough to just add retry logic into 
*ContainerLaunch#call()*. When NM restart, it will call 
*RecoveredContainerLaunch#call*, then we also need add retry logic at this 
place, otherwise container might exit with failure with no retry. The logic 
seems more clear to add a state transition, and avoids duplicated codes.

In order to avoid FileAlreadyExistsException, I add some 
code(*cleanupContainerFilesForRelaunch*) to cleanup files(token file and launch 
script). We also need to cleanup previous PID file, NM will try to get PID 
through this file when NM restart.

In order to use same container working directory and log directory, we need to 
record these path, and need to store these path to NMStateStore for NM restart 
case. According to [~vvasudev]'s suggestion, we use a simple heuristic – if a 
good work directory with the container tokens file already exists, use that 
directory otherwise use a new one. That way we don’t need to worry about 
storing the directories in the state store. However there is not a file likes 
'tokens file' for log directory, so we use the file 'stdout' as this kind of 
file. We assume there is 'stdout' in most containers' log direcotry.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2016-01-07 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087392#comment-15087392
 ] 

Varun Vasudev commented on YARN-3998:
-

Thanks for the patch [~hex108]!

In your implementation, the relaunched container will go through the 
launchContainer call which will try to setup the container launch environment 
again(like creating the local and log dirs, creating tokens, etc). Won't this 
lead to FileAlreadyExistsException being thrown as part of the launchContainer 
call? In addition, this also means that on a node with more than one local dir, 
different attempts could get allocated to different local dirs. I wonder if 
it's better to move the retry logic into the launchContainer function instead 
of adding a new state transition?

Some feedback on the code itself -
1)
Rename ContainerRetry to ContainerRetryContext

2)
{code}
{@link ContainerRetryPolicy} : NEVER_RETRY(no matter what error code is
+ * when container fails to run, just do not retry), ALWAYS_RETRY(no matter
+ * what error code is, when container fails to run, just retry),
+ * RETRY_ON_SPECIFIC_ERROR_CODE(when container fails to run, do retry if 
the
+ * error code is one of errorCodes, otherwise do not retry.
+ * Note: if error code is 137(SIGKILL) or 143(SIGTERM), it will not retry
+ * because it is usually killed on purpose.
{code}
Specify that the default policy is NEVER_RETRY

3)
Rename retryTimes to maxRetries

4)
Change the interval unit to ms instead of seconds

5)
{code}
+  // remain retries to relaunch container if needed
+  private int remainRetries;
{code}
Rename remainRetries to remainingRetryAttempts

6)
{code}
+if (launchContext != null) {
+  this.containerRetry = launchContext.getContainerRetry();
+  if (this.containerRetry != null) {
+remainRetries = containerRetry.getRetryTimes();
+  }
+} else {
+  this.containerRetry = null;
+}
{code}
Instead of setting containerRetry to null, can we initialize to a retry object 
with never_retry

7)
Rename RetryWithFailureTransition to RetryFailureTransition

8)
Change
{code}
+LOG.info("Relaunch Container " + container.getContainerId()
++ ". Remain retry times : " + container.remainRetries
++ ". Retry interval is "
++ container.containerRetry.getRetryInterval() + "s");
{code}
to 
{code}
+LOG.info("Relaunching container " + container.getContainerId()
++ ". Remaining retry attempts : " + container.remainRetries
++ ". Retry interval is "
++ container.containerRetry.getRetryInterval() + "s");
{code}

9)
Rename storeContainerRemainRetries to storeContainerRemainingRetryAttempts


> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-12-26 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072022#comment-15072022
 ] 

Jun Gong commented on YARN-3998:


Attach a new patch to fix above errors, add test cases.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-12-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072026#comment-15072026
 ] 

Hadoop QA commented on YARN-3998:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 45s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
26s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 47s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
53s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
55s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 33s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 51s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
38s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 44s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 44s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 26s 
{color} | {color:red} Patch generated 4 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 540, now 540). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 49s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
51s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
27s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 52s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 36s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 46s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 
passed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 9s 

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-12-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071166#comment-15071166
 ] 

Hadoop QA commented on YARN-3998:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
7s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 0s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
30s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 55s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
52s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
10s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 39s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 5s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 59s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 59s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s 
{color} | {color:red} Patch generated 5 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 456, now 458). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 53s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
52s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 35s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
introduced 3 new FindBugs issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 8s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 1s {color} | 
{color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 53s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 6m 53s {color} 
| {color:red} hadoop-yarn-applications-distributedshell in the patch failed 
with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 27s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. 
{color} |
| {color:red}-1{color} | {color:red} unit 

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-12-24 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071113#comment-15071113
 ] 

Jun Gong commented on YARN-3998:


Sorry for late. I just attached a patch for review, will add test cases later.

In the patch, I add *ContainerRetry* to specify container’s retry strategy when 
container fails to run, *ContainerRetry* includes 4 fields:
* *ContainerRetryPolicy*: it has three policies: *NEVER_RETRY*(no matter what 
error code is when container fails to run, just do not retry), 
*ALWAYS_RETRY*(no matter what error code is, when container fails to run, just 
retry), *RETRY_ON_SPECIFIC_ERROR_CODE*(when container fails to run, do retry if 
the error code is one of errorCodes, otherwise do not retry).
* *errorCodes* is described as above.
* *retryTimes* specifies how many times to retry if need to retry, if the value 
is -1, it means retrying forever.
* *retryInterval* specifies delaying some time before relaunch container, the 
unit is seconds.

I store container's remain retry times to NMStateStore to keep it across NM 
restart. And I modified distributed shell to support container retry, it helps 
a lot for testing.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3998.01.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-12-10 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052043#comment-15052043
 ] 

Jun Gong commented on YARN-3998:


[~vvasudev] Thanks for the attention and suggestion. 

We have a patch for it, it implemented the policy "restart on all errors". Will 
attach a patch if it is OK.

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-12-10 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051024#comment-15051024
 ] 

Varun Vasudev commented on YARN-3998:
-

[~hex108] - do you wish to work on this? Do you have a patch that adds support 
for this?

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-12-09 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049050#comment-15049050
 ] 

Varun Vasudev commented on YARN-3998:
-

I'm thinking of re-opening this issue because we've seen a use case for this - 
long running services which (ideally) shouldn't lose local data if the service 
crashes.

Some pros about supporting some forms of restart policies on the NM -

1. Retry policies can be unified instead of every application having to 
re-implement their own.

2. Faster restarts - instead of the NM reaching out to the AM and then deciding 
what to do(and maintaining the container work dir), it can make an immediate 
decision. It's also an easier change to make - if the NMs need to talk to the 
AMs to decide whether to restart a container - we'll probably need a new state 
transition. Instead if we allow the AMs to specify a restart policy, the NM can 
make an immediate decision as soon as the container exits.

3. Similar to what Jun mentioned - when running Docker containers, it's useful 
to be able to restart containers that exit with an error code.

When I say restart policies - off the top of my head - I can think of 3 
policies - never restart(default), restart on all errors, restart on specific 
error codes.

[~jlowe], [~steve_l] - do you guys still feel that this should be done at the 
app level(and essentially re-implemented by every app)?

> Add retry-times to let NM re-launch container when it fails to run
> --
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-08-03 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651887#comment-14651887
 ] 

Jun Gong commented on YARN-3998:


Thanks [~jlowe] and [~ste...@apache.org] for the detailed explanation. Yes, app 
will have more control when doing it at the app level.

Our specific cases:
1. Container generates several large files(total size is 20G~50G) in the 
working directory. If container exits for some reason, it need re-generate 
those files, and container's service will be interrupted for a long time.

2. When running Docker container, docker image might be very large and pulled 
for a long time. If the container fails to run and re-scheduled to another 
host, NM might need pull the image again.

 Add retry-times to let NM re-launch container when it fails to run
 --

 Key: YARN-3998
 URL: https://issues.apache.org/jira/browse/YARN-3998
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jun Gong
Assignee: Jun Gong

 I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
 launches containers, it could specify the value. Then NM will re-launch the 
 container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
 It will save a lot of time. It avoids container localization. RM does not 
 need to re-schedule the container. And local files in container's working 
 directory will be left for re-use.(If container have downloaded some big 
 files, it does not need to re-download them when running again.) 
 We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-07-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649347#comment-14649347
 ] 

Jason Lowe commented on YARN-3998:
--

I think it's more effort for YARN to support this than it is for apps to do 
this on their own.  For apps this can be a simple shell script placed in front 
of the normal container launch command, and that's very simple and 
straightforward to do.  Also if the app controls this directly then it is free 
to do all sorts of interesting, app-specific retry scenarios, e.g.: only retry 
on certain exit codes but not others, only retry if certain files have or have 
not been created in the local filesystem, etc.

However I may be in the minority on this, and it would be interesting to hear 
what others think.

 Add retry-times to let NM re-launch container when it fails to run
 --

 Key: YARN-3998
 URL: https://issues.apache.org/jira/browse/YARN-3998
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jun Gong
Assignee: Jun Gong

 I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
 launches containers, it could specify the value. Then NM will re-launch the 
 container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
 It will save a lot of time. It avoids container localization. RM does not 
 need to re-schedule the container. And local files in container's working 
 directory will be left for re-use.(If container have downloaded some big 
 files, it does not need to re-download them when running again.) 
 We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-07-31 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649089#comment-14649089
 ] 

Jun Gong commented on YARN-3998:


[~jlowe] Thanks for the comment. 

I agree with you. I am not sure how many applications need this feature. If the 
number is large, it might be better to put this functionality in YARN, then 
applications themselves do not need add this. In our system, we write a new AM 
and add this feature in YARN, then we find Storm also need it.

 Add retry-times to let NM re-launch container when it fails to run
 --

 Key: YARN-3998
 URL: https://issues.apache.org/jira/browse/YARN-3998
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jun Gong
Assignee: Jun Gong

 I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
 launches containers, it could specify the value. Then NM will re-launch the 
 container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
 It will save a lot of time. It avoids container localization. RM does not 
 need to re-schedule the container. And local files in container's working 
 directory will be left for re-use.(If container have downloaded some big 
 files, it does not need to re-download them when running again.) 
 We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-07-30 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647755#comment-14647755
 ] 

Jason Lowe commented on YARN-3998:
--

Is this really a feature that YARN needs to provide?  To me this is basically a 
case of container re-use which the application itself can control.  A primitive 
example would be an application that launches a container that wraps the real 
task in a wrapper shell script or Java program that spawns the real task and 
will respawn it some number of times if the real task fails before failing the 
entire container.  I'm not sure YARN is the best place to put this 
functionality.

 Add retry-times to let NM re-launch container when it fails to run
 --

 Key: YARN-3998
 URL: https://issues.apache.org/jira/browse/YARN-3998
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jun Gong
Assignee: Jun Gong

 I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
 launches containers, it could specify the value. Then NM will re-launch the 
 container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
 It will save a lot of time. It avoids container localization. RM does not 
 need to re-schedule the container. And local files in container's working 
 directory will be left for re-use.(If container have downloaded some big 
 files, it does not need to re-download them when running again.) 
 We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)