[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262586#comment-15262586 ] Varun Vasudev commented on YARN-3998: - I'm going to commit the latest patch tomorrow if no one objects. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260696#comment-15260696 ] Varun Vasudev commented on YARN-3998: - [~vinodkv] - do you want to review this further or can I go ahead and commit it? > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239448#comment-15239448 ] Varun Vasudev commented on YARN-3998: - The latest patch looks good to me. The points that still need to be addressed - # Unify restart policies with AM restart work # Support minimum retry interval # Fix containerLaunchStartTime and track the individual start time for the retry attempt. In my opinion, all of these can be done as follow up JIRAs. [~vinodkv] - can you take a look at the latest patch? > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235395#comment-15235395 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 23s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s {color} | {color:green} trunk passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 43s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 31s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s {color} | {color:green} trunk passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 3s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 39s {color} | {color:green} the patch passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 40s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 17 new + 774 unchanged - 8 fixed = 791 total (was 782) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 45s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 40s {color} | {color:green} the patch passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 3s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_77. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 10s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_77. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 12s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_77. {color} |
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235213#comment-15235213 ] Jun Gong commented on YARN-3998: Thanks [~vvasudev] for the review and comments! Attach a rebased patch 09.patch and fix problem 3. I will handle problem 2 in following jira. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234983#comment-15234983 ] Varun Vasudev commented on YARN-3998: - Thanks for the patch [~hex108]. It looks good. Few things that need to be corrected - # It needs to be rebased to the latest trunk # In ContainerImpl - {code} +containerLaunchStartTime = clock.getTime(); {code} is not required. We shouldn't updated the start time for every launch. We should add a new variable to track the running time of the current launch but that can be done in a follow up jira # In NMStateStoreService - {code} + .append(", WorDir: ").append(workDir) {code} there's typo in WorDir Apart from that it looks good to me. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227816#comment-15227816 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 6 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 25s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 30s {color} | {color:green} trunk passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 50s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 7s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 43s {color} | {color:green} trunk passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 48s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 40s {color} | {color:green} the patch passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 3m 40s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 40s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 1s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 3m 1s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 1s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 55s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 17 new + 773 unchanged - 8 fixed = 790 total (was 781) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 4s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 52s {color} | {color:green} the patch passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 2s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 29s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_77. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 23s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_77. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 14s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_77. {color} |
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227697#comment-15227697 ] Jun Gong commented on YARN-3998: [~vvasudev] Thanks for the review and comments. Attach a new patch 08.patch to address your comments and fix test cases errors. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225716#comment-15225716 ] Varun Vasudev commented on YARN-3998: - Thanks for the patch [~hex108] # {code}+ List containerLocalDirs = new ArrayList<>(localDirs.size());{code} containerLocalDirs is created but never populated # In ContainerRelaunch#call, there's a lot of duplicate code copied from ContainerLaunch#call. Can you please refactor the code in ContainerLaunch#call so that it can be re-used? >From my analysis, these pieces can be moved into their own functions are >called from ContainerLaunch#call and ContainerRelaunch#call {code} +// CONTAINER_KILLED_ON_REQUEST should not be missed if the container +// is already at KILLING +if (container.getContainerState() == ContainerState.KILLING) { + dispatcher.getEventHandler().handle( + new ContainerExitEvent(containerID, + ContainerEventType.CONTAINER_KILLED_ON_REQUEST, + Shell.WINDOWS ? ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode() : + ContainerExecutor.ExitCode.TERMINATED.getExitCode(), + "Container terminated before launch.")); + return 0; +} {code} This can be moved into it's own function that returns a boolean that we can use to get the same behaviour {code} + localResources = container.getLocalizedResources(); + if (localResources == null) { +throw RPCUtil.getRemoteException( +"Unable to get local resources when Container " + containerID + +" is at " + container.getContainerState()); + } {code} This can be moved into it's own function that returns {code}localResources{code} if it's not null and throws an exception otherwise. {code} + List logDirs = dirsHandler.getLogDirs(); + + List containerLogDirs = new ArrayList<>(); + String relativeContainerLogDir = ContainerLaunch + .getRelativeContainerLogDir(appIdStr, containerIdStr); + for(String logDir : logDirs) { +containerLogDirs.add(logDir + Path.SEPARATOR + relativeContainerLogDir); + } {code} can be moved into it's own function that returns the populated List {code} +StringBuilder diagnosticInfo = +new StringBuilder("Container exited with a non-zero exit code "); +diagnosticInfo.append(ret); +diagnosticInfo.append(". "); +if (ret == ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode() +|| ret == ContainerExecutor.ExitCode.TERMINATED.getExitCode()) { + // If the process was killed, Send container_cleanedup_after_kill and + // just break out of this method. + dispatcher.getEventHandler().handle( + new ContainerExitEvent(containerID, + ContainerEventType.CONTAINER_KILLED_ON_REQUEST, ret, + diagnosticInfo.toString())); + return ret; +} {code} can be moved it's own function. I suspect we can move {code} + // LaunchContainer is a blocking call. We are here almost means the + // container is launched, so send out the event. + dispatcher.getEventHandler().handle(new ContainerEvent( + containerID, + ContainerEventType.CONTAINER_LAUNCHED)); + context.getNMStateStore().storeContainerLaunched(containerID); + + // Check if the container is signalled to be killed. + if (!shouldLaunchContainer.compareAndSet(false, true)) { +LOG.info("Container " + containerIdStr + " not launched as " ++ "cleanup already called"); +ret = ContainerExecutor.ExitCode.TERMINATED.getExitCode(); + } else { +exec.activateContainer(containerID, pidFilePath); +ret = exec.launchContainer(new ContainerStartContext.Builder() +.setContainer(container) +.setLocalizedResources(localResources) +.setNmPrivateContainerScriptPath(nmPrivateContainerScriptPath) +.setNmPrivateTokensPath(nmPrivateTokensPath) +.setUser(container.getUser()) +.setAppId(appIdStr) +.setContainerWorkDir(containerWorkDir) +.setLocalDirs(localDirs) +.setLogDirs(logDirs) +.setContainerLocalDirs(containerLocalDirs) +.setContainerLogDirs(containerLogDirs) +.build()); + } {code} into its own function as well. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch > > > I'd like to add a
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214969#comment-15214969 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 52s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 9s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 42s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 46s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 39s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 2s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 48s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 48s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 8s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 8s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 43s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 23 new + 777 unchanged - 5 fixed = 800 total (was 782) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 47s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 54s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 20s {color} | {color:red} hadoop-yarn-api in the patch failed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 6s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 12s {color} | {color:green}
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214532#comment-15214532 ] Jun Gong commented on YARN-3998: Very sorry for late. I just a attached a new patch 07.patch. In the new patch: 1. It treats RESTART in a first class manner: add new state RELAUNCHING for container, and ContainerRelaunch. 2. ContainerLaunch records and stores container's working directory and log directory, ContainerRelaunch will reuse previous working directory, log directory and container's launch script. If previous directories/files are not available, it will fail to relaunch. We could change this behavior afterwards. 3. Address [~vvasudev]'s all comments to 06.patch. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188681#comment-15188681 ] Jun Gong commented on YARN-3998: Yes, it seems we need to deal with platform errors, however it still seems useful for user to specify retry policy that retry on what error codes because only user knows meaning of every error codes. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187920#comment-15187920 ] Vinod Kumar Vavilapalli commented on YARN-3998: --- Instead of making it an arbitrary case-by-case per-exit-code behavior, we should have a general policy that is easily understood by users. Namely we should not automatically restart containers for platform errors like mis-configuration of YARN, wrong permissions on yarn directories, absent user-accounts etc. This is similar, again, to how we handled AM failures at a higher level via YARN-614. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184968#comment-15184968 ] Junping Du commented on YARN-3998: -- bq. We could specify retry policy to RETRY_ON_SPECIFIC_ERROR_CODE to handle this case, the error code might be INITIALIZE_USER_FAILED for the case if "user no found". Or do you mean it is not sufficient to just specify error codes? Theoretically, it sounds ok to make retry policy based on error code. However, I think in some particular case (like "user not found" exception in LinuxContainerExecutor), it could share the same error code like -1000(INVALID) with other cases and we may need more specific info (like parsing error messages). > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184951#comment-15184951 ] Jun Gong commented on YARN-3998: We could specify retry policy to RETRY_ON_SPECIFIC_ERROR_CODE to handle this case, the error code might be INITIALIZE_USER_FAILED for the case if "user no found". Or do you mean it is not sufficient to just specify error codes? > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184952#comment-15184952 ] Jun Gong commented on YARN-3998: We could specify retry policy to RETRY_ON_SPECIFIC_ERROR_CODE to handle this case, the error code might be INITIALIZE_USER_FAILED for the case if "user no found". Or do you mean it is not sufficient to just specify error codes? > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184809#comment-15184809 ] Junping Du commented on YARN-3998: -- Thanks [~hex108] for the patch. In addition, I think we should be specific on failure types for re-luanch policy. e.g. if container get failed with "user no found" (enabling LinuxContainerExecutor), then relaunch container on this node could just be waste of time. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184167#comment-15184167 ] Jun Gong commented on YARN-3998: Thanks [~vvasudev] and [~vinodkv] for the comments and suggestions. I will update the patch. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183634#comment-15183634 ] Vinod Kumar Vavilapalli commented on YARN-3998: --- Okay, seems like we are in general agreement. - Let's try to look at the unification part in this JIRA itself as it concerns public facing APIs - even if we don't unify them completely here. - I filed YARN-4770 and assigned to myself. [~hex108] / [~vvasudev], feel free to take them over as you start working on them. - [~vvasudev], please also file a sub-task under YARN-4725 for the LCE related work. Tx. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182570#comment-15182570 ] Varun Vasudev commented on YARN-3998: - My apologies for not responding earlier. I think creating a branch is not required. We can continue to work off trunk as long as the default behaviour preserves the existing behaviour - i.e. no retries. [~hex108] - what will the complexity of using the AM policy of retry windows for this patch? If it's straight forward, we should do it as part of this patch otherwise we should do it as part of a follow up. Just to recap - what I would prefer is we address the following points(in order) - # Treat restarts in a first class manner - add the state machine changes required # When handling restarts - restart containers with the same local and log dirs(as long as the disks haven't gone bad) # Attempt to unify the AM/Container retry policies if feasible - this can be done as part of a follow up JIRA because it probably requires some discussion. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166567#comment-15166567 ] Jun Gong commented on YARN-3998: Thanks [~vinodkv] for explaining it. {quote} My point was mainly about creating and reusing a common policy-framework even if the actual policies may not be entirely reused. We should seriously consider this instead of creating adhoc APIs for custom hard-coded policies. {quote} Yes, it will be better if we could reuse a common policy-framework, we might need discuss it more. {quote} I'm okay creating separate JIRAs under YARN-3998 if you both think of doing so, but treat (some of the above) as blockers for releasing this feature. Given that, does it make sense to work on this in a branch? {quote} I could address these block problems in this issue if needed. [~vvasudev] Could you share your thought please? Thanks. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159891#comment-15159891 ] Vinod Kumar Vavilapalli commented on YARN-3998: --- [~vvasudev], [~hex108] bq. Unification with AM restart policies My point was mainly about creating and reusing a common _policy-framework_ even if the actual policies may not be entirely reused. We should seriously consider this instead of creating adhoc APIs for custom hard-coded policies. bq. We should also handle restarting of any containers that may have crashed during the NM reboot. bq. The current version of the patch doesn't work with LCE because the NM can't cleanup the launch container script and the tokens(they're owned by the user who submitted the job/nobody) bq. Yes, it will be better to store the log dir and work dir if we aims to make it more accurate. I was thinking to make minimal changes for this feature. I'm okay creating separate JIRAs under YARN-3998 if you both think of doing so, but treat (some of the above) as blockers for releasing this feature. Given that, does it make sense to work on this in a branch? [~asuresh] bq. Would it make sense to add a time dimension. Instead of just specifying a retry count, it might be better to specify something like : "Kill container if it restarts X times in Y seconds". As I was saying, this is similar to AM restarts where we support a sliding window of restarts so as to forget about very-old failures instead of accumulating them for-ever. This was added at YARN-611. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159810#comment-15159810 ] Arun Suresh commented on YARN-3998: --- By the way, Thanks a ton for raising this [~hex108].. Extremely useful feature to have !! > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159791#comment-15159791 ] Arun Suresh commented on YARN-3998: --- Was spending some time thinking about this.. Would it make sense to add a time dimension. Instead of just specifying a retry count, it might be better to specify something like : "Kill container if it restarts X times in Y seconds". This way, we can distinguish between transient and steady state errors. Most un-recoverable errors are transient (something that pops up during startup etc.) which will cause the container to restart multiple times really fast and thus should be permanently killed. A steady state restart (for eg. a container hosting a web server going down due to some 500 error.. caused due to some weird user request) should not contribute to the retry increment. It might also make sense to extend this as a full fledged Container restart *Strategy*... Similar to what is found in systems written in Erlang (http://erlang.org/doc/design_principles/sup_princ.html) and the Scala Akka supervisor (http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance.html) Thoughts ? > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159768#comment-15159768 ] Vinod Kumar Vavilapalli commented on YARN-3998: --- Making this a sub-task of YARN-4725 where we can track the remaining items. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149654#comment-15149654 ] Jun Gong commented on YARN-3998: Sorry for late reply, I was on holiday. Thanks [~vinodkv] and [~vvasudev] for suggestion and review! Some additional thought besides [~vvasudev]'s opinion: {quote} Unification with AM restart policies {quote} I agree with [~vvasudev]. Now AM restart polices is retrying across different nodes, this feature is retrying on local node. When RM launches AM, it could specify local retry policy for it. {quote} Treat relaunch in a first-class manner {quote} Glad to see it to be a first-class manner, I will update the patch. {quote} The following isn’t fool-proof and won’t work for all apps, can we just persist and read the selected log-dir from the state-store? ContainerLaunch.handleContainerExitWithFailure() needs to handled differently during container-relaunches. The same can be done for the work-dir. All of these are related. If we store the log dir and work dir in the state store, we can address all 3 of these. {quote} Yes, it will be better to store the log dir and work dir if we aims to make it more accurate. I was thinking to make minimal changes for this feature. {quote} In fact, if we end up changing the work-dir during relaunch due to a bad-dir, that may result in a breakage for the app. Apps may be reading from / writing into the work-dir and changing it during relaunch may invalidate application's assumptions. Should we just fail the container completely and let the AM deal with it? {quote} My thought is that if user specifies retry policy on container, the user should make sure that container could deal with this situation. {quote} Instead of removing a line and setting the limit to 10*1000, take the last 'n' characters in the string where 'n' is a config setting. {quote} It might make the diagnostics not consistent to remove the last n characters, suppose the diagnostics is “The exception is ” and there is n characters in XXX, the diagnositics becomes “The exception is”. There is similar problem by removing first or last n lines. How about removing previous attempts' error information and just keeping the latest attempt's information? Glad to see more discussion about the feature. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134366#comment-15134366 ] Varun Vasudev commented on YARN-3998: - Thanks for your comments Vinod. Overall, I mostly agree with your thoughts - it might make sense to file an umbrella JIRA to add first class container retry support and have this become a sub-task. With regards to your specific comments - bq. Unification with AM restart policies This should be taken up as a different JIRA - there are significant differences between AM restarts and container restarts - the most important one being that AM restarts can occur on different nodes(which is probably the desired behavior) but container relaunches by definition have to be on the same node. We might be able to re-use the notion of retrying on specific error codes, but it probably requires some more discussion. bq. Treat relaunch in a first-class manner Makes sense - [~hex108] - can you incorporate these into your next patch? bq. The relaunch feature needs to work across NM restarts, so we should save the retry-context and policy per container into the state-store and reload it for continue relaunching after NM restart. The container retry policy details are already stored in the state-store as part of the ContainerLaunchContext bq. We should also handle restarting of any containers that may have crashed during the NM reboot. This should be handled as a separate JIRA and not as part of this one - especially because we won't have the reason the container crashed if it crashed during the NM reboot. bq. The following isn’t fool-proof and won’t work for all apps, can we just persist and read the selected log-dir from the state-store? bq. ContainerLaunch.handleContainerExitWithFailure() needs to handled differently during container-relaunches. bq. The same can be done for the work-dir. All of these are related. If we store the log dir and work dir in the state store, we can address all 3 of there. [~hex108] - what do you think? bq.In fact, if we end up changing the work-dir during relaunch due to a bad-dir, that may result in a breakage for the app. Apps may be reading from / writing into the work-dir and changing it during relaunch may invalidate application's assumptions. Should we just fail the container completely and let the AM deal with it? I suspect we'll have to add an additional flag which the AMs will have to set to decide the behaviour. There are many valid cases where containers can store state on mounted volumes and don't rely on the YARN local dirs. On the other hand, like you pointed out in some cases, we might want to abandon the relaunch. I'm not sure if this should be done as part of this JIRA or another one. [~hex108] with regards to your latest patch - 1) Can you change RETRY_ON_SPECIFIC_ERROR_CODE to RETRY_ON_SPECIFIC_ERROR_CODES 2) {code} +if (diagnostics.length() > DIAGNOSTICS_MAX_SIZE) { + // delete first line + int i = diagnostics.indexOf("\n"); + if (i != -1) { +diagnostics.delete(0, i + 1); + } +} {code} Instead of this - can you please change the function to a) Only trim the diagnostics string size in case of a relaunch b) Instead of removing a line and setting the limit to 10*1000, take the last 'n' characters in the string where 'n' is a config setting. 3) {code} +} catch (IOException e) { + LOG.error("Failed to delete " + relativePath, e); +} {code} and {code} +} catch (IOException e) { + LOG.error("Failed to delete container script file and token file", e); +} + Change LOG.error to LOG.warn 4) The current version of the patch doesn't work with LCE because the NM can't cleanup the launch container script and the tokens(they're owned by the user who submitted the job/nobody). To fix this requires a change in the container-executor binary, which will involve passing a flag to the container-executor binary to relaunch a container. [~vinodkv] - what do you think we should do here - hold off committing this patch until we have support for relaunch in container-executor or commit it documenting the fact that it'll only work with DefaultContainerExecutor and file a follow up JIRA? > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133367#comment-15133367 ] Vinod Kumar Vavilapalli commented on YARN-3998: --- Thanks for working on this [~hex108]! I've been meaning to file this JIRA with a slighly larger scope, but glad to see this moving. The following has a list of additional things that I wanted in this feature. But, if you / [~vvasudev] want progress on the current patch as it stands, I can go file additional tickets as needed. h4. Unification with AM restart policies - We should try to unify the retry-policies you are adding with the ones for AM container / app-attempt itself: See {{ApplicationSubmissionContext. getMaxAppAttempts()}} / {{getAttemptFailuresValidityInterval()}}. I like your policy framework, so may be (in a followup?) use the same framework for AMs. /cc [~xgong] - To avoid containers crashing and restarting in no time, we should have a global min-retry-interval. See YARN-3669 which does the same for AMs. - Also, similar to AM restarts, we should support a sliding window of restarts so as to forget about very-old failures instead of accumulating them for-ever. h4. Treat relaunch in a first-class manner - It's surprising we don't recognize the relaunches in a first-class manner. - I really don't like lots of if-else conditions everywhere. So, instead of using Container.isRelaunch(), I think we should have — an explicit RELAUNCHING state instead of simply going back to LOCALIZED state. — An additional ContainersLauncherEventType.RELAUNCH_CONTAINER — And a new {{ContainerRelaunch}} callable which overrides the behavior of ContainerLaunch. h4. Additional things needed during relaunch - The relaunch feature needs to work across NM restarts, so we should save the retry-context and policy per container into the state-store and reload it for continue relaunching after NM restart. - We should also handle restarting of any containers that may have crashed during the NM reboot. h4. Comments on the current approach and patch. - The following isn’t fool-proof and won’t work for all apps, can we just persist and read the selected log-dir from the state-store? {code} + * We apply a simple heuristic to find its previous working directory: + * if a good work dir with file pattern '*out' (e.g. stdout) already exists, {code} - The same can be done for the work-dir. - In fact, if we end up changing the work-dir during relaunch due to a bad-dir, that may result in a breakage for the app. Apps may be reading from / writing into the work-dir and changing it during relaunch may invalidate application's assumptions. Should we just fail the container completely and let the AM deal with it? - ContainerLaunch.handleContainerExitWithFailure() needs to handled differently during container-relaunches. There are other minor things I saw but they can wait for these things. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130109#comment-15130109 ] Jun Gong commented on YARN-3998: [~vvasudev], I just attached a new patch to address above problems. Thanks for review. 1) When finding container's previous working directory and log directory, just locate corresponding files in good directories which could be read/write and not full. 2) Limiting diagnostic message's message to 1 bytes. If the length is greater than it, delete the first line whose separator is "\n". 3) After some container retries, env variable *MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX*(DEFAULT_NM_ADMIN_USER_ENV) will be expanded to *MALLOC_ARENA_MAX=::*(a lot of ":"). I fixed it in *Apps#addToEnvironment*. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130229#comment-15130229 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 37s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 38s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 27s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 23s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 58s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 52s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 50s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 24s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 33s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 30s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 30s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 40s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 4 new + 522 unchanged - 4 fixed = 526 total (was 526) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 4 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 45s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 23s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 19s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 47s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 40s {color} | {color:green}
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127929#comment-15127929 ] Jun Gong commented on YARN-3998: Thanks [~vvasudev] for the detailed review and comments. I will update the patch. {quote} 9) In ContainerLaunch.java, in getContainerWorkDir, we should use dirsHandler.getLocalPathForWrite instead of dirsHandler.getLocalPathForRead. {quote} *getLocalPathForWrite* is used for allocating a new directory, I just want to search container's token file in all log paths in *getContainerLogDir*, so I use *getLocalPathForRead*. If the token file is not found, a new directory will be allocated by *getLocalPathForWrite*. {quote} A couple of additional thoughts - I would like to restart containers for kill and term signals which don't originate from YARN and handle scenarios where disks have gone bad and resources need to be relocalized but both those cases can be handled as follow up JIRAs. {quote} OK. I will create another JIRA to handle these cases. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127952#comment-15127952 ] Jun Gong commented on YARN-3998: Thanks for clarifying. Yes, it will be a problem, I will handle it. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127937#comment-15127937 ] Varun Vasudev commented on YARN-3998: - {quote} getLocalPathForWrite is used for allocating a new directory, I just want to search container's token file in all log paths in getContainerLogDir, so I use getLocalPathForRead. If the token file is not found, a new directory will be allocated by getLocalPathForWrite. {quote} The issue with getLocalPathForRead is that it will return true even if the disk is full. We need to avoid re-using a directory that's full. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127734#comment-15127734 ] Varun Vasudev commented on YARN-3998: - A couple of additional thoughts - I would like to restart containers for kill and term signals which don't originate from YARN and handle scenarios where disks have gone bad and resources need to be relocalized but both those cases can be handled as follow up JIRAs. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127732#comment-15127732 ] Varun Vasudev commented on YARN-3998: - My apologies for the slow response [~hex108]. Thank you for your patience. Comments on the latest patch 1) The patch doesn't apply cleanly on trunk anymore 2) In ContainerLaunchContext.java, change "Retry strategy when container fails to run.” to "Retry strategy when container exits with failure." 3) In yarn_protos.proto and ContainerRetryPolicy.java, rename ALWAYS_RETRY to RETRY_ON_ALL_ERRORS 4) In ApplicationMaster.java and Client.java, rename "retry_policy" to "container_retry_policy", "error_codes" to "container_retry_error_codes", "max_retries" to "container_max_retries", and "retry_interval" to "container_retry_interval" 5) In ContainerImpl.java, change {code} + LOG.info("Relaunch Container " + container.getContainerId() ++ ". Remain retry attempts : " + container.remainingRetryAttempts ++ ". Retry interval is " ++ container.containerRetryContext.getRetryInterval() + "ms"); +container.isRelaunch = true; {code} to {code} + LOG.info("Relaunching container " + container.getContainerId() ++ ". Remaining retry attempts(after relaunch) : " + container.remainingRetryAttempts ++ ". Interval between retries is " ++ container.containerRetryContext.getRetryInterval() + "ms"); +container.isRelaunch = true; {code} 6) In ContainerImpl.java, can we avoid creating a new thread if the interval is 0? {code} +new Thread() { + @Override + public void run() { +try { + Thread.sleep(container.containerRetryContext.getRetryInterval()); + container.sendLaunchEvent(); +} catch (InterruptedException e) { + return; +} + } +}.start(); {code} 7) In ContainerLaunch.java, change "If container is in relaunch" to "If container is being relaunched" 8) In ContainerLaunch.java, rename container1 to targetContainer 9) In ContainerLaunch.java, in getContainerWorkDir, we should use dirsHandler.getLocalPathForWrite instead of dirsHandler.getLocalPathForRead. As a result, {code} + public Path getLocalPathForRead(String pathStr) throws IOException { +return getPathToRead(pathStr, getLocalDirsForRead()); + } {code} is not required 10) In ContainerLaunch.java, in getContainerLogDir, instead of matching stdout exactly, can we use FileSystem.globStatus and match *out instead? 11) In the current version of the patch, the diagnosticInfo String is unbounded and there is no indication as to which attempt the diagnosticInfo came from. Can you update the patch to - put some seperator(like "Diagnostic message from attempt:") in the diagnostic message - limit the size of the diagnostic message to some upper bound > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125597#comment-15125597 ] Jun Gong commented on YARN-3998: hi [~vvasudev], could you please help review the latest patch if you have time? Thanks! > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101922#comment-15101922 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 36s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 42s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 53s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 56s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 59s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 35s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 41s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 33s {color} | {color:red} Patch generated 4 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 530, now 530). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 52s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 55s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 38s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 47s {color} | {color:green} hadoop-yarn-applications-distributedshell in the patch
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101483#comment-15101483 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 54s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 39s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 41s {color} | {color:red} root in trunk failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 1m 56s {color} | {color:red} root in trunk failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 9m 21s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 41s {color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s {color} | {color:blue} Skipped branch modules with no Java source: . {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 17s {color} | {color:red} root in trunk failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 9m 7s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 48s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 34s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 47s {color} | {color:red} root in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} cc {color} | {color:red} 0m 47s {color} | {color:red} root in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 47s {color} | {color:red} root in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 5m 40s {color} | {color:red} root in the patch failed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} cc {color} | {color:red} 5m 40s {color} | {color:red} root in the patch failed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 5m 40s {color} | {color:red} root in the patch failed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 6s {color} | {color:red} Patch generated 8 new checkstyle issues in root (total was 605, now 609). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 9m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} shellcheck {color} | {color:green} 0m 9s {color} | {color:green} There were no new shellcheck issues. {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 49 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s {color} | {color:blue} Skipped patch modules with no Java source: . {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 17s {color} | {color:red} root in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 9m 5s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 4m 20s {color} | {color:red} root in the patch
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101552#comment-15101552 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 44s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 54s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 53s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 35s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 51s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 6s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s {color} | {color:red} Patch generated 8 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 530, now 534). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 53s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 56s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 52s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 59s {color} | {color:green} hadoop-yarn-applications-distributedshell in the patch
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101383#comment-15101383 ] Jun Gong commented on YARN-3998: [~vvasudev] Thanks for the detailed review and suggestions! I just attached a new patch to address above problems. {quote} In your implementation, the relaunched container will go through the launchContainer call which will try to setup the container launch environment again(like creating the local and log dirs, creating tokens, etc). Won't this lead to FileAlreadyExistsException being thrown as part of the launchContainer call? In addition, this also means that on a node with more than one local dir, different attempts could get allocated to different local dirs. I wonder if it's better to move the retry logic into the launchContainer function instead of adding a new state transition? {quote} The reason for adding a new state transition are as following: 1. Between retry interval, container is not running actually, it seems more reasonable to make it in LOCALIZED state. 2. For NM restart, it will not be enough to just add retry logic into *ContainerLaunch#call()*. When NM restart, it will call *RecoveredContainerLaunch#call*, then we also need add retry logic at this place, otherwise container might exit with failure with no retry. The logic seems more clear to add a state transition, and avoids duplicated codes. In order to avoid FileAlreadyExistsException, I add some code(*cleanupContainerFilesForRelaunch*) to cleanup files(token file and launch script). We also need to cleanup previous PID file, NM will try to get PID through this file when NM restart. In order to use same container working directory and log directory, we need to record these path, and need to store these path to NMStateStore for NM restart case. According to [~vvasudev]'s suggestion, we use a simple heuristic – if a good work directory with the container tokens file already exists, use that directory otherwise use a new one. That way we don’t need to worry about storing the directories in the state store. However there is not a file likes 'tokens file' for log directory, so we use the file 'stdout' as this kind of file. We assume there is 'stdout' in most containers' log direcotry. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087392#comment-15087392 ] Varun Vasudev commented on YARN-3998: - Thanks for the patch [~hex108]! In your implementation, the relaunched container will go through the launchContainer call which will try to setup the container launch environment again(like creating the local and log dirs, creating tokens, etc). Won't this lead to FileAlreadyExistsException being thrown as part of the launchContainer call? In addition, this also means that on a node with more than one local dir, different attempts could get allocated to different local dirs. I wonder if it's better to move the retry logic into the launchContainer function instead of adding a new state transition? Some feedback on the code itself - 1) Rename ContainerRetry to ContainerRetryContext 2) {code} {@link ContainerRetryPolicy} : NEVER_RETRY(no matter what error code is + * when container fails to run, just do not retry), ALWAYS_RETRY(no matter + * what error code is, when container fails to run, just retry), + * RETRY_ON_SPECIFIC_ERROR_CODE(when container fails to run, do retry if the + * error code is one of errorCodes, otherwise do not retry. + * Note: if error code is 137(SIGKILL) or 143(SIGTERM), it will not retry + * because it is usually killed on purpose. {code} Specify that the default policy is NEVER_RETRY 3) Rename retryTimes to maxRetries 4) Change the interval unit to ms instead of seconds 5) {code} + // remain retries to relaunch container if needed + private int remainRetries; {code} Rename remainRetries to remainingRetryAttempts 6) {code} +if (launchContext != null) { + this.containerRetry = launchContext.getContainerRetry(); + if (this.containerRetry != null) { +remainRetries = containerRetry.getRetryTimes(); + } +} else { + this.containerRetry = null; +} {code} Instead of setting containerRetry to null, can we initialize to a retry object with never_retry 7) Rename RetryWithFailureTransition to RetryFailureTransition 8) Change {code} +LOG.info("Relaunch Container " + container.getContainerId() ++ ". Remain retry times : " + container.remainRetries ++ ". Retry interval is " ++ container.containerRetry.getRetryInterval() + "s"); {code} to {code} +LOG.info("Relaunching container " + container.getContainerId() ++ ". Remaining retry attempts : " + container.remainRetries ++ ". Retry interval is " ++ container.containerRetry.getRetryInterval() + "s"); {code} 9) Rename storeContainerRemainRetries to storeContainerRemainingRetryAttempts > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072022#comment-15072022 ] Jun Gong commented on YARN-3998: Attach a new patch to fix above errors, add test cases. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072026#comment-15072026 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 45s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 47s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 53s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 33s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 51s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 26s {color} | {color:red} Patch generated 4 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 540, now 540). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 52s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 36s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 46s {color} | {color:green} hadoop-yarn-applications-distributedshell in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 9s
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071166#comment-15071166 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 7s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 0s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 52s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 39s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 5s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 59s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 59s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 17s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 17s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s {color} | {color:red} Patch generated 5 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 456, now 458). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 52s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 35s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common introduced 3 new FindBugs issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 41s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 8s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 1s {color} | {color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 53s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 6m 53s {color} | {color:red} hadoop-yarn-applications-distributedshell in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 27s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} unit
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071113#comment-15071113 ] Jun Gong commented on YARN-3998: Sorry for late. I just attached a patch for review, will add test cases later. In the patch, I add *ContainerRetry* to specify container’s retry strategy when container fails to run, *ContainerRetry* includes 4 fields: * *ContainerRetryPolicy*: it has three policies: *NEVER_RETRY*(no matter what error code is when container fails to run, just do not retry), *ALWAYS_RETRY*(no matter what error code is, when container fails to run, just retry), *RETRY_ON_SPECIFIC_ERROR_CODE*(when container fails to run, do retry if the error code is one of errorCodes, otherwise do not retry). * *errorCodes* is described as above. * *retryTimes* specifies how many times to retry if need to retry, if the value is -1, it means retrying forever. * *retryInterval* specifies delaying some time before relaunch container, the unit is seconds. I store container's remain retry times to NMStateStore to keep it across NM restart. And I modified distributed shell to support container retry, it helps a lot for testing. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052043#comment-15052043 ] Jun Gong commented on YARN-3998: [~vvasudev] Thanks for the attention and suggestion. We have a patch for it, it implemented the policy "restart on all errors". Will attach a patch if it is OK. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051024#comment-15051024 ] Varun Vasudev commented on YARN-3998: - [~hex108] - do you wish to work on this? Do you have a patch that adds support for this? > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049050#comment-15049050 ] Varun Vasudev commented on YARN-3998: - I'm thinking of re-opening this issue because we've seen a use case for this - long running services which (ideally) shouldn't lose local data if the service crashes. Some pros about supporting some forms of restart policies on the NM - 1. Retry policies can be unified instead of every application having to re-implement their own. 2. Faster restarts - instead of the NM reaching out to the AM and then deciding what to do(and maintaining the container work dir), it can make an immediate decision. It's also an easier change to make - if the NMs need to talk to the AMs to decide whether to restart a container - we'll probably need a new state transition. Instead if we allow the AMs to specify a restart policy, the NM can make an immediate decision as soon as the container exits. 3. Similar to what Jun mentioned - when running Docker containers, it's useful to be able to restart containers that exit with an error code. When I say restart policies - off the top of my head - I can think of 3 policies - never restart(default), restart on all errors, restart on specific error codes. [~jlowe], [~steve_l] - do you guys still feel that this should be done at the app level(and essentially re-implemented by every app)? > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651887#comment-14651887 ] Jun Gong commented on YARN-3998: Thanks [~jlowe] and [~ste...@apache.org] for the detailed explanation. Yes, app will have more control when doing it at the app level. Our specific cases: 1. Container generates several large files(total size is 20G~50G) in the working directory. If container exits for some reason, it need re-generate those files, and container's service will be interrupted for a long time. 2. When running Docker container, docker image might be very large and pulled for a long time. If the container fails to run and re-scheduled to another host, NM might need pull the image again. Add retry-times to let NM re-launch container when it fails to run -- Key: YARN-3998 URL: https://issues.apache.org/jira/browse/YARN-3998 Project: Hadoop YARN Issue Type: New Feature Reporter: Jun Gong Assignee: Jun Gong I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers, it could specify the value. Then NM will re-launch the container 'retry-times' times when it fails to run(e.g.exit code is not 0). It will save a lot of time. It avoids container localization. RM does not need to re-schedule the container. And local files in container's working directory will be left for re-use.(If container have downloaded some big files, it does not need to re-download them when running again.) We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649347#comment-14649347 ] Jason Lowe commented on YARN-3998: -- I think it's more effort for YARN to support this than it is for apps to do this on their own. For apps this can be a simple shell script placed in front of the normal container launch command, and that's very simple and straightforward to do. Also if the app controls this directly then it is free to do all sorts of interesting, app-specific retry scenarios, e.g.: only retry on certain exit codes but not others, only retry if certain files have or have not been created in the local filesystem, etc. However I may be in the minority on this, and it would be interesting to hear what others think. Add retry-times to let NM re-launch container when it fails to run -- Key: YARN-3998 URL: https://issues.apache.org/jira/browse/YARN-3998 Project: Hadoop YARN Issue Type: New Feature Reporter: Jun Gong Assignee: Jun Gong I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers, it could specify the value. Then NM will re-launch the container 'retry-times' times when it fails to run(e.g.exit code is not 0). It will save a lot of time. It avoids container localization. RM does not need to re-schedule the container. And local files in container's working directory will be left for re-use.(If container have downloaded some big files, it does not need to re-download them when running again.) We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649089#comment-14649089 ] Jun Gong commented on YARN-3998: [~jlowe] Thanks for the comment. I agree with you. I am not sure how many applications need this feature. If the number is large, it might be better to put this functionality in YARN, then applications themselves do not need add this. In our system, we write a new AM and add this feature in YARN, then we find Storm also need it. Add retry-times to let NM re-launch container when it fails to run -- Key: YARN-3998 URL: https://issues.apache.org/jira/browse/YARN-3998 Project: Hadoop YARN Issue Type: New Feature Reporter: Jun Gong Assignee: Jun Gong I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers, it could specify the value. Then NM will re-launch the container 'retry-times' times when it fails to run(e.g.exit code is not 0). It will save a lot of time. It avoids container localization. RM does not need to re-schedule the container. And local files in container's working directory will be left for re-use.(If container have downloaded some big files, it does not need to re-download them when running again.) We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647755#comment-14647755 ] Jason Lowe commented on YARN-3998: -- Is this really a feature that YARN needs to provide? To me this is basically a case of container re-use which the application itself can control. A primitive example would be an application that launches a container that wraps the real task in a wrapper shell script or Java program that spawns the real task and will respawn it some number of times if the real task fails before failing the entire container. I'm not sure YARN is the best place to put this functionality. Add retry-times to let NM re-launch container when it fails to run -- Key: YARN-3998 URL: https://issues.apache.org/jira/browse/YARN-3998 Project: Hadoop YARN Issue Type: New Feature Reporter: Jun Gong Assignee: Jun Gong I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers, it could specify the value. Then NM will re-launch the container 'retry-times' times when it fails to run(e.g.exit code is not 0). It will save a lot of time. It avoids container localization. RM does not need to re-schedule the container. And local files in container's working directory will be left for re-use.(If container have downloaded some big files, it does not need to re-download them when running again.) We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)