[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225716#comment-15225716 ]
Varun Vasudev commented on YARN-3998: ------------------------------------- Thanks for the patch [~hex108] # {code}+ List<String> containerLocalDirs = new ArrayList<>(localDirs.size());{code} containerLocalDirs is created but never populated # In ContainerRelaunch#call, there's a lot of duplicate code copied from ContainerLaunch#call. Can you please refactor the code in ContainerLaunch#call so that it can be re-used? >From my analysis, these pieces can be moved into their own functions are >called from ContainerLaunch#call and ContainerRelaunch#call {code} + // CONTAINER_KILLED_ON_REQUEST should not be missed if the container + // is already at KILLING + if (container.getContainerState() == ContainerState.KILLING) { + dispatcher.getEventHandler().handle( + new ContainerExitEvent(containerID, + ContainerEventType.CONTAINER_KILLED_ON_REQUEST, + Shell.WINDOWS ? ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode() : + ContainerExecutor.ExitCode.TERMINATED.getExitCode(), + "Container terminated before launch.")); + return 0; + } {code} This can be moved into it's own function that returns a boolean that we can use to get the same behaviour {code} + localResources = container.getLocalizedResources(); + if (localResources == null) { + throw RPCUtil.getRemoteException( + "Unable to get local resources when Container " + containerID + + " is at " + container.getContainerState()); + } {code} This can be moved into it's own function that returns {code}localResources{code} if it's not null and throws an exception otherwise. {code} + List<String> logDirs = dirsHandler.getLogDirs(); + + List<String> containerLogDirs = new ArrayList<>(); + String relativeContainerLogDir = ContainerLaunch + .getRelativeContainerLogDir(appIdStr, containerIdStr); + for(String logDir : logDirs) { + containerLogDirs.add(logDir + Path.SEPARATOR + relativeContainerLogDir); + } {code} can be moved into it's own function that returns the populated List {code} + StringBuilder diagnosticInfo = + new StringBuilder("Container exited with a non-zero exit code "); + diagnosticInfo.append(ret); + diagnosticInfo.append(". "); + if (ret == ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode() + || ret == ContainerExecutor.ExitCode.TERMINATED.getExitCode()) { + // If the process was killed, Send container_cleanedup_after_kill and + // just break out of this method. + dispatcher.getEventHandler().handle( + new ContainerExitEvent(containerID, + ContainerEventType.CONTAINER_KILLED_ON_REQUEST, ret, + diagnosticInfo.toString())); + return ret; + } {code} can be moved it's own function. I suspect we can move {code} + // LaunchContainer is a blocking call. We are here almost means the + // container is launched, so send out the event. + dispatcher.getEventHandler().handle(new ContainerEvent( + containerID, + ContainerEventType.CONTAINER_LAUNCHED)); + context.getNMStateStore().storeContainerLaunched(containerID); + + // Check if the container is signalled to be killed. + if (!shouldLaunchContainer.compareAndSet(false, true)) { + LOG.info("Container " + containerIdStr + " not launched as " + + "cleanup already called"); + ret = ContainerExecutor.ExitCode.TERMINATED.getExitCode(); + } else { + exec.activateContainer(containerID, pidFilePath); + ret = exec.launchContainer(new ContainerStartContext.Builder() + .setContainer(container) + .setLocalizedResources(localResources) + .setNmPrivateContainerScriptPath(nmPrivateContainerScriptPath) + .setNmPrivateTokensPath(nmPrivateTokensPath) + .setUser(container.getUser()) + .setAppId(appIdStr) + .setContainerWorkDir(containerWorkDir) + .setLocalDirs(localDirs) + .setLogDirs(logDirs) + .setContainerLocalDirs(containerLocalDirs) + .setContainerLogDirs(containerLogDirs) + .build()); + } {code} into its own function as well. > Add retry-times to let NM re-launch container when it fails to run > ------------------------------------------------------------------ > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Jun Gong > Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)