[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

Varun Vasudev (JIRA) Mon, 04 Apr 2016 22:38:41 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225716#comment-15225716
 ]


Varun Vasudev commented on YARN-3998:
-------------------------------------

Thanks for the patch [~hex108]

# {code}+      List<String> containerLocalDirs = new 
ArrayList<>(localDirs.size());{code}
containerLocalDirs is created but never populated
# In ContainerRelaunch#call, there's a lot of duplicate code copied from 
ContainerLaunch#call. Can you please refactor the code in ContainerLaunch#call 
so that it can be re-used?

>From my analysis, these pieces can be moved into their own functions are 
>called from ContainerLaunch#call and ContainerRelaunch#call

{code}
+    // CONTAINER_KILLED_ON_REQUEST should not be missed if the container
+    // is already at KILLING
+    if (container.getContainerState() == ContainerState.KILLING) {
+      dispatcher.getEventHandler().handle(
+          new ContainerExitEvent(containerID,
+              ContainerEventType.CONTAINER_KILLED_ON_REQUEST,
+              Shell.WINDOWS ? 
ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode() :
+                  ContainerExecutor.ExitCode.TERMINATED.getExitCode(),
+              "Container terminated before launch."));
+      return 0;
+    }
{code}
This can be moved into it's own function that returns a boolean that we can use 
to get the same behaviour

{code}
+      localResources = container.getLocalizedResources();
+      if (localResources == null) {
+        throw RPCUtil.getRemoteException(
+            "Unable to get local resources when Container " + containerID +
+                " is at " + container.getContainerState());
+      }
{code}
This can be moved into it's own function that returns 
{code}localResources{code} if it's not null and throws an exception otherwise.

{code}
+      List<String> logDirs = dirsHandler.getLogDirs();
+
+      List<String> containerLogDirs = new ArrayList<>();
+      String relativeContainerLogDir = ContainerLaunch
+          .getRelativeContainerLogDir(appIdStr, containerIdStr);
+      for(String logDir : logDirs) {
+        containerLogDirs.add(logDir + Path.SEPARATOR + 
relativeContainerLogDir);
+      }
{code}
can be moved into it's own function that returns the populated List

{code}
+    StringBuilder diagnosticInfo =
+        new StringBuilder("Container exited with a non-zero exit code ");
+    diagnosticInfo.append(ret);
+    diagnosticInfo.append(". ");
+    if (ret == ContainerExecutor.ExitCode.FORCE_KILLED.getExitCode()
+        || ret == ContainerExecutor.ExitCode.TERMINATED.getExitCode()) {
+      // If the process was killed, Send container_cleanedup_after_kill and
+      // just break out of this method.
+      dispatcher.getEventHandler().handle(
+          new ContainerExitEvent(containerID,
+              ContainerEventType.CONTAINER_KILLED_ON_REQUEST, ret,
+              diagnosticInfo.toString()));
+      return ret;
+    }
{code}
can be moved it's own function.
I suspect we can move
{code}
+      // LaunchContainer is a blocking call. We are here almost means the
+      // container is launched, so send out the event.
+      dispatcher.getEventHandler().handle(new ContainerEvent(
+          containerID,
+          ContainerEventType.CONTAINER_LAUNCHED));
+      context.getNMStateStore().storeContainerLaunched(containerID);
+
+      // Check if the container is signalled to be killed.
+      if (!shouldLaunchContainer.compareAndSet(false, true)) {
+        LOG.info("Container " + containerIdStr + " not launched as "
+            + "cleanup already called");
+        ret = ContainerExecutor.ExitCode.TERMINATED.getExitCode();
+      } else {
+        exec.activateContainer(containerID, pidFilePath);
+        ret = exec.launchContainer(new ContainerStartContext.Builder()
+            .setContainer(container)
+            .setLocalizedResources(localResources)
+            .setNmPrivateContainerScriptPath(nmPrivateContainerScriptPath)
+            .setNmPrivateTokensPath(nmPrivateTokensPath)
+            .setUser(container.getUser())
+            .setAppId(appIdStr)
+            .setContainerWorkDir(containerWorkDir)
+            .setLocalDirs(localDirs)
+            .setLogDirs(logDirs)
+            .setContainerLocalDirs(containerLocalDirs)
+            .setContainerLogDirs(containerLogDirs)
+            .build());
+      }
{code}
into its own function as well.

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, 
> YARN-3998.06.patch, YARN-3998.07.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

Reply via email to