[jira] [Commented] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

Varun Vasudev (JIRA) Thu, 08 Sep 2016 04:09:30 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473563#comment-15473563
 ]


Varun Vasudev commented on YARN-5620:
-------------------------------------

Thanks for the patch [~asuresh]!

1)
{code}
-            containermanager.container.ContainerState.RUNNING)) {
+            containermanager.container.ContainerState.RUNNING)
+        || container.isReInitializing()) {
{code}

Minor nit - can we add a function called container.isRunning and move the 
running state check into that function? Then the function becomes 
container.isRunning() || container.isReInitializing()

2)
{code}
+  private void preUpgradeCheck(ContainerId containerId, String op)
{code}

Maybe switch to enum instead of String for the op?

3)
{code}
+        container.launchContext = container.reInitContext.newLaunchContext;
+        container.resourceSet.merge(container.reInitContext.resourceSet);
+
+        container.sendLaunchEvent();
{code}

Instead of a launch event we should send a relaunch event - the relaunch takes 
care of trying to run in same work dir as the earlier attempt, etc

4)
{code}
+    public void transition(ContainerImpl container, ContainerEvent event) {
+      container.reInitContext = createReInitContext(container, event);
{code}
Should there be a guard against calling reint if a reinit is already in 
progress? Could we end up with the ReInitContext in odd state?

5)
{code}
+        List<String> l = resourceSet.resourceLocalized(
+            rsrcEvent.getResource(), rsrcEvent.getLocation());
+        if (l != null) {
+          links.addAll(l);
+        }
{code}
Do we need to de-dup here? It’s possible that the same link gets added twice?

6)
{quote}
How does AM determine whether the upgrade is successful (like what kind signal 
should AM depend on)? I feel once the container starts running, even for AM, 
it's hard to distinguish whether the failure is caused by upgrade or runtime. 
IMO, if container fails to launch on upgrade, it should be considered as 
upgrade failure. Once the container starts running, if the container fails, it 
can be considered as runtime failure. If user does want to rollback, user call 
the upgardeContainer/rollback command again to roll back.
{quote}

I think both [~jianhe] and [~asuresh] raise valid points. I think an explicit 
commit API(with auto-commit option being the default option) should satisfy 
both use cases.

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -----------------------------------------------------------------------------
>
>                 Key: YARN-5620
>                 URL: https://issues.apache.org/jira/browse/YARN-5620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

Reply via email to