[jira] [Comment Edited] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

Arun Suresh (JIRA) Fri, 09 Sep 2016 11:51:05 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477938#comment-15477938
 ]


Arun Suresh edited comment on YARN-5620 at 9/9/16 6:50 PM:
-----------------------------------------------------------

[~jianhe], thanks for thoughtful comments.

bq. Also, if it’s ignored here, then it appears to AM that the upgrade call is 
somehow ignored
So in the {{ReInitializeContainerTransition}}, I do the following
{noformat}
if (container.reInitContext != null) {
   container.addDiagnostics("Container [" + container.getContainerId()
      + "] ReInitialization already in progress !!");
  return;
}
{noformat}
So a new upgrade is not ignored..

bq.  I think we can move the logic of ReInitializeContainerTransition to the 
upgrade API ? All it does is to resend the events to containerLauncher or 
ResourceLocalizationService , which can be done in the API. Also, this has the 
benefit of rejecting the upgrade call while the previous upgrade is 
in_progress. 
Actually, in one of my initial versions, I was setting the reInitContext on the 
container object in the upgrade API itself. But then I realized that all state 
change on the Container is affected via a state machine transition and I did 
not want to break that. I think we should keep the state change of a Container 
limited to state machine transitions as the code is cleaner and we can guard 
against other synchronization issues.

Also, I think we should have a consistent way to letting the AM know about 
errors. I prefer we use the diagnostic messages (with maybe an error code) to 
let the AM know if failures. I guess you are already doing it for Failed 
Localization. This way, the checks and enforcements are done only in 
transitions. I agree the AM will not know immediately of an error, but 
subsequent 'getContainerStatus' calls should notify the AM.

bq. One other potential race is that the relocalize API has a chance to go 
through instead of rejected, while upgrading,
I understand.. I was actually thinking we have a set of 
*reInitilizingContainerIds* in the NM context. we can check if containerId is 
present in the set in the upgrade and relocalize API and add to the set if not. 
We then remove from this set once the Container moves from LOCALIZED to RUNNING 
(this signifies that reinitialization is complete). Thoughts ? 

bq. Given so many if(reinitializing) conditions in containerImpl, should we 
consider adding a new state?
I tried that at first... but I preferred having more if thens than extra 
Container states. I can provide a version with different container states 
though... 

bq. when launching the container, we need to cleanupPreviousContainerFiles as 
done in ContainerRelaunch, right?
This is actually something I wanted to discuss. Do you think we should cleanup 
the old script file  ? If the upgrade uses the same script file name, it will 
be overwritten right ? the token file is anyway overwritten right ?



was (Author: asuresh):
[~jianhe], thanks for thoughtful comments.

bq. Also, if it’s ignored here, then it appears to AM that the upgrade call is 
somehow ignored
So in the {{ReInitializeContainerTransition}}, I do the following
{noformat}
if (container.reInitContext != null) {
   container.addDiagnostics("Container [" + container.getContainerId()
      + "] ReInitialization already in progress !!");
  return;
}
{noformat}
So a new upgrade is not ignored..

bq.  I think we can move the logic of ReInitializeContainerTransition to the 
upgrade API ? All it does is to resend the events to containerLauncher or 
ResourceLocalizationService , which can be done in the API. Also, this has the 
benefit of rejecting the upgrade call while the previous upgrade is 
in_progress. 
Actually, in one of my initial versions, I was setting the reInitContext on the 
container object in the upgrade API itself. But then I realized that all state 
change on the Container is affected via a state machine transition and I did 
not want to break that. I think we should keep the state change of a Container 
limited to state machine transitions as the code is cleaner and we can guard 
against other synchronization issues.

Also, I think we should have a consistent way to letting the AM know about 
errors. I prefer we use the diagnostic messages (with maybe an error code) to 
let the AM know if failures. I guess you are already doing it for Failed 
Localization. This way, the checks and enforcements are done only in 
transitions. I agree the AM will not know immediately of an error, but 
subsequent 'getContainerStatus' calls should notify the AM.

bq. One other potential race is that the relocalize API has a chance to go 
through instead of rejected, while upgrading,
I understand.. I was actually thinking we have a set of 
*reInitilizingContainerId* in the NM context. we can check if containerId is 
present in the set before performing upgrade and relocalize and we can remove 
from this set once the Container moves from LOCALIZED to RUNNING (this 
signifies that reinitialization is complete)

bq. Given so many if(reinitializing) conditions in containerImpl, should we 
consider adding a new state?
I tried that at first... but I preferred having more if thens than extra 
Container states. I can provide a version with different container states 
though... 

bq. when launching the container, we need to cleanupPreviousContainerFiles as 
done in ContainerRelaunch, right?
This is actually something I wanted to discuss. Do you think we should cleanup 
the old script file  ? If the upgrade uses the same script file name, it will 
be overwritten right ? the token file is anyway overwritten right ?


> Core changes in NodeManager to support for upgrade and rollback of Containers
> -----------------------------------------------------------------------------
>
>                 Key: YARN-5620
>                 URL: https://issues.apache.org/jira/browse/YARN-5620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, 
> YARN-5620.006.patch, YARN-5620.007.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

Reply via email to