[ 
https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230349#comment-15230349
 ] 

Varun Vasudev commented on YARN-4876:
-------------------------------------

Thanks for the document [~asuresh]!

Here are my initial thoughts -

{code} Add int field 'destroyDelay' to each 'StartContainerRequest':{code}

I think we should avoid this for now - we should require that AMs that use 
initialize() must call destroy and AMs that call start with the 
ContainerLaunchContext can't call destroy. We can achieve that by adding the 
destroyDelay field you mentioned in your document but don't allow AMs to set 
it. If initialize is called, set destroyDelay internally to \-1, else to 0. I'm 
not saying we should drop the feature, just that we should come back to it once 
we've sorted out the lifecycle from an initialize->destroy perspective.

{code}
Modify 'StopContainerRequest' Record:
  Add boolean 'destroyContainer':
{code}
Similar to above - let's avoid mixing initialize/destroy with start/stop for 
now.

{code}
• Introduce a new 'ContainerEventType.START_CONTAINER' event type.
• Introduce a new 'ContainerEventType.DESTROY_CONTAINER' event type.
• The Container remains in the LOCALIZED state until it receives the 
'START_CONTAINER' event.
{code}

Can you add a state machine transition diagram to explain how new states and 
events affect each other?

{code}
If 'initializeContainer' with a new ContainerLaunchContext is called by the AM 
while the Container
is RUNNING, It is treated as a KILL_CONTAINER event followed by a 
CONTAINER_RESOURCE_CLEANUP and an INIT_CONTAINER event to kick of 
re-localization after which the Container will return to LOCALIZED state.
{code}
I'd really like to avoid this specific behavior. I think we should add an 
explicit re-initialize API. For a running process, ideally, we want to localize 
the upgraded bits while the container is running and then kill the existing 
process to minimize the downtime. For containers where localization can take a 
long time, forcing a kill and then a re-initialize adds quite a serious amount 
of downtime. Re-initialize and initialize will probably end up having differing 
behaviors. On a similar note, I think we might have to introduce a new 
"re-initalizing/re-localizing/running-localizing state" which implies that a 
container is running but we are carrying out some background work.
In addition, I don't think we can do a cleanup of resources during an upgrade. 
For services that have local state in the container work dir, we're essentially 
wiping away all the local state and forcing them to start from scratch.
Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm 
assuming you meant CLEANUP_CONTAINER_RESOURCES

{code}
• If 'intializeContainer' is called WITHOUT a new ContainerLaunchContext by the 
AM, it is considered a restart, and will follow the same code path as 
'initializeContainer' with new ContainerLaunchContext, but will not perform a 
CONTAINER_RESOURCE_CLEANUP and INIT_CONTAINER. The Container process will be 
killed and the container will be returned to LOCALIZED state.
• If 'startContainer' is called WITHOUT a new ContainerLaunchContext by the AM, 
it treated exactly as the above case, but it will also trigger a 
START_CONTAINER event.
{code}
Instead of forcing AMs to make two calls, why don't we just add a restart API 
that does everything you've outlined above? It's cleaner and we don't have to 
do as many condition checks. In addition, with a restart API we can do stuff 
like allowing AMs to specify a delay, or some conditions when the restart 
should happen.

> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the 
> *ContainerManagementProtocol* and decouple the actual start of a container 
> from the initialization. This will allow AMs to re-start a container without 
> having to lose the allocation.
> Additionally, if the localization of the container is associated to the 
> initialize (and the cleanup with the destroy), This can also be used by 
> applications to upgrade a Container by *re-initializing* with a new 
> *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to