[ 
https://issues.apache.org/jira/browse/YARN-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727256#comment-15727256
 ] 

Hitesh Sharma commented on YARN-5972:
-------------------------------------

Hi folks, thanks for opening this JIRA and the feedback. Much appreciated. 

{quote}
While notifying an AM of containers that are about to be preempted does allow 
the AM to check-point work, it does imply, as you pointed out, that AMs be 
modified to act on this input and make some decisions based on it.

Container pausing/freezing on the other hand, given OS/VM level support (also 
exposed via Docker and LXC) to actually freeze a process (agreed, their 
definition of freeze might vary), is actually AM/application independent. This 
can be useful, for applications and deployments that do not really want to 
check-point on its own but at the same time like the idea of container 
preemption with work preservations.
{quote}

Agree with [~asuresh] here. What container pausing/freezing offers is an 
ability to delegate to the underlying OS how the resources used by a container 
should be reclaimed and when resources free up again then restart the 
container. The gains of doing so will vary based on the container executor 
implementation. That said it doesn't make the PAUSE/RESUME functionality to be 
the perfect solution for work preservation or substitute AM specific 
checkpointing.

[YARN-5292] adds PAUSE/RESUME for opportunistic containers and doesn't target 
guaranteed containers. I can think of scenarios where it is good to have this 
functionality in guaranteed containers but I would wait and see some need 
coming in the community.  

Allowing the ContainerManager to initiate a pause/resume on an opportunistic 
container was considered but we decided not to have that functionality. There 
are some edge cases around what happens if the CM initiates a RESUME on a 
paused container and the NM tries to PAUSE it ([YARN-5216]). I think [~subru] 
is also touching towards these edge cases.

Overall I feel that the current design of allowing PAUSE/RESUME on 
opportunistic containers is a good starting point and allows to PAUSE an 
opportunistic container in favor of a guaranteed one and when resources free up 
it gets RESUMED ([YARN-5216]). We should probably implement pauseContainer and 
resumeContainer for Docker based container executors as opportunistic 
containers running inside Docker containers can benefit from it. 

If the community feels then we can extend the functionality towards guaranteed 
containers. I personally think that may become more relevant as YARN containers 
become virtualized via Docker or virtual machines, but I would love to hear 
some scenarios before we do that.

> Add Support for Pausing/Freezing of containers
> ----------------------------------------------
>
>                 Key: YARN-5972
>                 URL: https://issues.apache.org/jira/browse/YARN-5972
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Hitesh Sharma
>            Assignee: Hitesh Sharma
>
> YARN-2877 introduced OPPORTUNISTIC containers, and YARN-5216 proposes to add 
> capability to customize how OPPORTUNISTIC containers get preempted.
> In this JIRA we propose introducing a PAUSED container state.
> Instead of preempting a running container, the container can be moved to a 
> PAUSED state, where it remains until resources get freed up on the node then 
> the preempted container can resume to the running state.
> Note that process freezing this is already supported by 'cgroups freezer' 
> which is used internally by the docker pause functionality. Windows also has 
> OS level support of a similar nature.
> One scenario where this capability is useful is work preservation. How 
> preemption is done, and whether the container supports it, is implementation 
> specific.
> For instance, if the container is a virtual machine, then preempt call would 
> pause the VM and resume would restore it back to the running state.
> If the container executor / runtime doesn't support preemption, then preempt 
> would default to killing the container. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to