[ 
https://issues.apache.org/jira/browse/YARN-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685939#comment-15685939
 ] 

Subru Krishnan commented on YARN-5292:
--------------------------------------

Thanks [~hrsharma] for the design doc and patch. I looked at the doc and 
discussions in the JIRA and please find my $0.02 (mostly reiterating 
[~asuresh]/[~jianhe]) below.

The design doc is a good start but it needs to cover RM changes required to 
handle PAUSED CONTAINERS. More importantly, adding the mechanism to support 
PAUSE in NM seems manageable especially given YARN-4597 but to have a practical 
version we need to consider the changes required in ContainerExecutor (OS) and 
Container (process) level to cover both the aspects of work preservation and 
resource transfer. To illustrate, YARN preemption was designed to be 
work-preserving from the start (YARN-45) but we have found it hard to enforce 
that in practice as it needs individual framework support. This is in spite of 
the fact that the feature has been available for years.

There are also more nuances in the NM to support PAUSE, for e.g: NM restart, 
rolling upgrades, etc which are not covered by the design currently. 

I am not in favor of exposing PAUSE/RESUME to AMs for the following two reasons:
  * We cannot guarantee RESUME unless we block the allocation for the Container 
which IMHO defeats the purpose.
  * AMs already have the option of check-pointing their containers, for e.g: 
MAPREDUCE-4584  

I think we should separately deal with PAUSE/RESUME for GUARANTEED and 
OPPORTUNISTIC containers.
 
In a tangential vein, off-late I am seeing huge monolithic patches being 
committed directly to trunk which I am personally not a fan of as they are not 
only very difficult to review in the first place but the side-effects (both 
good & bad) are hairy to track/manage. 

Considering all of the above, I strongly agree with [~asuresh] that this should 
be an umbrella JIRA which should be developed in a feature branch and that we 
should have a fleshed out design before we starting getting into patches.

> Support for PAUSED container state
> ----------------------------------
>
>                 Key: YARN-5292
>                 URL: https://issues.apache.org/jira/browse/YARN-5292
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Hitesh Sharma
>            Assignee: Hitesh Sharma
>         Attachments: YARN-5292.001.patch, YARN-5292.002.patch, 
> YARN-5292.003.patch, yarn-5292.pdf
>
>
> YARN-2877 introduced OPPORTUNISTIC containers, and YARN-5216 proposes to add 
> capability to customize how OPPORTUNISTIC containers get preempted.
> In this JIRA we propose introducing a PAUSED container state.
> When a running container gets preempted, it enters the PAUSED state, where it 
> remains until resources get freed up on the node then the preempted container 
> can resume to the running state.
>  
> One scenario where this capability is useful is work preservation. How 
> preemption is done, and whether the container supports it, is implementation 
> specific.
> For instance, if the container is a virtual machine, then preempt would pause 
> the VM and resume would restore it back to the running state.
> If the container doesn't support preemption, then preempt would default to 
> killing the container. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to