[ https://issues.apache.org/jira/browse/YARN-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15681223#comment-15681223 ]
Arun Suresh commented on YARN-5292: ----------------------------------- Thanks for the patch [~hrsharma].. Did a fly-by of the patch and the design doc. Some design comments: # The original intent of the JIRA, I guess is to provide an alternative to killing opportunistic containers to make room for guaranteed containers. This implies that we would need to wire this through the ContainerScheduler, which is now the entity that decides when to and which opp containers to kill. # I was thinking we could also expose an API on the ContainerManagementProtocol, to allow AMs to directly pause a container, but I am guessing this should be allowed only for Guaranteed containers. Since if we expose a pause API, we should expose a resume API, but it is not necessary that opportunistic containers are resume-able at the time the AM needs them to be. [~jianhe], [~vvasudev], would be nice to hear your thoughts on this. Since if I understand correctly, for yarn native services, there is a need to just stop a container (without losing the allocation) for a period of time. Don't know if that can be modeled as a container PAUSE via some support from the underlying ContainerExecutor/Runtime. # We need some way to expose what resource are reclaimable by the NM when a container is paused. It is possible that on deployments using some implementations of the ContainerExecutor/Runtime that not all resources of a paused container will be reclaim-able by the NM to start other opportunistic/guaranteed containers. For eg, it maybe that on some systems, vcores are throttled to 0 for the container, while on others, the memory / state is also dumped into a secondary store, which means the memory also might be re-claimable. We would some way to plug this information into the ResourUtilizationTracker and the ContainerScheduler. I am thinking we should maybe convert this to an Umbrella JIRA and have work items as sub-jiras created against it and work against a branch. With regard to the patch itself, I understand the current one is meant to handle the changes needed in the state machines etc. Do take a look at {{TestContainer}} class, and see if it is possible to add some tests to verify that container life-cycle events are handled correctly. Will take a deeper look at the patch after that. > Support for PAUSED container state > ---------------------------------- > > Key: YARN-5292 > URL: https://issues.apache.org/jira/browse/YARN-5292 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Hitesh Sharma > Assignee: Hitesh Sharma > Attachments: YARN-5292.001.patch, YARN-5292.002.patch, > YARN-5292.003.patch, yarn-5292.pdf > > > YARN-2877 introduced OPPORTUNISTIC containers, and YARN-5216 proposes to add > capability to customize how OPPORTUNISTIC containers get preempted. > In this JIRA we propose introducing a PAUSED container state. > When a running container gets preempted, it enters the PAUSED state, where it > remains until resources get freed up on the node then the preempted container > can resume to the running state. > > One scenario where this capability is useful is work preservation. How > preemption is done, and whether the container supports it, is implementation > specific. > For instance, if the container is a virtual machine, then preempt would pause > the VM and resume would restore it back to the running state. > If the container doesn't support preemption, then preempt would default to > killing the container. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org