[ https://issues.apache.org/jira/browse/YARN-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685939#comment-15685939 ]
Subru Krishnan commented on YARN-5292: -------------------------------------- Thanks [~hrsharma] for the design doc and patch. I looked at the doc and discussions in the JIRA and please find my $0.02 (mostly reiterating [~asuresh]/[~jianhe]) below. The design doc is a good start but it needs to cover RM changes required to handle PAUSED CONTAINERS. More importantly, adding the mechanism to support PAUSE in NM seems manageable especially given YARN-4597 but to have a practical version we need to consider the changes required in ContainerExecutor (OS) and Container (process) level to cover both the aspects of work preservation and resource transfer. To illustrate, YARN preemption was designed to be work-preserving from the start (YARN-45) but we have found it hard to enforce that in practice as it needs individual framework support. This is in spite of the fact that the feature has been available for years. There are also more nuances in the NM to support PAUSE, for e.g: NM restart, rolling upgrades, etc which are not covered by the design currently. I am not in favor of exposing PAUSE/RESUME to AMs for the following two reasons: * We cannot guarantee RESUME unless we block the allocation for the Container which IMHO defeats the purpose. * AMs already have the option of check-pointing their containers, for e.g: MAPREDUCE-4584 I think we should separately deal with PAUSE/RESUME for GUARANTEED and OPPORTUNISTIC containers. In a tangential vein, off-late I am seeing huge monolithic patches being committed directly to trunk which I am personally not a fan of as they are not only very difficult to review in the first place but the side-effects (both good & bad) are hairy to track/manage. Considering all of the above, I strongly agree with [~asuresh] that this should be an umbrella JIRA which should be developed in a feature branch and that we should have a fleshed out design before we starting getting into patches. > Support for PAUSED container state > ---------------------------------- > > Key: YARN-5292 > URL: https://issues.apache.org/jira/browse/YARN-5292 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Hitesh Sharma > Assignee: Hitesh Sharma > Attachments: YARN-5292.001.patch, YARN-5292.002.patch, > YARN-5292.003.patch, yarn-5292.pdf > > > YARN-2877 introduced OPPORTUNISTIC containers, and YARN-5216 proposes to add > capability to customize how OPPORTUNISTIC containers get preempted. > In this JIRA we propose introducing a PAUSED container state. > When a running container gets preempted, it enters the PAUSED state, where it > remains until resources get freed up on the node then the preempted container > can resume to the running state. > > One scenario where this capability is useful is work preservation. How > preemption is done, and whether the container supports it, is implementation > specific. > For instance, if the container is a virtual machine, then preempt would pause > the VM and resume would restore it back to the running state. > If the container doesn't support preemption, then preempt would default to > killing the container. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org