[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174775#comment-15174775 ] Vinod Kumar Vavilapalli commented on YARN-1040: --- bq. General case: AM launches multiple containers at the same time. This is essentially container-groups - we should keep this option open. Clarification on what I meant here: It's okay for now to only design APIs (and defer implementation) so that even if our first version of implementation only covers allocation-vs-container delinking, container-groups are possible in future without further API changes/addition. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167889#comment-15167889 ] Bikas Saha commented on YARN-1040: -- Vinod, the plan you are suggesting has merits. But my initial impression is that reworking allocations and containers is a much bigger change than whats proposed earlier in this jira. Not only internally in YARN but also externally in terms of thinking about the whole larger flow of allocations and containers for users of YARN. The proposal discussed earlier is of much smaller scope and I believe sufficient to take us where we need to go. And it does not need reworking the RM related flow of allocations and containers. E.g. it may not be necessary for the RM to understand single use allocations vs multi-use vs concurrent use allocations. But for the RM level changes you are suggesting we may be on the path of convergence. At this point, the discussion is complex enough that we may want to gather interested people and do it as a group outside jira comments and then post it back. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167794#comment-15167794 ] Vinod Kumar Vavilapalli commented on YARN-1040: --- Still catching up on some of your discussion, but quick comments on a few things that I care We should really call them instead as {{Allocation}} and {{Container}}. That's the nomenclature I used at YARN-4726. Till now, YARN has combined the notion of Allocation and Container, which is the main de-linking that we need to do here. Process has an OS level connotation, and doesn't work well with more things in the picture like process-trees / multiple-processes / Docker (YARN-2466). Taking this further, here's how the overall picture can look like the following *ResourceManager* - RM only does allocations in the scheduling path. ResourceManager does all scheduling based on AllocationRequests and tracks Allocations.. - RM receives AllocationRequests and returns fulfilled Allocations (and AllocationTokens) to AMs. *Applications* - AM can in turn use the Allocations (and AllocationTokens) to launch multiple Containers on the NM. -- Simple case: AM only launches containers one-after-another. It's up to the app to do this. -- General case: AM launches multiple containers at the same time. This is essentially container-groups - we should keep this option open. - AMs can specify *single-use* AllocationRequests, at which point RM can simply return Containers and Container-Tokens (today's code-path). - Each Container exits when the process-tree / linux-container exits. - Each Container has an Identifier. -- For single-use allocation-requests, RM generates ContainerIDs -- For multi-use allocation-requests, apps could optionally specify a container-name that is scoped under the allocation. NM always returns a (generated or app-specified) ContainerID based off the allocation-ID. Essentially, allocationID + containerID is unique *NodeManagers* - NodeManager also understands incoming Allocations and ties them to Container groups: it deals with Allocation activation/deactivation and Container start/stop. but does -- the following *decoupled from both allocations and containers*: localizations / re-localizations. This means local-resources should now have more scopes: container, allocation, application etc. -- *per allocation*: enforcement of resource-limits -- all of the following *per container*: (a) process/OS-container activation / deactivation, (b) process/OS-container auto-restart (YARN-4725) log-aggregation > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167738#comment-15167738 ] Bikas Saha commented on YARN-1040: -- My guess is that YARN-4725 may be redundant after we do this work because then we would have exposed primitives to apps to make that happen. The arguments for YARN not doing it by itself would be the same. If it can be done easily by the app and is very likely app dependent without one-size-fits-all then let the app do it. Coming back to this jira. Yes, lets please track any first-class support of the notion of upgrades separately which can be done as a follow up. Perhaps we can put the design in a document and look at the next level of details. We can send email to the dev list after adding a more detailed document to this jira. Then, based on +ve feedback, we could go ahead with jiras/code. The devil is in the details :) This would be a significant change and we could use more eyes for reviews. For startProcess identifier, it may be useful for the app to provide the identifier in startProcess and then use it later to refer to the process. E.g. stopProcess. vs YARN trying to come up with identifiers. This may make the apps life easier because it could use meaningful terms based on its own logic. We can discuss such details in the design document. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167674#comment-15167674 ] Arun Suresh commented on YARN-1040: --- Thanks for clarifying [~bikassaha] I propose we break this down into sub-jiras : # New APIs specific to delinking container life-cycle from the process : This will include 4 of new APIs specified above (excluding localization) but will assume single process (and this the startProcess does not need to return a processId) # Add support for clubbing APIs into a single RPC ** Might have to think a bit about validating the order and multiplicity of the API calls in each command (which I expect might be different for single process / multiple processes) # Add support for localize API # Add support for Multiple processes ** A processId will be returned for a startProcess. Might have to think thru this further. for eg. how does this integrate with YARN-4725 For the purpose of Application Upgrades (for which this JIRA is marked as a sub-task of... also why im calling it out specifically) : Add support for Container Upgrades # Expose a canned NMCommand that has the list of APIs to upgrade based on some policy If folks are fine with this, I will ahead and open JIRAs and link this issue to each of the above JIRAs (Since I don't think I can create subtasks for this JIRA) so that we can start work on the same.. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166633#comment-15166633 ] Bikas Saha commented on YARN-1040: -- I am sorry if I caused a digression by mentioning Slider etc. I am not sure the upgrade scenario is the only one for this jira since this jira covers a broader set. Even without upgrades apps can change the processes they are running in a container without having to lose the container allocation. Identical calls of primitives could be used without the notion of upgrade. E.g. start a Java process first for a Java task, then launch a python process for a Python task. To the NM this is identical to starting v1 and then starting v2. So while it makes sense for the second one to use an API called upgrade, it may not for the first one. (Unrelated to this jira, IMO, YARN should allow upgrade of app code without losing containers but not necessarily understand it deeply. E.g. YARN need not assume that upgrade will need additional resource or try to acquire them transparently for the application.) For the purpose of this jira here is what my thoughts are when I had opened YARN-1292 to delink process lifecycle from container. 1) new API - acquireContainer - means ask for the allocated resource. The API has a flag to specify whether process exit implies releaseContainer. This is for backwards compatibility with a default of true. Apps that want to continue to use that behavior can explicitly pass true when using the new API and is mainly for reducing number of RPCs for apps like MR/Tez etc. 2) new API - startProcess - means start the remote process 3) new API - stopProcess - means stop the remote process 4) new API - releaseContainer - means release the allocated resource 5) Potentially a new API for localization, though in theory, this could be separate. Since this fine grained control makes the protocol chatty, we can reduce the RPC traffic by having a new NM RPC, say NMCommand, that takes a sequence of API primitives that can be sent in 1 RPC. So the current API of startContainer effectively becomes NMCommand(1, 2) and stopContainer becomes NMCommand(3,4). This can be leveraged for backwards compatibility and rolling upgrades. The above items would effectively delink process and container lifecyle and close out this jira. This provides the fine grained control in core YARN that can be used for various scenarios e.g. upgrades without YARN understanding the scenarios. If we need to add higher level notions for upgrades etc. then those could be done as separate items. I hope that helps make my thoughts concrete within the scope of this jira. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166462#comment-15166462 ] Arun Suresh commented on YARN-1040: --- So that we are on the same page, If we were to separate what needs to be in YARN vs what Slider etc. should handle, id say : *YARN* * Container Upgrade primitive: ** provide AM with APIs (via NMClient) to upgrade the Container. ** API takes 1) new {{ContainerLaunchContext}} and 2) a policy viz. *In-place* (localize in parallel v2, start v2, stop v1) or *New+rollback* (stop v1, localize v2, start v2) + (start v1 if start v2 fails) *or* list of primitive composable commands if the above policies doesn't cover the use case. ** should negotiate Resource increase for in-place upgrade with RM prior to upgrade via YARN-1197 (or perhaps use OPPORTUNISTIC containers to locally negotiated at the NM for the resource spike needed for upgrade, once YARN-2877 is ready) *Slider / or something similar* * Application upgrade primitive ** Upgrade Orchestration Policy: Allow applications deployed via slider to specify order in which tasks/roles are upgraded (or started) ** Allow applications to specify how containers of each role are upgraded ** Actually call the YARN container upgrade APIs (described above) to perform upgrade of each container in the user specified order/policy Makes sense ? > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15165668#comment-15165668 ] Bikas Saha commented on YARN-1040: -- Agree with your scenarios. I am trying to figure a way by which this does not become a YARN problem (both initial work and ongoing maintenance). E.g. we dont know for sure that the resource needs to be x, 2x or 3x. This is an allocation decision and cannot be done without the RMs blessing. And increasing container resources is already work in progress and may become another NM primitive. Next, what is the ordering for the tasks during an upgrade? We could implement one of many possibilities but then be stuck with bug-fixing or improving it. Potentially use that as a precedent to implement yet another upgrade policy. Hence, my suggestion of creating composable primitives that can be used to easily implement these flows. And leave it to the apps to determine the exact upgrades paths. Perhaps Slider is a better place which could wrap different upgrade possibilities using the composable primitives. E.g. SliderStopAllUpgradePolicy or SliderConcurrentUpgradePolicy. Or they could be provided as helper libs in YARN/NMClient so apps dont have to compose the primitives from scratch. The main aim is to continue to make core YARN/NM simple by creating primitives and layering complexity on top. This approach may be simpler and incremental to develop, test and deploy. Of course, these are my personal design views :) Thoughts? > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163864#comment-15163864 ] Arun Suresh commented on YARN-1040: --- Thanks for the feedback [~bikassaha] I understand we might not want to place artificial constraint of apps, I was just trying to scope out the bare min effort required specifically for long running container upgrades. That said, im all for going the whole hog (allow 0 or 1+ processes) if that is maybe easier. Some thoughts specifically with regard to container upgrade: # If we allow multiple processes per container, we might need to have {{startProcess()}} to return maybe a *processId* which can subsequently be used by the AM to address the process in subsequent calls like {{stopProcess()}}. This might complicate the state of AM, and maybe we can leave it out in the first cut. # w.r.t resource re-localization, as per YARN-4597, we are exploring localization as a service and possibly re-localization on the fly. # I like the idea of clubbing multiple API calls in the same RPC. But should *upgrade* be a first class semantic, or should it be expressed as a {{localize v2, start v2, stop v1}} API combo. One reason to distinguish may be in the case of having both versions up at the same time till the new version stabilizes... in an upgrade case, the Container should probably be allowed to go 2x its allocated resource limit for a period of time, but in the case were we are just starting 2 processes, this should probably not be allowed. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163708#comment-15163708 ] Bikas Saha commented on YARN-1040: -- I am not sure we need to place (somewhat artificial) constraints on the app when its not clear that it practically affects YARN 1) Container with no process should be allowed. Apps could terminate all running tasks of version A, then start running tasks of version B when they are not backwards compatible. 2) Container should be allowed to run multiple processes. This is similar to the existing process spawning more processes. It is different from that in the sense that the NM has to add the new process to existing monitoring/cgroups etc. 3) Startprocess should be allowed with no process actually started. This will allow apps to localize new resources to an existing container. Alternatively, we could create a new localization API thats delinked from starting the process. But re-localization is an important related feature that we should look at supporting via this work because currently that does not work since its tied to start process. 4) Most current apps are already communicating directly with their tasks and hence can shut them down when they are not needed. However, like suggested above, it may be useful for the NM to provide a feature whereby the previous task can be shutdown when a new task request is received. Alternatively, the NM could provide a stopProcess API to make that explicit. IMO all of this should be allowed. The timeline could be different with some being allowed earlier and some later based on implementation effort. Thinking ahead, it may be useful for the NM to accept a series of API calls within the same RPC (with the current mechanism supported as a single command entity for backwards compatibility). Then we will not have to build a lot of logic into the NM. The app can get all features by composing a multi-command entity. E.g. Current start process = {acquire, localize, start} // where acquire means start container Current shutdown process = {stop, release} // where release means give up container Only localize = {localize} Start another process = {localize, start} Start another process after shutting down first process = {stop, start} or {stop, localize, start} Start another process and then shutdown the first process = {start, stop} New container shutdown = {release} // at this point there may be 0 or more processes running and which will be stopped > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163564#comment-15163564 ] Arun Suresh commented on YARN-1040: --- Spent some time going thru the conversation (this one as well as YARN-1404) Given that this has been tracked as a requirement for In place application upgrades and it has been sometime since any activity has been posted here, [~bikassaha] / [~vinodkv] / [~hitesh] / [~tucu00] / [~steve_l], can you kindly clarify the following ? # Are we still trying to handle the case where we have > 1 processes running against a container *at the same time* # Have we decided that allowing a Container with 0 processes running is a bad idea ? >From the context of getting Application upgrades working, I guess 1) can be >relaxed to exactly 1 process running under a container but AM has the option >of explicitly starting via the {{startProcess(containerLaunchContext)}} API >Bikas mentioned (an additional constraint could probably be the startProcess >has to be called within a timeout if no ContainerLaunchContext has been >provided with the initial {{startContainer()}} else NM will deem the container >dead). In addition, I was also thinking # If a process is already running in the container when a {{startProcess(ContainerLaunchContext)}} is received, then the first process is killed and another is started using the new {{ContainerLaunchContext}} # Maybe we can refine the above by add an {{upgradeProcess(ContainerLaunchContext)}} API that can additionally take on a policy like: ## auto-rollback if new process does not start within a timout. ## Rollback could either mean keeping the old process running until upgraded process is up -or- if we want to preserve semantics of only 1 process per container, first kill the old process and try to start new one, and on failure restart old version. If everyone is ok with the above, I volunteer to either post a preliminary patch for the above or if the details get dicier during investigation, I can put up a doc. Thoughts ? > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159730#comment-15159730 ] Vinod Kumar Vavilapalli commented on YARN-1040: --- Moved this to be a sub-task of YARN-4692 given the renewed focus there. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844594#comment-13844594 ] Hitesh Shah commented on YARN-1040: --- Given the recent comments on YARN-1404, I believe that this should not be supported unless the resources are being delegated to another YARN container. Furthermore, if we are talking about container leases ( for multiple process launches and not doing any resource delegation ), a container lease should start when the first process is launched - thereby having an API that supports a null ContainerLaunchContext is moot. The lease aspects should probably be encoded into the container token so that the NM understands that a process exiting in a particular container need not signal the end of the container i.e. multipleProcesses should not be an explicit flag in the api. De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container Key: YARN-1040 URL: https://issues.apache.org/jira/browse/YARN-1040 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 3.0.0 Reporter: Steve Loughran The AM should be able to exec 1 process in a container, rather than have the NM automatically release the container when the single process exits. This would let an AM restart a process on the same container repeatedly, which for HBase would offer locality on a restarted region server. We may also want the ability to exec multiple processes in parallel, so that something could be run in the container while a long-lived process was already running. This can be useful in monitoring and reconfiguring the long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844595#comment-13844595 ] Hitesh Shah commented on YARN-1040: --- Sorry - got my wires crossed on the different jiras going around. To clarify, I believe container leases for multiple processes is a good feature to have. Allowing a container to be launched without a process should be a no-no. Resource delegation as mentioned in YARN-1404 seems to be a decent approach at assigning resources to other containers - however, it should only be restricted to assigning resources to containers under the control of YARN. De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container Key: YARN-1040 URL: https://issues.apache.org/jira/browse/YARN-1040 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 3.0.0 Reporter: Steve Loughran The AM should be able to exec 1 process in a container, rather than have the NM automatically release the container when the single process exits. This would let an AM restart a process on the same container repeatedly, which for HBase would offer locality on a restarted region server. We may also want the ability to exec multiple processes in parallel, so that something could be run in the container while a long-lived process was already running. This can be useful in monitoring and reconfiguring the long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13821597#comment-13821597 ] Alejandro Abdelnur commented on YARN-1040: -- [~bikassaha], if I got it right, you suggest: * {{StartContainerRequest}} would have a new property {{boolean multipleProcesses (false)}} * An additional API {{startProcess(ContainerId, ContainerLaunchContext)}} will be used to start multiple processes within the same container. * In a {{StartContainerRequest}}, if the {{ContainerLaunchContext == null}} and {{multipleProcesses = true}}, the container is started with no associated process and the container allocation will not timeout as it as been claimed by the AM (because of the start container request). If that is the case, then YARN-1404 would be a special case of this JIRA. Am i right? De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container Key: YARN-1040 URL: https://issues.apache.org/jira/browse/YARN-1040 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 3.0.0 Reporter: Steve Loughran The AM should be able to exec 1 process in a container, rather than have the NM automatically release the container when the single process exits. This would let an AM restart a process on the same container repeatedly, which for HBase would offer locality on a restarted region server. We may also want the ability to exec multiple processes in parallel, so that something could be run in the container while a long-lived process was already running. This can be useful in monitoring and reconfiguring the long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820645#comment-13820645 ] Alejandro Abdelnur commented on YARN-1040: -- [~bikassaha], a minor twist to the outlined approach is that we don't need flag, just a NULL {{ContainerLaunchContext}} and that this context is not NULL on {{startContainer()}} the container is meant to have 1 process only and finishes on process completion. This would preserve backwards compatibility. Only when the startContainer has a NULL context, there could be multiple processes. Makes sense? De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container Key: YARN-1040 URL: https://issues.apache.org/jira/browse/YARN-1040 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 3.0.0 Reporter: Steve Loughran The AM should be able to exec 1 process in a container, rather than have the NM automatically release the container when the single process exits. This would let an AM restart a process on the same container repeatedly, which for HBase would offer locality on a restarted region server. We may also want the ability to exec multiple processes in parallel, so that something could be run in the container while a long-lived process was already running. This can be useful in monitoring and reconfiguring the long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13821012#comment-13821012 ] Bikas Saha commented on YARN-1040: -- Thats unnecessary overhead to make 2 RPC calls for the first process when I want to run multiple processes within the same container. I first need to do startContainer(null) and then startProcess(). startContainer(process, flag-multiple-true) is more efficient since there is only 1 RPC. Also, the flag is completely backwards compatible with a default of false. We must support startContainer(null/no-process, flag-multiple-true) for the case in which the first process to run is not yet ready or the case mentioned in YARN-1404 where we dont ever want to run a process. De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container Key: YARN-1040 URL: https://issues.apache.org/jira/browse/YARN-1040 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 3.0.0 Reporter: Steve Loughran The AM should be able to exec 1 process in a container, rather than have the NM automatically release the container when the single process exits. This would let an AM restart a process on the same container repeatedly, which for HBase would offer locality on a restarted region server. We may also want the ability to exec multiple processes in parallel, so that something could be run in the container while a long-lived process was already running. This can be useful in monitoring and reconfiguring the long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.1#6144)