[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-03-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174775#comment-15174775
 ] 

Vinod Kumar Vavilapalli commented on YARN-1040:
---

bq. General case: AM launches multiple containers at the same time. This is 
essentially container-groups - we should keep this option open.
Clarification on what I meant here: It's okay for now to only design APIs (and 
defer implementation) so that even if our first version of implementation only 
covers allocation-vs-container delinking, container-groups are possible in 
future without further API changes/addition.

> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-25 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167889#comment-15167889
 ] 

Bikas Saha commented on YARN-1040:
--

Vinod, the plan you are suggesting has merits. But my initial impression is 
that reworking allocations and containers is a much bigger change than whats 
proposed earlier in this jira. Not only internally in YARN but also externally 
in terms of thinking about the whole larger flow of allocations and containers 
for users of YARN.
The proposal discussed earlier is of much smaller scope and I believe 
sufficient to take us where we need to go. And it does not need reworking the 
RM related flow of allocations and containers. E.g. it may not be necessary for 
the RM to understand single use allocations vs multi-use vs concurrent use 
allocations. But for the RM level changes you are suggesting we may be on the 
path of convergence.

At this point, the discussion is complex enough that we may want to gather 
interested people and do it as a group outside jira comments and then post it 
back.

> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-25 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167794#comment-15167794
 ] 

Vinod Kumar Vavilapalli commented on YARN-1040:
---

Still catching up on some of your discussion, but quick comments on a few 
things that I care

We should really call them instead as {{Allocation}} and {{Container}}. That's 
the nomenclature I used at YARN-4726. Till now, YARN has combined the notion of 
Allocation and Container, which is the main de-linking that we need to do here. 
Process has an OS level connotation, and doesn't work well with more things in 
the picture like process-trees / multiple-processes / Docker (YARN-2466).

Taking this further, here's how the overall picture can look like the following

*ResourceManager*
 - RM only does allocations in the scheduling path. ResourceManager does all 
scheduling based on AllocationRequests and tracks Allocations..
 - RM receives AllocationRequests and returns fulfilled Allocations (and 
AllocationTokens) to AMs.

*Applications*
 - AM can in turn use the Allocations (and AllocationTokens) to launch multiple 
Containers on the NM.
-- Simple case: AM only launches containers one-after-another. It's up to 
the app to do this.
-- General case: AM launches multiple containers at the same time. This is 
essentially container-groups - we should keep this option open.
 - AMs can specify *single-use* AllocationRequests, at which point RM can 
simply return Containers and Container-Tokens (today's code-path).
 - Each Container exits when the process-tree / linux-container exits.
 - Each Container has an Identifier.
-- For single-use allocation-requests, RM generates ContainerIDs
-- For multi-use allocation-requests, apps could optionally specify a 
container-name that is scoped under the allocation. NM always returns a 
(generated or app-specified) ContainerID based off the allocation-ID. 
Essentially, allocationID + containerID is unique

*NodeManagers*
 - NodeManager also understands incoming Allocations and ties them to Container 
groups: it deals with Allocation activation/deactivation and Container 
start/stop. but does
-- the following *decoupled from both allocations and containers*: 
localizations / re-localizations. This means local-resources should now have 
more scopes: container, allocation, application etc.
-- *per allocation*: enforcement of resource-limits
-- all of the following *per container*: (a) process/OS-container 
activation / deactivation, (b) process/OS-container auto-restart (YARN-4725)  
log-aggregation

> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-25 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167738#comment-15167738
 ] 

Bikas Saha commented on YARN-1040:
--

My guess is that YARN-4725 may be redundant after we do this work because then 
we would have exposed primitives to apps to make that happen. The arguments for 
YARN not doing it by itself would be the same. If it can be done easily by the 
app and is very likely app dependent without one-size-fits-all then let the app 
do it.

Coming back to this jira. Yes, lets please track any first-class support of the 
notion of upgrades separately which can be done as a follow up.

Perhaps we can put the design in a document and look at the next level of 
details. We can send email to the dev list after adding a more detailed 
document to this jira. Then, based on +ve feedback, we could go ahead with 
jiras/code. The devil is in the details :) This would be a significant change 
and we could use more eyes for reviews.

For startProcess identifier, it may be useful for the app to provide the 
identifier in startProcess and then use it later to refer to the process. E.g. 
stopProcess. vs YARN trying to come up with identifiers. This may make the apps 
life easier because it could use meaningful terms based on its own logic. We 
can discuss such details in the design document.


> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-25 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167674#comment-15167674
 ] 

Arun Suresh commented on YARN-1040:
---

Thanks for clarifying [~bikassaha]

I propose we break this down into sub-jiras :
# New APIs specific to delinking container life-cycle from the process : This 
will include 4 of new APIs specified above (excluding localization) but will 
assume single process (and this the startProcess does not need to return a 
processId)
# Add support for clubbing APIs into a single RPC
** Might have to think a bit about validating the order and multiplicity of the 
API calls in each command (which I expect might be different for single process 
/ multiple processes) 
# Add support for localize API
# Add support for Multiple processes
** A processId will be returned for a startProcess. Might have to think thru 
this further. for eg. how does this integrate with YARN-4725

For the purpose of Application Upgrades (for which this JIRA is marked as a 
sub-task of... also why im calling it out specifically) : Add support for 
Container Upgrades
# Expose a canned NMCommand that has the list of APIs to upgrade based on some 
policy

If folks are fine with this, I will ahead and open JIRAs and link this issue to 
each of the above JIRAs (Since I don't think I can create subtasks for this 
JIRA) so that we can start work on the same..

> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-24 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166633#comment-15166633
 ] 

Bikas Saha commented on YARN-1040:
--

I am sorry if I caused a digression by mentioning Slider etc.

I am not sure the upgrade scenario is the only one for this jira since this 
jira covers a broader set. Even without upgrades apps can change the processes 
they are running in a container without having to lose the container 
allocation. Identical calls of primitives could be used without the notion of 
upgrade. E.g. start a Java process first for a Java task, then launch a python 
process for a Python task. To the NM this is identical to starting v1 and then 
starting v2. So while it makes sense for the second one to use an API called 
upgrade, it may not for the first one. 

(Unrelated to this jira, IMO, YARN should allow upgrade of app code without 
losing containers but not necessarily understand it deeply. E.g. YARN need not 
assume that upgrade will need additional resource or try to acquire them 
transparently for the application.)

For the purpose of this jira here is what my thoughts are when I had opened 
YARN-1292 to delink process lifecycle from container.
1) new API - acquireContainer - means ask for the allocated resource. The API 
has a flag to specify whether process exit implies releaseContainer. This is 
for backwards compatibility with a default of true. Apps that want to continue 
to use that behavior can explicitly pass true when using the new API and is 
mainly for reducing number of RPCs for apps like MR/Tez etc.
2) new API - startProcess - means start the remote process
3) new API - stopProcess - means stop the remote process
4) new API - releaseContainer - means release the allocated resource
5) Potentially a new API for localization, though in theory, this could be 
separate.

Since this fine grained control makes the protocol chatty, we can reduce the 
RPC traffic by having a new NM RPC, say NMCommand, that takes a sequence of API 
primitives that can be sent in 1 RPC.
So the current API of startContainer effectively becomes NMCommand(1, 2) and 
stopContainer becomes NMCommand(3,4). This can be leveraged for backwards 
compatibility and rolling upgrades.

The above items would effectively delink process and container lifecyle and 
close out this jira.

This provides the fine grained control in core YARN that can be used for 
various scenarios e.g. upgrades without YARN understanding the scenarios. If we 
need to add higher level notions for upgrades etc. then those could be done as 
separate items.

I hope that helps make my thoughts concrete within the scope of this jira.


> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-24 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166462#comment-15166462
 ] 

Arun Suresh commented on YARN-1040:
---

So that we are on the same page, If we were to separate what needs to be in 
YARN vs what Slider etc. should handle, id say :

*YARN*
* Container Upgrade primitive:
** provide AM with APIs (via NMClient) to upgrade the Container.
** API takes 1) new {{ContainerLaunchContext}} and 2) a policy viz. *In-place* 
(localize in parallel v2, start v2, stop v1) or *New+rollback* (stop v1, 
localize v2, start v2) + (start v1 if start v2 fails)  *or* list of primitive 
composable commands if the above policies doesn't cover the use case.
** should negotiate Resource increase for in-place upgrade with RM prior to 
upgrade via YARN-1197 (or perhaps use OPPORTUNISTIC containers to locally 
negotiated at the NM for the resource spike needed for upgrade, once YARN-2877 
is ready)

*Slider / or something similar*
* Application upgrade primitive
** Upgrade Orchestration Policy: Allow applications deployed via slider to 
specify order in which tasks/roles are upgraded (or started) 
** Allow applications to specify how containers of each role are upgraded
** Actually call the YARN container upgrade APIs (described above) to perform 
upgrade of each container in the user specified order/policy

Makes sense ?


> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-24 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15165668#comment-15165668
 ] 

Bikas Saha commented on YARN-1040:
--

Agree with your scenarios. 

I am trying to figure a way by which this does not become a YARN problem (both 
initial work and ongoing maintenance). E.g. we dont know for sure that the 
resource needs to be x, 2x or 3x. This is an allocation decision and cannot be 
done without the RMs blessing. And increasing container resources is already 
work in progress and may become another NM primitive. Next, what is the 
ordering for the tasks during an upgrade? We could implement one of many 
possibilities but then be stuck with bug-fixing or improving it. Potentially 
use that as a precedent to implement yet another upgrade policy. 

Hence, my suggestion of creating composable primitives that can be used to 
easily implement these flows. And leave it to the apps to determine the exact 
upgrades paths. Perhaps Slider is a better place which could wrap different 
upgrade possibilities using the composable primitives. E.g. 
SliderStopAllUpgradePolicy or SliderConcurrentUpgradePolicy. Or they could be 
provided as helper libs in YARN/NMClient so apps dont have to compose the 
primitives from scratch. The main aim is to continue to make core YARN/NM 
simple by creating primitives and layering complexity on top. This approach may 
be simpler and incremental to develop, test and deploy. Of course, these are my 
personal design views :)

Thoughts?


> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-24 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163864#comment-15163864
 ] 

Arun Suresh commented on YARN-1040:
---

Thanks for the feedback [~bikassaha]

I understand we might not want to place artificial constraint of apps, I was 
just trying to scope out the bare min effort required specifically for long 
running container upgrades. That said, im all for going the whole hog (allow 0 
or 1+ processes) if that is maybe easier.

Some thoughts specifically with regard to container upgrade:
# If we allow multiple processes per container, we might need to have 
{{startProcess()}} to return maybe a *processId* which can subsequently be used 
by the AM to address the process in subsequent calls like {{stopProcess()}}. 
This might complicate the state of AM, and maybe we can leave it out in the 
first cut.
# w.r.t resource re-localization, as per YARN-4597, we are exploring 
localization as a service and possibly re-localization on the fly.
# I like the idea of clubbing multiple API calls in the same RPC. But should 
*upgrade* be a first class semantic, or should it be expressed as a {{localize 
v2, start v2, stop v1}} API combo. One reason to distinguish may be in the case 
of having both versions up at the same time till the new version stabilizes... 
in an upgrade case, the Container should probably be allowed to go 2x its 
allocated resource limit for a period of time, but in the case were we are just 
starting 2 processes, this should probably not be allowed.


> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-24 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163708#comment-15163708
 ] 

Bikas Saha commented on YARN-1040:
--

I am not sure we need to place (somewhat artificial) constraints on the app 
when its not clear that it practically affects YARN

1) Container with no process should be allowed. Apps could terminate all 
running tasks of version A, then start running tasks of version B when they are 
not backwards compatible.
2) Container should be allowed to run multiple processes. This is similar to 
the existing process spawning more processes. It is different from that in the 
sense that the NM has to add the new process to existing monitoring/cgroups etc.
3) Startprocess should be allowed with no process actually started. This will 
allow apps to localize new resources to an existing container. Alternatively, 
we could create a new localization API thats delinked from starting the 
process. But re-localization is an important related feature that we should 
look at supporting via this work because currently that does not work since its 
tied to start process.
4) Most current apps are already communicating directly with their tasks and 
hence can shut them down when they are not needed. However, like suggested 
above, it may be useful for the NM to provide a feature whereby the previous 
task can be shutdown when a new task request is received. Alternatively, the NM 
could provide a stopProcess API to make that explicit.

IMO all of this should be allowed. The timeline could be different with some 
being allowed earlier and some later based on implementation effort.

Thinking ahead, it may be useful for the NM to accept a series of API calls 
within the same RPC (with the current mechanism supported as a single command 
entity for backwards compatibility). Then we will not have to build a lot of 
logic into the NM. The app can get all features by composing a multi-command 
entity.
E.g.
Current start process = {acquire, localize, start} // where acquire means start 
container
Current shutdown process = {stop, release} // where release means give up 
container
Only localize = {localize}
Start another process = {localize, start}
Start another process after shutting down first process = {stop, start} or 
{stop, localize, start}
Start another process and then shutdown the first process = {start, stop}
New container shutdown = {release} // at this point there may be 0 or more 
processes running and which will be stopped


> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-24 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163564#comment-15163564
 ] 

Arun Suresh commented on YARN-1040:
---

Spent some time going thru the conversation (this one as well as YARN-1404)
Given that this has been tracked as a requirement for In place application 
upgrades and it has been sometime since any activity has been posted here, 
[~bikassaha] / [~vinodkv] / [~hitesh] / [~tucu00] / [~steve_l], can you kindly 
clarify the following ?
# Are we still trying to handle the case where we have > 1 processes running 
against a container *at the same time*
# Have we decided that allowing a Container with 0 processes running is a bad 
idea ?

>From the context of getting Application upgrades working, I guess 1) can be 
>relaxed to exactly 1 process running under a container but AM has the option 
>of explicitly starting via the {{startProcess(containerLaunchContext)}} API 
>Bikas mentioned (an additional constraint could probably be the startProcess 
>has to be called within a timeout if no ContainerLaunchContext has been 
>provided with the initial {{startContainer()}} else NM will deem the container 
>dead).

In addition, I was also thinking
# If a process is already running in the container when a 
{{startProcess(ContainerLaunchContext)}} is received, then the first process is 
killed and another is started using the new {{ContainerLaunchContext}}
# Maybe we can refine the above by add an 
{{upgradeProcess(ContainerLaunchContext)}} API that can additionally take on a 
policy like:
## auto-rollback if new process does not start within a timout.
## Rollback could either mean keeping the old process running until upgraded 
process is up -or- if we want to preserve semantics of only 1 process per 
container, first kill the old process and try to start new one, and on failure 
restart old version.

If everyone is ok with the above, I volunteer to either post a preliminary 
patch for the above or if the details get dicier during investigation, I can 
put up a doc.

Thoughts ?  


> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2016-02-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159730#comment-15159730
 ] 

Vinod Kumar Vavilapalli commented on YARN-1040:
---

Moved this to be a sub-task of YARN-4692 given the renewed focus there.

> De-link container life cycle from the process and add ability to execute 
> multiple processes in the same long-lived container
> 
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2013-12-10 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844594#comment-13844594
 ] 

Hitesh Shah commented on YARN-1040:
---

Given the recent comments on YARN-1404, I believe that this should not be 
supported unless the resources are being delegated to another YARN container. 

Furthermore, if we are talking about container leases ( for multiple process 
launches and not doing any resource delegation ), a container lease should 
start when the first process is launched - thereby having an API that supports 
a null ContainerLaunchContext is moot. The lease aspects should probably be 
encoded into the container token so that the NM understands that a process 
exiting in a particular container need not signal the end of the container i.e. 
multipleProcesses should not be an explicit flag in the api.  

 De-link container life cycle from the process and add ability to execute 
 multiple processes in the same long-lived container
 

 Key: YARN-1040
 URL: https://issues.apache.org/jira/browse/YARN-1040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Steve Loughran

 The AM should be able to exec 1 process in a container, rather than have the 
 NM automatically release the container when the single process exits.
 This would let an AM restart a process on the same container repeatedly, 
 which for HBase would offer locality on a restarted region server.
 We may also want the ability to exec multiple processes in parallel, so that 
 something could be run in the container while a long-lived process was 
 already running. This can be useful in monitoring and reconfiguring the 
 long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2013-12-10 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844595#comment-13844595
 ] 

Hitesh Shah commented on YARN-1040:
---

Sorry - got my wires crossed on the different jiras going around. To clarify, I 
believe container leases for multiple processes is a good feature to have. 
Allowing a container to be launched without a process should be a no-no. 
Resource delegation as mentioned in YARN-1404 seems to be a decent approach at 
assigning resources to other containers - however, it should only be restricted 
to assigning resources to containers under the control of YARN.



 De-link container life cycle from the process and add ability to execute 
 multiple processes in the same long-lived container
 

 Key: YARN-1040
 URL: https://issues.apache.org/jira/browse/YARN-1040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Steve Loughran

 The AM should be able to exec 1 process in a container, rather than have the 
 NM automatically release the container when the single process exits.
 This would let an AM restart a process on the same container repeatedly, 
 which for HBase would offer locality on a restarted region server.
 We may also want the ability to exec multiple processes in parallel, so that 
 something could be run in the container while a long-lived process was 
 already running. This can be useful in monitoring and reconfiguring the 
 long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2013-11-13 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13821597#comment-13821597
 ] 

Alejandro Abdelnur commented on YARN-1040:
--

[~bikassaha], if I got it right, you suggest:

* {{StartContainerRequest}} would have a new property {{boolean 
multipleProcesses (false)}}
* An additional API {{startProcess(ContainerId, ContainerLaunchContext)}} will 
be used to start multiple processes within the same container.
* In a {{StartContainerRequest}}, if the {{ContainerLaunchContext == null}} and 
{{multipleProcesses = true}}, the container is started with no associated 
process and the container allocation will not timeout as it as been claimed by 
the AM (because of the start container request).

If that is the case, then YARN-1404 would be a special case of this JIRA.

Am i right?

 De-link container life cycle from the process and add ability to execute 
 multiple processes in the same long-lived container
 

 Key: YARN-1040
 URL: https://issues.apache.org/jira/browse/YARN-1040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Steve Loughran

 The AM should be able to exec 1 process in a container, rather than have the 
 NM automatically release the container when the single process exits.
 This would let an AM restart a process on the same container repeatedly, 
 which for HBase would offer locality on a restarted region server.
 We may also want the ability to exec multiple processes in parallel, so that 
 something could be run in the container while a long-lived process was 
 already running. This can be useful in monitoring and reconfiguring the 
 long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2013-11-12 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820645#comment-13820645
 ] 

Alejandro Abdelnur commented on YARN-1040:
--

[~bikassaha], a minor twist to the outlined approach is that we don't need 
flag, just a NULL {{ContainerLaunchContext}} and that this context is not NULL 
on {{startContainer()}} the container is meant to have 1 process only and 
finishes on process completion. This would preserve backwards compatibility. 
Only when the startContainer has a NULL context, there could be multiple 
processes. Makes sense?

 De-link container life cycle from the process and add ability to execute 
 multiple processes in the same long-lived container
 

 Key: YARN-1040
 URL: https://issues.apache.org/jira/browse/YARN-1040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Steve Loughran

 The AM should be able to exec 1 process in a container, rather than have the 
 NM automatically release the container when the single process exits.
 This would let an AM restart a process on the same container repeatedly, 
 which for HBase would offer locality on a restarted region server.
 We may also want the ability to exec multiple processes in parallel, so that 
 something could be run in the container while a long-lived process was 
 already running. This can be useful in monitoring and reconfiguring the 
 long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container

2013-11-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13821012#comment-13821012
 ] 

Bikas Saha commented on YARN-1040:
--

Thats unnecessary overhead to make 2 RPC calls for the first process when I 
want to run multiple processes within the same container. I first need to do 
startContainer(null) and then startProcess(). startContainer(process, 
flag-multiple-true) is more efficient since there is only 1 RPC. Also, the flag 
is completely backwards compatible with a default of false. We must support 
startContainer(null/no-process, flag-multiple-true) for the case in which the 
first process to run is not yet ready or the case mentioned in YARN-1404 where 
we dont ever want to run a process.

 De-link container life cycle from the process and add ability to execute 
 multiple processes in the same long-lived container
 

 Key: YARN-1040
 URL: https://issues.apache.org/jira/browse/YARN-1040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Steve Loughran

 The AM should be able to exec 1 process in a container, rather than have the 
 NM automatically release the container when the single process exits.
 This would let an AM restart a process on the same container repeatedly, 
 which for HBase would offer locality on a restarted region server.
 We may also want the ability to exec multiple processes in parallel, so that 
 something could be run in the container while a long-lived process was 
 already running. This can be useful in monitoring and reconfiguring the 
 long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.1#6144)