[ 
https://issues.apache.org/jira/browse/YARN-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667624#comment-15667624
 ] 

Varun Vasudev commented on YARN-1593:
-------------------------------------

[~hitesh] - 
{quote}
My concern is around the feedback loop in terms of failure handling by the apps 
when the system container dies at any of the following points:

    system container dies before an allocated container is launched on that node
    it dies while a container is running
    it dies after a container has completed

Would applications that define affinity to these system services now be getting 
updates (notifications) when system service containers go down or come back up?
{quote}
All of these are questions that we have to solve for the general services 
scenarios and I suspect that they might take some time to get right. Our 
solution till we have a well rounded story for these questions is to use the 
second method I mentioned above where we launch the Tez shuffle service on 
every node. That way Tez doesn't need to change any behaviour for now. Once we 
have the services scheduling and notification pieces sorted out we can start 
moving to the affinity model. 

{quote}
In addition to the feedback loop, is there any behavior change as a result of 
this? i.e. if the system container is not alive, will the app container still 
get launched given that its dependent service is down ( for shuffle, this might 
be ok if the system container eventually comes up but there might be other 
services that provide more synchronous functionality such as a caching layer? 
{quote}
This depends on whether it's a system service or a system container (the 
difference is that the first one has an AM running whereas the second is more 
like auxiliary services running as a container). In case of system containers - 
the NM will stop accepting container requests until the system container is 
back up. In case of the system service, the NM will continue to accept 
container requests.

> support out-of-proc AuxiliaryServices
> -------------------------------------
>
>                 Key: YARN-1593
>                 URL: https://issues.apache.org/jira/browse/YARN-1593
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager, rolling upgrade
>            Reporter: Ming Ma
>            Assignee: Varun Vasudev
>         Attachments: SystemContainersandSystemServices.pdf
>
>
> AuxiliaryServices such as ShuffleHandler currently run in the same process as 
> NM. There are some benefits to host them in dedicated processes.
> 1. NM rolling restart. If we want to upgrade YARN , NM restart will force the 
> ShuffleHandler restart. If ShuffleHandler runs as a separate process, 
> ShuffleHandler can continue to run during NM restart. NM can reconnect the 
> the running ShuffleHandler after restart.
> 2. Resource management. It is possible another type of AuxiliaryServices will 
> be implemented. AuxiliaryServices are considered YARN application specific 
> and could consume lots of resources. Running AuxiliaryServices in separate 
> processes allow easier resource management. NM could potentially stop a 
> specific AuxiliaryServices process from running if it consumes resource way 
> above its allocation.
> Here are some high level ideas:
> 1. NM provides a hosting process for each AuxiliaryService. Existing 
> AuxiliaryService API doesn't change.
> 2. The hosting process provides RPC server for AuxiliaryService proxy object 
> inside NM to connect to.
> 3. When we rolling restart NM, the existing AuxiliaryService processes will 
> continue to run. NM could reconnect to the running AuxiliaryService processes 
> upon restart.
> 4. Policy and resource management of AuxiliaryServices. So far we don't have 
> immediate need for this. AuxiliaryService could run inside a container and 
> its resource utilization could be taken into account by RM and RM could 
> consider a specific type of applications overutilize cluster resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to