[ https://issues.apache.org/jira/browse/YARN-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667624#comment-15667624 ]
Varun Vasudev commented on YARN-1593: ------------------------------------- [~hitesh] - {quote} My concern is around the feedback loop in terms of failure handling by the apps when the system container dies at any of the following points: system container dies before an allocated container is launched on that node it dies while a container is running it dies after a container has completed Would applications that define affinity to these system services now be getting updates (notifications) when system service containers go down or come back up? {quote} All of these are questions that we have to solve for the general services scenarios and I suspect that they might take some time to get right. Our solution till we have a well rounded story for these questions is to use the second method I mentioned above where we launch the Tez shuffle service on every node. That way Tez doesn't need to change any behaviour for now. Once we have the services scheduling and notification pieces sorted out we can start moving to the affinity model. {quote} In addition to the feedback loop, is there any behavior change as a result of this? i.e. if the system container is not alive, will the app container still get launched given that its dependent service is down ( for shuffle, this might be ok if the system container eventually comes up but there might be other services that provide more synchronous functionality such as a caching layer? {quote} This depends on whether it's a system service or a system container (the difference is that the first one has an AM running whereas the second is more like auxiliary services running as a container). In case of system containers - the NM will stop accepting container requests until the system container is back up. In case of the system service, the NM will continue to accept container requests. > support out-of-proc AuxiliaryServices > ------------------------------------- > > Key: YARN-1593 > URL: https://issues.apache.org/jira/browse/YARN-1593 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, rolling upgrade > Reporter: Ming Ma > Assignee: Varun Vasudev > Attachments: SystemContainersandSystemServices.pdf > > > AuxiliaryServices such as ShuffleHandler currently run in the same process as > NM. There are some benefits to host them in dedicated processes. > 1. NM rolling restart. If we want to upgrade YARN , NM restart will force the > ShuffleHandler restart. If ShuffleHandler runs as a separate process, > ShuffleHandler can continue to run during NM restart. NM can reconnect the > the running ShuffleHandler after restart. > 2. Resource management. It is possible another type of AuxiliaryServices will > be implemented. AuxiliaryServices are considered YARN application specific > and could consume lots of resources. Running AuxiliaryServices in separate > processes allow easier resource management. NM could potentially stop a > specific AuxiliaryServices process from running if it consumes resource way > above its allocation. > Here are some high level ideas: > 1. NM provides a hosting process for each AuxiliaryService. Existing > AuxiliaryService API doesn't change. > 2. The hosting process provides RPC server for AuxiliaryService proxy object > inside NM to connect to. > 3. When we rolling restart NM, the existing AuxiliaryService processes will > continue to run. NM could reconnect to the running AuxiliaryService processes > upon restart. > 4. Policy and resource management of AuxiliaryServices. So far we don't have > immediate need for this. AuxiliaryService could run inside a container and > its resource utilization could be taken into account by RM and RM could > consider a specific type of applications overutilize cluster resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org