[ https://issues.apache.org/jira/browse/YARN-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15710178#comment-15710178 ]
Haibo Chen edited comment on YARN-1593 at 11/30/16 11:25 PM: ------------------------------------------------------------- Thanks for starting the work on this, [~vvasudev]! I’d like to understand the proposal better. A few comments/questions on the proposal. Please correct me as necessary. It seems like system containers are overloaded in the design doc. From a NM’s perspective, my understanding is that system containers are special container runtime (relative to the container types we have today in NM) provided by NM to be used by system services to run their components/instances. In other cases, system containers represent components/instances of system services on the worker nodes. In the former case, we may only need to be concerned with issues such as classpath and container executors. For ShuffleHandler for instance, it is an alternative of the in-process runtime it gets from NM today. The latter, is where we discuss whether RM or NM does the heavy-lifting of managing system containers. As you mention, no one option suits all use cases. Option 1 suits some, while option 3 suits others. I wonder if this is because we are conflating two different types of containers in the proposal - (1) framework-specific services like MR shuffle, and (2) application-specific services. Framework services are to be run on all nodes that support the framework (e.g. MR). Since these run on every node, node-level configs (option 3) would work best. Application-services (e.g. ATS AM-companion-collector), on the other hand, are application specific and need to run on a subset of cluster nodes; option 1 readily applies to these. Is this categorization accurate? And, do you see merit in differentiating between these two? bq. Allow shuffle to run on the NodeManagers without requiring it to be setup as an AuxiliaryService Not sure if I understand this correctly, IHO, we could let the user continue with their current configuration for AuxiliaryService, but just run them in containers with AuxiliaryService proxy like Junping said in the jira description. bq. Handling container status for system-containers - we will need to add logic to not act upon the container status of a system-container. Can you please elaborate more on this? Shouldn’t NM try to relaunch system containers? Does this mean that RM will take the responsibility of handling system container failures? bq. I think discovery is going to be one major piece that needs to be addressed from the beginning Agree with Sangjin that discovery problem needs to be addressed right at the beginning. For option 3, I think we can add a queryable registry in AuxiliaryServices when NM launches a proxied AuxiliaryService assuming that NM will launch the AuxiliaryServices in the right order and each AuxiliaryService knows its dependent services. bq. the NodeManager will block container requests until all the system-containers are running With global scheduling and resource affinity, NM does not necessarily need to block container launching. NM can launch system containers asynchronously and report to resource manager upon launch success, and RM can only schedule containers on those nodes if the services that the containers depend on have been launched on those nodes. But that’s way in the future I guess bq. We can’t solve the dependency management and affinity/anti-affinity requirements. (One of cons in option 3) Not quite sure how option 1 solves the affinity requirement. Can you elaborate a little more on this? To solve the dependency management issue, one thing that occurred to me, but I have not thought about in much details, is, we could have RM manages all system services together and construct a DAG of system services that need to be launched on each NM. Alternatively, RM can just decide what services need to be launched on which nodes with their dependency clearly defined, and then NM can construct the DAG themselves and launches them in topological order. This however, does put some burden on RM. was (Author: haibochen): Thanks for starting the work on this, Varun Vasudev! I’d like to understand the proposal better. A few comments/questions on the proposal. Please correct me as necessary. It seems like system containers are overloaded in the design doc. From a NM’s perspective, my understanding is that system containers are special container runtime (relative to the container types we have today in NM) provided by NM to be used by system services to run their components/instances. In other cases, system containers represent components/instances of system services on the worker nodes. In the former case, we may only need to be concerned with issues such as classpath and container executors. For ShuffleHandler for instance, it is an alternative of the in-process runtime it gets from NM today. The latter, is where we discuss whether RM or NM does the heavy-lifting of managing system containers. As you mention, no one option suits all use cases. Option 1 suits some, while option 3 suits others. I wonder if this is because we are conflating two different types of containers in the proposal - (1) framework-specific services like MR shuffle, and (2) application-specific services. Framework services are to be run on all nodes that support the framework (e.g. MR). Since these run on every node, node-level configs (option 3) would work best. Application-services (e.g. ATS AM-companion-collector), on the other hand, are application specific and need to run on a subset of cluster nodes; option 1 readily applies to these. Is this categorization accurate? And, do you see merit in differentiating between these two? bq. Allow shuffle to run on the NodeManagers without requiring it to be setup as an AuxiliaryService Not sure if I understand this correctly, IHO, we could let the user continue with their current configuration for AuxiliaryService, but just run them in containers with AuxiliaryService proxy like Junping said in the jira description. bq. Handling container status for system-containers - we will need to add logic to not act upon the container status of a system-container. Can you please elaborate more on this? Shouldn’t NM try to relaunch system containers? Does this mean that RM will take the responsibility of handling system container failures? bq. I think discovery is going to be one major piece that needs to be addressed from the beginning Agree with Sangjin that discovery problem needs to be addressed right at the beginning. For option 3, I think we can add a queryable registry in AuxiliaryServices when NM launches a proxied AuxiliaryService assuming that NM will launch the AuxiliaryServices in the right order and each AuxiliaryService knows its dependent services. bq. the NodeManager will block container requests until all the system-containers are running With global scheduling and resource affinity, NM does not necessarily need to block container launching. NM can launch system containers asynchronously and report to resource manager upon launch success, and RM can only schedule containers on those nodes if the services that the containers depend on have been launched on those nodes. But that’s way in the future I guess bq. We can’t solve the dependency management and affinity/anti-affinity requirements. (One of cons in option 3) Not quite sure how option 1 solves the affinity requirement. Can you elaborate a little more on this? To solve the dependency management issue, one thing that occurred to me, but I have not thought about in much details, is, we could have RM manages all system services together and construct a DAG of system services that need to be launched on each NM. Alternatively, RM can just decide what services need to be launched on which nodes with their dependency clearly defined, and then NM can construct the DAG themselves and launches them in topological order. This however, does put some burden on RM. > support out-of-proc AuxiliaryServices > ------------------------------------- > > Key: YARN-1593 > URL: https://issues.apache.org/jira/browse/YARN-1593 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, rolling upgrade > Reporter: Ming Ma > Assignee: Varun Vasudev > Attachments: SystemContainersandSystemServices.pdf > > > AuxiliaryServices such as ShuffleHandler currently run in the same process as > NM. There are some benefits to host them in dedicated processes. > 1. NM rolling restart. If we want to upgrade YARN , NM restart will force the > ShuffleHandler restart. If ShuffleHandler runs as a separate process, > ShuffleHandler can continue to run during NM restart. NM can reconnect the > the running ShuffleHandler after restart. > 2. Resource management. It is possible another type of AuxiliaryServices will > be implemented. AuxiliaryServices are considered YARN application specific > and could consume lots of resources. Running AuxiliaryServices in separate > processes allow easier resource management. NM could potentially stop a > specific AuxiliaryServices process from running if it consumes resource way > above its allocation. > Here are some high level ideas: > 1. NM provides a hosting process for each AuxiliaryService. Existing > AuxiliaryService API doesn't change. > 2. The hosting process provides RPC server for AuxiliaryService proxy object > inside NM to connect to. > 3. When we rolling restart NM, the existing AuxiliaryService processes will > continue to run. NM could reconnect to the running AuxiliaryService processes > upon restart. > 4. Policy and resource management of AuxiliaryServices. So far we don't have > immediate need for this. AuxiliaryService could run inside a container and > its resource utilization could be taken into account by RM and RM could > consider a specific type of applications overutilize cluster resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org