Wangda Tan created YARN-6223: -------------------------------- Summary: [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN Key: YARN-6223 URL: https://issues.apache.org/jira/browse/YARN-6223 Project: Hadoop YARN Issue Type: New Feature Reporter: Wangda Tan Assignee: Wangda Tan
As varieties of workloads are moving to YARN, including machine learning / deep learning which can speed up by leveraging GPU computation power. Workloads should be able to request GPU from YARN as simple as CPU and memory. To make a complete GPU story, we should support following pieces: 1) GPU discovery/configuration: Admin can either config GPU resources and architectures on each node, or more advanced, NodeManager can automatically discover GPU resources and architectures and report to ResourceManager 2) GPU scheduling: YARN scheduler should account GPU as a resource type just like CPU and memory. 3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager should properly isolate and monitor task's resource usage. For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced an extensible framework to support isolation for different resource types and different runtimes. There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but different solutions: For scheduling: - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource protocol instead of leveraging YARN-3926. For isolation: - And YARN-4122 proposed to use CGroups to do isolation which cannot solve the problem listed at https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver versions, etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org