[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093148#comment-16093148
 ] 

Zhankun Tang commented on YARN-6223:
------------------------------------

[~wangda], sorry for the late reply. Great thanks for the ver.3 patch!

It looks good to me. But my concerns are mainly about modularity:
1. If container-executor will be compiled with GPU module, does it mean that 
all type of accelerator devices(like FPGA, SSD, DSP) should implement one 
module in C language? Or perhaps is it possible that we provide a generic 
interface in container-executor to handle all device isolation?
2. Since we already have "node-resources.xml" for end-users to declare 
customized resource like GPU/FPGA, is it possible to put the allowed devices 
configuration here instead of "container-executor.cfg"?
For instance:
{code:xml}
...
<property>
   <name>yarn.nodemanager.resource-types.MCP</name>
   <value>2</value>
</property>
<property>
   <name>yarn.nodemanager.resource-types.NvidiaGPU.allowed</name>
   <value>195:0,195:1</value>
</property>
...
{code}

Please correct me if I made mistakes.

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-6223
>                 URL: https://issues.apache.org/jira/browse/YARN-6223
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>         Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to