[ 
https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751008#comment-16751008
 ] 

Zhankun Tang commented on YARN-9060:
------------------------------------

[~cheersyang] , [~sunilg] , The patch consists below key things:

1. The native isolation module. It has a different c-e.cfg with GPU/FPGA module 
due to the bug they have. See above comments for details explanation. The key 
change of the config is we use "devices.denied-numbers" instead of 
"devices.allowed-number".
[devices] 
  module.enabled=true  device.allowed-numbers=8:32 # this will be removed.
  devices.denied-numbers=8:48,8:16 #comma separated major:minor. Empty means 
allow default devices reported by device plugin.
And the interface of this c-e module for the Java layer to invoke is:
c-e --module-devices \
  --excluded_devices b-8:32-rwm,c-195:0 \
  --allowed_devices 8:16,8:48,195:1 \
  --container_id container_x_y
2. The DeviceResourceDockerRuntimePluginImpl.java which bridge the 
DockerLinuxContainerRuntime and the vendor device plugin. The vendor device 
plugin's onDeviceAllocated generated DeviceRuntimeSpec will be used in this 
class. The spec will be translated to internal YARN Docker volume or run 
command.

3. A sample Nvidia GPU plugin which uses our new DevicePlugin interface.

I did the End-To-End test on an AWS EC2 instance with 1 GPU card. Please help 
to review. Thanks!

> [YARN-8851] Phase 1 - Support device isolation in native container-executor
> ---------------------------------------------------------------------------
>
>                 Key: YARN-9060
>                 URL: https://issues.apache.org/jira/browse/YARN-9060
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Zhankun Tang
>            Assignee: Zhankun Tang
>            Priority: Major
>         Attachments: YARN-9060-trunk.001.patch, YARN-9060-trunk.002.patch, 
> YARN-9060-trunk.003.patch, YARN-9060-trunk.004.patch, 
> YARN-9060-trunk.005.patch, YARN-9060-trunk.006.patch, 
> YARN-9060-trunk.007.patch, YARN-9060-trunk.008.patch, 
> YARN-9060-trunk.009.patch
>
>
> Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
> the value of the device cgroups controller unless we have the root permission 
> ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
>  So we need to support this in container-executor for Java layer to invoke.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to