[ https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751008#comment-16751008 ]
Zhankun Tang commented on YARN-9060: ------------------------------------ [~cheersyang] , [~sunilg] , The patch consists below key things: 1. The native isolation module. It has a different c-e.cfg with GPU/FPGA module due to the bug they have. See above comments for details explanation. The key change of the config is we use "devices.denied-numbers" instead of "devices.allowed-number". [devices] module.enabled=true device.allowed-numbers=8:32 # this will be removed. devices.denied-numbers=8:48,8:16 #comma separated major:minor. Empty means allow default devices reported by device plugin. And the interface of this c-e module for the Java layer to invoke is: c-e --module-devices \ --excluded_devices b-8:32-rwm,c-195:0 \ --allowed_devices 8:16,8:48,195:1 \ --container_id container_x_y 2. The DeviceResourceDockerRuntimePluginImpl.java which bridge the DockerLinuxContainerRuntime and the vendor device plugin. The vendor device plugin's onDeviceAllocated generated DeviceRuntimeSpec will be used in this class. The spec will be translated to internal YARN Docker volume or run command. 3. A sample Nvidia GPU plugin which uses our new DevicePlugin interface. I did the End-To-End test on an AWS EC2 instance with 1 GPU card. Please help to review. Thanks! > [YARN-8851] Phase 1 - Support device isolation in native container-executor > --------------------------------------------------------------------------- > > Key: YARN-9060 > URL: https://issues.apache.org/jira/browse/YARN-9060 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Zhankun Tang > Assignee: Zhankun Tang > Priority: Major > Attachments: YARN-9060-trunk.001.patch, YARN-9060-trunk.002.patch, > YARN-9060-trunk.003.patch, YARN-9060-trunk.004.patch, > YARN-9060-trunk.005.patch, YARN-9060-trunk.006.patch, > YARN-9060-trunk.007.patch, YARN-9060-trunk.008.patch, > YARN-9060-trunk.009.patch > > > Due to the cgroups v1 implementation policy in linux kernel, we cannot update > the value of the device cgroups controller unless we have the root permission > ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]). > So we need to support this in container-executor for Java layer to invoke. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org