[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-7224: -- Labels: Docker (was: ) > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Labels: Docker > Fix For: 3.1.0 > > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, > YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, > YARN-7224.006.patch, YARN-7224.007.patch, YARN-7224.008.patch, > YARN-7224.009.patch > > > This patch is to address issues when docker container is being used: > 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are > pre-packaged inside docker image, it could conflict to driver and > nvidia-libraries installed on Host OS. An alternative solution is to detect > Host OS's installed drivers and devices, mount it when launch docker > container. Please refer to \[1\] for more details. > 2. Image detection: > From \[2\], the challenge is: > bq. Mounting user-level driver libraries and device files clobbers the > environment of the container, it should be done only when the container is > running a GPU application. The challenge here is to determine if a given > image will be using the GPU or not. We should also prevent launching > containers based on a Docker image that is incompatible with the host NVIDIA > driver version, you can find more details on this wiki page. > 3. GPU isolation. > *Proposed solution*: > a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same > solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA. > We won't ship nvidia-docker-plugin with out releases and we require cluster > admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. > "nvidia-docker" is a wrapper of docker binary which can address #3 as well, > however "nvidia-docker" doesn't provide same semantics of docker, and it > needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use > it. To avoid introducing additional issues, we plan to use > nvidia-docker-plugin + docker binary approach. > b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin > \[3\] to create a volume which includes GPU-related libraries and mount it > when docker container being launched. Changes include: > - Instead of using {{volume-driver}}, this patch added {{docker volume > create}} command to c-e and NM Java side. The reason is {{volume-driver}} can > only use single volume driver for each launched docker container. > - Updated {{c-e}} and Java side, if a mounted volume is a named volume in > docker, skip checking file existence. (Named-volume still need to be added to > permitted list of container-executor.cfg). > c. To address isolation issue: > We found that, cgroup + docker doesn't work under newer docker version which > uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup > which include any {{devices.deny}} causes docker container cannot be launched. > Instead this patch passes allowed GPU devices via {{--device}} to docker > launch command. > References: > \[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver > \[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection > \[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin > \[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Attachment: YARN-7224.009.patch bq. Could we print nvidia-docker-plugin -v some where from c-e or java side to dump version info. Helpful for debugging later. Good suggestion, but can we get this done later (with other GPU-debuggbility JIRA, will file later). Fixed #2/#3. Uploaded ver.9 patch, could you help review? > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, > YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, > YARN-7224.006.patch, YARN-7224.007.patch, YARN-7224.008.patch, > YARN-7224.009.patch > > > This patch is to address issues when docker container is being used: > 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are > pre-packaged inside docker image, it could conflict to driver and > nvidia-libraries installed on Host OS. An alternative solution is to detect > Host OS's installed drivers and devices, mount it when launch docker > container. Please refer to \[1\] for more details. > 2. Image detection: > From \[2\], the challenge is: > bq. Mounting user-level driver libraries and device files clobbers the > environment of the container, it should be done only when the container is > running a GPU application. The challenge here is to determine if a given > image will be using the GPU or not. We should also prevent launching > containers based on a Docker image that is incompatible with the host NVIDIA > driver version, you can find more details on this wiki page. > 3. GPU isolation. > *Proposed solution*: > a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same > solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA. > We won't ship nvidia-docker-plugin with out releases and we require cluster > admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. > "nvidia-docker" is a wrapper of docker binary which can address #3 as well, > however "nvidia-docker" doesn't provide same semantics of docker, and it > needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use > it. To avoid introducing additional issues, we plan to use > nvidia-docker-plugin + docker binary approach. > b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin > \[3\] to create a volume which includes GPU-related libraries and mount it > when docker container being launched. Changes include: > - Instead of using {{volume-driver}}, this patch added {{docker volume > create}} command to c-e and NM Java side. The reason is {{volume-driver}} can > only use single volume driver for each launched docker container. > - Updated {{c-e}} and Java side, if a mounted volume is a named volume in > docker, skip checking file existence. (Named-volume still need to be added to > permitted list of container-executor.cfg). > c. To address isolation issue: > We found that, cgroup + docker doesn't work under newer docker version which > uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup > which include any {{devices.deny}} causes docker container cannot be launched. > Instead this patch passes allowed GPU devices via {{--device}} to docker > launch command. > References: > \[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver > \[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection > \[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin > \[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Attachment: YARN-7224.008.patch Thanks [~sunilg] for comments, bq. In assignGpus, do we also need to update the assigned gpus to container's resource mapping list ? I would prefer to keep them in NMStateStore#storeAssignedResources, otherwise all new resource plugins need to implement such logics. bq. In general dockerCommandPlugin.updateDockerRunCommand helps to update docker command for volume etc. However is its better to have an api named sanitize/verifyCommand in dockerCommandPlugin so that incoming/created command will validated and logged based on system parameters I'm not quite sure about this, could you explain? bq. Once a docker volume is created, when this volume will be cleaned or unmounted ? in case when container crashes or force stopping container from external docker commands etc bq. With container upgrades or partially using GPU device for a timeslice of container lifetime, how volumes could be mounted/re-mounted ? For the GPU docker integration, we don't need to do this. Because all launched containers will share the same docker volume, so we don't need to create the docker volume again and again. I agree that we may need this in the future. So I added one method (getCleanupDockerVolumeCommand) to DockerCommandPlugin interface. bq. In GpuDevice, do we also need to add make (like nvidia with version etc ? ) We don't need it for now, we can add it in the future easily when required. bq. In initializeWhenGpuRequested, we do a lazy initialization. However if docker end point is down(default port), this could cause delay in container launch. Do we need a health mechanism to get this data updated ? To me this is same as docker daemon is down. And since containers will fail fast, so admin should be able to fix this issue. bq. Once docker volume is created, its better to dump the docker volume inspect o/p on created volume. Could help for debugging later. I like this ideal, but considering size of this patch, can we do this in a follow up JIRA? Attached ver.8 patch. > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, > YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, > YARN-7224.006.patch, YARN-7224.007.patch, YARN-7224.008.patch > > > This patch is to address issues when docker container is being used: > 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are > pre-packaged inside docker image, it could conflict to driver and > nvidia-libraries installed on Host OS. An alternative solution is to detect > Host OS's installed drivers and devices, mount it when launch docker > container. Please refer to \[1\] for more details. > 2. Image detection: > From \[2\], the challenge is: > bq. Mounting user-level driver libraries and device files clobbers the > environment of the container, it should be done only when the container is > running a GPU application. The challenge here is to determine if a given > image will be using the GPU or not. We should also prevent launching > containers based on a Docker image that is incompatible with the host NVIDIA > driver version, you can find more details on this wiki page. > 3. GPU isolation. > *Proposed solution*: > a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same > solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA. > We won't ship nvidia-docker-plugin with out releases and we require cluster > admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. > "nvidia-docker" is a wrapper of docker binary which can address #3 as well, > however "nvidia-docker" doesn't provide same semantics of docker, and it > needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use > it. To avoid introducing additional issues, we plan to use > nvidia-docker-plugin + docker binary approach. > b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin > \[3\] to create a volume which includes GPU-related libraries and mount it > when docker container being launched. Changes include: > - Instead of using {{volume-driver}}, this patch added {{docker volume > create}} command to c-e and NM Java side. The reason is {{volume-driver}} can > only use single volume driver for each launched docker container. > - Updated {{c-e}} and Java side, if a mounted volume is a named volume in > docker, skip checking file existence. (Named-volume still need to be added to > permitted list of co
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Attachment: YARN-7224.007.patch Attached ver.7 patch, fixed warnings / javadocs, UT failure is not related. > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, > YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, > YARN-7224.006.patch, YARN-7224.007.patch > > > This patch is to address issues when docker container is being used: > 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are > pre-packaged inside docker image, it could conflict to driver and > nvidia-libraries installed on Host OS. An alternative solution is to detect > Host OS's installed drivers and devices, mount it when launch docker > container. Please refer to \[1\] for more details. > 2. Image detection: > From \[2\], the challenge is: > bq. Mounting user-level driver libraries and device files clobbers the > environment of the container, it should be done only when the container is > running a GPU application. The challenge here is to determine if a given > image will be using the GPU or not. We should also prevent launching > containers based on a Docker image that is incompatible with the host NVIDIA > driver version, you can find more details on this wiki page. > 3. GPU isolation. > *Proposed solution*: > a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same > solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA. > We won't ship nvidia-docker-plugin with out releases and we require cluster > admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. > "nvidia-docker" is a wrapper of docker binary which can address #3 as well, > however "nvidia-docker" doesn't provide same semantics of docker, and it > needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use > it. To avoid introducing additional issues, we plan to use > nvidia-docker-plugin + docker binary approach. > b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin > \[3\] to create a volume which includes GPU-related libraries and mount it > when docker container being launched. Changes include: > - Instead of using {{volume-driver}}, this patch added {{docker volume > create}} command to c-e and NM Java side. The reason is {{volume-driver}} can > only use single volume driver for each launched docker container. > - Updated {{c-e}} and Java side, if a mounted volume is a named volume in > docker, skip checking file existence. (Named-volume still need to be added to > permitted list of container-executor.cfg). > c. To address isolation issue: > We found that, cgroup + docker doesn't work under newer docker version which > uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup > which include any {{devices.deny}} causes docker container cannot be launched. > Instead this patch passes allowed GPU devices via {{--device}} to docker > launch command. > References: > \[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver > \[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection > \[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin > \[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Attachment: YARN-7224.006.patch Attached ver.6 patch to run Jenkins. > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, > YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, > YARN-7224.006.patch > > > This patch is to address issues when docker container is being used: > 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are > pre-packaged inside docker image, it could conflict to driver and > nvidia-libraries installed on Host OS. An alternative solution is to detect > Host OS's installed drivers and devices, mount it when launch docker > container. Please refer to \[1\] for more details. > 2. Image detection: > From \[2\], the challenge is: > bq. Mounting user-level driver libraries and device files clobbers the > environment of the container, it should be done only when the container is > running a GPU application. The challenge here is to determine if a given > image will be using the GPU or not. We should also prevent launching > containers based on a Docker image that is incompatible with the host NVIDIA > driver version, you can find more details on this wiki page. > 3. GPU isolation. > *Proposed solution*: > a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same > solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA. > We won't ship nvidia-docker-plugin with out releases and we require cluster > admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. > "nvidia-docker" is a wrapper of docker binary which can address #3 as well, > however "nvidia-docker" doesn't provide same semantics of docker, and it > needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use > it. To avoid introducing additional issues, we plan to use > nvidia-docker-plugin + docker binary approach. > b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin > \[3\] to create a volume which includes GPU-related libraries and mount it > when docker container being launched. Changes include: > - Instead of using {{volume-driver}}, this patch added {{docker volume > create}} command to c-e and NM Java side. The reason is {{volume-driver}} can > only use single volume driver for each launched docker container. > - Updated {{c-e}} and Java side, if a mounted volume is a named volume in > docker, skip checking file existence. (Named-volume still need to be added to > permitted list of container-executor.cfg). > c. To address isolation issue: > We found that, cgroup + docker doesn't work under newer docker version which > uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup > which include any {{devices.deny}} causes docker container cannot be launched. > Instead this patch passes allowed GPU devices via {{--device}} to docker > launch command. > References: > \[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver > \[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection > \[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin > \[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Description: This patch is to address issues when docker container is being used: 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are pre-packaged inside docker image, it could conflict to driver and nvidia-libraries installed on Host OS. An alternative solution is to detect Host OS's installed drivers and devices, mount it when launch docker container. Please refer to \[1\] for more details. 2. Image detection: >From \[2\], the challenge is: bq. Mounting user-level driver libraries and device files clobbers the environment of the container, it should be done only when the container is running a GPU application. The challenge here is to determine if a given image will be using the GPU or not. We should also prevent launching containers based on a Docker image that is incompatible with the host NVIDIA driver version, you can find more details on this wiki page. 3. GPU isolation. *Proposed solution*: a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA. We won't ship nvidia-docker-plugin with out releases and we require cluster admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. "nvidia-docker" is a wrapper of docker binary which can address #3 as well, however "nvidia-docker" doesn't provide same semantics of docker, and it needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use it. To avoid introducing additional issues, we plan to use nvidia-docker-plugin + docker binary approach. b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin \[3\] to create a volume which includes GPU-related libraries and mount it when docker container being launched. Changes include: - Instead of using {{volume-driver}}, this patch added {{docker volume create}} command to c-e and NM Java side. The reason is {{volume-driver}} can only use single volume driver for each launched docker container. - Updated {{c-e}} and Java side, if a mounted volume is a named volume in docker, skip checking file existence. (Named-volume still need to be added to permitted list of container-executor.cfg). c. To address isolation issue: We found that, cgroup + docker doesn't work under newer docker version which uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup which include any {{devices.deny}} causes docker container cannot be launched. Instead this patch passes allowed GPU devices via {{--device}} to docker launch command. References: \[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver \[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection \[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin \[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ was:YARN-6620 added support of GPU isolation in NM side, which only supports non-docker containers. We need to add support to help docker containers launched by YARN can utilize GPUs. > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, > YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch > > > This patch is to address issues when docker container is being used: > 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are > pre-packaged inside docker image, it could conflict to driver and > nvidia-libraries installed on Host OS. An alternative solution is to detect > Host OS's installed drivers and devices, mount it when launch docker > container. Please refer to \[1\] for more details. > 2. Image detection: > From \[2\], the challenge is: > bq. Mounting user-level driver libraries and device files clobbers the > environment of the container, it should be done only when the container is > running a GPU application. The challenge here is to determine if a given > image will be using the GPU or not. We should also prevent launching > containers based on a Docker image that is incompatible with the host NVIDIA > driver version, you can find more details on this wiki page. > 3. GPU isolation. > *Proposed solution*: > a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same > solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA. > We won't ship nvidia-docker-plugin with out releases and we require cluster > admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. > "nvidia-docker" is
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Attachment: YARN-7224.005.patch Attached ver.5 patch. > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, > YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch > > > YARN-6620 added support of GPU isolation in NM side, which only supports > non-docker containers. We need to add support to help docker containers > launched by YARN can utilize GPUs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Attachment: YARN-7224.004.patch Attached ver.4 patch, fixed warnings / test failures and added more preventive tests. > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, > YARN-7224.003.patch, YARN-7224.004.patch > > > YARN-6620 added support of GPU isolation in NM side, which only supports > non-docker containers. We need to add support to help docker containers > launched by YARN can utilize GPUs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Attachment: YARN-7224.003.patch Attached ver.003 patch, major updates: 1) Instead of using {{volume-driver}}, this patch added {{docker volume create}} command to c-e and NM Java side. The reason is {{volume-driver}} can only use single volume driver for each launched docker container. 2) Updated {{c-e}} and Java side, if a mounted volume is a named volume in docker, skip checking file existence. (Named-volume still need to be added to permitted list of container-executor.cfg). 3) More tests and cleanups. > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, > YARN-7224.003.patch > > > YARN-6620 added support of GPU isolation in NM side, which only supports > non-docker containers. We need to add support to help docker containers > launched by YARN can utilize GPUs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Attachment: YARN-7224.002-wip.patch Attached ver.2 work-in-progress patch. Major change of this patch is I found cgroup + docker doesn't work under newer docker version which uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup which include any {{devices.deny}} causes docker container cannot be launched. Instead this patch passes allowed GPU devices via {{--device}} to docker launch command. Tested this patch in a centos 7 machine with 2 GPU devices, it works fine. There're some cleanups need to be done and more unit tests need to be added. Marked as WIP. > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch > > > YARN-6620 added support of GPU isolation in NM side, which only supports > non-docker containers. We need to add support to help docker containers > launched by YARN can utilize GPUs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Attachment: YARN-7224.001.patch Attached ver.1 patch on top of YARN-6620. Please feel free to share your thoughts! > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-7224.001.patch > > > YARN-6620 added support of GPU isolation in NM side, which only supports > non-docker containers. We need to add support to help docker containers > launched by YARN can utilize GPUs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7224) Support GPU isolation for docker container
[ https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7224: - Description: YARN-6620 added support of GPU isolation in NM side, which only supports non-docker containers. We need to add support to help docker containers launched by YARN can utilize GPUs. > Support GPU isolation for docker container > -- > > Key: YARN-7224 > URL: https://issues.apache.org/jira/browse/YARN-7224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > > YARN-6620 added support of GPU isolation in NM side, which only supports > non-docker containers. We need to add support to help docker containers > launched by YARN can utilize GPUs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org