subject:"\[jira\] \[Commented\] \(AURORA\-1763\) GPU drivers are missing when using a Docker image"

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2017-02-08 Thread Andrei Filip (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858139#comment-15858139
 ] 

Andrei Filip commented on AURORA-1763:
--

We're currently working with mesos and aurora and have recently ran into this 
exact issue. Requiring gpu support for processes running in docker containers. 
Is there any update here or can you guys recommend any valid short-term 
workaround ?

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-12-23 Thread Kostiantyn Bokhan (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15773271#comment-15773271
 ] 

Kostiantyn Bokhan commented on AURORA-1763:
---

PATH and LD_LIBRARY_PATH are not  transferred  form a image as well. So, 

{code}NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please 
make sure that the NVIDIA Display Driver is properly installed and present in 
your system.
Please also try adding directory that contains libnvidia-ml.so to your system 
PATH.{code}



> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468200#comment-15468200
 ] 

Joshua Cohen commented on AURORA-1763:
--

Yes, for the reasons Jie mentions, setting rootfs is not an option for Thermos.

Another option would be to configure each Mesos agent host with a 
{{/usr/local/nvidia}} and then configure the {{--global_container_mounts}} flag 
on the Scheduler to point to that path. Thermos will then mount that into each 
task.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468190#comment-15468190
 ] 

Jie Yu commented on AURORA-1763:


Setting rootfs for the executor is another option, but i think that might break 
thermos because it assume it can see host rootfs (i might be wrong)? Also, 
bundling executor (and libmesos.so) in an image is not trivial because of the 
ABI compatibility issue. That means thermos needs to have one docker image for 
each linux distribution.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 -

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468177#comment-15468177
 ] 

Jie Yu commented on AURORA-1763:


I mean it's from /var/run/mesos/isolators/gpu/xxx and should be mounted to 
/usr/local/nvidia in the task's rootfs.

GPU isolator will prepare all files under `/var/run/mesos/isolators/gpu` so 
thermos does not have to worry about it.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 /

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468170#comment-15468170
 ] 

Justin Pinkul commented on AURORA-1763:
---

Including the GPU drivers in the Docker image could work as a temporary work 
around but falls apart incredibly fast. The problem here is that the driver is 
specific to the GPU and, I think, even the kernel version. This means that as 
soon as we have more than one type of GPU this totally falls part. In our 
particular use case we expect to hit this scenario in around a month.

Binding the GPU volume into the task's rootfs will work. However as the 
executor is implemented right now determining the location to mount from will 
be very fragile. Since the executor is not launched with a different rootfs 
Mesos does not mount the driver's for the executor. This means that the 
executor would have to find the drivers itself, in this case from 
{{/run/mesos/isolators/gpu/nvidia_352.39}}. This sounds very fragile, 
especially since the path includes the driver version.

I think a better approach would be to set the rootfs for the executor so Mesos 
mounts the drivers to {{/usr/loca/nvidia}}. Then when the executor starts the 
task it can pass this mount on without duplicating the logic Mesos uses to find 
the drivers.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 /

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468146#comment-15468146
 ] 

Joshua Cohen commented on AURORA-1763:
--

Where are they mounted from?

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs 
> systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
> 97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 -

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468142#comment-15468142
 ] 

Jie Yu commented on AURORA-1763:


Short term, can thermos bind mount gpu volumes into task's rootfs? 

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs 
> systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
> 97 96 0:34 / /proc/sys/fs/binfmt_misc

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468115#comment-15468115
 ] 

Joshua Cohen commented on AURORA-1763:
--

[~jpinkul] I think in the mean time, the only solution is to explicitly include 
the GPU drivers in your Docker image if you'd like to use the unified 
containerizer? Is that correct [~jieyu]?

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Zameer Manji (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468116#comment-15468116
 ] 

Zameer Manji commented on AURORA-1763:
--

[~jieyu]: I agree Pods is the future here. We can drop our dependency on 
{{mesos-containerizer}} and use the appropriate primitives. However, what are 
we supposed to do in the short term?

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468110#comment-15468110
 ] 

Jie Yu commented on AURORA-1763:


This is expected. I think the ultimate solution for Aurora is to use the 
upcoming nested container primitive to create nested container. THe gpu 
isolator will be made nesting aware so it'll prepare rootfs for the nested 
container as well. The Mesos design doc is here:
https://docs.google.com/document/d/1FtcyQkDfGp-bPHTW4pUoqQCgVlPde936bo-IIENO_ho/edit

Expect nested container support to land soon (in 1 or 2 months)

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 -

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467903#comment-15467903
 ] 

Joshua Cohen commented on AURORA-1763:
--

Thanks for the context. I'm wondering if this is what's causing it:

{code:title=https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L288-L290}
  if (!containerConfig.has_rootfs()) {
 return None();
  }
{code}

Aurora does not configure tasks with a {{ContainerInfo}} that has an {{Image}} 
set. Instead it configures the task's filesystem as a {{Volume}} with an 
{{Image}} set. The executor then uses {{mesos-containerizer launch ...}} to 
pivot/chroot into that filesystem. To me it looks like the code above is 
relying on the container itself having a rootfs, which won't be the case 
currently based on the way we isolate task filesystems

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

12 matches

Site Navigation

Mail list logo

Footer information