Hi everyone,
I just ran into a very subtle configuration problem with enabling GPU support 
on Mesos and thought I'd share a brief post mortem.
Scenario:Running a GPU Mesos task. This task first executes nvidia-smi to 
confirm the GPUs are visible and then executes a Caffe training example to 
verify the GPU is usable.
Symptom:The nvidia-smi reported the correct number of GPUs but the training 
example crashed when creating the CUDA device.
Debugging tactics:To debug this I added an infinite loop to the end of the task 
so the environment would not be torn down. Next I logged into the machine, 
found the PID of the Mesos task and entered the namespace with: nsenter -t 
$TASK_PID -m -u -i -n -p -r -w

At this point I attempted to manually run the test and it worked. The reason it 
worked was that my test terminal was not added to the devices CGROUP. So next I 
added it to the CGROUP with:echo $TEST_TERMINAL_PID >> 
/sys/fs/cgroup/devices/mesos/f6736041-d403-4494-95fd-604eace34ce1/tasks
After joining the CGROUP I could reproduce the problem and systematically added 
devices to the CGROUP's allow list until it worked.
Root cause:After rebooting a machine the nvidia-uvm device is not created 
automatically. To create this device "sudo mknod -m 666 /dev/nvidia-uvm c 250 
0" was added to a start up script. The problem with this is that nvidia-uvm 
uses a major device ID in the experimental range. One of the consequences of 
this is that the major device ID might change on boot. This means the hardcoded 
value of 250 in the start up script is incorrect. When Mesos starts up it reads 
the major device ID from /dev/nvidia-uvm which matched the value given by the 
start up script. Then when it created the devices CGROUP it uses that number 
instead of the correct one. nvidia-smi worked because it never accessed the 
nvidia-uvm device.
The fix:Do not hard code the major device ID of nvidia-uvm in a start up 
script. Instead bring the device up with:nvidia-modprobe -u -c 0

I hope this information helps someone and a big thanks to Kevin Klues for 
helping me debug this issue.
Justin
                                          

Reply via email to