Hi All,
We have a couple nodes with 8 Nvidia Titan X GPUs each. We have some software
that can run in parallel across GPUs, but performance is only good if the
inter-GPU communication stays on the PCI links of a single CPU socket.
Right now, the only thing I have been able to work reliably [with slurm
14.11.8 on Scientific Linux 6] is to define two types of gpus in the gres.conf:
NodeName=c-3-29,c-9-9 Name=gpu Type=titanxa File=/dev/nvidia0 CPUs=0-15
NodeName=c-3-29,c-9-9 Name=gpu Type=titanxa File=/dev/nvidia1 CPUs=0-15
NodeName=c-3-29,c-9-9 Name=gpu Type=titanxa File=/dev/nvidia2 CPUs=0-15
NodeName=c-3-29,c-9-9 Name=gpu Type=titanxa File=/dev/nvidia3 CPUs=0-15
NodeName=c-3-29,c-9-9 Name=gpu Type=titanxb File=/dev/nvidia4 CPUs=16-31
NodeName=c-3-29,c-9-9 Name=gpu Type=titanxb File=/dev/nvidia5 CPUs=16-31
NodeName=c-3-29,c-9-9 Name=gpu Type=titanxb File=/dev/nvidia6 CPUs=16-31
NodeName=c-3-29,c-9-9 Name=gpu Type=titanxb File=/dev/nvidia7 CPUs=16-31
The downside is that the user needs to specify one GRES type or the other at
job submission. I suppose I could modify the job submit lua script to pick one
randomly or on current usage, but that could still lead to imbalanced usage.
I had earlier tried to have a single Type=titanx, with each device restricted
to the cores on one socket or the other. I couldn't figure out a way to reliably
restrict a single job to cores on a single socket. Also, even with the device
restrictions, I was able to get a job with CPU cores on one socket, but using
the GPU connected to the other socket.
Is there a recommended way to handle this situation? I'd like to preserve the
option of having a single job be able to use all 8 GPUs.
Thanks,
Nate Crawford
--
________________________________________________________________________
Dr. Nathan Crawford [email protected]
Modeling Facility Director
Department of Chemistry
1102 Natural Sciences II Office: 2101 Natural Sciences II
University of California, Irvine Phone: 949-824-4508
Irvine, CA 92697-2025, USA