Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected
On Wednesday, 13 November 2019 10:11:30 AM PST Tamas Hegedus wrote: > Thanks for your suggestion. You are right, I do not have to deal with > specific GPUs. > (I have not tried to compile your code, I simply tested two gromacs runs > on the same node with -gres=gpu:1 options.) How are you controlling access to GPUs? Is that via cgroups? If so you should be fine, but if you're not using cgroups to control access then you may well find that they are sharing the same GPU. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected
Thanks for your suggestion. You are right, I do not have to deal with specific GPUs. (I have not tried to compile your code, I simply tested two gromacs runs on the same node with -gres=gpu:1 options.) On 11/13/19 5:17 PM, Renfro, Michael wrote: Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you have reserved for that job. Here’s a verification code you can run to verify that two different GPU jobs see different GPU devices (compile with nvcc): = // From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu #include void printDevProp(cudaDeviceProp dP) { printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount); printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, dP.pciDeviceID); } int main() { // Number of CUDA devices int devCount; cudaGetDeviceCount(); printf("There are %d CUDA devices.\n", devCount); // Iterate through devices for (int i = 0; i < devCount; ++i) { // Get device properties printf("CUDA Device #%d: ", i); cudaDeviceProp devProp; cudaGetDeviceProperties(, i); printDevProp(devProp); } return 0; } = When run from two simultaneous jobs on the same node (each with a gres=gpu), I get: = [renfro@gpunode003(job 221584) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 5, DeviceID 0 = [renfro@gpunode003(job 221585) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 6, DeviceID 0 = -- Tamas Hegedus, PhD Senior Research Fellow Department of Biophysics and Radiation Biology Semmelweis University | phone: (36) 1-459 1500/60233 Tuzolto utca 37-47| mailto:ta...@hegelab.org Budapest, 1094, Hungary | http://www.hegelab.org
Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected
Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you have reserved for that job. Here’s a verification code you can run to verify that two different GPU jobs see different GPU devices (compile with nvcc): = // From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu #include void printDevProp(cudaDeviceProp dP) { printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount); printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, dP.pciDeviceID); } int main() { // Number of CUDA devices int devCount; cudaGetDeviceCount(); printf("There are %d CUDA devices.\n", devCount); // Iterate through devices for (int i = 0; i < devCount; ++i) { // Get device properties printf("CUDA Device #%d: ", i); cudaDeviceProp devProp; cudaGetDeviceProperties(, i); printDevProp(devProp); } return 0; } = When run from two simultaneous jobs on the same node (each with a gres=gpu), I get: = [renfro@gpunode003(job 221584) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 5, DeviceID 0 = [renfro@gpunode003(job 221585) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 6, DeviceID 0 = -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On Nov 13, 2019, at 9:54 AM, Tamas Hegedus wrote: > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > > > Hi, > > I run gmx 2019 using GPU > There are 4 GPUs in my GPU hosts. > I have slurm and configured gres=gpu > > 1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used > (-gpu_id $CUDA_VISIBLE_DEVICES). > 2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1 > and selected, but GPU #0 is identified by gmx as a compatible gpu. > From the output: > > gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu > -npme 1 -ntmpi 4 > > GPU info: >Number of GPUs detected: 1 >#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: > compatible > > Fatal error: > You limited the set of compatible GPUs to a set that included ID #1, but > that > ID is not for a compatible GPU. List only compatible GPUs. > > 3. If I login to that node and run the mdrun command written into the > output in the previous step then it selects the right gpu and runs as > expected. > > $CUDA_DEVICE_ORDER is set to PCI_BUS_ID > > I can not decide if this is a slurm config error or something with > gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect > gromacs to detect all 4GPUs. > > Thanks for your help and suggestions, > Tamas > > -- > > Tamas Hegedus, PhD > Senior Research Fellow > Department of Biophysics and Radiation Biology > Semmelweis University | phone: (36) 1-459 1500/60233 > Tuzolto utca 37-47| mailto:ta...@hegelab.org > Budapest, 1094, Hungary | http://www.hegelab.org > >
[slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected
Hi, I run gmx 2019 using GPU There are 4 GPUs in my GPU hosts. I have slurm and configured gres=gpu 1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used (-gpu_id $CUDA_VISIBLE_DEVICES). 2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1 and selected, but GPU #0 is identified by gmx as a compatible gpu. From the output: gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu -npme 1 -ntmpi 4 GPU info: Number of GPUs detected: 1 #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible Fatal error: You limited the set of compatible GPUs to a set that included ID #1, but that ID is not for a compatible GPU. List only compatible GPUs. 3. If I login to that node and run the mdrun command written into the output in the previous step then it selects the right gpu and runs as expected. $CUDA_DEVICE_ORDER is set to PCI_BUS_ID I can not decide if this is a slurm config error or something with gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect gromacs to detect all 4GPUs. Thanks for your help and suggestions, Tamas -- Tamas Hegedus, PhD Senior Research Fellow Department of Biophysics and Radiation Biology Semmelweis University | phone: (36) 1-459 1500/60233 Tuzolto utca 37-47| mailto:ta...@hegelab.org Budapest, 1094, Hungary | http://www.hegelab.org