Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you have reserved for that job.
Here’s a verification code you can run to verify that two different GPU jobs see different GPU devices (compile with nvcc): ===== // From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu #include <stdio.h> void printDevProp(cudaDeviceProp dP) { printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount); printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, dP.pciDeviceID); } int main() { // Number of CUDA devices int devCount; cudaGetDeviceCount(&devCount); printf("There are %d CUDA devices.\n", devCount); // Iterate through devices for (int i = 0; i < devCount; ++i) { // Get device properties printf("CUDA Device #%d: ", i); cudaDeviceProp devProp; cudaGetDeviceProperties(&devProp, i); printDevProp(devProp); } return 0; } ===== When run from two simultaneous jobs on the same node (each with a gres=gpu), I get: ===== [renfro@gpunode003(job 221584) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 5, DeviceID 0 ===== [renfro@gpunode003(job 221585) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 6, DeviceID 0 ===== -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On Nov 13, 2019, at 9:54 AM, Tamas Hegedus <ta...@hegelab.org> wrote: > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > ________________________________ > > Hi, > > I run gmx 2019 using GPU > There are 4 GPUs in my GPU hosts. > I have slurm and configured gres=gpu > > 1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used > (-gpu_id $CUDA_VISIBLE_DEVICES). > 2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1 > and selected, but GPU #0 is identified by gmx as a compatible gpu. > From the output: > > gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu > -npme 1 -ntmpi 4 > > GPU info: > Number of GPUs detected: 1 > #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: > compatible > > Fatal error: > You limited the set of compatible GPUs to a set that included ID #1, but > that > ID is not for a compatible GPU. List only compatible GPUs. > > 3. If I login to that node and run the mdrun command written into the > output in the previous step then it selects the right gpu and runs as > expected. > > $CUDA_DEVICE_ORDER is set to PCI_BUS_ID > > I can not decide if this is a slurm config error or something with > gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect > gromacs to detect all 4GPUs. > > Thanks for your help and suggestions, > Tamas > > -- > > Tamas Hegedus, PhD > Senior Research Fellow > Department of Biophysics and Radiation Biology > Semmelweis University | phone: (36) 1-459 1500/60233 > Tuzolto utca 37-47 | mailto:ta...@hegelab.org > Budapest, 1094, Hungary | http://www.hegelab.org > >