Hello, We're looking for any advice for salloc/srun setup that uses 1 gpu/task but where the job makes use of all available gpus.
*Test #1:* We desire an salloc and srun such that each task gets 1 GPU, but the GPU usage for the job is spread out among 4 available devices. See gres.conf below. % salloc -n 12 -c 2 -gres=gpu:1 % srun env | grep CUDA CUDA_VISIBLE_DEVICES=0 (12 times) Where we desire: CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=2 CUDA_VISIBLE_DEVICES=3 And so on (12 times), such that each task still gets 1 gpu, but usage is spread out among the 4 available devices (see gres.conf below). Not one (device=0). That way each task is not waiting on device 0 to free up, as is currently the case. What are we missing or misunderstanding? - salloc / srun parameter? - slurm.conf or gres.conf setting? Also see other additional tests below that illustrate current behavior: *Test #2* Here we believe each srun task will need 4 gpus each. % salloc -n 12 -c 2 -gres=gpu:4 % srun env | grep CUDA CUDA_VISIBLE_DEVICES=0,1,2,3 (12 times) This matches expectation. *Test #3* Another test, where I submit multiple sruns in succession: Here we use a simple sleepCUDA.py scripts which sleeps a few seconds, and then prints $CUDA_VISIBLE_DEVICES) % salloc -n 12 -c 2 -gres=gpu:4 % srun -gres=gpu:1 sleepCUDA.py & % srun -gres=gpu:1 sleepCUDA.py & % srun -gres=gpu:1 sleepCUDA.py & % srun -gres=gpu:1 sleepCUDA.py & Result: CUDA_VISIBLE_DEVICES=0 (jobid 1) CUDA_VISIBLE_DEVICES=1 (jobid 2) CUDA_VISIBLE_DEVICES=2 (jobid 3) CUDA_VISIBLE_DEVICES=3 (jobid 4) And so on (but not necessarily in 0,1,2,3 order) Though a single srun submission would only use 1 gpu (device=0) as before as expected. But this seems like a step in right direction since multiple devices were used, but not quite what we wanted. And according to: https://slurm.schedmd.com/archive/slurm-16.05.7/gres.html *“By default, a job step will be allocated all of the generic resources allocated to the job/ [Test #2]* *If desired, the job step may explicitly specify a different generic resource count than the job. [Test #3]”* To Test#3 non-iteractively should we look into creating an sbatch script (with multiple sruns) instead of salloc? *OS: *CentOS 7 *Slurm version: *16.05.6 *gres.conf* Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 *slurm.conf (truncated/partial/simplified)* NodeName=node1 Gres=gpu:4 NodeName=node2 Gres=gpu:4 NodeName=node3 Gres=gpu:4 NodeName=node4 Gres=gpu:4 GresTypes=gpu No cgroup.conf Posting actual .conf is not practical due to firewalls. Any advice will be greatly appreciated! Thank you!