Thanks Paul!

On 26-01-2021 21:11, Paul Raines wrote:
You should check your jobs that allocated GPUs and make sure
CUDA_VISIBLE_DEVICES is being set in the environment.  This is a sign
you GPU support is not really there but SLURM is just doing "generic"
resource assignment.

Could you elaborate a bit on this remark? Are you saying that I need to check if CUDA_VISIBLE_DEVICES is defined automatically by Slurm inside the batch job as described in https://slurm.schedmd.com/gres.html?

What do you mean by "your GPU support is not really there" and Slurm doing "generic" resource assignment? I'm just not understanding this...

With my Slurm 20.02.6 built without NVIDIA libraries, Slurm nevertheless seems to be scheduling multiple jobs so that different jobs are assigned to different GPUs. The GRES=gpu* values point to distinct IDX values (GPU indexes). The nvidia-smi command shows individual processes running on distinct GPUs. All seems to be fine - or am I completely mistaken?

Thanks,
Ole


Reply via email to