I upgraded Slurm to 23.02.3 but I'm still running into the same problem. 
Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still being 
made available to jobs so we end up with compute jobs being run on GPUs which 
should only be used

Any ideas?

Thanks,
Steve
________________________________
From: Wilson, Steven M
Sent: Tuesday, June 27, 2023 9:50 AM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: Unconfigured GPUs being allocated

Hi,

I manually configure the GPUs in our Slurm configuration (AutoDetect=off in 
gres.conf) and everything works fine when all the GPUs in a node are configured 
in gres.conf and available to Slurm.  But we have some nodes where a GPU is 
reserved for running the display and is specifically not configured in 
gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes it 
available to Slurm jobs.  Using a simple Slurm job that executes "nvidia-smi 
-L", it will display the unconfigured GPU along with as many configured GPUs as 
requested by the job.

For example, in a node configured with this line in slurm.conf:
    NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
and this line in gres.conf:
    Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
I will get the following results from a job running "nvidia-smi -L" that 
requested a single GPU:
    GPU 0: NVIDIA GeForce GT 710 (UUID: 
GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
    GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: 
GPU-0dc4da58-5026-6173-1156-c4559a268bf5)

But in another node that has all GPUs configured in Slurm like this in 
slurm.conf:
    NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 Gres=gpu:TITANX:2
and this line in gres.conf:
    Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
I get the expected results from the job running "nvidia-smi -L" that requested 
a single GPU:
    GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)

I'm running Slurm 22.05.5.

Thanks in advance for any suggestions to help correct this problem!

Steve

Reply via email to