Hi,

I manually configure the GPUs in our Slurm configuration (AutoDetect=off in 
gres.conf) and everything works fine when all the GPUs in a node are configured 
in gres.conf and available to Slurm.  But we have some nodes where a GPU is 
reserved for running the display and is specifically not configured in 
gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes it 
available to Slurm jobs.  Using a simple Slurm job that executes "nvidia-smi 
-L", it will display the unconfigured GPU along with as many configured GPUs as 
requested by the job.

For example, in a node configured with this line in slurm.conf:
    NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
and this line in gres.conf:
    Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
I will get the following results from a job running "nvidia-smi -L" that 
requested a single GPU:
    GPU 0: NVIDIA GeForce GT 710 (UUID: 
GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
    GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: 
GPU-0dc4da58-5026-6173-1156-c4559a268bf5)

But in another node that has all GPUs configured in Slurm like this in 
slurm.conf:
    NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 Gres=gpu:TITANX:2
and this line in gres.conf:
    Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
I get the expected results from the job running "nvidia-smi -L" that requested 
a single GPU:
    GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)

I'm running Slurm 22.05.5.

Thanks in advance for any suggestions to help correct this problem!

Steve

Reply via email to