I haven't seen anything that allows for disabling a defined Gres device. It 
does seem to work if I define the GPUs that I don't want to use and then 
specifically submit jobs to the other GPUs using --gpu like 
"--gpu=gpu:rtx_2080_ti:1". I suppose if I set the GPU Type to be "COMPUTE" for 
the GPUs I want to use for computing and "UNUSED" for those that I don't, this 
scheme might work (e.g., --gpu=gpu:COMPUTE:3). But then every job submission 
would be required to have this option set. Not a very workable solution.

Thanks!
Steve
________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Feng 
Zhang <prod.f...@gmail.com>
Sent: Friday, July 14, 2023 3:09 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

[Some people who received this message don't often get email from 
prod.f...@gmail.com. Learn why this is important at 
https://aka.ms/LearnAboutSenderIdentification ]

---- External Email: Use caution with attachments, links, or sharing data ----


Very interesting issue.

I am guessing there might be a workaround: SInce oryx has 2 gpus
instead, you can define both of them, but disable the GT 710? Does
Slurm support this?

Best,

Feng

Best,

Feng


On Tue, Jun 27, 2023 at 9:54 AM Wilson, Steven M <ste...@purdue.edu> wrote:
>
> Hi,
>
> I manually configure the GPUs in our Slurm configuration (AutoDetect=off in 
> gres.conf) and everything works fine when all the GPUs in a node are 
> configured in gres.conf and available to Slurm.  But we have some nodes where 
> a GPU is reserved for running the display and is specifically not configured 
> in gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes 
> it available to Slurm jobs.  Using a simple Slurm job that executes 
> "nvidia-smi -L", it will display the unconfigured GPU along with as many 
> configured GPUs as requested by the job.
>
> For example, in a node configured with this line in slurm.conf:
>     NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
> and this line in gres.conf:
>     Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
> I will get the following results from a job running "nvidia-smi -L" that 
> requested a single GPU:
>     GPU 0: NVIDIA GeForce GT 710 (UUID: 
> GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
>     GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: 
> GPU-0dc4da58-5026-6173-1156-c4559a268bf5)
>
> But in another node that has all GPUs configured in Slurm like this in 
> slurm.conf:
>     NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 
> Gres=gpu:TITANX:2
> and this line in gres.conf:
>     Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
> I get the expected results from the job running "nvidia-smi -L" that 
> requested a single GPU:
>     GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)
>
> I'm running Slurm 22.05.5.
>
> Thanks in advance for any suggestions to help correct this problem!
>
> Steve

Reply via email to