I assume you mean the sentence about dynamic MIG at
https://slurm.schedmd.com/gres.html#MIG_Management
Could it be supported? I think so, but only if one of their paying
customers (that could be you) asks for it.

On Wed, Nov 22, 2023 at 11:24 AM Aaron Kollmann <
aaron.kollm...@student.hpi.de> wrote:

> Hello All,
>
> I am currently working in a research project and we are trying to find out
> whether we can use NVIDIAs multi-instance GPU (MIG) dynamically in SLURM.
>
> For instance:
>
> - a user requests a job and wants a GPU but none is available
>
> - now SLURM will reconfigure a MIG GPU to create a partition (e.g. 1g.5gb)
> which becomes available and allocated immediately
>
> I can already reconfigure MIG + SLURM within a few seconds to start jobs
> on newly partitioned resources, but Jobs get killed when I restart slurmd
> on nodes with a changed MIG config. (see script example below)
>
> *Do you think it is possible to develop a plugin or change SLURM to the
> extent that dynamic MIG will be supported one day? *
>
> (The website says it is not supported)
>
>
>
> Best
>
> - Aaron
>
>
>
>
> #!/usr/bin/bash
>
> # Generate Start Config
> killall slurmd
> killall slurmctld
> nvidia-smi mig -dci
> nvidia-smi mig -dgi
> nvidia-smi mig -cgi 19,14,5 -i 0 -C
> nvidia-smi mig -cgi 0 -i 1 -C
> cp -f ./slurm-19145-0.conf /etc/slurm/slurm.conf
> slurmd -c
> slurmctld -c
> sleep 5
>
> # Start a running and a pending job (the first job gets killed by slurm)
> srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 &
> srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 &
> sleep 5
>
> # Simulate MIG Config Change
> nvidia-smi mig -i 1 -dci
> nvidia-smi mig -i 1 -dgi
> nvidia-smi mig -cgi 19,14,5 -i 1 -C
> cp -f ./slurm-2x19145.conf /etc/slurm/slurm.conf
> killall slurmd
> killall slurmctld
> slurmd
> slurmctld
>

Reply via email to