Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread Ward Poelmans

Hi,

We have a dedicated partitions for GPUs (their name ends with _gpu) and simply 
forbid a job that is not requesting GPU resources to use this partition:

local function job_total_gpus(job_desc)
-- return total number of GPUs allocated to the job
-- there are many ways to request a GPU. This comes from the job_submit 
example in the slurm source
-- a GPU resource is either nil or "gres:gpu:N", with N the number of GPUs 
requested

-- pick relevant job resources for GPU spec (undefined resources can show 
limit values)
gpu_specs = {
['tres_per_node'] = 1,
['tres_per_task'] = 1,
['tres_per_socket'] = 1,
['tres_per_job'] = 1,
}

-- number of nodes
if job_desc['min_nodes'] < 0xFFFE then gpu_specs['tres_per_node'] = 
job_desc['min_nodes'] end
-- number of tasks
if job_desc['num_tasks'] < 0xFFFE then gpu_specs['tres_per_task'] = 
job_desc['num_tasks'] end
-- number of sockets
if job_desc['sockets_per_node'] < 0xFFFE then gpu_specs['tres_per_socket'] 
= job_desc['sockets_per_node'] end
gpu_specs['tres_per_socket'] = gpu_specs['tres_per_socket'] * 
gpu_specs['tres_per_node']

gpu_options = {}
for tres_name, _ in pairs(gpu_specs) do
local num_gpus = string.match(tostring(job_desc[tres_name]), 
"^gres:gpu:([0-9]+)") or 0
gpu_options[tres_name] = tonumber(num_gpus)
end
-- calculate total GPUs
for tres_name, job_res in pairs(gpu_specs) do
local num_gpus = gpu_options[tres_name]
if num_gpus > 0 then
total_gpus = num_gpus * tonumber(job_res)
return total_gpus
end
end
return 0
end



function slurm_job_submit(job_desc, part_list, submit_uid)
local total_gpus = job_total_gpus(job_desc)
slurm.log_debug("Job total number of GPUs: %s", tostring(total_gpus));

if total_gpus == 0 then
for partition in string.gmatch(tostring(job_desc.partition), '([^,]+)') 
do
if string.match(partition, '_gpu$') then
slurm.log_user(string.format('ERROR: GPU partition %s is not 
allowed for non-GPU jobs.', partition))
return ESLURM_INVALID_GRES
end
end
end

return slurm.SUCCESS
end



Ward

On 29/03/2023 01:24, Frank Pari wrote:

Well, I wanted to avoid using lua.  But, it looks like that's going to be the 
easiest way to do this without having to create a separate partition for the 
GPUs.  Basically, check for at least one gpu in the job submission and if none 
exclude all GPU nodes for the job.

image.png

Now I'm wondering how to auto-gen the list of nodes with GPUs, so I don't have 
to remember to update job_submit.lua everytime we get new GPU nodes.

-F

On Tue, Mar 28, 2023 at 4:06 PM Frank Pari mailto:pa...@bc.edu>> 
wrote:

Hi all,

First, thank you all for participating in this list.  I've learned so much 
by just following in other's threads.  =)

I'm looking at creating a scavenger partition with idle resources from CPU 
and GPU nodes and I'd like to keep this to one partition.  But, I don't want 
CPU only jobs using up resources on the GPU nodes.

I've seen suggestions for job/lua scripts.  But, I'm wondering if there's 
any other way to ensure a job has requested at least 1 gpu for the scheduler to 
assign that job to a GPU node.

Thanks in advance!

-Frank





smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread Wagner, Marcus

Hi Frank,

use Features on the nodes, every cpu node gets e.g. "cpu", every gpu 
node e.g. "gpu".


If a job asks for no gpus, set an additional constraint "cpu" for the job.


Best
Marcus

Am 29.03.2023 um 01:24 schrieb Frank Pari:
Well, I wanted to avoid using lua.  But, it looks like that's going to 
be the easiest way to do this without having to create a separate 
partition for the GPUs.  Basically, check for at least one gpu in the 
job submission and if none exclude all GPU nodes for the job.


image.png

Now I'm wondering how to auto-gen the list of nodes with GPUs, so I 
don't have to remember to update job_submit.lua everytime we get new 
GPU nodes.


-F

On Tue, Mar 28, 2023 at 4:06 PM Frank Pari  wrote:

Hi all,

First, thank you all for participating in this list. I've learned
so much by just following in other's threads.  =)

I'm looking at creating a scavenger partition with idle resources
from CPU and GPU nodes and I'd like to keep this to one
partition.  But, I don't want CPU only jobs using up resources on
the GPU nodes.

I've seen suggestions for job/lua scripts.  But, I'm wondering if
there's any other way to ensure a job has requested at least 1 gpu
for the scheduler to assign that job to a GPU node.

Thanks in advance!

-Frank


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread René Sitt

Hello,

maybe some additional notes:

While the cited procedure works great in general, it gets more 
complicated for heterogeneous setups, i.e. if you have several GPU types 
defined in gres.conf, since the 'tres_per_' fields can then take the 
form of either 'gres:gpu:N' or 'gres:gpu::N' - depending on 
whether the job script specifies a GPU type or not.
Of course, you could omit the GPU type definition in gres.conf and 
define the type as a node feature instead, as long as no nodes contain 
multiple different GPU types.
Since the latter is the case in our cluster, I instead opted to check 
only for the existence of 'gpu' in the 'tres_per_' fields and to not 
bother with parsing the actual number of GPUs. However, there is an 
interesting edge case here, as users are free to set --gpus=0 - either 
one has to filter for that specifically, or instruct one's users to not 
do that.


Kind Regards,
René Sitt

Am 29.03.23 um 08:57 schrieb Ward Poelmans:


Hi,

We have a dedicated partitions for GPUs (their name ends with _gpu) 
and simply forbid a job that is not requesting GPU resources to use 
this partition:


local function job_total_gpus(job_desc)
    -- return total number of GPUs allocated to the job
    -- there are many ways to request a GPU. This comes from the 
job_submit example in the slurm source
    -- a GPU resource is either nil or "gres:gpu:N", with N the number 
of GPUs requested


    -- pick relevant job resources for GPU spec (undefined resources 
can show limit values)

    gpu_specs = {
    ['tres_per_node'] = 1,
    ['tres_per_task'] = 1,
    ['tres_per_socket'] = 1,
    ['tres_per_job'] = 1,
    }

    -- number of nodes
    if job_desc['min_nodes'] < 0xFFFE then 
gpu_specs['tres_per_node'] = job_desc['min_nodes'] end

    -- number of tasks
    if job_desc['num_tasks'] < 0xFFFE then 
gpu_specs['tres_per_task'] = job_desc['num_tasks'] end

    -- number of sockets
    if job_desc['sockets_per_node'] < 0xFFFE then 
gpu_specs['tres_per_socket'] = job_desc['sockets_per_node'] end
    gpu_specs['tres_per_socket'] = gpu_specs['tres_per_socket'] * 
gpu_specs['tres_per_node']


    gpu_options = {}
    for tres_name, _ in pairs(gpu_specs) do
    local num_gpus = string.match(tostring(job_desc[tres_name]), 
"^gres:gpu:([0-9]+)") or 0

    gpu_options[tres_name] = tonumber(num_gpus)
    end
    -- calculate total GPUs
    for tres_name, job_res in pairs(gpu_specs) do
    local num_gpus = gpu_options[tres_name]
    if num_gpus > 0 then
    total_gpus = num_gpus * tonumber(job_res)
    return total_gpus
    end
    end
    return 0
end



function slurm_job_submit(job_desc, part_list, submit_uid)
    local total_gpus = job_total_gpus(job_desc)
    slurm.log_debug("Job total number of GPUs: %s", 
tostring(total_gpus));


    if total_gpus == 0 then
    for partition in string.gmatch(tostring(job_desc.partition), 
'([^,]+)') do

    if string.match(partition, '_gpu$') then
    slurm.log_user(string.format('ERROR: GPU partition %s 
is not allowed for non-GPU jobs.', partition))

    return ESLURM_INVALID_GRES
    end
    end
    end

    return slurm.SUCCESS
end



Ward

On 29/03/2023 01:24, Frank Pari wrote:
Well, I wanted to avoid using lua.  But, it looks like that's going 
to be the easiest way to do this without having to create a separate 
partition for the GPUs. Basically, check for at least one gpu in the 
job submission and if none exclude all GPU nodes for the job.


image.png

Now I'm wondering how to auto-gen the list of nodes with GPUs, so I 
don't have to remember to update job_submit.lua everytime we get new 
GPU nodes.


-F

On Tue, Mar 28, 2023 at 4:06 PM Frank Pari > wrote:


    Hi all,

    First, thank you all for participating in this list.  I've 
learned so much by just following in other's threads.  =)


    I'm looking at creating a scavenger partition with idle resources 
from CPU and GPU nodes and I'd like to keep this to one partition.  
But, I don't want CPU only jobs using up resources on the GPU nodes.


    I've seen suggestions for job/lua scripts.  But, I'm wondering if 
there's any other way to ensure a job has requested at least 1 gpu 
for the scheduler to assign that job to a GPU node.


    Thanks in advance!

    -Frank




--
Dipl.-Chem. René Sitt
Hessisches Kompetenzzentrum für Hochleistungsrechnen
Philipps-Universität Marburg
Hans-Meerwein-Straße
35032 Marburg

Tel. +49 6421 28 23523
si...@hrz.uni-marburg.de
www.hkhlr.de



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread Markus Kötter

Hello,

On 29.03.23 10:08, René Sitt wrote:
While the cited procedure works great in general, it gets more 
complicated for heterogeneous setups
, i.e. if you have several GPU types 
defined in gres.conf, since the 'tres_per_' fields can then take the 
form of either 'gres:gpu:N' or 'gres:gpu::N' - depending on 
whether the job script specifies a GPU type or not.


Using lua match:

> for g in job_desc.gres:gmatch("[^,]*") do
>   count = g:match("gres:gpu:%w+:(%d+)$") or g:match("gres:gpu:(%d+)$")
> if count then


MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Dr. Thomas Orgis
Am Mon, 27 Mar 2023 13:17:01 +0200
schrieb Ole Holm Nielsen :

> FYI: Slurm power_save works very well for us without the issues that you 
> describe below.  We run Slurm 22.05.8, what's your version?

I'm sure that there are setups where it works nicely;-) For us, it
didn't, and I was faced with hunting the bug in slurm or working around
it with more control, fixing the underlying issue of the node resume
script being called _after_ the job has been allocated to the node.
That is too late in case of node bootup failure and causes annoying
delays for users only to see jobs fail.

We do run 21.08.8-2, which means any debugging of this on the slurm
side would mean upgrading first (we don't upgrade just for upgrade's
sake). And, as I said: The issue of the wrong timing remains, unless I
try deeper changes in slurm's logic. The other issue is that we had a
kludge in place, anyway, to enable slurmctld to power on nodes via
IPMI. The machine slurmctld runs on has no access to the IPMI network
itself, so we had to build a polling communication channel to the node
which has this access (and which is on another security layer, hence no
ssh into it). For all I know, this communication kludge is not to
blame, as, in the spurious failures, the nodes did boot up just fine
and were ready. Only slurmctld decided to let the timeout pass first,
then recognize that the slurmd on the node is there, right that instant.

Did your power up/down script workflow work with earlier slurm
versions, too? Did you use it on bare metal servers or mostly on cloud
instances?

Do you see a chance for

a) fixing up the internal powersaving logic to properly allocating
   nodes to a job only when these nodes are actually present (ideally,
   with a health check passing) or
b) designing an interface between slurm as manager of available
   resources and another site-specific service responsible for off-/onlining
   resources that are known to slurm, but down/drained?

My view is that Slurm's task is to distribute resources among users.
The cluster manager (person or $MIGHTY_SITE_SPECIFIC_SOFTWARE) decides
if a node is currently available to Slurm or down for maintenance, for
example. Power saving would be another reason for a node being taken
out of service.

Maybe I got an old-fashioned minority view …


Alrighty then,

Thomas

PS: I guess solution a) above goes against Slurm's focus on throughput
and avoiding delays caused by synchronization points, while our idea here
is that batch jobs where that matters should be written differently,
packing more than a few seconds worth of work into each step.

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg



Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Ben Polman


I'd be interested in your kludge, we face a similar situation where the 
slurmctld node
does not have access to the ipmi network and can not ssh to machines 
that have access.
We are thinking on creating a rest interface to a control server which 
would be running the ipmi commands


Ben


On 29-03-2023 14:16, Dr. Thomas Orgis wrote:

Am Mon, 27 Mar 2023 13:17:01 +0200
schrieb Ole Holm Nielsen :


FYI: Slurm power_save works very well for us without the issues that you
describe below.  We run Slurm 22.05.8, what's your version?

I'm sure that there are setups where it works nicely;-) For us, it
didn't, and I was faced with hunting the bug in slurm or working around
it with more control, fixing the underlying issue of the node resume
script being called _after_ the job has been allocated to the node.
That is too late in case of node bootup failure and causes annoying
delays for users only to see jobs fail.

We do run 21.08.8-2, which means any debugging of this on the slurm
side would mean upgrading first (we don't upgrade just for upgrade's
sake). And, as I said: The issue of the wrong timing remains, unless I
try deeper changes in slurm's logic. The other issue is that we had a
kludge in place, anyway, to enable slurmctld to power on nodes via
IPMI. The machine slurmctld runs on has no access to the IPMI network
itself, so we had to build a polling communication channel to the node
which has this access (and which is on another security layer, hence no
ssh into it). For all I know, this communication kludge is not to
blame, as, in the spurious failures, the nodes did boot up just fine
and were ready. Only slurmctld decided to let the timeout pass first,
then recognize that the slurmd on the node is there, right that instant.

Did your power up/down script workflow work with earlier slurm
versions, too? Did you use it on bare metal servers or mostly on cloud
instances?

Do you see a chance for

a) fixing up the internal powersaving logic to properly allocating
nodes to a job only when these nodes are actually present (ideally,
with a health check passing) or
b) designing an interface between slurm as manager of available
resources and another site-specific service responsible for off-/onlining
resources that are known to slurm, but down/drained?

My view is that Slurm's task is to distribute resources among users.
The cluster manager (person or $MIGHTY_SITE_SPECIFIC_SOFTWARE) decides
if a node is currently available to Slurm or down for maintenance, for
example. Power saving would be another reason for a node being taken
out of service.

Maybe I got an old-fashioned minority view …


Alrighty then,

Thomas

PS: I guess solution a) above goes against Slurm's focus on throughput
and avoiding delays caused by synchronization points, while our idea here
is that batch jobs where that matters should be written differently,
packing more than a few seconds worth of work into each step.



--
-
Dr. B.J.W. Polman, C&CZ, Radboud University Nijmegen.
Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360
e-mail: ben.pol...@science.ru.nl



OpenPGP_0xEE3D0443F73E4A1D.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Dr. Thomas Orgis
Am Wed, 29 Mar 2023 14:42:33 +0200
schrieb Ben Polman :

> I'd be interested in your kludge, we face a similar situation where the 
> slurmctld node
> does not have access to the ipmi network and can not ssh to machines 
> that have access.
> We are thinking on creating a rest interface to a control server which 
> would be running the ipmi commands

We settled on transient files in /dev/shm on the slurmctld side as
"API". You could call it in-memory transactional database;-)

#!/bin/sh
# node-suspend and node-resume (symlinked) script

powerdir=/dev/shm/powersave
scontrol=$(cd "$(dirname "$0")" && pwd)/scontrol
hostlist=$1

case $0 in
*-suspend)
  subdir=suspend
;;
*-resume)
  subdir=resume
;;
esac

mkdir -p "$powerdir/$subdir" &&
cd "$powerdir/$subdir" &&
tmp=$(mktemp XXX.tmp) &&
$scontrol show hostnames "$hostlist" > "$tmp" &&
echo "$(date +%Y%m%d-%H%M%S) $(basename $0) $(cat "$tmp"|tr '\n' ' ')" >> 
$powerdir/log
mv "$tmp" "${tmp%.tmp}.list"
# end

This atomically creates powersave/suspend/*.list and
powersave/resume/*.list files with node names in them.

On the priviledged server, a script periodically looked at the directories
(via ssh) and triggered the appropriate actions, including some
heuristics about unlcean shutdowns or spontaneous re-availability (with
a thousand runs, there's a good chance for something getting stuck, in
some driver code, even).

#!/bin/sh

powerdir=/dev/shm/powersave

batch()
{
  ssh-wrapper-that-correctly-quotes-argument-list --host=batchhost "$@"
}

while sleep 5
do
  suspendlists=$(batch ls "$powerdir/suspend/" 2>/dev/null | grep '.list$')
  for f in $suspendlists
  do
hosts=$(batch cat "$powerdir/suspend/$f" 2>/dev/null)
for h in $hosts
do
  case "$h" in
  node*|data*)
echo "suspending $h"
node-shutdown-wrapper "$h"
  ;;
  *)
echo "malformed node name"
  ;;
  esac
done
batch rm -f "$powerdir/suspend/$f"
  done
  resumelists=$(batch ls $powerdir/resume/ 2>/dev/null | grep '.list$')
  for f in $resumelists
  do
hosts=$(batch cat "$powerdir/resume/$f" 2>/dev/null)
for h in $hosts
do
-  case "$h" in
  node*)
echo "resuming $h"
# Assume the node _should_ be switched off. Ensure that now (in
# case it hung during shutdown).
   if ipmi-wrapper "$h" chassis power status|grep -q on$; then
  if ssh -o ConnectTimeout=2 "$h" pgrep slurmd >/dev/null 2>&1 


Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Ole Holm Nielsen

Hi Thomas,

I think the Slurm power_save is not problematic for us with bare-metal 
on-premise nodes, in contrast to the problems you're having.


We use power_save with on-premise nodes where we control the power down/up 
by means of IPMI commands as provided in the scripts which you will find 
in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
There's no hokus-pocus once the IPMI commands are working correctly with 
your nodes.


Of course, our slurmctld server can communicate with our IPMI management 
network to perform power management.  I don't see this network access as a 
security problem.


I think we had power_save with IPMI working also in Slurm 21.08 before we 
upgraded to 22.05.


As for job scheduling, slurmctld may allocate a job to some powered-off 
nodes and then calls the ResumeProgram defined in slurm.conf.  From this 
point it may indeed take 2-3 minutes before a node is up and running 
slurmd, during which time it will have a state of POWERING_UP (see "man 
sinfo").  If this doesn't complete after ResumeTimeout seconds, the node 
will go into a failed state.  All this logic seems to be working well.


If you would like to try out the above mentioned IPMI scripts, you could 
test them on a node on your IPMI network to see if you can reliably power 
some nodes up and down.  If this works, hopefully you could configure 
slurmctld so that it executes the scripts (note: it will be run by the 
"slurm" user).


Best regards,
Ole


On 3/29/23 14:16, Dr. Thomas Orgis wrote:

Am Mon, 27 Mar 2023 13:17:01 +0200
schrieb Ole Holm Nielsen :


FYI: Slurm power_save works very well for us without the issues that you
describe below.  We run Slurm 22.05.8, what's your version?


I'm sure that there are setups where it works nicely;-) For us, it
didn't, and I was faced with hunting the bug in slurm or working around
it with more control, fixing the underlying issue of the node resume
script being called _after_ the job has been allocated to the node.
That is too late in case of node bootup failure and causes annoying
delays for users only to see jobs fail.

We do run 21.08.8-2, which means any debugging of this on the slurm
side would mean upgrading first (we don't upgrade just for upgrade's
sake). And, as I said: The issue of the wrong timing remains, unless I
try deeper changes in slurm's logic. The other issue is that we had a
kludge in place, anyway, to enable slurmctld to power on nodes via
IPMI. The machine slurmctld runs on has no access to the IPMI network
itself, so we had to build a polling communication channel to the node
which has this access (and which is on another security layer, hence no
ssh into it). For all I know, this communication kludge is not to
blame, as, in the spurious failures, the nodes did boot up just fine
and were ready. Only slurmctld decided to let the timeout pass first,
then recognize that the slurmd on the node is there, right that instant.

Did your power up/down script workflow work with earlier slurm
versions, too? Did you use it on bare metal servers or mostly on cloud
instances?

Do you see a chance for

a) fixing up the internal powersaving logic to properly allocating
nodes to a job only when these nodes are actually present (ideally,
with a health check passing) or
b) designing an interface between slurm as manager of available
resources and another site-specific service responsible for off-/onlining
resources that are known to slurm, but down/drained?

My view is that Slurm's task is to distribute resources among users.
The cluster manager (person or $MIGHTY_SITE_SPECIFIC_SOFTWARE) decides
if a node is currently available to Slurm or down for maintenance, for
example. Power saving would be another reason for a node being taken
out of service.

Maybe I got an old-fashioned minority view …


Alrighty then,

Thomas

PS: I guess solution a) above goes against Slurm's focus on throughput
and avoiding delays caused by synchronization points, while our idea here
is that batch jobs where that matters should be written differently,
packing more than a few seconds worth of work into each step.






[slurm-users] Using JSON/YAML to describe jobs for submission to SLURM

2023-03-29 Thread Nicholas Yue
Hi,

  I am looking at parsing some data and submitting lots of jobs to SLURM
and was wondering if there is a way to describe all the jobs and their
dependencies in some JSON file and submit that JSON file instead of making
individual calls to SLURM ?

Cheers
-- 
Nicholas Yue
https://www.linkedin.com/in/nicholasyue/


[slurm-users] LSF Wrappers for Slurm

2023-03-29 Thread Amir Ben Avi
Hi,

Does anyone know if there are any LSF wrappers ( bsub, bjobs , bkill etc ) 
that can work in Slurm ?
What I found so far is table that convert  LSF command to Slurm command.
Any info will be appreciated


Thanks,
Amir


[slurm-users] Mixing GPU Types on Same Node

2023-03-29 Thread collin.m.mccarthy
Hello,
 
Apologies if this is in the docs but I couldn't find it anywhere. 
 
I've been using Slurm to run a small 7-node cluster in a research lab for a
couple of years now (I'm a PhD student). A couple of our nodes have
heterogenous GPU models. One in particular has quite a few: 2x NVIDIA A100s,
1x NVIDIA 3090, 2x NVIDIA GV100 w/ NVLink, 1x AMD MI100, 2x AMD MI200. This
makes things a bit challenging but I need to work with what I have. 
 
1.  I've only been able to set this up previously on Slurm 20.02 by
"ignoring" the AMDs and just specifying the NVIDIA GPUs. That worked when we
had one or two people using the AMD GPUs and they could coordinate between
themselves. But now, we have more people interested. I'm upgrading Slurm to
23.02 in hopes that might fix some of the challenges, but should this be
possible? Ideally I would like to have AutoDetect=nvml and AutoDetect=rsmi
both on. If it's not I'll shuffle GPUs around to make this node NVIDIA-only.
2.  I want everyone to allocate GPUs with --gpus=: instead of
--gpus=, so they don't "block" a nice GPU like an A100 when they really
wanted any-old GPU on the machine like a GV100 or 3090. Can I force people
to specify a GPU type and not just a count? This is especially important if
I'm mixing AMDs and NVIDIAs on the same node. If not, can I specify the
"order" in which I want GPUs to be scheduled if they don't specify a type
(so they get handed out from least-powerful to most-powerful if people don't
care)? 
 
Any help and/or advice here is much appreciated. Slurm has been amazing for
our lab (albeit challenging to setup at first) and I want to get everything
dialed before I graduate :D . 
 
Thanks,
-Collin


Re: [slurm-users] Mixing GPU Types on Same Node

2023-03-29 Thread Thomas M. Payerle
You can probably have a job submit lua script that looks at the --gpus flag
(and maybe the --gres=gpu:* flag as well) and force a GPU type.  A bit
complicated, and not sure if it will catch srun submissions.  I don't think
this is flexible enough to ensure they get the least powerful GPU among all
idle GPUs, but you can force it to default to the lowest GPU on the cluster
--- if nothing else this will force users who want more powerful GPUs to
explicitly give a GPU type

On Wed, Mar 29, 2023 at 2:31 PM  wrote:

> Hello,
>
>
>
> Apologies if this is in the docs but I couldn’t find it anywhere.
>
>
>
> I’ve been using Slurm to run a small 7-node cluster in a research lab for
> a couple of years now (I’m a PhD student). A couple of our nodes have
> heterogenous GPU models. One in particular has quite a few: 2x NVIDIA
> A100s, 1x NVIDIA 3090, 2x NVIDIA GV100 w/ NVLink, 1x AMD MI100, 2x AMD
> MI200. This makes things a bit challenging but I need to work with what I
> have.
>
>
>
>1. I’ve only been able to set this up previously on Slurm 20.02 by
>“ignoring” the AMDs and just specifying the NVIDIA GPUs. That worked when
>we had one or two people using the AMD GPUs and they could coordinate
>between themselves. But now, we have more people interested. I’m upgrading
>Slurm to 23.02 in hopes that might fix some of the challenges, but
>should this be possible? Ideally I would like to have AutoDetect=nvml
>and AutoDetect=rsmi both on. If it’s not I’ll shuffle GPUs around to
>make this node NVIDIA-only.
>2. I want everyone to allocate GPUs with --gpus=: instead
>of --gpus=, so they don’t “block” a nice GPU like an A100 when
>they really wanted any-old GPU on the machine like a GV100 or 3090. Can I
>force people to specify a GPU type and not just a count? This is especially
>important if I’m mixing AMDs and NVIDIAs on the same node. If not, can I
>specify the “order” in which I want GPUs to be scheduled if they don’t
>specify a type (so they get handed out from least-powerful to most-powerful
>if people don’t care)?
>
>
>
> Any help and/or advice here is much appreciated. Slurm has been amazing
> for our lab (albeit challenging to setup at first) and I want to get
> everything dialed before I graduate :D .
>
>
>
> Thanks,
>
> -Collin
>


-- 
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu
5825 University Research Park   (301) 405-6135
University of Maryland
College Park, MD 20740-3831