Chris
Upon further testing this morning I see the job is assigned two different
jobid's, something I wasn't expecting. This lead me down the road of thinking
the output was incorrect.
Scontrol on a hetro job will show multi-jobids for the job. So, the output just
wasn't what I was
We put a ‘gpu’ QOS on all our GPU partitions, and limit jobs per user to 8 (our
GPU capacity) via MaxJobsPerUser. Extra jobs get blocked, allowing other users
to queue jobs ahead of the extras.
# sacctmgr show qos gpu format=name,maxjobspu
Name MaxJobsPU
-- -
gpu
Hi everyone,
We have a single node with 8 gpus. Users often pile up lots of pending jobs and
are using all 8 at the same time, but for a user who just wants to do a short
run debug job and needs one of the gpus, they are having to wait too long for a
gpu to free up. Is there a way with
Hello Michael,
Thank you for your email and apologies for my tardy response. I'm still sorting
out my mailbox after an Easter break. I've taken your comments on board and
I'll see how I go with your suggestions.
Best regards,
David
From: slurm-users on behalf
Thanks for the info.
Thing is that I don't want to totally set the node as unhealthy. Assume the
following scenarios:
compute-0-0 running slurm jobs and system load is 15 (32 cores)
compute-0-1 running non-slurm jobs and system load is 25 (32 cores)
Then a new slurm job should be dispatched to