Hello Matthew,

You may be aware of this already, but most sites would make these kinds of
checks/validations using job_submit.lua. I'm not an expert in that - though
plenty of others on this list are - but I'm positive you could implement
this type of validation logic. I'd like to say that I've come across a good
tutorial for job_submit.lua, but I haven't really found one. This is kind
of a good intro:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins

You can also find some sample scripts, such as:

https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua

Warmest regards,
Jason

On Tue, Feb 27, 2024 at 5:02 PM Matthew R. Baney via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hello Slurm users,
>
> I'm trying to write a check in our job_submit.lua script that enforces
> relative resource requirements such as disallowing more than 4 CPUs or 48GB
> of memory per GPU. The QOS itself has a MaxTRESPerJob of
> cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to
> prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with
> only 1 GPU.
>
> I might be missing something obvious, but the rabbit hole I'm going down
> at the moment is trying to check all of the different ways job arguments
> could be set in the job descriptor.
>
> i.e., the following should all be disallowed:
>
> srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the
> descriptor)
>
> srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres)
>
> srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks,
> ntasks_per_tres)
>
> srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks,
> mem_per_cpu)
>
> ...
>
> Essentially what I'm looking for is a way to access the ReqTRES string
> from the job record before it exists, and then run some logic against that
> i.e., if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G,
> error out.
>
> Is something like this possible?
>
> Thanks,
> Matthew
>
> --
> Matthew Baney
> Assistant Director of Computational Systems
> mba...@umd.edu | (301) 405-6756
> University of Maryland Institute for Advanced Computer Studies
> 3154 Brendan Iribe Center
> 8125 Paint Branch Dr.
> College Park, MD 20742
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to