[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-27 Thread Chris Samuel via slurm-users

On 26/2/24 12:27 am, Josef Dvoracek via slurm-users wrote:


What is the recommended way to run longer interactive job at your systems?


We provide NX for our users and also access via JupyterHub.

We also have high priority QOS's intended for interactive use for rapid 
response, but they are capped at 4 hours (or 6 hours for Jupyter users).


All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Enforcing relative resource restrictions in submission script

2024-02-27 Thread Matthew R. Baney via slurm-users
Hello Slurm users,

I'm trying to write a check in our job_submit.lua script that enforces
relative resource requirements such as disallowing more than 4 CPUs or 48GB
of memory per GPU. The QOS itself has a MaxTRESPerJob of
cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to
prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with
only 1 GPU.

I might be missing something obvious, but the rabbit hole I'm going down at
the moment is trying to check all of the different ways job arguments could
be set in the job descriptor.

i.e., the following should all be disallowed:

srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the
descriptor)

srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres)

srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks,
ntasks_per_tres)

srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks,
mem_per_cpu)

...

Essentially what I'm looking for is a way to access the ReqTRES string from
the job record before it exists, and then run some logic against that i.e.,
if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G, error out.

Is something like this possible?

Thanks,
Matthew

-- 
Matthew Baney
Assistant Director of Computational Systems
mba...@umd.edu | (301) 405-6756
University of Maryland Institute for Advanced Computer Studies
3154 Brendan Iribe Center
8125 Paint Branch Dr.
College Park, MD 20742

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-27 Thread Brian Andrus via slurm-users

Josef,

for us, we put a load balancer in front of the login nodes with session 
affinity enabled. This makes them land on the same backend node each time.


Also, for interactive X sessions, users start a desktop session on the 
node and then use vnc to connect there. This accommodates disconnection 
for any reason even for X-based apps.


Personally, I don't care much for interactive sessions in HPC, but there 
is a large body that only knows how to do things that way, so it is there.


Brian Andrus


On 2/26/2024 12:27 AM, Josef Dvoracek via slurm-users wrote:
What is the recommended way to run longer interactive job at your 
systems?


Our how-to includes starting screen at front-end node and running srun 
with bash/zsh inside,
but that indeed brings dependency between login node (with screen) and 
the compute node job.


On systems with multiple front-ends users need to remember the login 
node where they have their screen session..


Are you anybody using something more advanced and still understandable 
by casual user of HPC?


(I know Open On Demand, but often the use of native console has 
certain benefits. )


cheers

josef







--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com