[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?
On 26/2/24 12:27 am, Josef Dvoracek via slurm-users wrote: What is the recommended way to run longer interactive job at your systems? We provide NX for our users and also access via JupyterHub. We also have high priority QOS's intended for interactive use for rapid response, but they are capped at 4 hours (or 6 hours for Jupyter users). All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Enforcing relative resource restrictions in submission script
Hello Slurm users, I'm trying to write a check in our job_submit.lua script that enforces relative resource requirements such as disallowing more than 4 CPUs or 48GB of memory per GPU. The QOS itself has a MaxTRESPerJob of cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with only 1 GPU. I might be missing something obvious, but the rabbit hole I'm going down at the moment is trying to check all of the different ways job arguments could be set in the job descriptor. i.e., the following should all be disallowed: srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the descriptor) srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres) srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks, ntasks_per_tres) srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks, mem_per_cpu) ... Essentially what I'm looking for is a way to access the ReqTRES string from the job record before it exists, and then run some logic against that i.e., if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G, error out. Is something like this possible? Thanks, Matthew -- Matthew Baney Assistant Director of Computational Systems mba...@umd.edu | (301) 405-6756 University of Maryland Institute for Advanced Computer Studies 3154 Brendan Iribe Center 8125 Paint Branch Dr. College Park, MD 20742 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?
Josef, for us, we put a load balancer in front of the login nodes with session affinity enabled. This makes them land on the same backend node each time. Also, for interactive X sessions, users start a desktop session on the node and then use vnc to connect there. This accommodates disconnection for any reason even for X-based apps. Personally, I don't care much for interactive sessions in HPC, but there is a large body that only knows how to do things that way, so it is there. Brian Andrus On 2/26/2024 12:27 AM, Josef Dvoracek via slurm-users wrote: What is the recommended way to run longer interactive job at your systems? Our how-to includes starting screen at front-end node and running srun with bash/zsh inside, but that indeed brings dependency between login node (with screen) and the compute node job. On systems with multiple front-ends users need to remember the login node where they have their screen session.. Are you anybody using something more advanced and still understandable by casual user of HPC? (I know Open On Demand, but often the use of native console has certain benefits. ) cheers josef -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com