[slurm-users] Overhead of multiple concurrent job steps

Mehta, Kshitij via slurm-users Fri, 15 Mar 2024 07:43:16 -0700

Hello,
We have a use case in which we need to launch multiple concurrently running MPI 
applications inside a job allocation. Most supercomputing facilities limit the 
number of concurrent job steps as they incur an overhead with the global Slurm 
scheduler. Some frameworks, such as the Flux framework from LLNL, claim to 
mitigate this issue by starting an instance of their own scheduler inside an 
allocation, which then acts as the resource manager for the compute nodes in 
the allocation.


Out of curiosity, I was wondering if there is a fundamental reason behind 
having a single global scheduler that the srun launch commands must contact to 
launch job steps. Perhaps it was overkill to develop a ‘hierarhical’ design in 
which Slurm launches a local job daemon for every allocation that manages 
resources for that allocation? I would appreciate your insight in understanding 
more about Slurm’s core design.

Thanks and regards,
Kshitij Mehta
Oak Ridge National Laboratory

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Overhead of multiple concurrent job steps

Reply via email to