Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

Paul Edmon Tue, 27 Aug 2019 07:08:21 -0700

At least for our cluster we generally recommend that if you aresubmitting large numbers of jobs you either use a job array or you justfor loop over the jobs you want to submit. A fork bomb is definitelynot recommended. For highest throughput submission a job array is yourbest bet as in one submission it will generate thousands of jobs whichthen the scheduler can handle sensibly. So I highly recommend using jobarrays.


-Paul Edmon-


On 8/27/19 3:45 AM, Guillaume Perrault Archambault wrote:

Hi Paul,

Thanks a lot for your suggestion.
The cluster I'm using has thousands of users, so I'm doubtful theadmins will change this setting just for me. But I'll mention it tothe support team I'm working with.
I was hoping more for something that can be done on the user end.
Is there some way for the user to measure whether the scheduler is inRPC saturation? And then if it is, I could make sure my script doesn'tlaunch too many jobs in parallel.
Sorry if my question is too vague, I don't understand the backend ofthe SLURM scheduler too well, so my questions are using the limitedterminology of a user.
My concern is just to make sure that my scripts don't send out morecommands (simultaneously) than the scheduler can handle.
For example, as an extreme scenario, suppose a user forks off 1000sbatch commands in parallel, is that more than the scheduler canhandle? As a user, how can I know whether it is?
Regards,
Guillaume.
On Mon, Aug 26, 2019 at 10:15 AM Paul Edmon <ped...@cfa.harvard.edu<mailto:ped...@cfa.harvard.edu>> wrote:
    We've hit this before due to RPC saturation.  I highly recommend
    using max_rpc_cnt and/or defer for scheduling. That should help
    alleviate this problem.

    -Paul Edmon-

    On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:
    Hello,

    I wrote a regression-testing toolkit to manage large numbers of
    SLURM jobs and their output (the toolkit can be found here
    <https://github.com/gobbedy/slurm_simulation_toolkit/> if anyone
    is interested).

    To make job launching faster, sbatch commands are forked, so that
    numerous jobs may be submitted in parallel.

    We (the cluster admin and myself) are concerned that this may
    cause unresponsiveness for other users.

    I cannot say for sure since I don't have visibility over all
    users of the cluster, but unresponsiveness doesn't seem to have
    occurred so far. That being said, the fact that it hasn't
    occurred yet doesn't mean it won't in the future. So I'm treating
    this as a ticking time bomb to be fixed asap.

    My questions are the following:
    1) Does anyone have experience with large numbers of jobs
    submitted in parallel? What are the limits that can be hit? For
    example is there some hard limit on how many jobs a SLURM
    scheduler can handle before blacking out / slowing down?
    2) Is there a way for me to find/measure/ping this resource limit?
    3) How can I make sure I don't hit this resource limit?

    From what I've observed, parallel submission can improve
    submission time by a factor at least 10x. This can make a big
    difference in users' workflows.

    For that reason I would like to keep the option of launching jobs
    sequentially as a last resort.

    Thanks in advance.

    Regards,
    Guillaume.

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

Reply via email to