The documentation here should help, especially the Slurm configuration  
section:
http://slurm.schedmd.com/high_throughput.html

Quoting "Jeff Squyres (jsquyres)" <[email protected]>:

>
> I'm recently upgraded to use SLURM 2.5.4 on RHEL 6.3.
>
> I am trying to sbatch a very large number of short running jobs  
> (each job is 1-10 minutes long).  I have a perl script that calls  
> sbatch a bazillion times to submit jobs to slurm.  With a totally  
> empty queue and all my SLURM compute nodes powered up, if I run my  
> sbatch-submitting script, it starts slowing down after submitting  
> about 9500 jobs, and around 9800 jobs it starts pausing with  
> messages like "sbatch: error: Slurm temporarily unable to accept  
> job, sleeping and retrying."  *Sometimes* an individual job will be  
> able to submit successfully, but most times the sbatch eventually  
> fails.
>
> What causes this?  Is there some internal limit in SLURM about the  
> max number of jobs that can be queued in a partition?  If so, is  
> there a way to increase it?  (I have oodles of resources to burn on  
> the head node; I'm not concerned if increasing SLURM resources will  
> consume slurmd / slurmctld RAM or disk space)
>
> What's worse, however, is that after I run into these delays /  
> hangs, SLURM starts acting somewhat nondeterministically for a while  
> (anywhere from 2-5 minutes afterwards).
>
> For example, I had just sbatch submitted about 9800 jobs, but it got  
> stuck with the "Slurm temporarily unable..." messages, so I killed  
> my submit script and scancel-cleared the entire queue.  I can see  
> via sinfo that all 32 nodes in eurompi are idle:
>
> -----
> % sinfo
> PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
> defq*         up   infinite      0    n/a
> eurompi       up   infinite     32   idle node[001-032]
> infiniband    up   infinite     36  down* dell[003-016,022-043]
> infiniband    up   infinite      2  idle~ dell[001-002]
> xuyang        up   infinite      0    n/a
> %
> -----
>
> But if I try to salloc them, it tells me that resources are  
> temporarily unavailable:
>
> -----
> % salloc -N 32 -p eurompi
> salloc: error: Failed to allocate resources: Resource temporarily unavailable
> %
> -----
>
> If I wait a few minutes (the exact timing seems to be somewhat  
> nondeterministic), I can "salloc -N 32 -p eurompi" no problem, and  
> start submitting jobs again, etc.
>
> Can anyone guess as to why this is happening, and/or provide some  
> suggestions for preventing it from happening?
>
> --
> Jeff Squyres
> [email protected]
> For corporate legal information go to:  
> http://www.cisco.com/web/about/doing_business/legal/cri/
>

Reply via email to