[slurm-dev] Re: sbatch: error: Slurm temporarily unable to accept job

Jeff Squyres (jsquyres) Tue, 16 Apr 2013 09:09:14 -0700

Perfect; thank you!

On Apr 14, 2013, at 11:42 AM, Moe Jette <[email protected]> wrote:


> The documentation here should help, especially the Slurm configuration 
> section:
> http://slurm.schedmd.com/high_throughput.html
> 
> Quoting "Jeff Squyres (jsquyres)" <[email protected]>:
> 
>> 
>> I'm recently upgraded to use SLURM 2.5.4 on RHEL 6.3.
>> 
>> I am trying to sbatch a very large number of short running jobs (each job is 
>> 1-10 minutes long).  I have a perl script that calls sbatch a bazillion 
>> times to submit jobs to slurm.  With a totally empty queue and all my SLURM 
>> compute nodes powered up, if I run my sbatch-submitting script, it starts 
>> slowing down after submitting about 9500 jobs, and around 9800 jobs it 
>> starts pausing with messages like "sbatch: error: Slurm temporarily unable 
>> to accept job, sleeping and retrying."  *Sometimes* an individual job will 
>> be able to submit successfully, but most times the sbatch eventually fails.
>> 
>> What causes this?  Is there some internal limit in SLURM about the max 
>> number of jobs that can be queued in a partition?  If so, is there a way to 
>> increase it?  (I have oodles of resources to burn on the head node; I'm not 
>> concerned if increasing SLURM resources will consume slurmd / slurmctld RAM 
>> or disk space)
>> 
>> What's worse, however, is that after I run into these delays / hangs, SLURM 
>> starts acting somewhat nondeterministically for a while (anywhere from 2-5 
>> minutes afterwards).
>> 
>> For example, I had just sbatch submitted about 9800 jobs, but it got stuck 
>> with the "Slurm temporarily unable..." messages, so I killed my submit 
>> script and scancel-cleared the entire queue.  I can see via sinfo that all 
>> 32 nodes in eurompi are idle:
>> 
>> -----
>> % sinfo
>> PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> defq*         up   infinite      0    n/a
>> eurompi       up   infinite     32   idle node[001-032]
>> infiniband    up   infinite     36  down* dell[003-016,022-043]
>> infiniband    up   infinite      2  idle~ dell[001-002]
>> xuyang        up   infinite      0    n/a
>> %
>> -----
>> 
>> But if I try to salloc them, it tells me that resources are temporarily 
>> unavailable:
>> 
>> -----
>> % salloc -N 32 -p eurompi
>> salloc: error: Failed to allocate resources: Resource temporarily unavailable
>> %
>> -----
>> 
>> If I wait a few minutes (the exact timing seems to be somewhat 
>> nondeterministic), I can "salloc -N 32 -p eurompi" no problem, and start 
>> submitting jobs again, etc.
>> 
>> Can anyone guess as to why this is happening, and/or provide some 
>> suggestions for preventing it from happening?
>> 
>> --
>> Jeff Squyres
>> [email protected]
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
> 
> 
> 


-- 
Jeff Squyres
[email protected]
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

[slurm-dev] Re: sbatch: error: Slurm temporarily unable to accept job

Reply via email to