I'm recently upgraded to use SLURM 2.5.4 on RHEL 6.3.

I am trying to sbatch a very large number of short running jobs (each job is 
1-10 minutes long).  I have a perl script that calls sbatch a bazillion times 
to submit jobs to slurm.  With a totally empty queue and all my SLURM compute 
nodes powered up, if I run my sbatch-submitting script, it starts slowing down 
after submitting about 9500 jobs, and around 9800 jobs it starts pausing with 
messages like "sbatch: error: Slurm temporarily unable to accept job, sleeping 
and retrying."  *Sometimes* an individual job will be able to submit 
successfully, but most times the sbatch eventually fails.

What causes this?  Is there some internal limit in SLURM about the max number 
of jobs that can be queued in a partition?  If so, is there a way to increase 
it?  (I have oodles of resources to burn on the head node; I'm not concerned if 
increasing SLURM resources will consume slurmd / slurmctld RAM or disk space)

What's worse, however, is that after I run into these delays / hangs, SLURM 
starts acting somewhat nondeterministically for a while (anywhere from 2-5 
minutes afterwards).

For example, I had just sbatch submitted about 9800 jobs, but it got stuck with 
the "Slurm temporarily unable..." messages, so I killed my submit script and 
scancel-cleared the entire queue.  I can see via sinfo that all 32 nodes in 
eurompi are idle:

-----
% sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*         up   infinite      0    n/a 
eurompi       up   infinite     32   idle node[001-032]
infiniband    up   infinite     36  down* dell[003-016,022-043]
infiniband    up   infinite      2  idle~ dell[001-002]
xuyang        up   infinite      0    n/a 
%
-----

But if I try to salloc them, it tells me that resources are temporarily 
unavailable:

-----
% salloc -N 32 -p eurompi
salloc: error: Failed to allocate resources: Resource temporarily unavailable
% 
-----

If I wait a few minutes (the exact timing seems to be somewhat 
nondeterministic), I can "salloc -N 32 -p eurompi" no problem, and start 
submitting jobs again, etc.

Can anyone guess as to why this is happening, and/or provide some suggestions 
for preventing it from happening?

-- 
Jeff Squyres
[email protected]
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to