I'm recently upgraded to use SLURM 2.5.4 on RHEL 6.3. I am trying to sbatch a very large number of short running jobs (each job is 1-10 minutes long). I have a perl script that calls sbatch a bazillion times to submit jobs to slurm. With a totally empty queue and all my SLURM compute nodes powered up, if I run my sbatch-submitting script, it starts slowing down after submitting about 9500 jobs, and around 9800 jobs it starts pausing with messages like "sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying." *Sometimes* an individual job will be able to submit successfully, but most times the sbatch eventually fails.
What causes this? Is there some internal limit in SLURM about the max number of jobs that can be queued in a partition? If so, is there a way to increase it? (I have oodles of resources to burn on the head node; I'm not concerned if increasing SLURM resources will consume slurmd / slurmctld RAM or disk space) What's worse, however, is that after I run into these delays / hangs, SLURM starts acting somewhat nondeterministically for a while (anywhere from 2-5 minutes afterwards). For example, I had just sbatch submitted about 9800 jobs, but it got stuck with the "Slurm temporarily unable..." messages, so I killed my submit script and scancel-cleared the entire queue. I can see via sinfo that all 32 nodes in eurompi are idle: ----- % sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 0 n/a eurompi up infinite 32 idle node[001-032] infiniband up infinite 36 down* dell[003-016,022-043] infiniband up infinite 2 idle~ dell[001-002] xuyang up infinite 0 n/a % ----- But if I try to salloc them, it tells me that resources are temporarily unavailable: ----- % salloc -N 32 -p eurompi salloc: error: Failed to allocate resources: Resource temporarily unavailable % ----- If I wait a few minutes (the exact timing seems to be somewhat nondeterministic), I can "salloc -N 32 -p eurompi" no problem, and start submitting jobs again, etc. Can anyone guess as to why this is happening, and/or provide some suggestions for preventing it from happening? -- Jeff Squyres [email protected] For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
