The documentation here should help, especially the Slurm configuration section: http://slurm.schedmd.com/high_throughput.html
Quoting "Jeff Squyres (jsquyres)" <[email protected]>: > > I'm recently upgraded to use SLURM 2.5.4 on RHEL 6.3. > > I am trying to sbatch a very large number of short running jobs > (each job is 1-10 minutes long). I have a perl script that calls > sbatch a bazillion times to submit jobs to slurm. With a totally > empty queue and all my SLURM compute nodes powered up, if I run my > sbatch-submitting script, it starts slowing down after submitting > about 9500 jobs, and around 9800 jobs it starts pausing with > messages like "sbatch: error: Slurm temporarily unable to accept > job, sleeping and retrying." *Sometimes* an individual job will be > able to submit successfully, but most times the sbatch eventually > fails. > > What causes this? Is there some internal limit in SLURM about the > max number of jobs that can be queued in a partition? If so, is > there a way to increase it? (I have oodles of resources to burn on > the head node; I'm not concerned if increasing SLURM resources will > consume slurmd / slurmctld RAM or disk space) > > What's worse, however, is that after I run into these delays / > hangs, SLURM starts acting somewhat nondeterministically for a while > (anywhere from 2-5 minutes afterwards). > > For example, I had just sbatch submitted about 9800 jobs, but it got > stuck with the "Slurm temporarily unable..." messages, so I killed > my submit script and scancel-cleared the entire queue. I can see > via sinfo that all 32 nodes in eurompi are idle: > > ----- > % sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > defq* up infinite 0 n/a > eurompi up infinite 32 idle node[001-032] > infiniband up infinite 36 down* dell[003-016,022-043] > infiniband up infinite 2 idle~ dell[001-002] > xuyang up infinite 0 n/a > % > ----- > > But if I try to salloc them, it tells me that resources are > temporarily unavailable: > > ----- > % salloc -N 32 -p eurompi > salloc: error: Failed to allocate resources: Resource temporarily unavailable > % > ----- > > If I wait a few minutes (the exact timing seems to be somewhat > nondeterministic), I can "salloc -N 32 -p eurompi" no problem, and > start submitting jobs again, etc. > > Can anyone guess as to why this is happening, and/or provide some > suggestions for preventing it from happening? > > -- > Jeff Squyres > [email protected] > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ >
