Perfect; thank you! On Apr 14, 2013, at 11:42 AM, Moe Jette <[email protected]> wrote:
> The documentation here should help, especially the Slurm configuration > section: > http://slurm.schedmd.com/high_throughput.html > > Quoting "Jeff Squyres (jsquyres)" <[email protected]>: > >> >> I'm recently upgraded to use SLURM 2.5.4 on RHEL 6.3. >> >> I am trying to sbatch a very large number of short running jobs (each job is >> 1-10 minutes long). I have a perl script that calls sbatch a bazillion >> times to submit jobs to slurm. With a totally empty queue and all my SLURM >> compute nodes powered up, if I run my sbatch-submitting script, it starts >> slowing down after submitting about 9500 jobs, and around 9800 jobs it >> starts pausing with messages like "sbatch: error: Slurm temporarily unable >> to accept job, sleeping and retrying." *Sometimes* an individual job will >> be able to submit successfully, but most times the sbatch eventually fails. >> >> What causes this? Is there some internal limit in SLURM about the max >> number of jobs that can be queued in a partition? If so, is there a way to >> increase it? (I have oodles of resources to burn on the head node; I'm not >> concerned if increasing SLURM resources will consume slurmd / slurmctld RAM >> or disk space) >> >> What's worse, however, is that after I run into these delays / hangs, SLURM >> starts acting somewhat nondeterministically for a while (anywhere from 2-5 >> minutes afterwards). >> >> For example, I had just sbatch submitted about 9800 jobs, but it got stuck >> with the "Slurm temporarily unable..." messages, so I killed my submit >> script and scancel-cleared the entire queue. I can see via sinfo that all >> 32 nodes in eurompi are idle: >> >> ----- >> % sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> defq* up infinite 0 n/a >> eurompi up infinite 32 idle node[001-032] >> infiniband up infinite 36 down* dell[003-016,022-043] >> infiniband up infinite 2 idle~ dell[001-002] >> xuyang up infinite 0 n/a >> % >> ----- >> >> But if I try to salloc them, it tells me that resources are temporarily >> unavailable: >> >> ----- >> % salloc -N 32 -p eurompi >> salloc: error: Failed to allocate resources: Resource temporarily unavailable >> % >> ----- >> >> If I wait a few minutes (the exact timing seems to be somewhat >> nondeterministic), I can "salloc -N 32 -p eurompi" no problem, and start >> submitting jobs again, etc. >> >> Can anyone guess as to why this is happening, and/or provide some >> suggestions for preventing it from happening? >> >> -- >> Jeff Squyres >> [email protected] >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> > > > -- Jeff Squyres [email protected] For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
