I finally figured out what caused my problem. I think I found a bug in the scheduling algorithm with SelectType=select/serial.
In the throughput tuning, http://slurm.schedmd.com/high_throughput.html, It is recommended to use SelectType=select/serial for serial-jobs-only environment. So for my throughput test, I used the SelectType=select/serial option. But it failed to schedule my array job for the entire cluster. It turned out that if I switched to SelectType=select/cons_res, it works fine. < SelectType=select/serial (broken with array job that failed to fill up the cluster after 2nd pass) --- > SelectType=select/cons_res (works fine to fill up the cluster all the time) With SelectType=select/serial, it appears that Slurm schedules the first 102 tasks. And then, it fills up the entire cluster during the next scheduler run. However, after that, it failed to schedule any pending tasks no more than one compute node. $ sbatch --array=1-5000 -o /dev/null --wrap="/bin/sleep 120" Submitted batch job 4096654 At first, it only schedules 102 tasks: $ squeue -a |more JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4096654_[102-5000] normal wrap xxxxxx PD 0:00 1 (None) 4096654_17 normal wrap xxxxxx R 0:04 1 f-12-5 . . . During the next scheduler run, it fills up the entire cluster. The full cluster has 1408 cores (32 cores per node). $ sleep 30; squeue -a |more JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4096654_[1409-5000 normal wrap xxxxxx PD 0:00 1 (Resources) 4096654_793 normal wrap xxxxxx R 0:03 1 f-12-34 4096654_794 normal wrap xxxxxx R 0:03 1 f-12-34 After that, it failed to fill up the cluster. It only fills up one node. At this point, no more than one node is scheduled. $ sleep 120; squeue -a|more JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4096654_[1441-5000 normal wrap xxxxxx PD 0:00 1 (Resources) 4096654_1416 normal wrap xxxxxx R 1:17 1 f-12-5 However, iqf I switched to SelectType=select/cons_res, it works fine: At first, it schedules the first 102 tasks, $ squeue -a |more JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4098287_[102-5000] normal wrap xxxxxx PD 0:00 1 (None) 4098287_1 normal wrap xxxxxx R 0:10 1 f-15-12 . . . And then, at the next scheduler run, it fills up the cluster: $ sleep 30; squeue -a |more JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4098287_[1409-5000 normal wrap xxxxxx PD 0:00 1 (Resources) 4098287_1389 normal wrap xxxxxx R 0:18 1 f-15-10 The next scheduling time, it still fills up the cluster as expected. $ sleep 120; squeue -a|more JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4098287_[2817-5000 normal wrap xxxxxx PD 0:00 1 (Resources) 4098287_2728 normal wrap xxxxxx R 1:02 1 f-12-40 . . . Regards, - Chansup On Thu, Jan 7, 2016 at 6:14 AM, Daniel Letai <[email protected]> wrote: > > Your MaxJobCount/MinJobAge combo might be too high, and the slurmctld is > exhausting physical memory, resorting to swap which slows it down thus > exceeding it's scheduling loop time window. > You might wish to increase the scheduling loop duration as per > http://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters > and specifically: > http://slurm.schedmd.com/slurm.conf.html#OPT_max_sched_time=# > possibly also > http://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_interval=# > http://slurm.schedmd.com/slurm.conf.html#OPT_build_queue_timeout=# > Although the last 2 seem less likely (sleep has no dependencies, and > backfill is likely not playing a role). > > Other options - From http://slurm.schedmd.com/job_array.html > The sched/backfill plugin has been modified to improve performance with > job arrays. Once one element of a job array is discovered to not be runable > or impact the scheduling of pending jobs, the remaining elements of that > job array will be quickly skipped. > > Have you enabled backfill debugging flags to verify this is not happening > for some reason? > > > > On 01/06/2016 08:12 PM, CB wrote: > >> slurm job array limit? >> >> Hi, >> >> I'm running Slurm 15.08.1 version. >> >> When I submitted a job array with 5000 tasks, it only scheduled the first >> 102 tasks although there are plenty of slots available. >> >> sbatch --array=1-5000 -o /dev/null --wrap="/bin/sleep 120" >> >> The slurmctld log says: >> >> [2016-01-06T12:43:43.496] debug: sched: already tested 102 jobs, >> breaking out >> >> Then, after a while, the schduler dispatched some 1000 tasks and says >> >> [[2016-01-06T12:44:24.003] debug: sched: loop taking too long, breaking >> out >> [2016-01-06T12:44:24.004] debug: Note large processing time from >> schedule: usec=1439516 began=12:44:22.564 >> [2016-01-06T12:44:24.070] debug: Note large processing time from >> _slurmctld_background: usec=1531381 began=12:44:22.538 >> >> After that, slurm schedules the remaining tasks only one compute nodes. >> >> Has anyone seen this behavior? >> >> Currently we've set the following Slurm parameters: >> MaxArraySize=100000 >> MaxJobCount=2500000 >> >> Thanks, >> - Chansup >> >
