I finally figured out what caused my problem.  I think I found a bug in the
scheduling algorithm with SelectType=select/serial.

In the throughput tuning, http://slurm.schedmd.com/high_throughput.html, It
is recommended to use SelectType=select/serial for serial-jobs-only
environment.
So for my throughput test, I used the SelectType=select/serial option.

But it failed to schedule my array job for the entire cluster.
It turned out that if I switched to SelectType=select/cons_res, it works
fine.

< SelectType=select/serial   (broken with array job that failed to fill up
the cluster after 2nd pass)
---
> SelectType=select/cons_res  (works fine to fill up the cluster all the
time)


With SelectType=select/serial, it appears that Slurm schedules the first
102 tasks.  And then, it fills up the entire cluster during the next
scheduler run.  However, after that, it failed to schedule any pending
tasks no more than one compute node.

$ sbatch --array=1-5000 -o /dev/null --wrap="/bin/sleep 120"
Submitted batch job 4096654

At first, it only schedules 102 tasks:

$ squeue -a |more
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
4096654_[102-5000]    normal     wrap  xxxxxx PD       0:00      1 (None)
        4096654_17    normal     wrap  xxxxxx R       0:04      1 f-12-5
. . .

During the next scheduler run, it fills up the entire cluster. The full
cluster has 1408 cores (32 cores per node).

$ sleep 30; squeue -a |more
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
4096654_[1409-5000    normal     wrap  xxxxxx PD       0:00      1
(Resources)
       4096654_793    normal     wrap  xxxxxx R       0:03      1 f-12-34
       4096654_794    normal     wrap  xxxxxx R       0:03      1 f-12-34

After that, it failed to fill up the cluster. It only fills up one node. At
this point, no more than one node is scheduled.

$ sleep 120; squeue -a|more
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
4096654_[1441-5000    normal     wrap  xxxxxx PD       0:00      1
(Resources)
      4096654_1416    normal     wrap  xxxxxx R       1:17      1 f-12-5

However, iqf I switched to SelectType=select/cons_res, it works fine:
At first, it schedules the first 102 tasks,

$ squeue -a |more
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
4098287_[102-5000]    normal     wrap  xxxxxx PD       0:00      1 (None)
         4098287_1    normal     wrap  xxxxxx R       0:10      1 f-15-12
. . .

And then, at the next scheduler run, it fills up the cluster:

$ sleep 30; squeue -a |more
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
4098287_[1409-5000    normal     wrap  xxxxxx PD       0:00      1
(Resources)
      4098287_1389    normal     wrap  xxxxxx R       0:18      1 f-15-10

The next scheduling time, it still fills up the cluster as expected.

$ sleep 120; squeue -a|more
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
4098287_[2817-5000    normal     wrap  xxxxxx PD       0:00      1
(Resources)
      4098287_2728    normal     wrap  xxxxxx R       1:02      1 f-12-40
. . .

Regards,
- Chansup



On Thu, Jan 7, 2016 at 6:14 AM, Daniel Letai <[email protected]> wrote:

>
> Your MaxJobCount/MinJobAge combo might be too high, and the slurmctld is
> exhausting physical memory, resorting to swap which slows it down thus
> exceeding it's scheduling loop time window.
> You might wish to increase the scheduling loop duration as per
> http://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters
> and specifically:
> http://slurm.schedmd.com/slurm.conf.html#OPT_max_sched_time=#
> possibly also
> http://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_interval=#
> http://slurm.schedmd.com/slurm.conf.html#OPT_build_queue_timeout=#
> Although the last 2 seem less likely (sleep has no dependencies, and
> backfill is likely not playing a role).
>
> Other options - From http://slurm.schedmd.com/job_array.html
> The sched/backfill plugin has been modified to improve performance with
> job arrays. Once one element of a job array is discovered to not be runable
> or impact the scheduling of pending jobs, the remaining elements of that
> job array will be quickly skipped.
>
> Have you enabled backfill debugging flags to verify this is not happening
> for some reason?
>
>
>
> On 01/06/2016 08:12 PM, CB wrote:
>
>> slurm job array limit?
>>
>> Hi,
>>
>> I'm running Slurm 15.08.1 version.
>>
>> When I submitted a job array with 5000 tasks, it only scheduled the first
>> 102 tasks although there are plenty of slots available.
>>
>> sbatch --array=1-5000 -o /dev/null --wrap="/bin/sleep 120"
>>
>> The slurmctld log says:
>>
>> [2016-01-06T12:43:43.496] debug:  sched: already tested 102 jobs,
>> breaking out
>>
>> Then, after a while, the schduler dispatched some 1000 tasks and says
>>
>> [[2016-01-06T12:44:24.003] debug:  sched: loop taking too long, breaking
>> out
>> [2016-01-06T12:44:24.004] debug:  Note large processing time from
>> schedule: usec=1439516 began=12:44:22.564
>> [2016-01-06T12:44:24.070] debug:  Note large processing time from
>> _slurmctld_background: usec=1531381 began=12:44:22.538
>>
>> After that, slurm schedules the remaining tasks only one compute nodes.
>>
>> Has anyone seen this behavior?
>>
>> Currently we've set the following Slurm parameters:
>> MaxArraySize=100000
>> MaxJobCount=2500000
>>
>> Thanks,
>> - Chansup
>>
>

Reply via email to