Re: [slurm-users] Queue size, slow/unresponsive head node

Colas Rivière Fri, 12 Jan 2018 11:01:44 -0800

Nicholas,

Why do you have?
SchedulerParameters     = (null)

I did not set these parameters, so I assume "(null)" means all thedefault values are used.


John,

thanks, I'll try that, and look into these SchedulerParameter more.

Cheers,
Colas

On 2018-01-12 09:08, John DeSantis wrote:

Colas,

We had a similar experience a long time ago, and we solved it by adding
the following SchedulerParameters:

max_rpc_cnt=150,defer

HTH,
John DeSantis

On Thu, 11 Jan 2018 16:39:43 -0500
Colas Rivière <rivi...@umdgrb.umd.edu> wrote:

Hello,

I'm managing a small cluster (one head node, 24 workers, 1160 total
worker threads). The head node has two E5-2680 v3 CPUs
(hyper-threaded), ~100 GB of memory and spinning disks.
The head node becomes occasionally less responsive when there are
more than 10k jobs in queue, and becomes really unmanageable when
reaching 100k jobs in queue, with error messages such as:

sbatch: error: Slurm temporarily unable to accept job, sleeping and
retrying.

or

Running: slurm_load_jobs error: Socket timed out on send/recv
operation

Is that normal to experience slowdowns when the queue reaches this
few 10k jobs? What limit should I expect? Would adding a SSD drive
for SlurmdSpoolDir help? What can be done to push this limit?

The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached
(from `scontrol show config`).

Thanks,
Colas

Re: [slurm-users] Queue size, slow/unresponsive head node

Reply via email to