hi,
I am trying to use slurm as a resource manager, but am running into problems
when trying to submit over 10,000 jobs to the queue. Each job is queued by
issuing a separate sbatch command, which works well up to a few thousand jobs
but then I begin seeing the error
sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
Many jobs still get submitted after a few retries, but when around 9,980 jobs
are in the queue, invariably some job(s) hit the 15 MAX_RETRIES and exit with
the error
sbatch: error: Batch job submission failed: Resource temporarily unavailable
Is slurm not suited to handling tens of thousands of jobs? Or are there some
configuration/job submission changes I could make to allow slurm to handle up
to 50K jobs?
Details of my current setup are as follows. A partition is specified for each
worker for later scaling of the cluster, where multiple nodes would be assigned
to each partition.
Thank you very much for any help!
Slurm version: 2.2.7
### slurm.conf
####ClusterName=wgsControlMachine=wgmasterSlurmUser=slurmSlurmctldPort=6818SlurmdPort=6817AuthType=auth/mungeStateSaveLocation=/tmpSlurmdSpoolDir=/tmp/slurmdSwitchType=switch/noneMpiDefault=noneSlurmctldPidFile=/var/run/slurmctld.pidSlurmdPidFile=/var/run/slurmd.pidProctrackType=proctrack/pgidCacheGroups=0ReturnToService=0##
TIMERSSlurmctldTimeout=300SlurmdTimeout=300InactiveLimit=0MinJobAge=6000KillWait=30Waittime=0MessageTimeout=60##
SCHEDULINGSchedulerType=sched/backfillSchedulerParameters=deferSelectType=select/linearFastSchedule=1##
LOGGINGSlurmctldDebug=5SlurmctldLogFile=/var/log/slurmctldSlurmdDebug=5SlurmdLogFile=/var/log/slurmdJobCompType=jobcomp/filetxt##
COMPUTE NODESNodeName=wgmaster Procs=1 State=UNKNOWNNodeName=wgnode1
NodeHostname=wgnode1 Procs=1 State=UNKNOWNNodeName=wgnode2
NodeHostname=wgnode2 Procs=1 State=UNKNOWNNodeName=wgnode3
NodeHostname=wgnode3 Procs=1 State=UNKNOWNNodeName=wgnode4
NodeHostname=wgnode4 Procs=1 State=UNKNOWNNodeName=wgnode5
NodeHostname=wgnode5 Procs=1 State=UNKNOWN## PARTITIONSPartitionName=all
Nodes=wgmaster,wgnode[1-5] Default=NO MaxTime=INFINITE
State=UPPartitionName=worker Nodes=wgnode[1-5] Default=YES MaxTime=INFINITE
State=UPPartitionName=dbhost Nodes=wgmaster Default=NO MaxTime=INFINITE
State=UPPartitionName=p1 Nodes=wgnode[1] Default=NO MaxTime=INFINITE
State=UPPartitionName=p2 Nodes=wgnode[2] Default=NO MaxTime=INFINITE
State=UPPartitionName=p3 Nodes=wgnode[3] Default=NO MaxTime=INFINITE
State=UPPartitionName=p4 Nodes=wgnode[4] Default=NO MaxTime=INFINITE
State=UPPartitionName=p5 Nodes=wgnode[5] Default=NO MaxTime=INFINITE State=UP
### An example of the type of command being issued to sbatch (a script tries to
issue thousands of these commands in series) ###sbatch --job-name=j1
--partition=wgnode1 --error=./log/j1.err --output=./log/j1.out -vvvvv --share
./bin/dowork.sh j1
### And the example output logged by sbatch when it errors: ###sbatch: defined
options for program `sbatch'sbatch: -----------------
---------------------sbatch: user : `cluster'sbatch: uid
: 2113sbatch: gid : 2113sbatch: cwd :
/tmp/slurmtestsbatch: ntasks : 1 (default)sbatch: cpus_per_task
: 1 (default)sbatch: nodes : 1 (default)sbatch: jobid :
4294967294 (default)sbatch: partition : wgnode1sbatch: job name
: `j1'sbatch: reservation : `(null)'sbatch: wckey :
`(null)'sbatch: distribution : unknownsbatch: verbose : 8sbatch:
immediate : falsesbatch: overcommit : falsesbatch: account
: (null)sbatch: comment : (null)sbatch: dependency :
(null)sbatch: qos : (null)sbatch: constraints : mincpus=1
sbatch: geometry : (null)sbatch: reboot : yessbatch: rotate
: nosbatch: network : (null)sbatch: mail_type :
NONEsbatch: mail_user : (null)sbatch: sockets-per-node : -2sbatch:
cores-per-socket : -2sbatch: threads-per-core : -2sbatch: ntasks-per-node :
0sbatch: ntasks-per-socket : -2sbatch: ntasks-per-core : -2sbatch: cpu_bind
: defaultsbatch: mem_bind : defaultsbatch: plane_size :
4294967294sbatch: propagate : NONEsbatch: remote command :
`/tmp/slurmtest/./bin/dowork.sh'sbatch: debug: propagating
RLIMIT_CPU=18446744073709551615sbatch: debug: propagating
RLIMIT_FSIZE=18446744073709551615sbatch: debug: propagating
RLIMIT_DATA=18446744073709551615sbatch: debug: propagating
RLIMIT_STACK=8388608sbatch: debug: propagating RLIMIT_CORE=0sbatch: debug:
propagating RLIMIT_RSS=18446744073709551615sbatch: debug: propagating
RLIMIT_NPROC=61504sbatch: debug: propagating RLIMIT_NOFILE=8192sbatch: debug:
propagating RLIMIT_MEMLOCK=32768sbatch: debug: propagating
RLIMIT_AS=18446744073709551615sbatch: debug: propagating
SLURM_PRIO_PROCESS=0sbatch: debug: propagating
SUBMIT_DIR=/tmp/slurmtestsbatch: debug: propagating UMASK=0002sbatch: debug3:
Trying to load plugin /usr/lib/slurm/auth_munge.sosbatch: auth plugin for Munge
(http://home.gna.org/munge/) loadedsbatch: debug3: Success.sbatch: debug3:
Trying to load plugin /usr/lib/slurm/select_linear.sosbatch: debug3:
Success.sbatch: debug3: Trying to load plugin
/usr/lib/slurm/select_cray.sosbatch: debug3: Success.sbatch: debug3: Trying to
load plugin /usr/lib/slurm/select_cons_res.sosbatch: debug3: Success.sbatch:
debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so
sbatch: debug3: Success.sbatch: debug3: Trying to load plugin
/usr/lib/slurm/select_bgq.sosbatch: debug3: Success.sbatch: error: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: debug: Slurm
temporarily unable to accept job, sleeping and retrying.sbatch: error: Batch
job submission failed: Resource temporarily unavailable