Upgrade to slurm v2.4 and decrease MinJobAge, SlurmdDebug and SlurmctldDebug.
Quoting Cory McLean <[email protected]>: > > hi, > I am trying to use slurm as a resource manager, but am running into > problems when trying to submit over 10,000 jobs to the queue. Each > job is queued by issuing a separate sbatch command, which works well > up to a few thousand jobs but then I begin seeing the error > sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying. > Many jobs still get submitted after a few retries, but when around > 9,980 jobs are in the queue, invariably some job(s) hit the 15 > MAX_RETRIES and exit with the error > sbatch: error: Batch job submission failed: Resource temporarily unavailable > Is slurm not suited to handling tens of thousands of jobs? Or are > there some configuration/job submission changes I could make to > allow slurm to handle up to 50K jobs? > Details of my current setup are as follows. A partition is > specified for each worker for later scaling of the cluster, where > multiple nodes would be assigned to each partition. > Thank you very much for any help! > Slurm version: 2.2.7 > ### slurm.conf > ####ClusterName=wgsControlMachine=wgmasterSlurmUser=slurmSlurmctldPort=6818SlurmdPort=6817AuthType=auth/mungeStateSaveLocation=/tmpSlurmdSpoolDir=/tmp/slurmdSwitchType=switch/noneMpiDefault=noneSlurmctldPidFile=/var/run/slurmctld.pidSlurmdPidFile=/var/run/slurmd.pidProctrackType=proctrack/pgidCacheGroups=0ReturnToService=0## > > TIMERSSlurmctldTimeout=300SlurmdTimeout=300InactiveLimit=0MinJobAge=6000KillWait=30Waittime=0MessageTimeout=60## > > SCHEDULINGSchedulerType=sched/backfillSchedulerParameters=deferSelectType=select/linearFastSchedule=1## > > LOGGINGSlurmctldDebug=5SlurmctldLogFile=/var/log/slurmctldSlurmdDebug=5SlurmdLogFile=/var/log/slurmdJobCompType=jobcomp/filetxt## > COMPUTE NODESNodeName=wgmaster Procs=1 State=UNKNOWNNodeName=wgnode1 > NodeHostname=wgnode1 Procs=1 State=UNKNOWNNodeName=wgnode2 > NodeHostname=wgnode2 Procs=1 State=UNKNOWNNodeName=wgnode3 > NodeHostname=wgnode3 Procs=1 State=UNKNOWNNodeName=wgnode4 > NodeHostname=wgnode4 Procs=1 State=UNKNOWNNodeName=wgnode5 > NodeHostname=wgnode5 Procs=1 State=UNKNOWN## PARTITIONSPartitionName=all > Nodes=wgmaster,wgnode[1-5] Default=NO MaxTime=INFINITE > State=UPPartitionName=worker Nodes=wgnode[1-5] Default=YES MaxTime=INFINITE > State=UPPartitionName=dbhost Nodes=wgmaster Default=NO MaxTime=INFINITE > State=UPPartitionName=p1 Nodes=wgnode[1] Default=NO MaxTime=INFINITE > State=UPPartitionName=p2 Nodes=wgnode[2] Default=NO MaxTime=INFINITE > State=UPPartitionName=p3 Nodes=wgnode[3] Default=NO MaxTime=INFINITE > State=UPPartitionName=p4 Nodes=wgnode[4] Default=NO MaxTime=INFINITE > State=UPPartitionName=p5 Nodes=wgnode[5] Default=NO MaxTime=INFINITE > State=UP > > ### An example of the type of command being issued to sbatch (a > script tries to issue thousands of these commands in series) > ###sbatch --job-name=j1 --partition=wgnode1 --error=./log/j1.err > --output=./log/j1.out -vvvvv --share ./bin/dowork.sh j1 > > ### And the example output logged by sbatch when it errors: > ###sbatch: defined options for program `sbatch'sbatch: > ----------------- ---------------------sbatch: user : > `cluster'sbatch: uid : 2113sbatch: gid : > 2113sbatch: cwd : /tmp/slurmtestsbatch: ntasks > : 1 (default)sbatch: cpus_per_task : 1 (default)sbatch: nodes > : 1 (default)sbatch: jobid : 4294967294 > (default)sbatch: partition : wgnode1sbatch: job name > : `j1'sbatch: reservation : `(null)'sbatch: wckey > : `(null)'sbatch: distribution : unknownsbatch: verbose > : 8sbatch: immediate : falsesbatch: overcommit : > falsesbatch: account : (null)sbatch: comment : > (null)sbatch: dependency : (null)sbatch: qos : > (null)sbatch: constraints : mincpus=1 sbatch: geometry > : (null)sbatch: reboot : yessbatch: rotate : > nosbatch: network : (null)sbatch: mail_type : > NONEsbatch: mail_user : (null)sbatch: sockets-per-node : > -2sbatch: cores-per-socket : -2sbatch: threads-per-core : > -2sbatch: ntasks-per-node : 0sbatch: ntasks-per-socket : -2sbatch: > ntasks-per-core : -2sbatch: cpu_bind : defaultsbatch: > mem_bind : defaultsbatch: plane_size : > 4294967294sbatch: propagate : NONEsbatch: remote command > : `/tmp/slurmtest/./bin/dowork.sh'sbatch: debug: propagating > RLIMIT_CPU=18446744073709551615sbatch: debug: propagating > RLIMIT_FSIZE=18446744073709551615sbatch: debug: propagating > RLIMIT_DATA=18446744073709551615sbatch: debug: propagating > RLIMIT_STACK=8388608sbatch: debug: propagating RLIMIT_CORE=0sbatch: > debug: propagating RLIMIT_RSS=18446744073709551615sbatch: debug: > propagating RLIMIT_NPROC=61504sbatch: debug: propagating > RLIMIT_NOFILE=8192sbatch: debug: propagating > RLIMIT_MEMLOCK=32768sbatch: debug: propagating > RLIMIT_AS=18446744073709551615sbatch: debug: propagating > SLURM_PRIO_PROCESS=0sbatch: debug: propagating > SUBMIT_DIR=/tmp/slurmtestsbatch: debug: propagating > UMASK=0002sbatch: debug3: Trying to load plugin > /usr/lib/slurm/auth_munge.sosbatch: auth plugin for Munge > (http://home.gna.org/munge/) loadedsbatch: debug3: Success.sbatch: > debug3: Trying to load plugin /usr/lib/slurm/select_linear.sosbatch: > debug3: Success.sbatch: debug3: Trying to load plugin > /usr/lib/slurm/select_cray.sosbatch: debug3: Success.sbatch: debug3: > Trying to load plugin /usr/lib/slurm/select_cons_res.sosbatch: > debug3: Success.sbatch: debug3: Trying to load plugin > /usr/lib/slurm/select_bluegene.so > sbatch: > debug3: Success.sbatch: debug3: Trying to load plugin > /usr/lib/slurm/select_bgq.sosbatch: debug3: Success.sbatch: error: > Slurm temporarily unable to accept job, sleeping and > retrying.sbatch: debug: Slurm temporarily unable to accept job, > sleeping and retrying.sbatch: debug: Slurm temporarily unable to > accept job, sleeping and retrying.sbatch: debug: Slurm temporarily > unable to accept job, sleeping and retrying.sbatch: debug: Slurm > temporarily unable to accept job, sleeping and retrying.sbatch: > debug: Slurm temporarily unable to accept job, sleeping and > retrying.sbatch: debug: Slurm temporarily unable to accept job, > sleeping and retrying.sbatch: debug: Slurm temporarily unable to > accept job, sleeping and retrying.sbatch: debug: Slurm temporarily > unable to accept job, sleeping and retrying.sbatch: debug: Slurm > temporarily unable to accept job, sleeping and retrying.sbatch: > debug: Slurm temporarily unable to accept job, sleeping and > retrying.sbatch: debug: Slurm temporarily unable to accept job, > sleeping and retrying.sbatch: debug: Slurm temporarily unable to > accept job, sleeping and retrying.sbatch: debug: Slurm temporarily > unable to accept job, sleeping and retrying.sbatch: debug: Slurm > temporarily unable to accept job, sleeping and retrying.sbatch: > error: Batch job submission failed: Resource temporarily unavailable >
