Fantastic, thanks for your help! I confirmed that after setting the MaxJobCount in slurm.conf to 100000 (and lowering the slurmctlddebug and slurmddebug levels to 1 as per the other suggestion, though perhaps that was not necessary) that jobs are being submitted without a hitch.
> Date: Thu, 12 Jul 2012 17:45:03 -0600 > From: [email protected] > To: [email protected] > Subject: [slurm-dev] Re: sbatch error: Resource temporarily unavailable when > queue has 10K jobs > > > Or just change the slurm.conf file... > > Quoting "Huang, Perry" <[email protected]>: > > > > > Hi > > > > In read_config.h, the default value for the # of max job is 10,000. > > This can be fixed if you recompile with a higher value. > > > > line 89 #define DEFAULT_MAX_JOB_COUNT 10000 > > > > Perry Huang > > [email protected] > > Lawrence Livermore National Laboratory > > > > > > > > On Jul 12, 2012, at 2:14 PM, Cory McLean wrote: > > > >> hi, > >> > >> I am trying to use slurm as a resource manager, but am running into > >> problems when trying to submit over 10,000 jobs to the queue. Each > >> job is queued by issuing a separate sbatch command, which works > >> well up to a few thousand jobs but then I begin seeing the error > >> > >> sbatch: error: Slurm temporarily unable to accept job, sleeping and > >> retrying. > >> > >> Many jobs still get submitted after a few retries, but when around > >> 9,980 jobs are in the queue, invariably some job(s) hit the 15 > >> MAX_RETRIES and exit with the error > >> > >> sbatch: error: Batch job submission failed: Resource temporarily > >> unavailable > >> > >> Is slurm not suited to handling tens of thousands of jobs? Or are > >> there some configuration/job submission changes I could make to > >> allow slurm to handle up to 50K jobs? > >> > >> Details of my current setup are as follows. A partition is > >> specified for each worker for later scaling of the cluster, where > >> multiple nodes would be assigned to each partition. > >> > >> Thank you very much for any help! > >> > >> Slurm version: 2.2.7 > >> > >> ### slurm.conf ### > >> # > >> ClusterName=wgs > >> ControlMachine=wgmaster > >> SlurmUser=slurm > >> SlurmctldPort=6818 > >> SlurmdPort=6817 > >> AuthType=auth/munge > >> StateSaveLocation=/tmp > >> SlurmdSpoolDir=/tmp/slurmd > >> SwitchType=switch/none > >> MpiDefault=none > >> SlurmctldPidFile=/var/run/slurmctld.pid > >> SlurmdPidFile=/var/run/slurmd.pid > >> ProctrackType=proctrack/pgid > >> CacheGroups=0 > >> ReturnToService=0 > >> # > >> # TIMERS > >> SlurmctldTimeout=300 > >> SlurmdTimeout=300 > >> InactiveLimit=0 > >> MinJobAge=6000 > >> KillWait=30 > >> Waittime=0 > >> MessageTimeout=60 > >> # > >> # SCHEDULING > >> SchedulerType=sched/backfill > >> SchedulerParameters=defer > >> SelectType=select/linear > >> FastSchedule=1 > >> # > >> # LOGGING > >> SlurmctldDebug=5 > >> SlurmctldLogFile=/var/log/slurmctld > >> SlurmdDebug=5 > >> SlurmdLogFile=/var/log/slurmd > >> JobCompType=jobcomp/filetxt > >> # > >> # COMPUTE NODES > >> NodeName=wgmaster Procs=1 State=UNKNOWN > >> NodeName=wgnode1 NodeHostname=wgnode1 Procs=1 State=UNKNOWN > >> NodeName=wgnode2 NodeHostname=wgnode2 Procs=1 State=UNKNOWN > >> NodeName=wgnode3 NodeHostname=wgnode3 Procs=1 State=UNKNOWN > >> NodeName=wgnode4 NodeHostname=wgnode4 Procs=1 State=UNKNOWN > >> NodeName=wgnode5 NodeHostname=wgnode5 Procs=1 State=UNKNOWN > >> # > >> # PARTITIONS > >> PartitionName=all Nodes=wgmaster,wgnode[1-5] Default=NO > >> MaxTime=INFINITE State=UP > >> PartitionName=worker Nodes=wgnode[1-5] Default=YES > >> MaxTime=INFINITE State=UP > >> PartitionName=dbhost Nodes=wgmaster Default=NO MaxTime=INFINITE State=UP > >> PartitionName=p1 Nodes=wgnode[1] Default=NO MaxTime=INFINITE State=UP > >> PartitionName=p2 Nodes=wgnode[2] Default=NO MaxTime=INFINITE State=UP > >> PartitionName=p3 Nodes=wgnode[3] Default=NO MaxTime=INFINITE State=UP > >> PartitionName=p4 Nodes=wgnode[4] Default=NO MaxTime=INFINITE State=UP > >> PartitionName=p5 Nodes=wgnode[5] Default=NO MaxTime=INFINITE State=UP > >> > >> > >> ### An example of the type of command being issued to sbatch (a > >> script tries to issue thousands of these commands in series) ### > >> sbatch --job-name=j1 --partition=wgnode1 --error=./log/j1.err > >> --output=./log/j1.out -vvvvv --share ./bin/dowork.sh j1 > >> > >> > >> ### And the example output logged by sbatch when it errors: ### > >> sbatch: defined options for program `sbatch' > >> sbatch: ----------------- --------------------- > >> sbatch: user : `cluster' > >> sbatch: uid : 2113 > >> sbatch: gid : 2113 > >> sbatch: cwd : /tmp/slurmtest > >> sbatch: ntasks : 1 (default) > >> sbatch: cpus_per_task : 1 (default) > >> sbatch: nodes : 1 (default) > >> sbatch: jobid : 4294967294 (default) > >> sbatch: partition : wgnode1 > >> sbatch: job name : `j1' > >> sbatch: reservation : `(null)' > >> sbatch: wckey : `(null)' > >> sbatch: distribution : unknown > >> sbatch: verbose : 8 > >> sbatch: immediate : false > >> sbatch: overcommit : false > >> sbatch: account : (null) > >> sbatch: comment : (null) > >> sbatch: dependency : (null) > >> sbatch: qos : (null) > >> sbatch: constraints : mincpus=1 > >> sbatch: geometry : (null) > >> sbatch: reboot : yes > >> sbatch: rotate : no > >> sbatch: network : (null) > >> sbatch: mail_type : NONE > >> sbatch: mail_user : (null) > >> sbatch: sockets-per-node : -2 > >> sbatch: cores-per-socket : -2 > >> sbatch: threads-per-core : -2 > >> sbatch: ntasks-per-node : 0 > >> sbatch: ntasks-per-socket : -2 > >> sbatch: ntasks-per-core : -2 > >> sbatch: cpu_bind : default > >> sbatch: mem_bind : default > >> sbatch: plane_size : 4294967294 > >> sbatch: propagate : NONE > >> sbatch: remote command : `/tmp/slurmtest/./bin/dowork.sh' > >> sbatch: debug: propagating RLIMIT_CPU=18446744073709551615 > >> sbatch: debug: propagating RLIMIT_FSIZE=18446744073709551615 > >> sbatch: debug: propagating RLIMIT_DATA=18446744073709551615 > >> sbatch: debug: propagating RLIMIT_STACK=8388608 > >> sbatch: debug: propagating RLIMIT_CORE=0 > >> sbatch: debug: propagating RLIMIT_RSS=18446744073709551615 > >> sbatch: debug: propagating RLIMIT_NPROC=61504 > >> sbatch: debug: propagating RLIMIT_NOFILE=8192 > >> sbatch: debug: propagating RLIMIT_MEMLOCK=32768 > >> sbatch: debug: propagating RLIMIT_AS=18446744073709551615 > >> sbatch: debug: propagating SLURM_PRIO_PROCESS=0 > >> sbatch: debug: propagating SUBMIT_DIR=/tmp/slurmtest > >> sbatch: debug: propagating UMASK=0002 > >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so > >> sbatch: auth plugin for Munge (http://home.gna.org/munge/) loaded > >> sbatch: debug3: Success. > >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_linear.so > >> sbatch: debug3: Success. > >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so > >> sbatch: debug3: Success. > >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so > >> sbatch: debug3: Success. > >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so > >> sbatch: debug3: Success. > >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so > >> sbatch: debug3: Success. > >> sbatch: error: Slurm temporarily unable to accept job, sleeping and > >> retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: debug: Slurm temporarily unable to accept job, sleeping > >> and retrying. > >> sbatch: error: Batch job submission failed: Resource temporarily > >> unavailable > >> > >> > > >
