Or just change the slurm.conf file... Quoting "Huang, Perry" <[email protected]>:
> > Hi > > In read_config.h, the default value for the # of max job is 10,000. > This can be fixed if you recompile with a higher value. > > line 89 #define DEFAULT_MAX_JOB_COUNT 10000 > > Perry Huang > [email protected] > Lawrence Livermore National Laboratory > > > > On Jul 12, 2012, at 2:14 PM, Cory McLean wrote: > >> hi, >> >> I am trying to use slurm as a resource manager, but am running into >> problems when trying to submit over 10,000 jobs to the queue. Each >> job is queued by issuing a separate sbatch command, which works >> well up to a few thousand jobs but then I begin seeing the error >> >> sbatch: error: Slurm temporarily unable to accept job, sleeping and >> retrying. >> >> Many jobs still get submitted after a few retries, but when around >> 9,980 jobs are in the queue, invariably some job(s) hit the 15 >> MAX_RETRIES and exit with the error >> >> sbatch: error: Batch job submission failed: Resource temporarily unavailable >> >> Is slurm not suited to handling tens of thousands of jobs? Or are >> there some configuration/job submission changes I could make to >> allow slurm to handle up to 50K jobs? >> >> Details of my current setup are as follows. A partition is >> specified for each worker for later scaling of the cluster, where >> multiple nodes would be assigned to each partition. >> >> Thank you very much for any help! >> >> Slurm version: 2.2.7 >> >> ### slurm.conf ### >> # >> ClusterName=wgs >> ControlMachine=wgmaster >> SlurmUser=slurm >> SlurmctldPort=6818 >> SlurmdPort=6817 >> AuthType=auth/munge >> StateSaveLocation=/tmp >> SlurmdSpoolDir=/tmp/slurmd >> SwitchType=switch/none >> MpiDefault=none >> SlurmctldPidFile=/var/run/slurmctld.pid >> SlurmdPidFile=/var/run/slurmd.pid >> ProctrackType=proctrack/pgid >> CacheGroups=0 >> ReturnToService=0 >> # >> # TIMERS >> SlurmctldTimeout=300 >> SlurmdTimeout=300 >> InactiveLimit=0 >> MinJobAge=6000 >> KillWait=30 >> Waittime=0 >> MessageTimeout=60 >> # >> # SCHEDULING >> SchedulerType=sched/backfill >> SchedulerParameters=defer >> SelectType=select/linear >> FastSchedule=1 >> # >> # LOGGING >> SlurmctldDebug=5 >> SlurmctldLogFile=/var/log/slurmctld >> SlurmdDebug=5 >> SlurmdLogFile=/var/log/slurmd >> JobCompType=jobcomp/filetxt >> # >> # COMPUTE NODES >> NodeName=wgmaster Procs=1 State=UNKNOWN >> NodeName=wgnode1 NodeHostname=wgnode1 Procs=1 State=UNKNOWN >> NodeName=wgnode2 NodeHostname=wgnode2 Procs=1 State=UNKNOWN >> NodeName=wgnode3 NodeHostname=wgnode3 Procs=1 State=UNKNOWN >> NodeName=wgnode4 NodeHostname=wgnode4 Procs=1 State=UNKNOWN >> NodeName=wgnode5 NodeHostname=wgnode5 Procs=1 State=UNKNOWN >> # >> # PARTITIONS >> PartitionName=all Nodes=wgmaster,wgnode[1-5] Default=NO >> MaxTime=INFINITE State=UP >> PartitionName=worker Nodes=wgnode[1-5] Default=YES >> MaxTime=INFINITE State=UP >> PartitionName=dbhost Nodes=wgmaster Default=NO MaxTime=INFINITE State=UP >> PartitionName=p1 Nodes=wgnode[1] Default=NO MaxTime=INFINITE State=UP >> PartitionName=p2 Nodes=wgnode[2] Default=NO MaxTime=INFINITE State=UP >> PartitionName=p3 Nodes=wgnode[3] Default=NO MaxTime=INFINITE State=UP >> PartitionName=p4 Nodes=wgnode[4] Default=NO MaxTime=INFINITE State=UP >> PartitionName=p5 Nodes=wgnode[5] Default=NO MaxTime=INFINITE State=UP >> >> >> ### An example of the type of command being issued to sbatch (a >> script tries to issue thousands of these commands in series) ### >> sbatch --job-name=j1 --partition=wgnode1 --error=./log/j1.err >> --output=./log/j1.out -vvvvv --share ./bin/dowork.sh j1 >> >> >> ### And the example output logged by sbatch when it errors: ### >> sbatch: defined options for program `sbatch' >> sbatch: ----------------- --------------------- >> sbatch: user : `cluster' >> sbatch: uid : 2113 >> sbatch: gid : 2113 >> sbatch: cwd : /tmp/slurmtest >> sbatch: ntasks : 1 (default) >> sbatch: cpus_per_task : 1 (default) >> sbatch: nodes : 1 (default) >> sbatch: jobid : 4294967294 (default) >> sbatch: partition : wgnode1 >> sbatch: job name : `j1' >> sbatch: reservation : `(null)' >> sbatch: wckey : `(null)' >> sbatch: distribution : unknown >> sbatch: verbose : 8 >> sbatch: immediate : false >> sbatch: overcommit : false >> sbatch: account : (null) >> sbatch: comment : (null) >> sbatch: dependency : (null) >> sbatch: qos : (null) >> sbatch: constraints : mincpus=1 >> sbatch: geometry : (null) >> sbatch: reboot : yes >> sbatch: rotate : no >> sbatch: network : (null) >> sbatch: mail_type : NONE >> sbatch: mail_user : (null) >> sbatch: sockets-per-node : -2 >> sbatch: cores-per-socket : -2 >> sbatch: threads-per-core : -2 >> sbatch: ntasks-per-node : 0 >> sbatch: ntasks-per-socket : -2 >> sbatch: ntasks-per-core : -2 >> sbatch: cpu_bind : default >> sbatch: mem_bind : default >> sbatch: plane_size : 4294967294 >> sbatch: propagate : NONE >> sbatch: remote command : `/tmp/slurmtest/./bin/dowork.sh' >> sbatch: debug: propagating RLIMIT_CPU=18446744073709551615 >> sbatch: debug: propagating RLIMIT_FSIZE=18446744073709551615 >> sbatch: debug: propagating RLIMIT_DATA=18446744073709551615 >> sbatch: debug: propagating RLIMIT_STACK=8388608 >> sbatch: debug: propagating RLIMIT_CORE=0 >> sbatch: debug: propagating RLIMIT_RSS=18446744073709551615 >> sbatch: debug: propagating RLIMIT_NPROC=61504 >> sbatch: debug: propagating RLIMIT_NOFILE=8192 >> sbatch: debug: propagating RLIMIT_MEMLOCK=32768 >> sbatch: debug: propagating RLIMIT_AS=18446744073709551615 >> sbatch: debug: propagating SLURM_PRIO_PROCESS=0 >> sbatch: debug: propagating SUBMIT_DIR=/tmp/slurmtest >> sbatch: debug: propagating UMASK=0002 >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so >> sbatch: auth plugin for Munge (http://home.gna.org/munge/) loaded >> sbatch: debug3: Success. >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_linear.so >> sbatch: debug3: Success. >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so >> sbatch: debug3: Success. >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so >> sbatch: debug3: Success. >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so >> sbatch: debug3: Success. >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so >> sbatch: debug3: Success. >> sbatch: error: Slurm temporarily unable to accept job, sleeping and >> retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: debug: Slurm temporarily unable to accept job, sleeping >> and retrying. >> sbatch: error: Batch job submission failed: Resource temporarily unavailable >> >> >
