Hi In read_config.h, the default value for the # of max job is 10,000. This can be fixed if you recompile with a higher value.
line 89 #define DEFAULT_MAX_JOB_COUNT 10000 Perry Huang [email protected] Lawrence Livermore National Laboratory On Jul 12, 2012, at 2:14 PM, Cory McLean wrote: > hi, > > I am trying to use slurm as a resource manager, but am running into problems > when trying to submit over 10,000 jobs to the queue. Each job is queued by > issuing a separate sbatch command, which works well up to a few thousand jobs > but then I begin seeing the error > > sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying. > > Many jobs still get submitted after a few retries, but when around 9,980 jobs > are in the queue, invariably some job(s) hit the 15 MAX_RETRIES and exit with > the error > > sbatch: error: Batch job submission failed: Resource temporarily unavailable > > Is slurm not suited to handling tens of thousands of jobs? Or are there some > configuration/job submission changes I could make to allow slurm to handle up > to 50K jobs? > > Details of my current setup are as follows. A partition is specified for > each worker for later scaling of the cluster, where multiple nodes would be > assigned to each partition. > > Thank you very much for any help! > > Slurm version: 2.2.7 > > ### slurm.conf ### > # > ClusterName=wgs > ControlMachine=wgmaster > SlurmUser=slurm > SlurmctldPort=6818 > SlurmdPort=6817 > AuthType=auth/munge > StateSaveLocation=/tmp > SlurmdSpoolDir=/tmp/slurmd > SwitchType=switch/none > MpiDefault=none > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmdPidFile=/var/run/slurmd.pid > ProctrackType=proctrack/pgid > CacheGroups=0 > ReturnToService=0 > # > # TIMERS > SlurmctldTimeout=300 > SlurmdTimeout=300 > InactiveLimit=0 > MinJobAge=6000 > KillWait=30 > Waittime=0 > MessageTimeout=60 > # > # SCHEDULING > SchedulerType=sched/backfill > SchedulerParameters=defer > SelectType=select/linear > FastSchedule=1 > # > # LOGGING > SlurmctldDebug=5 > SlurmctldLogFile=/var/log/slurmctld > SlurmdDebug=5 > SlurmdLogFile=/var/log/slurmd > JobCompType=jobcomp/filetxt > # > # COMPUTE NODES > NodeName=wgmaster Procs=1 State=UNKNOWN > NodeName=wgnode1 NodeHostname=wgnode1 Procs=1 State=UNKNOWN > NodeName=wgnode2 NodeHostname=wgnode2 Procs=1 State=UNKNOWN > NodeName=wgnode3 NodeHostname=wgnode3 Procs=1 State=UNKNOWN > NodeName=wgnode4 NodeHostname=wgnode4 Procs=1 State=UNKNOWN > NodeName=wgnode5 NodeHostname=wgnode5 Procs=1 State=UNKNOWN > # > # PARTITIONS > PartitionName=all Nodes=wgmaster,wgnode[1-5] Default=NO MaxTime=INFINITE > State=UP > PartitionName=worker Nodes=wgnode[1-5] Default=YES MaxTime=INFINITE > State=UP > PartitionName=dbhost Nodes=wgmaster Default=NO MaxTime=INFINITE State=UP > PartitionName=p1 Nodes=wgnode[1] Default=NO MaxTime=INFINITE State=UP > PartitionName=p2 Nodes=wgnode[2] Default=NO MaxTime=INFINITE State=UP > PartitionName=p3 Nodes=wgnode[3] Default=NO MaxTime=INFINITE State=UP > PartitionName=p4 Nodes=wgnode[4] Default=NO MaxTime=INFINITE State=UP > PartitionName=p5 Nodes=wgnode[5] Default=NO MaxTime=INFINITE State=UP > > > ### An example of the type of command being issued to sbatch (a script tries > to issue thousands of these commands in series) ### > sbatch --job-name=j1 --partition=wgnode1 --error=./log/j1.err > --output=./log/j1.out -vvvvv --share ./bin/dowork.sh j1 > > > ### And the example output logged by sbatch when it errors: ### > sbatch: defined options for program `sbatch' > sbatch: ----------------- --------------------- > sbatch: user : `cluster' > sbatch: uid : 2113 > sbatch: gid : 2113 > sbatch: cwd : /tmp/slurmtest > sbatch: ntasks : 1 (default) > sbatch: cpus_per_task : 1 (default) > sbatch: nodes : 1 (default) > sbatch: jobid : 4294967294 (default) > sbatch: partition : wgnode1 > sbatch: job name : `j1' > sbatch: reservation : `(null)' > sbatch: wckey : `(null)' > sbatch: distribution : unknown > sbatch: verbose : 8 > sbatch: immediate : false > sbatch: overcommit : false > sbatch: account : (null) > sbatch: comment : (null) > sbatch: dependency : (null) > sbatch: qos : (null) > sbatch: constraints : mincpus=1 > sbatch: geometry : (null) > sbatch: reboot : yes > sbatch: rotate : no > sbatch: network : (null) > sbatch: mail_type : NONE > sbatch: mail_user : (null) > sbatch: sockets-per-node : -2 > sbatch: cores-per-socket : -2 > sbatch: threads-per-core : -2 > sbatch: ntasks-per-node : 0 > sbatch: ntasks-per-socket : -2 > sbatch: ntasks-per-core : -2 > sbatch: cpu_bind : default > sbatch: mem_bind : default > sbatch: plane_size : 4294967294 > sbatch: propagate : NONE > sbatch: remote command : `/tmp/slurmtest/./bin/dowork.sh' > sbatch: debug: propagating RLIMIT_CPU=18446744073709551615 > sbatch: debug: propagating RLIMIT_FSIZE=18446744073709551615 > sbatch: debug: propagating RLIMIT_DATA=18446744073709551615 > sbatch: debug: propagating RLIMIT_STACK=8388608 > sbatch: debug: propagating RLIMIT_CORE=0 > sbatch: debug: propagating RLIMIT_RSS=18446744073709551615 > sbatch: debug: propagating RLIMIT_NPROC=61504 > sbatch: debug: propagating RLIMIT_NOFILE=8192 > sbatch: debug: propagating RLIMIT_MEMLOCK=32768 > sbatch: debug: propagating RLIMIT_AS=18446744073709551615 > sbatch: debug: propagating SLURM_PRIO_PROCESS=0 > sbatch: debug: propagating SUBMIT_DIR=/tmp/slurmtest > sbatch: debug: propagating UMASK=0002 > sbatch: debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so > sbatch: auth plugin for Munge (http://home.gna.org/munge/) loaded > sbatch: debug3: Success. > sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_linear.so > sbatch: debug3: Success. > sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so > sbatch: debug3: Success. > sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so > sbatch: debug3: Success. > sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so > > > sbatch: debug3: Success. > sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so > sbatch: debug3: Success. > sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: debug: Slurm temporarily unable to accept job, sleeping and retrying. > sbatch: error: Batch job submission failed: Resource temporarily unavailable > >
