Fantastic, thanks for your help!  I confirmed that after setting the 
MaxJobCount in slurm.conf to 100000 (and lowering the slurmctlddebug and 
slurmddebug levels to 1 as per the other suggestion, though perhaps that was 
not necessary) that jobs are being submitted without a hitch.

> Date: Thu, 12 Jul 2012 17:45:03 -0600
> From: [email protected]
> To: [email protected]
> Subject: [slurm-dev] Re: sbatch error: Resource temporarily unavailable when 
> queue has 10K jobs
> 
> 
> Or just change the slurm.conf file...
> 
> Quoting "Huang, Perry" <[email protected]>:
> 
> >
> > Hi
> >
> > In read_config.h, the default value for the # of max job is 10,000.  
> > This can be fixed if you recompile with a higher value.
> >
> > line 89 #define DEFAULT_MAX_JOB_COUNT       10000
> >
> > Perry Huang
> > [email protected]
> > Lawrence Livermore National Laboratory
> >
> >
> >
> > On Jul 12, 2012, at 2:14 PM, Cory McLean wrote:
> >
> >> hi,
> >>
> >> I am trying to use slurm as a resource manager, but am running into  
> >> problems when trying to submit over 10,000 jobs to the queue.  Each  
> >> job is queued by issuing a separate sbatch command, which works  
> >> well up to a few thousand jobs but then I begin seeing the error
> >>
> >> sbatch: error: Slurm temporarily unable to accept job, sleeping and  
> >> retrying.
> >>
> >> Many jobs still get submitted after a few retries, but when around  
> >> 9,980 jobs are in the queue, invariably some job(s) hit the 15  
> >> MAX_RETRIES and exit with the error
> >>
> >> sbatch: error: Batch job submission failed: Resource temporarily 
> >> unavailable
> >>
> >> Is slurm not suited to handling tens of thousands of jobs?  Or are  
> >> there some configuration/job submission changes I could make to  
> >> allow slurm to handle up to 50K jobs?
> >>
> >> Details of my current setup are as follows.  A partition is  
> >> specified for each worker for later scaling of the cluster, where  
> >> multiple nodes would be assigned to each partition.
> >>
> >> Thank you very much for any help!
> >>
> >> Slurm version: 2.2.7
> >>
> >> ### slurm.conf ###
> >> #
> >> ClusterName=wgs
> >> ControlMachine=wgmaster
> >> SlurmUser=slurm
> >> SlurmctldPort=6818
> >> SlurmdPort=6817
> >> AuthType=auth/munge
> >> StateSaveLocation=/tmp
> >> SlurmdSpoolDir=/tmp/slurmd
> >> SwitchType=switch/none
> >> MpiDefault=none
> >> SlurmctldPidFile=/var/run/slurmctld.pid
> >> SlurmdPidFile=/var/run/slurmd.pid
> >> ProctrackType=proctrack/pgid
> >> CacheGroups=0
> >> ReturnToService=0
> >> #
> >> # TIMERS
> >> SlurmctldTimeout=300
> >> SlurmdTimeout=300
> >> InactiveLimit=0
> >> MinJobAge=6000
> >> KillWait=30
> >> Waittime=0
> >> MessageTimeout=60
> >> #
> >> # SCHEDULING
> >> SchedulerType=sched/backfill
> >> SchedulerParameters=defer
> >> SelectType=select/linear
> >> FastSchedule=1
> >> #
> >> # LOGGING
> >> SlurmctldDebug=5
> >> SlurmctldLogFile=/var/log/slurmctld
> >> SlurmdDebug=5
> >> SlurmdLogFile=/var/log/slurmd
> >> JobCompType=jobcomp/filetxt
> >> #
> >> # COMPUTE NODES
> >> NodeName=wgmaster  Procs=1  State=UNKNOWN
> >> NodeName=wgnode1  NodeHostname=wgnode1 Procs=1  State=UNKNOWN
> >> NodeName=wgnode2  NodeHostname=wgnode2 Procs=1  State=UNKNOWN
> >> NodeName=wgnode3  NodeHostname=wgnode3 Procs=1  State=UNKNOWN
> >> NodeName=wgnode4  NodeHostname=wgnode4 Procs=1  State=UNKNOWN
> >> NodeName=wgnode5  NodeHostname=wgnode5 Procs=1  State=UNKNOWN
> >> #
> >> # PARTITIONS
> >> PartitionName=all  Nodes=wgmaster,wgnode[1-5]  Default=NO  
> >> MaxTime=INFINITE  State=UP
> >> PartitionName=worker  Nodes=wgnode[1-5]  Default=YES  
> >> MaxTime=INFINITE  State=UP
> >> PartitionName=dbhost  Nodes=wgmaster  Default=NO MaxTime=INFINITE  State=UP
> >> PartitionName=p1  Nodes=wgnode[1] Default=NO MaxTime=INFINITE State=UP
> >> PartitionName=p2  Nodes=wgnode[2] Default=NO MaxTime=INFINITE State=UP
> >> PartitionName=p3  Nodes=wgnode[3] Default=NO MaxTime=INFINITE State=UP
> >> PartitionName=p4  Nodes=wgnode[4] Default=NO MaxTime=INFINITE State=UP
> >> PartitionName=p5  Nodes=wgnode[5] Default=NO MaxTime=INFINITE State=UP
> >>
> >>
> >> ### An example of the type of command being issued to sbatch (a  
> >> script tries to issue thousands of these commands in series) ###
> >> sbatch --job-name=j1 --partition=wgnode1 --error=./log/j1.err  
> >> --output=./log/j1.out -vvvvv --share ./bin/dowork.sh j1
> >>
> >>
> >> ### And the example output logged by sbatch when it errors: ###
> >> sbatch: defined options for program `sbatch'
> >> sbatch: ----------------- ---------------------
> >> sbatch: user              : `cluster'
> >> sbatch: uid               : 2113
> >> sbatch: gid               : 2113
> >> sbatch: cwd               : /tmp/slurmtest
> >> sbatch: ntasks            : 1 (default)
> >> sbatch: cpus_per_task     : 1 (default)
> >> sbatch: nodes             : 1 (default)
> >> sbatch: jobid             : 4294967294 (default)
> >> sbatch: partition         : wgnode1
> >> sbatch: job name          : `j1'
> >> sbatch: reservation       : `(null)'
> >> sbatch: wckey             : `(null)'
> >> sbatch: distribution      : unknown
> >> sbatch: verbose           : 8
> >> sbatch: immediate         : false
> >> sbatch: overcommit        : false
> >> sbatch: account           : (null)
> >> sbatch: comment           : (null)
> >> sbatch: dependency        : (null)
> >> sbatch: qos               : (null)
> >> sbatch: constraints       : mincpus=1
> >> sbatch: geometry          : (null)
> >> sbatch: reboot            : yes
> >> sbatch: rotate            : no
> >> sbatch: network           : (null)
> >> sbatch: mail_type         : NONE
> >> sbatch: mail_user         : (null)
> >> sbatch: sockets-per-node  : -2
> >> sbatch: cores-per-socket  : -2
> >> sbatch: threads-per-core  : -2
> >> sbatch: ntasks-per-node   : 0
> >> sbatch: ntasks-per-socket : -2
> >> sbatch: ntasks-per-core   : -2
> >> sbatch: cpu_bind          : default
> >> sbatch: mem_bind          : default
> >> sbatch: plane_size        : 4294967294
> >> sbatch: propagate         : NONE
> >> sbatch: remote command    : `/tmp/slurmtest/./bin/dowork.sh'
> >> sbatch: debug:  propagating RLIMIT_CPU=18446744073709551615
> >> sbatch: debug:  propagating RLIMIT_FSIZE=18446744073709551615
> >> sbatch: debug:  propagating RLIMIT_DATA=18446744073709551615
> >> sbatch: debug:  propagating RLIMIT_STACK=8388608
> >> sbatch: debug:  propagating RLIMIT_CORE=0
> >> sbatch: debug:  propagating RLIMIT_RSS=18446744073709551615
> >> sbatch: debug:  propagating RLIMIT_NPROC=61504
> >> sbatch: debug:  propagating RLIMIT_NOFILE=8192
> >> sbatch: debug:  propagating RLIMIT_MEMLOCK=32768
> >> sbatch: debug:  propagating RLIMIT_AS=18446744073709551615
> >> sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
> >> sbatch: debug:  propagating SUBMIT_DIR=/tmp/slurmtest
> >> sbatch: debug:  propagating UMASK=0002
> >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so
> >> sbatch: auth plugin for Munge (http://home.gna.org/munge/) loaded
> >> sbatch: debug3: Success.
> >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_linear.so
> >> sbatch: debug3: Success.
> >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so
> >> sbatch: debug3: Success.
> >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so
> >> sbatch: debug3: Success.
> >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so
> >> sbatch: debug3: Success.
> >> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so
> >> sbatch: debug3: Success.
> >> sbatch: error: Slurm temporarily unable to accept job, sleeping and  
> >> retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
> >> and retrying.
> >> sbatch: error: Batch job submission failed: Resource temporarily 
> >> unavailable
> >>
> >>
> >
> 
                                          

Reply via email to