hi,
I am trying to use slurm as a resource manager, but am running into problems 
when trying to submit over 10,000 jobs to the queue.  Each job is queued by 
issuing a separate sbatch command, which works well up to a few thousand jobs 
but then I begin seeing the error
sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
Many jobs still get submitted after a few retries, but when around 9,980 jobs 
are in the queue, invariably some job(s) hit the 15 MAX_RETRIES and exit with 
the error
sbatch: error: Batch job submission failed: Resource temporarily unavailable
Is slurm not suited to handling tens of thousands of jobs?  Or are there some 
configuration/job submission changes I could make to allow slurm to handle up 
to 50K jobs?
Details of my current setup are as follows.  A partition is specified for each 
worker for later scaling of the cluster, where multiple nodes would be assigned 
to each partition.
Thank you very much for any help!
Slurm version: 2.2.7
### slurm.conf 
####ClusterName=wgsControlMachine=wgmasterSlurmUser=slurmSlurmctldPort=6818SlurmdPort=6817AuthType=auth/mungeStateSaveLocation=/tmpSlurmdSpoolDir=/tmp/slurmdSwitchType=switch/noneMpiDefault=noneSlurmctldPidFile=/var/run/slurmctld.pidSlurmdPidFile=/var/run/slurmd.pidProctrackType=proctrack/pgidCacheGroups=0ReturnToService=0##
 
TIMERSSlurmctldTimeout=300SlurmdTimeout=300InactiveLimit=0MinJobAge=6000KillWait=30Waittime=0MessageTimeout=60##
 
SCHEDULINGSchedulerType=sched/backfillSchedulerParameters=deferSelectType=select/linearFastSchedule=1##
 
LOGGINGSlurmctldDebug=5SlurmctldLogFile=/var/log/slurmctldSlurmdDebug=5SlurmdLogFile=/var/log/slurmdJobCompType=jobcomp/filetxt##
 COMPUTE NODESNodeName=wgmaster  Procs=1  State=UNKNOWNNodeName=wgnode1  
NodeHostname=wgnode1 Procs=1  State=UNKNOWNNodeName=wgnode2  
NodeHostname=wgnode2 Procs=1  State=UNKNOWNNodeName=wgnode3  
NodeHostname=wgnode3 Procs=1  State=UNKNOWNNodeName=wgnode4  
NodeHostname=wgnode4 Procs=1  State=UNKNOWNNodeName=wgnode5  
NodeHostname=wgnode5 Procs=1  State=UNKNOWN## PARTITIONSPartitionName=all  
Nodes=wgmaster,wgnode[1-5]  Default=NO MaxTime=INFINITE  
State=UPPartitionName=worker  Nodes=wgnode[1-5]  Default=YES MaxTime=INFINITE  
State=UPPartitionName=dbhost  Nodes=wgmaster  Default=NO MaxTime=INFINITE  
State=UPPartitionName=p1  Nodes=wgnode[1] Default=NO MaxTime=INFINITE 
State=UPPartitionName=p2  Nodes=wgnode[2] Default=NO MaxTime=INFINITE 
State=UPPartitionName=p3  Nodes=wgnode[3] Default=NO MaxTime=INFINITE 
State=UPPartitionName=p4  Nodes=wgnode[4] Default=NO MaxTime=INFINITE 
State=UPPartitionName=p5  Nodes=wgnode[5] Default=NO MaxTime=INFINITE State=UP

### An example of the type of command being issued to sbatch (a script tries to 
issue thousands of these commands in series) ###sbatch --job-name=j1 
--partition=wgnode1 --error=./log/j1.err --output=./log/j1.out -vvvvv --share 
./bin/dowork.sh j1

### And the example output logged by sbatch when it errors: ###sbatch: defined 
options for program `sbatch'sbatch: ----------------- 
---------------------sbatch: user              : `cluster'sbatch: uid           
    : 2113sbatch: gid               : 2113sbatch: cwd               : 
/tmp/slurmtestsbatch: ntasks            : 1 (default)sbatch: cpus_per_task     
: 1 (default)sbatch: nodes             : 1 (default)sbatch: jobid             : 
4294967294 (default)sbatch: partition         : wgnode1sbatch: job name         
 : `j1'sbatch: reservation       : `(null)'sbatch: wckey             : 
`(null)'sbatch: distribution      : unknownsbatch: verbose           : 8sbatch: 
immediate         : falsesbatch: overcommit        : falsesbatch: account       
    : (null)sbatch: comment           : (null)sbatch: dependency        : 
(null)sbatch: qos               : (null)sbatch: constraints       : mincpus=1 
sbatch: geometry          : (null)sbatch: reboot            : yessbatch: rotate 
           : nosbatch: network           : (null)sbatch: mail_type         : 
NONEsbatch: mail_user         : (null)sbatch: sockets-per-node  : -2sbatch: 
cores-per-socket  : -2sbatch: threads-per-core  : -2sbatch: ntasks-per-node   : 
0sbatch: ntasks-per-socket : -2sbatch: ntasks-per-core   : -2sbatch: cpu_bind   
       : defaultsbatch: mem_bind          : defaultsbatch: plane_size        : 
4294967294sbatch: propagate         : NONEsbatch: remote command    : 
`/tmp/slurmtest/./bin/dowork.sh'sbatch: debug:  propagating 
RLIMIT_CPU=18446744073709551615sbatch: debug:  propagating 
RLIMIT_FSIZE=18446744073709551615sbatch: debug:  propagating 
RLIMIT_DATA=18446744073709551615sbatch: debug:  propagating 
RLIMIT_STACK=8388608sbatch: debug:  propagating RLIMIT_CORE=0sbatch: debug:  
propagating RLIMIT_RSS=18446744073709551615sbatch: debug:  propagating 
RLIMIT_NPROC=61504sbatch: debug:  propagating RLIMIT_NOFILE=8192sbatch: debug:  
propagating RLIMIT_MEMLOCK=32768sbatch: debug:  propagating 
RLIMIT_AS=18446744073709551615sbatch: debug:  propagating 
SLURM_PRIO_PROCESS=0sbatch: debug:  propagating 
SUBMIT_DIR=/tmp/slurmtestsbatch: debug:  propagating UMASK=0002sbatch: debug3: 
Trying to load plugin /usr/lib/slurm/auth_munge.sosbatch: auth plugin for Munge 
(http://home.gna.org/munge/) loadedsbatch: debug3: Success.sbatch: debug3: 
Trying to load plugin /usr/lib/slurm/select_linear.sosbatch: debug3: 
Success.sbatch: debug3: Trying to load plugin 
/usr/lib/slurm/select_cray.sosbatch: debug3: Success.sbatch: debug3: Trying to 
load plugin /usr/lib/slurm/select_cons_res.sosbatch: debug3: Success.sbatch: 
debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so                 
                                                                             
sbatch: debug3: Success.sbatch: debug3: Trying to load plugin 
/usr/lib/slurm/select_bgq.sosbatch: debug3: Success.sbatch: error: Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: debug:  Slurm 
temporarily unable to accept job, sleeping and retrying.sbatch: error: Batch 
job submission failed: Resource temporarily unavailable
                                          

Reply via email to