Upgrade to slurm v2.4 and decrease MinJobAge, SlurmdDebug and SlurmctldDebug.


Quoting Cory McLean <[email protected]>:

>
> hi,
> I am trying to use slurm as a resource manager, but am running into  
> problems when trying to submit over 10,000 jobs to the queue.  Each  
> job is queued by issuing a separate sbatch command, which works well  
> up to a few thousand jobs but then I begin seeing the error
> sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
> Many jobs still get submitted after a few retries, but when around  
> 9,980 jobs are in the queue, invariably some job(s) hit the 15  
> MAX_RETRIES and exit with the error
> sbatch: error: Batch job submission failed: Resource temporarily unavailable
> Is slurm not suited to handling tens of thousands of jobs?  Or are  
> there some configuration/job submission changes I could make to  
> allow slurm to handle up to 50K jobs?
> Details of my current setup are as follows.  A partition is  
> specified for each worker for later scaling of the cluster, where  
> multiple nodes would be assigned to each partition.
> Thank you very much for any help!
> Slurm version: 2.2.7
> ### slurm.conf  
> ####ClusterName=wgsControlMachine=wgmasterSlurmUser=slurmSlurmctldPort=6818SlurmdPort=6817AuthType=auth/mungeStateSaveLocation=/tmpSlurmdSpoolDir=/tmp/slurmdSwitchType=switch/noneMpiDefault=noneSlurmctldPidFile=/var/run/slurmctld.pidSlurmdPidFile=/var/run/slurmd.pidProctrackType=proctrack/pgidCacheGroups=0ReturnToService=0##
>  
> TIMERSSlurmctldTimeout=300SlurmdTimeout=300InactiveLimit=0MinJobAge=6000KillWait=30Waittime=0MessageTimeout=60##
>  
> SCHEDULINGSchedulerType=sched/backfillSchedulerParameters=deferSelectType=select/linearFastSchedule=1##
>  
> LOGGINGSlurmctldDebug=5SlurmctldLogFile=/var/log/slurmctldSlurmdDebug=5SlurmdLogFile=/var/log/slurmdJobCompType=jobcomp/filetxt##
>  COMPUTE NODESNodeName=wgmaster  Procs=1  State=UNKNOWNNodeName=wgnode1  
> NodeHostname=wgnode1 Procs=1  State=UNKNOWNNodeName=wgnode2  
> NodeHostname=wgnode2 Procs=1  State=UNKNOWNNodeName=wgnode3  
> NodeHostname=wgnode3 Procs=1  State=UNKNOWNNodeName=wgnode4  
> NodeHostname=wgnode4 Procs=1  State=UNKNOWNNodeName=wgnode5  
> NodeHostname=wgnode5 Procs=1  State=UNKNOWN## PARTITIONSPartitionName=all  
> Nodes=wgmaster,wgnode[1-5]  Default=NO MaxTime=INFINITE  
> State=UPPartitionName=worker  Nodes=wgnode[1-5]  Default=YES MaxTime=INFINITE 
>  State=UPPartitionName=dbhost  Nodes=wgmaster  Default=NO MaxTime=INFINITE  
> State=UPPartitionName=p1  Nodes=wgnode[1] Default=NO MaxTime=INFINITE 
> State=UPPartitionName=p2  Nodes=wgnode[2] Default=NO MaxTime=INFINITE 
> State=UPPartitionName=p3  Nodes=wgnode[3] Default=NO MaxTime=INFINITE 
> State=UPPartitionName=p4  Nodes=wgnode[4] Default=NO MaxTime=INFINITE 
> State=UPPartitionName=p5  Nodes=wgnode[5] Default=NO MaxTime=INFINITE  
> State=UP
>
> ### An example of the type of command being issued to sbatch (a  
> script tries to issue thousands of these commands in series)  
> ###sbatch --job-name=j1 --partition=wgnode1 --error=./log/j1.err  
> --output=./log/j1.out -vvvvv --share ./bin/dowork.sh j1
>
> ### And the example output logged by sbatch when it errors:  
> ###sbatch: defined options for program `sbatch'sbatch:  
> ----------------- ---------------------sbatch: user              :  
> `cluster'sbatch: uid               : 2113sbatch: gid               :  
> 2113sbatch: cwd               : /tmp/slurmtestsbatch: ntasks          
>    : 1 (default)sbatch: cpus_per_task     : 1 (default)sbatch: nodes  
>             : 1 (default)sbatch: jobid             : 4294967294  
> (default)sbatch: partition         : wgnode1sbatch: job name          
>  : `j1'sbatch: reservation       : `(null)'sbatch: wckey              
> : `(null)'sbatch: distribution      : unknownsbatch: verbose          
>   : 8sbatch: immediate         : falsesbatch: overcommit        :  
> falsesbatch: account           : (null)sbatch: comment           :  
> (null)sbatch: dependency        : (null)sbatch: qos               :  
> (null)sbatch: constraints       : mincpus=1 sbatch: geometry          
>  : (null)sbatch: reboot            : yessbatch: rotate            :  
> nosbatch: network           : (null)sbatch: mail_type         :  
> NONEsbatch: mail_user         : (null)sbatch: sockets-per-node  :  
> -2sbatch: cores-per-socket  : -2sbatch: threads-per-core  :  
> -2sbatch: ntasks-per-node   : 0sbatch: ntasks-per-socket : -2sbatch:  
> ntasks-per-core   : -2sbatch: cpu_bind          : defaultsbatch:  
> mem_bind          : defaultsbatch: plane_size        :  
> 4294967294sbatch: propagate         : NONEsbatch: remote command     
> : `/tmp/slurmtest/./bin/dowork.sh'sbatch: debug:  propagating  
> RLIMIT_CPU=18446744073709551615sbatch: debug:  propagating  
> RLIMIT_FSIZE=18446744073709551615sbatch: debug:  propagating  
> RLIMIT_DATA=18446744073709551615sbatch: debug:  propagating  
> RLIMIT_STACK=8388608sbatch: debug:  propagating RLIMIT_CORE=0sbatch:  
> debug:  propagating RLIMIT_RSS=18446744073709551615sbatch: debug:   
> propagating RLIMIT_NPROC=61504sbatch: debug:  propagating  
> RLIMIT_NOFILE=8192sbatch: debug:  propagating  
> RLIMIT_MEMLOCK=32768sbatch: debug:  propagating  
> RLIMIT_AS=18446744073709551615sbatch: debug:  propagating  
> SLURM_PRIO_PROCESS=0sbatch: debug:  propagating  
> SUBMIT_DIR=/tmp/slurmtestsbatch: debug:  propagating  
> UMASK=0002sbatch: debug3: Trying to load plugin  
> /usr/lib/slurm/auth_munge.sosbatch: auth plugin for Munge  
> (http://home.gna.org/munge/) loadedsbatch: debug3: Success.sbatch:  
> debug3: Trying to load plugin /usr/lib/slurm/select_linear.sosbatch:  
> debug3: Success.sbatch: debug3: Trying to load plugin  
> /usr/lib/slurm/select_cray.sosbatch: debug3: Success.sbatch: debug3:  
> Trying to load plugin /usr/lib/slurm/select_cons_res.sosbatch:  
> debug3: Success.sbatch: debug3: Trying to load plugin  
> /usr/lib/slurm/select_bluegene.so                                     
>                                                           sbatch:  
> debug3: Success.sbatch: debug3: Trying to load plugin  
> /usr/lib/slurm/select_bgq.sosbatch: debug3: Success.sbatch: error:  
> Slurm temporarily unable to accept job, sleeping and  
> retrying.sbatch: debug:  Slurm temporarily unable to accept job,  
> sleeping and retrying.sbatch: debug:  Slurm temporarily unable to  
> accept job, sleeping and retrying.sbatch: debug:  Slurm temporarily  
> unable to accept job, sleeping and retrying.sbatch: debug:  Slurm  
> temporarily unable to accept job, sleeping and retrying.sbatch:  
> debug:  Slurm temporarily unable to accept job, sleeping and  
> retrying.sbatch: debug:  Slurm temporarily unable to accept job,  
> sleeping and retrying.sbatch: debug:  Slurm temporarily unable to  
> accept job, sleeping and retrying.sbatch: debug:  Slurm temporarily  
> unable to accept job, sleeping and retrying.sbatch: debug:  Slurm  
> temporarily unable to accept job, sleeping and retrying.sbatch:  
> debug:  Slurm temporarily unable to accept job, sleeping and  
> retrying.sbatch: debug:  Slurm temporarily unable to accept job,  
> sleeping and retrying.sbatch: debug:  Slurm temporarily unable to  
> accept job, sleeping and retrying.sbatch: debug:  Slurm temporarily  
> unable to accept job, sleeping and retrying.sbatch: debug:  Slurm  
> temporarily unable to accept job, sleeping and retrying.sbatch:  
> error: Batch job submission failed: Resource temporarily unavailable
>

Reply via email to