[slurm-dev] Re: sbatch error: Resource temporarily unavailable when queue has 10K jobs

Moe Jette Thu, 12 Jul 2012 16:41:07 -0700

Or just change the slurm.conf file...

Quoting "Huang, Perry" <[email protected]>:


>
> Hi
>
> In read_config.h, the default value for the # of max job is 10,000.  
> This can be fixed if you recompile with a higher value.
>
> line 89 #define DEFAULT_MAX_JOB_COUNT       10000
>
> Perry Huang
> [email protected]
> Lawrence Livermore National Laboratory
>
>
>
> On Jul 12, 2012, at 2:14 PM, Cory McLean wrote:
>
>> hi,
>>
>> I am trying to use slurm as a resource manager, but am running into  
>> problems when trying to submit over 10,000 jobs to the queue.  Each  
>> job is queued by issuing a separate sbatch command, which works  
>> well up to a few thousand jobs but then I begin seeing the error
>>
>> sbatch: error: Slurm temporarily unable to accept job, sleeping and  
>> retrying.
>>
>> Many jobs still get submitted after a few retries, but when around  
>> 9,980 jobs are in the queue, invariably some job(s) hit the 15  
>> MAX_RETRIES and exit with the error
>>
>> sbatch: error: Batch job submission failed: Resource temporarily unavailable
>>
>> Is slurm not suited to handling tens of thousands of jobs?  Or are  
>> there some configuration/job submission changes I could make to  
>> allow slurm to handle up to 50K jobs?
>>
>> Details of my current setup are as follows.  A partition is  
>> specified for each worker for later scaling of the cluster, where  
>> multiple nodes would be assigned to each partition.
>>
>> Thank you very much for any help!
>>
>> Slurm version: 2.2.7
>>
>> ### slurm.conf ###
>> #
>> ClusterName=wgs
>> ControlMachine=wgmaster
>> SlurmUser=slurm
>> SlurmctldPort=6818
>> SlurmdPort=6817
>> AuthType=auth/munge
>> StateSaveLocation=/tmp
>> SlurmdSpoolDir=/tmp/slurmd
>> SwitchType=switch/none
>> MpiDefault=none
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmdPidFile=/var/run/slurmd.pid
>> ProctrackType=proctrack/pgid
>> CacheGroups=0
>> ReturnToService=0
>> #
>> # TIMERS
>> SlurmctldTimeout=300
>> SlurmdTimeout=300
>> InactiveLimit=0
>> MinJobAge=6000
>> KillWait=30
>> Waittime=0
>> MessageTimeout=60
>> #
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> SchedulerParameters=defer
>> SelectType=select/linear
>> FastSchedule=1
>> #
>> # LOGGING
>> SlurmctldDebug=5
>> SlurmctldLogFile=/var/log/slurmctld
>> SlurmdDebug=5
>> SlurmdLogFile=/var/log/slurmd
>> JobCompType=jobcomp/filetxt
>> #
>> # COMPUTE NODES
>> NodeName=wgmaster  Procs=1  State=UNKNOWN
>> NodeName=wgnode1  NodeHostname=wgnode1 Procs=1  State=UNKNOWN
>> NodeName=wgnode2  NodeHostname=wgnode2 Procs=1  State=UNKNOWN
>> NodeName=wgnode3  NodeHostname=wgnode3 Procs=1  State=UNKNOWN
>> NodeName=wgnode4  NodeHostname=wgnode4 Procs=1  State=UNKNOWN
>> NodeName=wgnode5  NodeHostname=wgnode5 Procs=1  State=UNKNOWN
>> #
>> # PARTITIONS
>> PartitionName=all  Nodes=wgmaster,wgnode[1-5]  Default=NO  
>> MaxTime=INFINITE  State=UP
>> PartitionName=worker  Nodes=wgnode[1-5]  Default=YES  
>> MaxTime=INFINITE  State=UP
>> PartitionName=dbhost  Nodes=wgmaster  Default=NO MaxTime=INFINITE  State=UP
>> PartitionName=p1  Nodes=wgnode[1] Default=NO MaxTime=INFINITE State=UP
>> PartitionName=p2  Nodes=wgnode[2] Default=NO MaxTime=INFINITE State=UP
>> PartitionName=p3  Nodes=wgnode[3] Default=NO MaxTime=INFINITE State=UP
>> PartitionName=p4  Nodes=wgnode[4] Default=NO MaxTime=INFINITE State=UP
>> PartitionName=p5  Nodes=wgnode[5] Default=NO MaxTime=INFINITE State=UP
>>
>>
>> ### An example of the type of command being issued to sbatch (a  
>> script tries to issue thousands of these commands in series) ###
>> sbatch --job-name=j1 --partition=wgnode1 --error=./log/j1.err  
>> --output=./log/j1.out -vvvvv --share ./bin/dowork.sh j1
>>
>>
>> ### And the example output logged by sbatch when it errors: ###
>> sbatch: defined options for program `sbatch'
>> sbatch: ----------------- ---------------------
>> sbatch: user              : `cluster'
>> sbatch: uid               : 2113
>> sbatch: gid               : 2113
>> sbatch: cwd               : /tmp/slurmtest
>> sbatch: ntasks            : 1 (default)
>> sbatch: cpus_per_task     : 1 (default)
>> sbatch: nodes             : 1 (default)
>> sbatch: jobid             : 4294967294 (default)
>> sbatch: partition         : wgnode1
>> sbatch: job name          : `j1'
>> sbatch: reservation       : `(null)'
>> sbatch: wckey             : `(null)'
>> sbatch: distribution      : unknown
>> sbatch: verbose           : 8
>> sbatch: immediate         : false
>> sbatch: overcommit        : false
>> sbatch: account           : (null)
>> sbatch: comment           : (null)
>> sbatch: dependency        : (null)
>> sbatch: qos               : (null)
>> sbatch: constraints       : mincpus=1
>> sbatch: geometry          : (null)
>> sbatch: reboot            : yes
>> sbatch: rotate            : no
>> sbatch: network           : (null)
>> sbatch: mail_type         : NONE
>> sbatch: mail_user         : (null)
>> sbatch: sockets-per-node  : -2
>> sbatch: cores-per-socket  : -2
>> sbatch: threads-per-core  : -2
>> sbatch: ntasks-per-node   : 0
>> sbatch: ntasks-per-socket : -2
>> sbatch: ntasks-per-core   : -2
>> sbatch: cpu_bind          : default
>> sbatch: mem_bind          : default
>> sbatch: plane_size        : 4294967294
>> sbatch: propagate         : NONE
>> sbatch: remote command    : `/tmp/slurmtest/./bin/dowork.sh'
>> sbatch: debug:  propagating RLIMIT_CPU=18446744073709551615
>> sbatch: debug:  propagating RLIMIT_FSIZE=18446744073709551615
>> sbatch: debug:  propagating RLIMIT_DATA=18446744073709551615
>> sbatch: debug:  propagating RLIMIT_STACK=8388608
>> sbatch: debug:  propagating RLIMIT_CORE=0
>> sbatch: debug:  propagating RLIMIT_RSS=18446744073709551615
>> sbatch: debug:  propagating RLIMIT_NPROC=61504
>> sbatch: debug:  propagating RLIMIT_NOFILE=8192
>> sbatch: debug:  propagating RLIMIT_MEMLOCK=32768
>> sbatch: debug:  propagating RLIMIT_AS=18446744073709551615
>> sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
>> sbatch: debug:  propagating SUBMIT_DIR=/tmp/slurmtest
>> sbatch: debug:  propagating UMASK=0002
>> sbatch: debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so
>> sbatch: auth plugin for Munge (http://home.gna.org/munge/) loaded
>> sbatch: debug3: Success.
>> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_linear.so
>> sbatch: debug3: Success.
>> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so
>> sbatch: debug3: Success.
>> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so
>> sbatch: debug3: Success.
>> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so
>> sbatch: debug3: Success.
>> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so
>> sbatch: debug3: Success.
>> sbatch: error: Slurm temporarily unable to accept job, sleeping and  
>> retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: debug:  Slurm temporarily unable to accept job, sleeping  
>> and retrying.
>> sbatch: error: Batch job submission failed: Resource temporarily unavailable
>>
>>
>

[slurm-dev] Re: sbatch error: Resource temporarily unavailable when queue has 10K jobs

Reply via email to