Hi

In read_config.h, the default value for the # of max job is 10,000. This can be 
fixed if you recompile with a higher value.

line 89 #define DEFAULT_MAX_JOB_COUNT       10000

Perry Huang
[email protected]
Lawrence Livermore National Laboratory



On Jul 12, 2012, at 2:14 PM, Cory McLean wrote:

> hi,
> 
> I am trying to use slurm as a resource manager, but am running into problems 
> when trying to submit over 10,000 jobs to the queue.  Each job is queued by 
> issuing a separate sbatch command, which works well up to a few thousand jobs 
> but then I begin seeing the error
> 
> sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
> 
> Many jobs still get submitted after a few retries, but when around 9,980 jobs 
> are in the queue, invariably some job(s) hit the 15 MAX_RETRIES and exit with 
> the error
> 
> sbatch: error: Batch job submission failed: Resource temporarily unavailable
> 
> Is slurm not suited to handling tens of thousands of jobs?  Or are there some 
> configuration/job submission changes I could make to allow slurm to handle up 
> to 50K jobs?
> 
> Details of my current setup are as follows.  A partition is specified for 
> each worker for later scaling of the cluster, where multiple nodes would be 
> assigned to each partition.
> 
> Thank you very much for any help!
> 
> Slurm version: 2.2.7
> 
> ### slurm.conf ###
> #
> ClusterName=wgs
> ControlMachine=wgmaster
> SlurmUser=slurm
> SlurmctldPort=6818
> SlurmdPort=6817
> AuthType=auth/munge
> StateSaveLocation=/tmp
> SlurmdSpoolDir=/tmp/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> ProctrackType=proctrack/pgid
> CacheGroups=0
> ReturnToService=0
> #
> # TIMERS
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=6000
> KillWait=30
> Waittime=0
> MessageTimeout=60
> #
> # SCHEDULING
> SchedulerType=sched/backfill
> SchedulerParameters=defer
> SelectType=select/linear
> FastSchedule=1
> #
> # LOGGING
> SlurmctldDebug=5
> SlurmctldLogFile=/var/log/slurmctld
> SlurmdDebug=5
> SlurmdLogFile=/var/log/slurmd
> JobCompType=jobcomp/filetxt
> #
> # COMPUTE NODES
> NodeName=wgmaster  Procs=1  State=UNKNOWN
> NodeName=wgnode1  NodeHostname=wgnode1 Procs=1  State=UNKNOWN
> NodeName=wgnode2  NodeHostname=wgnode2 Procs=1  State=UNKNOWN
> NodeName=wgnode3  NodeHostname=wgnode3 Procs=1  State=UNKNOWN
> NodeName=wgnode4  NodeHostname=wgnode4 Procs=1  State=UNKNOWN
> NodeName=wgnode5  NodeHostname=wgnode5 Procs=1  State=UNKNOWN
> #
> # PARTITIONS
> PartitionName=all  Nodes=wgmaster,wgnode[1-5]  Default=NO MaxTime=INFINITE  
> State=UP
> PartitionName=worker  Nodes=wgnode[1-5]  Default=YES MaxTime=INFINITE  
> State=UP
> PartitionName=dbhost  Nodes=wgmaster  Default=NO MaxTime=INFINITE  State=UP
> PartitionName=p1  Nodes=wgnode[1] Default=NO MaxTime=INFINITE State=UP
> PartitionName=p2  Nodes=wgnode[2] Default=NO MaxTime=INFINITE State=UP
> PartitionName=p3  Nodes=wgnode[3] Default=NO MaxTime=INFINITE State=UP
> PartitionName=p4  Nodes=wgnode[4] Default=NO MaxTime=INFINITE State=UP
> PartitionName=p5  Nodes=wgnode[5] Default=NO MaxTime=INFINITE State=UP
> 
> 
> ### An example of the type of command being issued to sbatch (a script tries 
> to issue thousands of these commands in series) ###
> sbatch --job-name=j1 --partition=wgnode1 --error=./log/j1.err 
> --output=./log/j1.out -vvvvv --share ./bin/dowork.sh j1
> 
> 
> ### And the example output logged by sbatch when it errors: ###
> sbatch: defined options for program `sbatch'
> sbatch: ----------------- ---------------------
> sbatch: user              : `cluster'
> sbatch: uid               : 2113
> sbatch: gid               : 2113
> sbatch: cwd               : /tmp/slurmtest
> sbatch: ntasks            : 1 (default)
> sbatch: cpus_per_task     : 1 (default)
> sbatch: nodes             : 1 (default)
> sbatch: jobid             : 4294967294 (default)
> sbatch: partition         : wgnode1
> sbatch: job name          : `j1'
> sbatch: reservation       : `(null)'
> sbatch: wckey             : `(null)'
> sbatch: distribution      : unknown
> sbatch: verbose           : 8
> sbatch: immediate         : false
> sbatch: overcommit        : false
> sbatch: account           : (null)
> sbatch: comment           : (null)
> sbatch: dependency        : (null)
> sbatch: qos               : (null)
> sbatch: constraints       : mincpus=1 
> sbatch: geometry          : (null)
> sbatch: reboot            : yes
> sbatch: rotate            : no
> sbatch: network           : (null)
> sbatch: mail_type         : NONE
> sbatch: mail_user         : (null)
> sbatch: sockets-per-node  : -2
> sbatch: cores-per-socket  : -2
> sbatch: threads-per-core  : -2
> sbatch: ntasks-per-node   : 0
> sbatch: ntasks-per-socket : -2
> sbatch: ntasks-per-core   : -2
> sbatch: cpu_bind          : default
> sbatch: mem_bind          : default
> sbatch: plane_size        : 4294967294
> sbatch: propagate         : NONE
> sbatch: remote command    : `/tmp/slurmtest/./bin/dowork.sh'
> sbatch: debug:  propagating RLIMIT_CPU=18446744073709551615
> sbatch: debug:  propagating RLIMIT_FSIZE=18446744073709551615
> sbatch: debug:  propagating RLIMIT_DATA=18446744073709551615
> sbatch: debug:  propagating RLIMIT_STACK=8388608
> sbatch: debug:  propagating RLIMIT_CORE=0
> sbatch: debug:  propagating RLIMIT_RSS=18446744073709551615
> sbatch: debug:  propagating RLIMIT_NPROC=61504
> sbatch: debug:  propagating RLIMIT_NOFILE=8192
> sbatch: debug:  propagating RLIMIT_MEMLOCK=32768
> sbatch: debug:  propagating RLIMIT_AS=18446744073709551615
> sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
> sbatch: debug:  propagating SUBMIT_DIR=/tmp/slurmtest
> sbatch: debug:  propagating UMASK=0002
> sbatch: debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so
> sbatch: auth plugin for Munge (http://home.gna.org/munge/) loaded
> sbatch: debug3: Success.
> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_linear.so
> sbatch: debug3: Success.
> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so
> sbatch: debug3: Success.
> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so
> sbatch: debug3: Success.
> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so       
>                                                                               
>          
> sbatch: debug3: Success.
> sbatch: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so
> sbatch: debug3: Success.
> sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: debug:  Slurm temporarily unable to accept job, sleeping and retrying.
> sbatch: error: Batch job submission failed: Resource temporarily unavailable
> 
> 

Reply via email to