Hi all, I am using GRES and cons_res plugin to control the max nb of jobs that can be run on a given node, to cope with software licences problems. On my cluster, I noticed that although there were Gres resources available, jobs are not put in 'Running' state and remain in 'Resources' state.
I can reproduce this behaviour on a test PC. My gres.conf and slurm.conf files are put in the message below. I have a script which starts hundreds of batch jobs that sleep for a few seconds. When I monitor the queue with squeue -i1, I can see that the number of running jobs is far under the expected 7; for instance, there should always be 3 running 'trans' jobs, but most of the time there is only 2, and for long periods even less. I'm running slurm 2.5.7 (backward compatitbility with my cluster) Does any one of you have a clue to solve this issue ? Thanks for your help Paule Here are my test scripts and config files. --------------------------------------------- ## T_job.sh small fake job #!/bin/bash echo "ID=$JOBID" t=`expr $RANDOM % 10` if [ $t -lt 3 ]; then t=3; fi echo "t = $t" srun sleep $t echo ok --------------------------------------------- ## run 100 jobs per partition #!/bin/bash # run 100 jobs per partition tmpdir=`pwd`/tmp rm -rf $tmpdir mkdir $tmpdir n=0 while [ $n -lt 100 ]; do sbatch -D $tmpdir -J SID$n --gres=sid:1 -pPSID T_job.sh sbatch -D $tmpdir -J LIDD$n --gres=lid:1 -pPLID T_job.sh sbatch -D $tmpdir -J TRANS$n --gres=trans:1 -pDEF T_job.sh n=$(( n + 1)) done ---------------------------------------------## gres.conf Name=sid Count=2 Name=lid Count=2 Name=trans Count=3 ---------------------------------------------## slurm.conf # ControlMachine=localhost AuthType=auth/none CacheGroups=0 CryptoType=crypto/openssl GresTypes=sid,lid,trans JobCredentialPrivateKey=/etc/slurm/slurm.key JobCredentialPublicCertificate=/etc/slurm/slurm.cert MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurm/slurmd SlurmUser=root SlurmdUser=root StateSaveLocation=/tmp SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SchedulerParameters=defer,default_queue_depth=500,bf_max_job_test=300,bf_resolution=120,bf_window=120 SelectType=select/cons_res AccountingStorageType=accounting_storage/none AccountingStoreJobComment=YES ClusterName=cluster DebugFlags=Backfill,Gres JobCompLoc=/tmp/jobcomp.txt JobCompType=jobcomp/filetxt JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/tmp/slurm/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/tmp/slurm/slurmd.log SlurmSchedLogFile=/tmp/slurm/sched.log SlurmSchedLogLevel=3 NodeName=localhost CPUs=8 Gres=sid:2,lid:2,trans:3 State=UNKNOWN PartitionName=DEF Nodes=localhost Default=YES MaxTime=INFINITE State=UP PartitionName=PSID Nodes=localhost Default=NO MaxTime=INFINITE State=UP PartitionName=PLID Nodes=localhost Default=NO MaxTime=INFINITE State=UP .
