Hi all,

I am using GRES and cons_res plugin  to control the max nb of jobs that can be 
run on a given node, to cope with software licences problems.
On my cluster, I noticed that although there were Gres resources available, 
jobs are not put in 'Running' state and remain in 'Resources' state. 

I can reproduce this behaviour on a test PC. My gres.conf and slurm.conf files 
are put in the message below.
I have a script which starts hundreds  of batch jobs that sleep for a few 
seconds. When I monitor the queue with squeue -i1, I can see that the number of 
running jobs is far under the expected 7; for instance, there should always be 
3 running 'trans' jobs, but most of the time there is only 2, and for long 
periods even less.

I'm running slurm 2.5.7  (backward compatitbility with my cluster)

Does any one of you have a clue to solve this issue ? Thanks for your help
Paule



Here are my test scripts and config files.

--------------------------------------------- ## T_job.sh  small  fake job
#!/bin/bash

echo "ID=$JOBID"
t=`expr $RANDOM % 10`
if [ $t -lt 3 ]; then t=3; fi
echo "t = $t"
srun sleep $t
echo ok


--------------------------------------------- ## run 100 jobs per partition
#!/bin/bash
#  run 100 jobs per partition

tmpdir=`pwd`/tmp
rm -rf $tmpdir
mkdir $tmpdir
n=0
while [ $n -lt 100 ]; do 
        sbatch -D $tmpdir -J SID$n --gres=sid:1  -pPSID T_job.sh
        sbatch -D $tmpdir -J LIDD$n --gres=lid:1 -pPLID T_job.sh
        sbatch -D $tmpdir -J TRANS$n --gres=trans:1 -pDEF  T_job.sh
        n=$(( n + 1))
done



---------------------------------------------##    gres.conf
Name=sid Count=2
Name=lid Count=2
Name=trans Count=3


---------------------------------------------## slurm.conf
#

ControlMachine=localhost
AuthType=auth/none
CacheGroups=0
CryptoType=crypto/openssl
GresTypes=sid,lid,trans
JobCredentialPrivateKey=/etc/slurm/slurm.key
JobCredentialPublicCertificate=/etc/slurm/slurm.cert
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurm/slurmd
SlurmUser=root
SlurmdUser=root
StateSaveLocation=/tmp
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SchedulerParameters=defer,default_queue_depth=500,bf_max_job_test=300,bf_resolution=120,bf_window=120
SelectType=select/cons_res
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
DebugFlags=Backfill,Gres
JobCompLoc=/tmp/jobcomp.txt
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/tmp/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/tmp/slurm/slurmd.log
SlurmSchedLogFile=/tmp/slurm/sched.log
SlurmSchedLogLevel=3
NodeName=localhost CPUs=8 Gres=sid:2,lid:2,trans:3 State=UNKNOWN
PartitionName=DEF Nodes=localhost Default=YES MaxTime=INFINITE State=UP
PartitionName=PSID Nodes=localhost Default=NO MaxTime=INFINITE State=UP
PartitionName=PLID Nodes=localhost Default=NO MaxTime=INFINITE State=UP



 
.

Reply via email to