We recently upgraded from 20.11.9 to 22.05.8 and appear to have a problem with jobs not being scheduled on nodes with free resources since then.
It particularly noticeable on one particular partition with only one GPU node in it. Jobs queuing for this node are the highest priority in the queue at the moment, and the node is idle, but the job does not start: [sudberlr-admin@bb-er-slurm01 ~]$ squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %30R %Q" JOBID PARTITION ST TIME NODES NODELIST(REASON) PRIORITY 66631657 broadwell PD 0:00 1 (Resources) 230 66609948 broadwell PD 0:00 1 (Resources) 203 [sudberlr-admin@bb-er-slurm01 ~]$ squeue --format "%Q %i" --sort -Q | head -4 PRIORITY JOBID 230 66631657 212 66622378 210 66322847 [sudberlr-admin@bb-er-slurm01 ~]$ scontrol show node bear-pg0212u17b NodeName=bear-pg0212u17b Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.01 AvailableFeatures=haswell ActiveFeatures=haswell Gres=gpu:m60:2(S:0-1) NodeAddr=bear-pg0212u17b NodeHostName=bear-pg0212u17b Version=22.05.8 OS=Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 RealMemory=511000 AllocMem=0 FreeMem=501556 Sockets=2 Boards=1 MemSpecLimit=501 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=broadwell-gpum60-ondemand,system BootTime=2023-04-25T08:24:10 SlurmdStartTime=2023-05-04T11:57:46 LastBusyTime=2023-05-09T13:27:07 CfgTRES=cpu=20,mem=511000M,billing=20,gres/gpu=2 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [sudberlr-admin@bb-er-slurm01 ~]$ The resources it requests easily met by the node: [sudberlr-admin@bb-er-slurm01 ~]$ scontrol show job 66631657 JobId=66631657 JobName=sys/dashboard/sys/bc_uob_paraview UserId=XXXX(633299) GroupId=users(100) MCS_label=N/A Priority=230 Nice=0 Account=XXXX QOS=bbondemand JobState=PENDING Reason=Resources Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2023-05-09T13:27:31 EligibleTime=2023-05-09T13:27:31 AccrueTime=2023-05-09T13:27:31 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-05-09T16:02:30 Scheduler=Main Partition=broadwell-gpum60-ondemand,cascadelake-hdr-ondemand,cascadelake-hdr-ondemand2 AllocNode:Sid=localhost:1120095 ReqNodeList=(null) ExcNodeList=(null) NodeList= NumNodes=1-1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=32G,node=1,billing=8,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/XXXXXXXXXXXXX StdErr=/XXXXXXXXXXXXX/output.log StdIn=/dev/null StdOut=/XXXXXXXXXXXXX/output.log Power= TresPerNode=gres:gpu:1 [sudberlr-admin@bb-er-slurm01 ~]$ This looks a bug to me because it was working fine before the upgrade and a simple restart of the slurm controller will often allow the jobs to start, without any other changes: [sudberlr-admin@bb-er-slurm01 ~]$ squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %32R %Q" JOBID PARTITION ST TIME NODES NODELIST(REASON) PRIORITY 66631657 broadwell PD 0:00 1 (Resources) 230 66609948 broadwell PD 0:00 1 (Resources) 203 [sudberlr-admin@bb-er-slurm01 ~]$ sudo systemctl restart slurmctld; sleep 30; squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %32R %Q" Job for slurmctld.service canceled. JOBID PARTITION ST TIME NODES NODELIST(REASON) PRIORITY 66631657 broadwell R 0:04 1 bear-pg0212u17b 230 66609948 broadwell R 0:04 1 bear-pg0212u17b 203 [sudberlr-admin@bb-er-slurm01 ~]$ Has anyone come across this behaviour or have any other ideas? Many thanks, Luke -- Luke Sudbery Principal Engineer (HPC and Storage). Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don't work on Monday.