Morning, Yesterday we had some internal network issues that caused havoc on our system. By the end of the day everything was ok on the whole.
This morning I came in to see one job on the queue (which was otherwise relatively quiet) with the error message/Nodelist Reason (launch failed requeued held) So I checked the system, noticed that one node was drained, resumed it. Then I tried both scontrol requeue 230591 scontrol resume 230591 but the job - which should be right to run otherwise - is just sitting there, wont kick off. I checked on the node in question and the slurmd service is running. There is nothing else running on that node, so it's not a resources issue. Also, there's a large chunk memory being used, but nothing running - I presume that's the job in "paused" state? sinfo -Nle -o '%n %C %t' -n papr-res-compute02 Fri Oct 28 08:40:42 2016 HOSTNAMES CPUS(A/I/O/T) STATE papr-res-compute02 0/40/0/40 idle [root@vmpr-res-head-node ~]# scontrol show job 230591 JobId=230591 JobName=Halo3 UserId=kamarasinghe GroupId=kamarasinghe MCS_label=N/A Priority=0 Nice=0 Account=core QOS=normal JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null) Requeue=1 Restarts=2 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A SubmitTime=2016-10-27T19:34:05 EligibleTime=2016-10-27T19:36:06 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=prod AllocNode:Sid=vmpr-res-head-node:24758 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) BatchHost=papr-res-compute02 NumNodes=1 NumCPUs=6 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=6,mem=20G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=20G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/Halo3_design2.sbatch HAPS-31633 /researchers/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/HAPS-31633 /researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/HAPS-31633 WorkDir=/researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX StdErr=/researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/./logs/230591.err StdIn=/dev/null StdOut=/researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/./logs/230591.out How do I get this job to kick off? cheers L. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper