Morning,

Yesterday we had some internal network issues that caused havoc on our
system. By the end of the day everything was ok on the whole.

This morning I came in to see one job on the queue (which was otherwise
relatively quiet) with the error message/Nodelist Reason (launch failed
requeued held)

So I checked the system, noticed that one node was drained, resumed it.
Then I tried both

scontrol requeue 230591
scontrol resume 230591

but the job - which should be right to run otherwise - is just sitting
there, wont kick off.

I checked on the node in question and the slurmd service is running. There
is nothing else running on that node, so it's not a resources issue. Also,
there's a large chunk memory being used, but nothing running - I presume
that's the job in "paused" state?

sinfo -Nle -o '%n %C %t' -n papr-res-compute02
Fri Oct 28 08:40:42 2016
HOSTNAMES CPUS(A/I/O/T) STATE
papr-res-compute02 0/40/0/40 idle



[root@vmpr-res-head-node ~]# scontrol show job 230591
JobId=230591 JobName=Halo3
   UserId=kamarasinghe GroupId=kamarasinghe MCS_label=N/A
   Priority=0 Nice=0 Account=core QOS=normal
   JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null)
   Requeue=1 Restarts=2 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2016-10-27T19:34:05 EligibleTime=2016-10-27T19:36:06
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=prod AllocNode:Sid=vmpr-res-head-node:24758
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=papr-res-compute02
   NumNodes=1 NumCPUs=6 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=20G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=20G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

Command=/researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/Halo3_design2.sbatch
HAPS-31633 /researchers/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/HAPS-31633
/researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/HAPS-31633
   WorkDir=/researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX

StdErr=/researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/./logs/230591.err
   StdIn=/dev/null

StdOut=/researchers/Analysis/Halo3/161025_AGRF_CAGRF13575_CA5VMANXX/./logs/230591.out


How do I get this job to kick off?

cheers
L.


------
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper

Reply via email to