Hello, Thank you for your response to my email. I've taken a look at one of the compute nodes that has been drained by the SLURM system -- please see below. If appears to suggest the node was drained due to a job failing (running out of walltime perhaps?).
This is very odd since I don't have anything the epilog script that takes nodes out of service for any reason (at least not explicitly). Does anyone have any ideas why nodes are draining following "job failures"? Best regards, David [root@blue30 etc]# scontrol show node red0038 NodeName=red0038 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=red0038 NodeHostName=red0038 Version=17.02 OS=Linux RealMemory=1 AllocMem=0 FreeMem=22050 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=batch BootTime=2017-04-07T09:31:26 SlurmdStartTime=2017-05-04T14:50:06 CfgTRES=cpu=12,mem=1M AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=batch job complete failure [root@2017-05-22T13:10:13] ________________________________________ From: Christopher Samuel [sam...@unimelb.edu.au] Sent: 24 May 2017 00:23 To: slurm-dev Subject: [slurm-dev] Re: Compute nodes going to drained/draining state On 22/05/17 19:57, Baker D.J. wrote: > I’ve recently started using slurm v17.02.2, however something seems very > odd. For some reason, when for example jobs fail or exceed their > walltime limit, I see that compute nodes are being placed in drained or > draining state. Does anyone understand what might be wrong? Anything setting a drain state is meant to also set a reason, what does "scontrol show node $NODE" say for these? Also are there any relevant messages in your slurmctld and slurmd logs? Best of luck, Chris -- Christopher Samuel Senior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545