If it is any help, https://slurm.schedmd.com/sinfo.html NODE STATE CODES
Node state codes are shortened as required for the field size. These node states may be followed by a special character to identify state flags associated with the node. The following node sufficies and states are used: *** The node is presently not responding and will not be allocated any new work. If the node remains non-responsive, it will be placed in the *DOWN* state (except in the case of *COMPLETING*, *DRAINED*, *DRAINING*, *FAIL*, *FAILING* nodes). On 18 July 2018 at 09:47, Antony Cleave <antony.cle...@gmail.com> wrote: > I've not seen the IDLE* issue before but when my nodes got stuck I've > always beena ble to fix them with this: > > [root@cloud01 ~]# scontrol update nodename=cloud01 state=down reason=stuck > [root@cloud01 ~]# scontrol update nodename=cloud01 state=idle > [root@cloud01 ~]# scontrol update nodename=cloud01 state=power_down > [root@cloud01 ~]# scontrol update nodename=cloud01 state=power_up > > Antony > > On 17 July 2018 at 18:13, Michael Gutteridge <michael.gutteri...@gmail.com > > wrote: > >> Hi >> >> I'm running a cluster in a cloud provider and have run up against an odd >> problem with power save. I've got several hundred nodes that Slurm won't >> power up even though they appear idle and in the powered-down state. I >> suspect that they are in a "not-so-idle" state: `scontrol` for all of the >> nodes which aren't being powered up shows the state as >> "IDLE*+CLOUD+POWER". The asterisk is throwing me off here- that state >> doesn't appear to be documented in the scontrol manpage (I want to say I'd >> seen it discussed on the list, but google searches haven't turned up much >> yet). >> >> The other nodes in the cluster are being powered up and down as we'd >> expect. It's just these nodes that Slurm doesn't power up. In fact, it >> appears that the controller doesn't even _try_ to power up the node- the >> logs (both for the controller with DebugFlags=Power and the power >> management script logs) don't indicate even an attempt to start a node when >> requested. >> >> I haven't figured a way to reliably reset the nodes to "IDLE". Some >> relevant configs are: >> >> SchedulerType=sched/backfill >> SelectType=select/cons_res >> SelectTypeParameters=CR_CPU >> SuspendProgram=/var/lib/slurm-llnl/suspend >> SuspendTime=300 >> SuspendRate=10 >> ResumeRate=10 >> ResumeProgram=/var/lib/slurm-llnl/resume >> ResumeTimeout=300 >> BatchStartTimeout=300 >> >> A typical node is configured thus: >> >> NodeName=nodef74 NodeAddr=nodef74.fhcrc.org Feature=c5.2xlarge CPUs=4 >> RealMemory=16384 Weight=40 State=CLOUD >> >> Thanks for your time- any advice or hints are greatly appreciated. >> >> Michael >> >> >> >