Hi,
Just tried running that command, but it only shows nodes that are up and running, doesn’t tell me about any nodes that are down and turned off, as an example please see below. There is a job running that should be using the 100 nodes but only 52 are allocated (plus 2 down* (that I know about and don’t care about in this case)) where are the stats and details on why the 40ish other nodes are not being used? (nothing in the masters log file either) btuser@bt_slurm_login001 ~ % tail /etc/slurm/slurm.conf NodeName=ip-10-0-8-[2-100] CPUs=16 RealMemory=27648 Sockets=1 CoresPerSocket=16 ThreadsPerCore=1 State=CLOUD NodeName=bt_slurm_login00[1-10] State=DOWN # these are the login nodes PartitionName=backtest Nodes=ip-10-0-8-[2-100] Default=YES MaxTime=300 Oversubscribe=NO State=UP Priority=1 PreemptMode=requeue btuser@bt_slurm_login001 ~ % sinfo -p backtest PARTITION AVAIL TIMELIMIT NODES STATE NODELIST backtest* up 5:00:00 2 down* ip-10-0-8-[29-30] backtest* up 5:00:00 52 alloc ip-10-0-8-[4-17,19-24,26-28,31-59] btuser@bt_slurm_login001 ~ % btuser@bt_slurm_login001 ~ % sinfo -p backtest -Rl -O reason:35,user,timestamp,statelong,nodelist Wed Jun 19 01:24:59 2019 REASON USER TIMESTAMP STATE NODELIST Not responding root 2019-06-04T04:09:31 down* ip-10-0-8-[29-30] On Tue., 18 Jun. 2019, 9:32 pm Sam Gallop (NBI), <sam.gal...@nbi.ac.uk> wrote: > Hi Nathan, > > The command I use to get the reason for failed nodes is ... 'sinfo -Ral'. > If you need to extend the width of the output then ... 'sinfo -Ral -O > reason:35,user,timestamp,statelong,nodelist'. > > Using the timestamp of the failure look in the slurmd or slurmctld logs. > > --- > Sam Gallop > > -----Original Message----- > From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of > nathan norton > Sent: 18 June 2019 09:33 > To: slurm-users@lists.schedmd.com > Subject: [slurm-users] status of cloud nodes > > Hi all, > > I am using slurm with a cloud provider it is all working a treat. > > lets say i have 100 nodes all working fine and able to be scheduled, > everything works fine. > > $ srun -N100 hostname > > works fine. > > For some unknown reason after machines shut down for example over the > weekend if no jobs get scheduled for an hour. The next time a job runs > > $srun -N90 hostname > > fails with: > > "srun: Required node not available (down, drained or reserved)" > > "srun: job JOBID queued and waiting for resources" > > This is weird as no other jobs are running and i should be able to start > up the nodes as requested. > > > Being 'cloud' type nodes if i run > > $scontrol show node > > only the up and working nodes are displayed and not the failed nodes. > how do i get the failed nodes information? > > if i stop all nodes and run below i can then start up all nodes again > > scontrol update NodeName=node-1-100 State=DOWN Reason="undraining" > scontrol update NodeName=node-1-100 State=RESUME > scontrol: show node node > > > So that fixes it, but i want to figure out why nodes get into this state > and how can i monitor it ? is there a command to get the status of CLOUD > nodes? > > any help appreciated > > Thanks > > Nathan. > > > >