Re: [slurm-users] status of cloud nodes

nathan norton Tue, 18 Jun 2019 21:40:31 -0700

Hi,


Just tried running that command, but it only shows nodes that are up and
running, doesn’t tell me about any nodes that are down and turned off, as
an example please see below. There is a job running that should be using
the 100 nodes but only 52 are allocated (plus 2 down* (that I know about
and don’t care about in this case)) where are the stats and details on why
the 40ish other nodes are not being used? (nothing in the masters log file
either)



btuser@bt_slurm_login001 ~ % tail  /etc/slurm/slurm.conf

NodeName=ip-10-0-8-[2-100] CPUs=16 RealMemory=27648 Sockets=1
CoresPerSocket=16 ThreadsPerCore=1  State=CLOUD

NodeName=bt_slurm_login00[1-10] State=DOWN # these are the login nodes

PartitionName=backtest    Nodes=ip-10-0-8-[2-100] Default=YES MaxTime=300
Oversubscribe=NO State=UP Priority=1 PreemptMode=requeue



btuser@bt_slurm_login001 ~ % sinfo -p backtest

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

backtest*    up    5:00:00      2  down* ip-10-0-8-[29-30]

backtest*    up    5:00:00     52  alloc ip-10-0-8-[4-17,19-24,26-28,31-59]



btuser@bt_slurm_login001 ~ %

btuser@bt_slurm_login001 ~ % sinfo -p backtest   -Rl  -O
reason:35,user,timestamp,statelong,nodelist


Wed Jun 19 01:24:59 2019

REASON                             USER                TIMESTAMP
STATE               NODELIST

Not responding                     root                2019-06-04T04:09:31
down*               ip-10-0-8-[29-30]





On Tue., 18 Jun. 2019, 9:32 pm Sam Gallop (NBI), <sam.gal...@nbi.ac.uk>
wrote:

> Hi Nathan,
>
> The command I use to get the reason for failed nodes is ... 'sinfo -Ral'.
> If you need to extend the width of the output then ... 'sinfo -Ral -O
> reason:35,user,timestamp,statelong,nodelist'.
>
> Using the timestamp of the failure look in the slurmd or slurmctld logs.
>
> ---
> Sam Gallop
>
> -----Original Message-----
> From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of
> nathan norton
> Sent: 18 June 2019 09:33
> To: slurm-users@lists.schedmd.com
> Subject: [slurm-users] status of cloud nodes
>
> Hi all,
>
> I am using slurm with a cloud provider it is all working a treat.
>
> lets say i have 100 nodes all working fine and able to be scheduled,
> everything works fine.
>
> $ srun -N100 hostname
>
> works fine.
>
> For some unknown reason after machines shut down for example over the
> weekend if no jobs get scheduled for an hour. The next time a job runs
>
> $srun -N90 hostname
>
> fails with:
>
> "srun: Required node not available (down, drained or reserved)"
>
> "srun: job JOBID queued and waiting for resources"
>
> This is weird as no other jobs are running and i should be able to start
> up the nodes as requested.
>
>
> Being 'cloud' type nodes if i run
>
> $scontrol show node
>
> only the up and working nodes are displayed and not the failed nodes.
> how do i get the failed nodes information?
>
> if i stop all nodes and run below i can then start up all nodes again
>
> scontrol update NodeName=node-1-100 State=DOWN Reason="undraining"
> scontrol update NodeName=node-1-100 State=RESUME
> scontrol: show node node
>
>
> So that fixes it, but i want to figure out why nodes get into this state
> and how can i monitor it ? is there a command to get the status of CLOUD
> nodes?
>
> any help appreciated
>
> Thanks
>
> Nathan.
>
>
>
>

Re: [slurm-users] status of cloud nodes

Reply via email to