[slurm-users] Slurm powersave

2023-10-04 Thread Davide DelVento
I'm experimenting with slurm powersave and I have several questions. I'm
following the guidance from https://slurm.schedmd.com/power_save.html and
the great presentation from our own
https://slurm.schedmd.com/SLUG23/DTU-SLUG23.pdf

I am running slurm 23.02.3

1) I'm not sure I fully understand ReconfigFlags=KeepPowerSaveSettings
The documentations ways that if set, an "scontrol reconfig" command will
preserve the current state of SuspendExcNodes, SuspendExcParts and
SuspendExcStates. Why would one *NOT* want to preserve that? What would
happen if one does not (or does) have this setting? For now I'm using it,
assuming that it means "if I run scontrol reconfig" don't shut off nodes
that are up because I said so that they should be up in slurm.conf with
those three options" --- but I am not clear if that is really what it says.

2) the PDF above says that the problem with nodes in down and drained state
is solved in 23.02 but that does not appear to be the case. Before running
my experiment, I had

$ sinfo -R
REASON   USER  TIMESTAMP   NODELIST
Not responding   root  2023-09-13T13:14:50 node31
ECC memory errorsroot  2023-08-26T07:21:04 node27

and after it became

$ sinfo -R
REASON   USER  TIMESTAMP   NODELIST
Not responding   root  2023-09-13T13:14:50 node31
none Unknown   Unknown node27

And that despite having excluded drain'ed nodes as below:

--- a/slurm/slurm.conf
+++ b/slurm/slurm.conf
@@ -140,12 +140,15 @@ SlurmdLogFile=/var/log/slurm/slurmd.log
 #
 #
 # POWER SAVE SUPPORT FOR IDLE NODES (optional)
+SuspendProgram=/opt/slurm/poweroff
+ResumeProgram=/opt/slurm/poweron
+SuspendTimeout=120
+ResumeTimeout=240
 #ResumeRate=
+SuspendExcNodes=node[13-32]:2
+SuspendExcStates=down,drain,fail,maint,not_responding,reserved
+BatchStartTimeout=60
+ReconfigFlags=KeepPowerSaveSettings # not sure if needed: preserve current
status when running "scontrol reconfig"
-PartitionName=compute512 Default=False Nodes=node[13-32] State=UP
DefMemPerCPU=9196
+PartitionName=compute512 Default=False Nodes=node[13-32] State=UP
DefMemPerCPU=9196 SuspendTime=600

so probably that's not solved? Anyway, that's a nuisance, not a deal breaker

3) The whole thing does not appear to be working as I intended. My
understanding of the "exclude node" above should have meant that slurm
should never attempt to shut off more than all idle nodes in that partition
minus 2. Instead it shut them off all of them, and then tried to turn them
back on:

$ sinfo | grep 512
compute512 up   infinite  1 alloc# node15
compute512 up   infinite  2  idle# node[14,32]
compute512 up   infinite  3  down~ node[16-17,31]
compute512 up   infinite  1 drain~ node27
compute512 up   infinite 12  idle~ node[18-26,28-30]
compute512 up   infinite  1  alloc node13

But again this is a minor nuisance which I can live with (especially if it
happens only when I "flip the switch"), and I'm mentioning only in case
it's a symptom of something else I'm doing wrong. I did try to use both the
SuspendExcNodes=node[13-32]:2 syntax as it seem more reasonable to me
(compared to the rest of the file, e.g. partitions definition) and the
SuspendExcNodes=node[13\-32]:2 as suggested in the slurm powersave
documentation. Behavior, exactly identical

4) Most importantly from the output above you may have noticed two nodes
(actually three by the time I ran the command below) that slurm deemed down

$ sinfo -R
REASON   USER  TIMESTAMP   NODELIST
Not responding   root  2023-09-13T13:14:50 node31
reboot timed out slurm 2023-10-04T14:51:28 node14
reboot timed out slurm 2023-10-04T14:52:28 node15
reboot timed out slurm 2023-10-04T14:49:58 node32
none Unknown   Unknown node27

This can't be the case, the nodes are fine, and cannot have timed out while
"rebooting", because for now my poweroff and poweron script are identical
and literally a simple one-liner bash script doing almost nothing and the
log file is populated correctly as I would expect

echo "Pretending to $0 the following node(s): $1"  >> $log_file 2>&1

So I can confirm slurm invoked the script, but then waited for something
(what? starting slurmd?) which failed to occur and marked the node down.
When I removed the suspend time from the partition to end the experiment,
the other nodes went "magically" in production , without slurm calling my
poweron script. Of course the nodes were never powered off, but slurm
thought they were, so why it did not have the problem it id with the node
which instead intentionally tried to power on?

Thanks for any light you can shed on these issues, particularly the last
one!


Re: [slurm-users] Slurm powersave

2023-10-05 Thread Ole Holm Nielsen

Hi Davide,

On 10/4/23 23:03, Davide DelVento wrote:
I'm experimenting with slurm powersave and I have several questions. I'm 
following the guidance from https://slurm.schedmd.com/power_save.html 
 and the great presentation 
from our own https://slurm.schedmd.com/SLUG23/DTU-SLUG23.pdf 



I presented that talk at SLUG'23 :-)


I am running slurm 23.02.3

1) I'm not sure I fully understand ReconfigFlags=KeepPowerSaveSettings
The documentations ways that if set, an "scontrol reconfig" command will 
preserve the current state of SuspendExcNodes, SuspendExcParts and 
SuspendExcStates. Why would one *NOT* want to preserve that? What would 
happen if one does not (or does) have this setting? For now I'm using it, 
assuming that it means "if I run scontrol reconfig" don't shut off nodes 
that are up because I said so that they should be up in slurm.conf with 
those three options" --- but I am not clear if that is really what it says.


As I understand it, the ReconfigFlags means that if you updated some 
settings using scontrol, they will be lost when slurmctld is reconfigured, 
and the settings from slurm.conf will be used in stead.


2) the PDF above says that the problem with nodes in down and drained 
state is solved in 23.02 but that does not appear to be the case. Before 
running my experiment, I had


$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2023-09-13T13:14:50 node31
ECC memory errors    root      2023-08-26T07:21:04 node27

and after it became

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2023-09-13T13:14:50 node31
none                 Unknown   Unknown             node27


Please use "sinfo -lR" so that we can see the node STATE.


And that despite having excluded drain'ed nodes as below:

--- a/slurm/slurm.conf
+++ b/slurm/slurm.conf
@@ -140,12 +140,15 @@ SlurmdLogFile=/var/log/slurm/slurmd.log
  #
  #
  # POWER SAVE SUPPORT FOR IDLE NODES (optional)
+SuspendProgram=/opt/slurm/poweroff
+ResumeProgram=/opt/slurm/poweron
+SuspendTimeout=120
+ResumeTimeout=240
  #ResumeRate=
+SuspendExcNodes=node[13-32]:2
+SuspendExcStates=down,drain,fail,maint,not_responding,reserved
+BatchStartTimeout=60
+ReconfigFlags=KeepPowerSaveSettings # not sure if needed: preserve 
current status when running "scontrol reconfig"
-PartitionName=compute512 Default=False Nodes=node[13-32] State=UP 
DefMemPerCPU=9196
+PartitionName=compute512 Default=False Nodes=node[13-32] State=UP 
DefMemPerCPU=9196 SuspendTime=600


so probably that's not solved? Anyway, that's a nuisance, not a deal breaker


With my 23.02.5 the SuspendExcStates is working as documented :-)

3) The whole thing does not appear to be working as I intended. My 
understanding of the "exclude node" above should have meant that slurm 
should never attempt to shut off more than all idle nodes in that 
partition minus 2. Instead it shut them off all of them, and then tried to 
turn them back on:


$ sinfo | grep 512
compute512     up   infinite      1 alloc# node15
compute512     up   infinite      2  idle# node[14,32]
compute512     up   infinite      3  down~ node[16-17,31]
compute512     up   infinite      1 drain~ node27
compute512     up   infinite     12  idle~ node[18-26,28-30]
compute512     up   infinite      1  alloc node13


I agree that 2 nodes from node[13-32] shouldn't be suspended, according to 
SuspendExcNodes in the slurm.conf manual.  I haven't tested this feature.


But again this is a minor nuisance which I can live with (especially if it 
happens only when I "flip the switch"), and I'm mentioning only in case 
it's a symptom of something else I'm doing wrong. I did try to use both 
the SuspendExcNodes=node[13-32]:2 syntax as it seem more reasonable to me 
(compared to the rest of the file, e.g. partitions definition) and the 
SuspendExcNodes=node[13\-32]:2 as suggested in the slurm powersave 
documentation. Behavior, exactly identical


4) Most importantly from the output above you may have noticed two nodes 
(actually three by the time I ran the command below) that slurm deemed down


$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2023-09-13T13:14:50 node31
reboot timed out     slurm     2023-10-04T14:51:28 node14
reboot timed out     slurm     2023-10-04T14:52:28 node15
reboot timed out     slurm     2023-10-04T14:49:58 node32
none                 Unknown   Unknown             node27

This can't be the case, the nodes are fine, and cannot have timed out 
while "rebooting", because for now my poweroff and poweron script are 
identical and literally a simple one-liner bash script doing almost 
nothing and the log file is populated correctly as I would expect


echo "Pretending to $0 the following node(s): $1"  >> $log_file 2>&1

So I can confirm slurm invoked the script, but then waited 

Re: [slurm-users] Slurm powersave

2023-10-05 Thread Davide DelVento
Hi Ole,

Thanks for getting back to me.

> the great presentation
> > from our own
> I presented that talk at SLUG'23 :-)
>

Yes! That's why I wrote "from our own", but perhaps these are local slangs
where I live (and English is my second language)


> > 1) I'm not sure I fully understand ReconfigFlags=KeepPowerSaveSettings

As I understand it, the ReconfigFlags means that if you updated some
> settings using scontrol, they will be lost when slurmctld is reconfigured,
> and the settings from slurm.conf will be used in stead.
>

I see, so that applies to the case in which I change the (power) state of
the nodes by scontrol.


>
> > 2) the PDF above says that the problem with nodes in down and drained
> > state is solved in 23.02 but that does not appear to be the case. Before
> > running my experiment, I had
> >
> > $ sinfo -R
> > REASON   USER  TIMESTAMP   NODELIST
> > Not responding   root  2023-09-13T13:14:50 node31
> > ECC memory errorsroot  2023-08-26T07:21:04 node27
> >
> > and after it became
> >
> > $ sinfo -R
> > REASON   USER  TIMESTAMP   NODELIST
> > Not responding   root  2023-09-13T13:14:50 node31
> > none Unknown   Unknown node27
>
> Please use "sinfo -lR" so that we can see the node STATE.
>

$ sinfo -lR
Thu Oct 05 07:08:18 2023
REASON   USER TIMESTAMP   STATE  NODELIST
Not responding   root(0)  2023-09-13T13:14:50 down~  node31
none root(0)  Unknown drain  node27

Somewhat it has now remembered that the user was root (it now shows that
even with plain sinfo -R)

> so probably that's not solved? Anyway, that's a nuisance, not a deal
> breaker
>
> With my 23.02.5 the SuspendExcStates is working as documented :-)
>

Okay so perhaps something happened between 23.02.3 and 23.02.5. I might
need to sleuth in the ticketing system.



> > 3) The whole thing does not appear to be working as I intended. My
> > understanding of the "exclude node" above should have meant that slurm
> > should never attempt to shut off more than all idle nodes in that
> > partition minus 2. Instead it shut them off all of them, and then tried
> to
> > turn them back on:
> >
> > $ sinfo | grep 512
> > compute512 up   infinite  1 alloc# node15
> > compute512 up   infinite  2  idle# node[14,32]
> > compute512 up   infinite  3  down~ node[16-17,31]
> > compute512 up   infinite  1 drain~ node27
> > compute512 up   infinite 12  idle~ node[18-26,28-30]
> > compute512 up   infinite  1  alloc node13
>
> I agree that 2 nodes from node[13-32] shouldn't be suspended, according to
> SuspendExcNodes in the slurm.conf manual.  I haven't tested this feature.
>

Good to know that an independent read of the manual is understood the same
way as mine. If you don't use this feature, what do you do? Shutting off
all idle nodes and leaving newly submitted jobs waiting for a boot? Or
something else?



> > But again this is a minor nuisance which I can live with


> 4) Most importantly from the output above you may have noticed two nodes
> > (actually three by the time I ran the command below) that slurm deemed
> down
> >
> > So I can confirm slurm invoked the script, but then waited for something
> > (what? starting slurmd?) which failed to occur and marked the node
> down.
> > When I removed the suspend time from the partition to end the
> experiment,
> > the other nodes went "magically" in production , without slurm calling
> my
> > poweron script. Of course the nodes were never powered off, but slurm
> > thought they were, so why it did not have the problem it id with the
> node
> > which instead intentionally tried to power on?
>
> IMHO, "pretending" to power down nodes defies the logic of the Slurm
> power_save plugin.


And it is sure useless ;)
But I was using the suggestion from
https://slurm.schedmd.com/power_save.html which says

You can also configure Slurm with programs that perform no action as
*SuspendProgram* and *ResumeProgram* to assess the potential impact of
power saving mode before enabling it.



> Slurmctld expects suspended nodes to *really* power
> down (slurmd is stopped).  When slurmctld resumes a suspended node, it
> expects slurmd to start up when the node is powered on.  There is a
> ResumeTimeout parameter which I've set to about 15-30 minutes in case of
> delays due to BIOS updates and the like - the default of 60 seconds is WAY
> too small!
>

Sure in fact I upped that to 4 minutes. Typically our nodes reboot in 3
minutes and will not update BIOS or OS automatically. Sometimes they become
"hosed" and slower (firmware bug throttling CPU speed for no reason) but in
that case better Slurm recognizes it and deems the node down. But in any
case this is a moot point since the node is not going down


> Have you tried to experiment with the IPMI based power down/up method
> explained in the above present

Re: [slurm-users] Slurm powersave

2023-10-06 Thread Ole Holm Nielsen

Hi Davide,

On 10/5/23 15:28, Davide DelVento wrote:

IMHO, "pretending" to power down nodes defies the logic of the Slurm
power_save plugin. 


And it is sure useless ;)
But I was using the suggestion from 
https://slurm.schedmd.com/power_save.html 
 which says


You can also configure Slurm with programs that perform no action as 
*SuspendProgram* and *ResumeProgram* to assess the potential impact of 
power saving mode before enabling it.


I had not noticed the above sentence in the power_save manual before!  So 
I decided to test a "no action" power saving script, similar to what you 
have done, applying it to a test partition.  I conclude that "no action" 
power saving DOES NOT WORK, at least in Slurm 23.02.5.  So I opened a bug 
report https://bugs.schedmd.com/show_bug.cgi?id=17848 to find out if the 
documentation is obsolete, or if there may be a bug.  Please follow that 
bug to find out the answer from SchedMD.


What I *believe* (but not with 100% certainty) really happens with power 
saving in the current Slurm versions is what I wrote yesterday:



Slurmctld expects suspended nodes to *really* power
down (slurmd is stopped).  When slurmctld resumes a suspended node, it
expects slurmd to start up when the node is powered on.  There is a
ResumeTimeout parameter which I've set to about 15-30 minutes in case of
delays due to BIOS updates and the like - the default of 60 seconds is
WAY too small!


I hope this helps,
Ole



Re: [slurm-users] Slurm powersave

2023-12-11 Thread Davide DelVento
In case it's useful to others: I've been able to get this working by having
the "no action" script stop the slurmd daemon and start it *with the -b
option*.

On Fri, Oct 6, 2023 at 4:28 AM Ole Holm Nielsen 
wrote:

> Hi Davide,
>
> On 10/5/23 15:28, Davide DelVento wrote:
> > IMHO, "pretending" to power down nodes defies the logic of the Slurm
> > power_save plugin.
> >
> > And it is sure useless ;)
> > But I was using the suggestion from
> > https://slurm.schedmd.com/power_save.html
> >  which says
> >
> > You can also configure Slurm with programs that perform no action as
> > *SuspendProgram* and *ResumeProgram* to assess the potential impact of
> > power saving mode before enabling it.
>
> I had not noticed the above sentence in the power_save manual before!  So
> I decided to test a "no action" power saving script, similar to what you
> have done, applying it to a test partition.  I conclude that "no action"
> power saving DOES NOT WORK, at least in Slurm 23.02.5.  So I opened a bug
> report https://bugs.schedmd.com/show_bug.cgi?id=17848 to find out if the
> documentation is obsolete, or if there may be a bug.  Please follow that
> bug to find out the answer from SchedMD.
>
> What I *believe* (but not with 100% certainty) really happens with power
> saving in the current Slurm versions is what I wrote yesterday:
>
> > Slurmctld expects suspended nodes to *really* power
> > down (slurmd is stopped).  When slurmctld resumes a suspended node,
> it
> > expects slurmd to start up when the node is powered on.  There is a
> > ResumeTimeout parameter which I've set to about 15-30 minutes in
> case of
> > delays due to BIOS updates and the like - the default of 60 seconds
> is
> > WAY too small!
>
> I hope this helps,
> Ole
>
>