Hi,

thanks for the suggestion. Unfortunately I already have set
SlurmctldDebug=9.

A "grep -i power /var/log/slurm/slurmctld.log | tail" gives:

[2014-08-29T09:10:05.202] Power save mode: 31 nodes
[2014-08-29T09:12:17.228] power_save: waking nodes n510301
[2014-08-29T09:15:56.267] power_save: waking nodes n510401
[2014-08-29T09:20:18.321] Power save mode: 29 nodes
[2014-08-29T09:23:23.359] power_save: waking nodes n511301
[2014-08-29T09:31:05.448] Power save mode: 28 nodes
[2014-08-29T09:41:45.535] Power save mode: 28 nodes
[2014-08-29T09:49:25.619] power_save: suspending nodes
n[511001,511101,511601]
[2014-08-29T09:52:07.648] Power save mode: 31 nodes
[2014-08-29T09:53:08.656] power_save: waking nodes n511001

Taking nodes n[511001,511101,511601] as example I get
"scontrol show node $NODE | grep State"
n511001: State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1
n511101: State=IDLE+POWER ThreadsPerCore=2 TmpDisk=0 Weight=1
n511601: State=IDLE+POWER ThreadsPerCore=2 TmpDisk=0 Weight=1

"ipmitool -Ilanplus -UADMIN -Pxxxxx -H $NODE-bmc power status"
n511001: Chassis Power is on
n511101: Chassis Power is on
n511601: Chassis Power is on

which indicates that SLURM tries to shut those nodes down but actually
fails. Which seems consistent with my suspicion that one of the scripts
doesn't get executed.

Executing my script manually successfully shuts down the node:

# sudo -u /opt/system/slurm/etc/node_poweroff.slurm n511601

But after turning this node on again I get a status of
State=DOWN+POWER with Reason=Node unexpectedly rebooted
[slurm@2014-08-29T10:20:08]

which seems odd as SLURM should know that this node was without power
for some time.


>From this situation I have two issues:

1) How can I debug that SLURM really executes the configured scripts?
2) Should I file a bug report for this "unexpected reboot" behavior? The
reboot was not unexpected as SLURM wanted this node to shut down.

Regards,

        Uwe





Am 28.08.2014 um 13:42 schrieb Franco Broi:
> We use power saving so it definitely works, maybe you should try turning
> on debugging for the controller daemon with scontrol  and checking the
> log file.
> 
> On 28 Aug 2014 19:18, Uwe Sauter <[email protected]> wrote:
> 
> Hi all,
> 
> (configuration and scripts below text)
> 
> I have configured SLURM to power down idle nodes but it probably is
> misconfigured. I aim for a configuration where after a certain period
> (say 10min) idle nodes are powered down.
> 
> As you can see from the configuration below I have SLURM call either
> "node_poweroff.slurm" or "node_poweron.slurm" which are wrapper scripts
> that handle the conversion of SLURM's nodelist syntax and call
> "node_poweroff" or "node_poweron" for each node.
> 
> "node_power{off,on}" log their actions into /var/log/slurm/powermgmt.log
> so I can follow and in the future analyze which nodes were turned off
> and on.
> 
> The current situation is that although I see 36 out of 54 nodes in a
> IDLE+POWER state all nodes are powered on and accessible via SSH.
> 
> Output from "grep -i power /var/log/slurm/slurmctld.log | tail"
> 
> [2014-08-28T12:01:24.975] Power save mode: 30 nodes
> [2014-08-28T12:11:44.080] Power save mode: 30 nodes
> [2014-08-28T12:22:44.194] Power save mode: 30 nodes
> [2014-08-28T12:33:44.306] Power save mode: 30 nodes
> [2014-08-28T12:44:01.425] Power save mode: 30 nodes
> [2014-08-28T12:51:44.514] power_save: suspending nodes
> n[510301,510601,511901]
> [2014-08-28T12:54:26.547] Power save mode: 33 nodes
> [2014-08-28T12:54:26.547] power_save: suspending nodes n[511101,512501]
> [2014-08-28T12:57:08.581] power_save: suspending nodes n510901
> [2014-08-28T13:05:10.666] Power save mode: 36 nodes
> 
> Output from "tail /var/log/slurm/powermgmt.log"
> 
> 2014-08-27 16:39:36 power on   n512501
> 2014-08-27 16:51:17 power on   n512601
> 2014-08-27 17:59:38 power on   n512601
> 2014-08-28 09:05:54 power on   n511101
> 2014-08-28 09:06:05 power on   n511201
> 2014-08-28 09:06:11 power on   n512001
> 2014-08-28 09:06:19 power on   n512201
> 2014-08-28 10:41:51 power on   n510501
> 2014-08-28 10:41:51 power on   n510701
> 2014-08-28 11:31:41 power on   n511101
> 
> grep does not find "down" in /var/log/slurm/powermgmt.log which it
> should if "node_poweroff" has been executed.
> 
> My impression is that something (misconfiguration? bad sudo
> configuration? other right stuff?) doesn't allow SLURM to execute one of
> the mentioned scripts.
> 
> Can someone check my configuration and give some advice on how to debug
> this issue further?
> 
> 
> Thank you,
> 
>         Uwe
> 
> 
> ### slurm.conf excerpt ###
> 
> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
> SuspendTime=600
> SuspendRate=30
> ResumeRate=10
> SuspendProgram=/opt/system/slurm/etc/node_poweroff.slurm
> ResumeProgram=/opt/system/slurm/etc/node_poweron.slurm
> SuspendTimeout=120
> ResumeTimeout=300
> #SuspendExcNodes=n51[03,04,29,30][01],n52[04,05][01]
> #SuspendExcParts=
> BatchStartTimeout=60
> 
> ##########################
> 
> ### /opt/system/slurm/etc/node_poweroff.slurm ###
> 
> #!/bin/bash
> set -o nounset
> 
> NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)
> 
> for NODE in ${NODES}; do
>   sudo /opt/system/slurm/etc/node_poweroff ${NODE}
> done
> 
> exit 0
> 
> #################################################
> 
> ### /opt/system/slurm/etc/node_poweron.slurm ###
> 
> #!/bin/bash
> set -o nounset
> 
> NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)
> 
> for NODE in ${NODES}; do
>   /opt/system/slurm/etc/node_poweron ${NODE}
> done
> 
> #################################################
> 
> ### /opt/system/slurm/etc/node_poweroff ###
> 
> #!/bin/bash
> set -o nounset
> 
> NODE=$1
> 
> echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log
> 
> ssh ${NODE} "/etc/init.d/lustre_client stop"
> ssh ${NODE} "umount /localscratch /nfs/*"
> ssh ${NODE} "service slurm stop"
> ssh ${NODE} "service munge stop"
> ssh ${NODE} "poweroff"
> 
> sleep 10
> 
> ping -c1 ${NODE} >/dev/null 2>&1
> [ $? -eq 0 ] && /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H
> ${NODE}-bmc power off
> 
> exit 0
> 
> #############################################
> 
> ### /opt/system/slurm/etc/node_poweron ###
> 
> #!/bin/bash
> set -o nounset
> 
> NODE=${1}
> 
> echo "$(date +'%F %T') power on   ${NODE}" >> /var/log/slurm/powermgmt.log
> 
> /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H ${NODE}-bmc power on
> 
> exit 0
> 
> 
> ##########################################
> 
> ### /etc/sudoers excerpt ###
> 
> slurm           ALL=NOPASSWD: /opt/system/slurm/etc/node_poweron
> slurm           ALL=NOPASSWD: /opt/system/slurm/etc/node_poweroff
> 
> ############################
> 
> ------------------------------------------------------------------------
> 
> 
> This email and any files transmitted with it are confidential and are
> intended solely for the use of the individual or entity to whom they are
> addressed. If you are not the original recipient or the person
> responsible for delivering the email to the intended recipient, be
> advised that you have received this email in error, and that any use,
> dissemination, forwarding, printing, or copying of this email is
> strictly prohibited. If you received this email in error, please
> immediately notify the sender and delete the original.
> 

Reply via email to