Re: [slurm-users] slurm elastic compute / power saving

2020-01-07 Thread Brian Andrus
I think we would need to see your SuspendScript to get a better idea of 
what is happening.


That error indicates the nodes are likely not running slurmd and the 
control daemon things they are still up.


What is the output of 'sinfo -R'?

Brian Andrus

On 1/7/2020 3:42 AM, Steve Brasier wrote:

Hi all,

I've got elastic compute working with slurm but on "suspend" I get 
something like the following in the slurmcltd log:


power down request repeating for node compute-2
power down request repeating for node compute-3
error: Nodes compute-[2-3] not responding

The docs say that the SuspendScript should only have to return the 
nodes to the cloud - but the above suggests that maybe the script 
should also notify the slurmctld that the nodes are offline? Is that 
right, and if so what state should they be set to?


many thanks
Steve




[slurm-users] Can I add a new slurm plugin to an existing installation, or do I have to rebuild and reinstall with the plugin source?

2020-01-07 Thread Dean Schulze
The SchedMD docs for adding a plugin
 describe
adding source code, Makefiles, and other modifications for a new plugin to
a git branch in the source tree. This makes it sound like I would have to
rebuild and reinstall slurm in order to use a new plugin that I create.
This would make for long iterations for developing a new plugin.

My slurm installation on Ubuntu 18 has the existing plugin binaries (.a,
.la, .so) in /usr/local/lib/slurm. Can I install a new plugin by copying
the binaries for the new plugin into this directory (maybe with some
configuration modifications)? The SchedMD docs don't mention anything about
installing a new plugin into an existing slurm installation, or if it's
even possible.

Thanks.


[slurm-users] slurm elastic compute / power saving

2020-01-07 Thread Steve Brasier
Hi all,

I've got elastic compute working with slurm but on "suspend" I get
something like the following in the slurmcltd log:

power down request repeating for node compute-2
power down request repeating for node compute-3
error: Nodes compute-[2-3] not responding

The docs say that the SuspendScript should only have to return the nodes to
the cloud - but the above suggests that maybe the script should also notify
the slurmctld that the nodes are offline? Is that right, and if so what
state should they be set to?

many thanks
Steve