Hello,

I just joined this mailing list.  I've been working on using the
cloud/powersave features of SLURM to run/terminate elastic nodes as
necessary for the duration of a given job.

I noticed that whenever I restart slurmctld, SLURM wishes to place all
nodes out of powersave mode upon restart.  The exact scenario is stated
below:

Configuration:
--------------
- There are 96 nodes defined in slurm.conf, all are CLOUD nodes.  For
example, from slurm.conf:
NodeName=rcmedium[1-24] CPUs=2 RealMemory=8 Sockets=1 CoresPerSocket=2
ThreadsPerCore=1 State=CLOUD

- There are Five partitions composed of some set of cloud nodes.  For
example from slurm.conf:
PartitionName=medium Nodes=rcmedium[1-24] Default=YES MaxTime=INFINITE
State=UP

- Nodes are spooled up through a series of scripts: PrologSlurmctld
(which determines which image and credentials to use, based on job
params), ResumeProgram (which launches and configures nodes, and
registers their newly minted hostname/address with scontrol),
EpilogSlurmctld (which does some accounting cleanup), and SuspendProgram
(which terminates the instances)


The situation:
--------------
When slurmctld initially starts, everything is fine.  All 96 nodes are
in power save mode, no cloud instances are running.

When a job is submitted, the scripts dutifully launch instances, run the
job, then terminate the instances.  When the jobs are finished and no
more are in any queues, the system settles into a state with all 96
nodes in power save mode, with no cloud instances running.  This is
good.

When I stop slurmctld, slurmctld saves state and everything looks fine:
[2013-09-20T11:42:49.030] Power save mode: 96 nodes
[2013-09-20T11:43:32.504] Terminate signal (SIGINT or SIGTERM) received
[2013-09-20T11:43:32.505] debug:  sched: slurmctld terminating
[2013-09-20T11:43:33.266] Saving all slurm state

When I re-start slurmctld, A number of interesting things happen...

This part looks OK:
[2013-09-20T11:44:33.673] Recovered state of 96 nodes
[2013-09-20T11:44:33.673] Recovered information about 0 jobs
[2013-09-20T11:44:33.673] Recovered state of 0 reservations
[2013-09-20T11:44:33.673] read_slurm_conf: backup_controller not
specified.
[2013-09-20T11:44:33.673] Running as primary controller
[2013-09-20T11:44:34.674] Power save mode: 96 nodes

The next part is bad.  For some reason, it tries to resolve the name of
all cloud hosts (which, of course, will never work as these machines
don't exist in tangible form, and don't have an address):
[2013-09-20T11:44:36.690] error: Unable to resolve "rcsmall13": Unknown
host
[2013-09-20T11:44:36.690] error: fwd_tree_thread: can't find address for
host rcsmall13, check slurm.conf

In another log in my system, there is evidence that slurmctld invokes
ResumeProgram at some point during this startup process, with *all*
nodes as an argument!
2013-09-20 11:44:34,797 - rcbatch_resume.py - INFO - Invoked with hosts
rclarge[1-16],rcmedium[1-24],rcsmall[1-32],rcxlarge[1-16],rcxxlarge[1-8]

At this stage, scontrol show node will show any node as having the
state:
IDLE#+CLOUD

This is different behaviour than the "normal" case, where I observe
scontrol show node to return nothing at all when querying the state of a
"sleeping" (nonexistant) cloud node.

With clean start of slurmctld (ignoring/removing any saved state),
everything works fine.  It seems that problems occur only after
slurmctld starts up and truies to restore previously saved state.  I was
just wondering if this could be a bug, or if I'm missing something?

Thanks,

  -Aaron

Reply via email to