As there haven't been any responses, I think I'll just go with my gut
feeling that this is a bug and submit a bug report.

Thanks,

  -Aaron

On Fri, 2013-09-20 at 09:26 -0700, Aaron Birkland wrote:
> Hello,
> 
> I just joined this mailing list.  I've been working on using the
> cloud/powersave features of SLURM to run/terminate elastic nodes as
> necessary for the duration of a given job.
> 
> I noticed that whenever I restart slurmctld, SLURM wishes to place all
> nodes out of powersave mode upon restart.  The exact scenario is stated
> below:
> 
> Configuration:
> --------------
> - There are 96 nodes defined in slurm.conf, all are CLOUD nodes.  For
> example, from slurm.conf:
> NodeName=rcmedium[1-24] CPUs=2 RealMemory=8 Sockets=1 CoresPerSocket=2
> ThreadsPerCore=1 State=CLOUD
> 
> - There are Five partitions composed of some set of cloud nodes.  For
> example from slurm.conf:
> PartitionName=medium Nodes=rcmedium[1-24] Default=YES MaxTime=INFINITE
> State=UP
> 
> - Nodes are spooled up through a series of scripts: PrologSlurmctld
> (which determines which image and credentials to use, based on job
> params), ResumeProgram (which launches and configures nodes, and
> registers their newly minted hostname/address with scontrol),
> EpilogSlurmctld (which does some accounting cleanup), and SuspendProgram
> (which terminates the instances)
> 
> 
> The situation:
> --------------
> When slurmctld initially starts, everything is fine.  All 96 nodes are
> in power save mode, no cloud instances are running.
> 
> When a job is submitted, the scripts dutifully launch instances, run the
> job, then terminate the instances.  When the jobs are finished and no
> more are in any queues, the system settles into a state with all 96
> nodes in power save mode, with no cloud instances running.  This is
> good.
> 
> When I stop slurmctld, slurmctld saves state and everything looks fine:
> [2013-09-20T11:42:49.030] Power save mode: 96 nodes
> [2013-09-20T11:43:32.504] Terminate signal (SIGINT or SIGTERM) received
> [2013-09-20T11:43:32.505] debug:  sched: slurmctld terminating
> [2013-09-20T11:43:33.266] Saving all slurm state
> 
> When I re-start slurmctld, A number of interesting things happen...
> 
> This part looks OK:
> [2013-09-20T11:44:33.673] Recovered state of 96 nodes
> [2013-09-20T11:44:33.673] Recovered information about 0 jobs
> [2013-09-20T11:44:33.673] Recovered state of 0 reservations
> [2013-09-20T11:44:33.673] read_slurm_conf: backup_controller not
> specified.
> [2013-09-20T11:44:33.673] Running as primary controller
> [2013-09-20T11:44:34.674] Power save mode: 96 nodes
> 
> The next part is bad.  For some reason, it tries to resolve the name of
> all cloud hosts (which, of course, will never work as these machines
> don't exist in tangible form, and don't have an address):
> [2013-09-20T11:44:36.690] error: Unable to resolve "rcsmall13": Unknown
> host
> [2013-09-20T11:44:36.690] error: fwd_tree_thread: can't find address for
> host rcsmall13, check slurm.conf
> 
> In another log in my system, there is evidence that slurmctld invokes
> ResumeProgram at some point during this startup process, with *all*
> nodes as an argument!
> 2013-09-20 11:44:34,797 - rcbatch_resume.py - INFO - Invoked with hosts
> rclarge[1-16],rcmedium[1-24],rcsmall[1-32],rcxlarge[1-16],rcxxlarge[1-8]
> 
> At this stage, scontrol show node will show any node as having the
> state:
> IDLE#+CLOUD
> 
> This is different behaviour than the "normal" case, where I observe
> scontrol show node to return nothing at all when querying the state of a
> "sleeping" (nonexistant) cloud node.
> 
> With clean start of slurmctld (ignoring/removing any saved state),
> everything works fine.  It seems that problems occur only after
> slurmctld starts up and truies to restore previously saved state.  I was
> just wondering if this could be a bug, or if I'm missing something?
> 
> Thanks,
> 
>   -Aaron
> 

Reply via email to