As there haven't been any responses, I think I'll just go with my gut feeling that this is a bug and submit a bug report.
Thanks, -Aaron On Fri, 2013-09-20 at 09:26 -0700, Aaron Birkland wrote: > Hello, > > I just joined this mailing list. I've been working on using the > cloud/powersave features of SLURM to run/terminate elastic nodes as > necessary for the duration of a given job. > > I noticed that whenever I restart slurmctld, SLURM wishes to place all > nodes out of powersave mode upon restart. The exact scenario is stated > below: > > Configuration: > -------------- > - There are 96 nodes defined in slurm.conf, all are CLOUD nodes. For > example, from slurm.conf: > NodeName=rcmedium[1-24] CPUs=2 RealMemory=8 Sockets=1 CoresPerSocket=2 > ThreadsPerCore=1 State=CLOUD > > - There are Five partitions composed of some set of cloud nodes. For > example from slurm.conf: > PartitionName=medium Nodes=rcmedium[1-24] Default=YES MaxTime=INFINITE > State=UP > > - Nodes are spooled up through a series of scripts: PrologSlurmctld > (which determines which image and credentials to use, based on job > params), ResumeProgram (which launches and configures nodes, and > registers their newly minted hostname/address with scontrol), > EpilogSlurmctld (which does some accounting cleanup), and SuspendProgram > (which terminates the instances) > > > The situation: > -------------- > When slurmctld initially starts, everything is fine. All 96 nodes are > in power save mode, no cloud instances are running. > > When a job is submitted, the scripts dutifully launch instances, run the > job, then terminate the instances. When the jobs are finished and no > more are in any queues, the system settles into a state with all 96 > nodes in power save mode, with no cloud instances running. This is > good. > > When I stop slurmctld, slurmctld saves state and everything looks fine: > [2013-09-20T11:42:49.030] Power save mode: 96 nodes > [2013-09-20T11:43:32.504] Terminate signal (SIGINT or SIGTERM) received > [2013-09-20T11:43:32.505] debug: sched: slurmctld terminating > [2013-09-20T11:43:33.266] Saving all slurm state > > When I re-start slurmctld, A number of interesting things happen... > > This part looks OK: > [2013-09-20T11:44:33.673] Recovered state of 96 nodes > [2013-09-20T11:44:33.673] Recovered information about 0 jobs > [2013-09-20T11:44:33.673] Recovered state of 0 reservations > [2013-09-20T11:44:33.673] read_slurm_conf: backup_controller not > specified. > [2013-09-20T11:44:33.673] Running as primary controller > [2013-09-20T11:44:34.674] Power save mode: 96 nodes > > The next part is bad. For some reason, it tries to resolve the name of > all cloud hosts (which, of course, will never work as these machines > don't exist in tangible form, and don't have an address): > [2013-09-20T11:44:36.690] error: Unable to resolve "rcsmall13": Unknown > host > [2013-09-20T11:44:36.690] error: fwd_tree_thread: can't find address for > host rcsmall13, check slurm.conf > > In another log in my system, there is evidence that slurmctld invokes > ResumeProgram at some point during this startup process, with *all* > nodes as an argument! > 2013-09-20 11:44:34,797 - rcbatch_resume.py - INFO - Invoked with hosts > rclarge[1-16],rcmedium[1-24],rcsmall[1-32],rcxlarge[1-16],rcxxlarge[1-8] > > At this stage, scontrol show node will show any node as having the > state: > IDLE#+CLOUD > > This is different behaviour than the "normal" case, where I observe > scontrol show node to return nothing at all when querying the state of a > "sleeping" (nonexistant) cloud node. > > With clean start of slurmctld (ignoring/removing any saved state), > everything works fine. It seems that problems occur only after > slurmctld starts up and truies to restore previously saved state. I was > just wondering if this could be a bug, or if I'm missing something? > > Thanks, > > -Aaron >
