Hello, I just joined this mailing list. I've been working on using the cloud/powersave features of SLURM to run/terminate elastic nodes as necessary for the duration of a given job.
I noticed that whenever I restart slurmctld, SLURM wishes to place all nodes out of powersave mode upon restart. The exact scenario is stated below: Configuration: -------------- - There are 96 nodes defined in slurm.conf, all are CLOUD nodes. For example, from slurm.conf: NodeName=rcmedium[1-24] CPUs=2 RealMemory=8 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 State=CLOUD - There are Five partitions composed of some set of cloud nodes. For example from slurm.conf: PartitionName=medium Nodes=rcmedium[1-24] Default=YES MaxTime=INFINITE State=UP - Nodes are spooled up through a series of scripts: PrologSlurmctld (which determines which image and credentials to use, based on job params), ResumeProgram (which launches and configures nodes, and registers their newly minted hostname/address with scontrol), EpilogSlurmctld (which does some accounting cleanup), and SuspendProgram (which terminates the instances) The situation: -------------- When slurmctld initially starts, everything is fine. All 96 nodes are in power save mode, no cloud instances are running. When a job is submitted, the scripts dutifully launch instances, run the job, then terminate the instances. When the jobs are finished and no more are in any queues, the system settles into a state with all 96 nodes in power save mode, with no cloud instances running. This is good. When I stop slurmctld, slurmctld saves state and everything looks fine: [2013-09-20T11:42:49.030] Power save mode: 96 nodes [2013-09-20T11:43:32.504] Terminate signal (SIGINT or SIGTERM) received [2013-09-20T11:43:32.505] debug: sched: slurmctld terminating [2013-09-20T11:43:33.266] Saving all slurm state When I re-start slurmctld, A number of interesting things happen... This part looks OK: [2013-09-20T11:44:33.673] Recovered state of 96 nodes [2013-09-20T11:44:33.673] Recovered information about 0 jobs [2013-09-20T11:44:33.673] Recovered state of 0 reservations [2013-09-20T11:44:33.673] read_slurm_conf: backup_controller not specified. [2013-09-20T11:44:33.673] Running as primary controller [2013-09-20T11:44:34.674] Power save mode: 96 nodes The next part is bad. For some reason, it tries to resolve the name of all cloud hosts (which, of course, will never work as these machines don't exist in tangible form, and don't have an address): [2013-09-20T11:44:36.690] error: Unable to resolve "rcsmall13": Unknown host [2013-09-20T11:44:36.690] error: fwd_tree_thread: can't find address for host rcsmall13, check slurm.conf In another log in my system, there is evidence that slurmctld invokes ResumeProgram at some point during this startup process, with *all* nodes as an argument! 2013-09-20 11:44:34,797 - rcbatch_resume.py - INFO - Invoked with hosts rclarge[1-16],rcmedium[1-24],rcsmall[1-32],rcxlarge[1-16],rcxxlarge[1-8] At this stage, scontrol show node will show any node as having the state: IDLE#+CLOUD This is different behaviour than the "normal" case, where I observe scontrol show node to return nothing at all when querying the state of a "sleeping" (nonexistant) cloud node. With clean start of slurmctld (ignoring/removing any saved state), everything works fine. It seems that problems occur only after slurmctld starts up and truies to restore previously saved state. I was just wondering if this could be a bug, or if I'm missing something? Thanks, -Aaron
