Re: [slurm-users] Slurm cloud scheduling/power saving

Brian Andrus Thu, 01 Apr 2021 09:59:29 -0700

Run 'sinfo -R' to see if any of your nodes are out of the mix.


If so, resume them and see if things work.

Brian Andrus

On 4/1/2021 1:53 AM, Steve Brasier wrote:

Hi all, anyone have suggestions for debugging cloud nodes notresuming? I've had this working before but I'm now using "configless"mode so wondering if that's an issue.
If I login as SlurmUser and run the ResumeProgram manually, thespecified node(s) boot, and if I log into them `sinfo` works althoughit only shows the "static" nodes, not the newly booted "cloud" nodes.So that at least shows the program works, the image works, and newnodes can contact the slurmctld.
However if I run a job which requires cloud nodes it immediately goesPending showing "Nodes required for job are DOWN, DRAINED or reservedfor jobs in higher priority partitions". Looking at SlurmctldLogFilewith SlurmdDebug=debug5 I don't see any attempt to boot the nodes atall :-(.
I can post slurm.conf if anyone wants to look but I think theimportant parameters are probably that I've got:
SlurmctldParameters=enable_configless,idle_on_node_suspend,cloud_dns,power_save_interval=10,power_save_min_interval=0

That look right?

thanks for any suggestions!

Steve

http://stackhpc.com/ <http://stackhpc.com/>
Please note I work Tuesday to Friday.

Re: [slurm-users] Slurm cloud scheduling/power saving

Reply via email to