Re: [slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

Brian Andrus Wed, 23 Nov 2022 05:40:40 -0800

Xavier,

You want to use the ResumeFailedProgram script.

We use a full cloud cluster and that is where we deal with things likethis. It will get called if your ResumeProgram does not result in slurmdbeing available on the node in a timely manner (whatever the reason).Writing it yourself makes complete sense when you think about the uses.Originally, it would be something that could be called because a nodehas a hardware issue and would not start. In the ResumeFailProgram youcould send an email letting an admin know about it.

For me, I completely delete the node resources and reset/recreate it.That addresses even a miffed software change.


Brian Andrus

On 11/23/2022 5:11 AM, Xaver Stiensmeier wrote:

Hello slurm-users,
The question can be found in a similar fashion here:https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system
  Issue


    Current behavior and problem description
When a node fails to |POWER_UP|, it is marked |DOWN|. While this is agreat idea in general, this is not useful when working with |CLOUD|nodes, because said |CLOUD| node is likely to be started on adifferent machine and therefore to |POWER_UP| without issues. Butsince the node is marked as down, that cloud resource is no longerused and never started again until freed manually.
    Wanted behavior
Ideally slurm would not mark the node as |DOWN|, but just attempt tostart another. If that's not possible, automatically resuming |DOWN|nodes would also be an option.
    Question
How can I prevent slurm from marking nodes that fail to |POWER_UP| as|DOWN| or make slurm restore |DOWN| nodes automatically to preventslurm from forgetting cloud resources?
  Attempts and Thoughts


    ReturnToService
I tried solving this using |ReturnToService|<https://slurm.schedmd.com/slurm.conf.html#OPT_ReturnToService> butthat didn't seem to solve my issue, since, if I understand itcorrectly, that will only accept slurm nodes starting up by themselvesor manually not taking them in consideration when scheduling jobsuntil they've been started.
    SlurmctldParameters=idle_on_node_suspend
While this is great and definitely helpful, it doesn't solve the issueat hand since a node that failed during power up, is not suspended.
    ResumeFailedProgram
I considered using |ResumeFailedProgram|<https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeFailProgram>, butit sounds odd that you have to write yourself a script for returningyour nodes to service if they fail on startup. This case sounds toousual to not be implemented in slurm. However, this will be my nextattempt: Implement a script that calls for every given node
    sudo scontrol update NodeName=$NODE_NAME state=RESUME
    reason=FailedShutdown


  Additional Information
In the |POWER_UP| script I am terminating the server if the setupfails for any reason and return an exit code unequal to 0.
In our Cloud Scheduling<https://slurm.schedmd.com/elastic_computing.html> instances arecreated once they are needed and deleted once they are no longerdeleted. This means that slurm stores that a node is |DOWN| while noreal instance behind it exists anymore. If that node wouldn't bemarked |DOWN| and a job would be scheduled towards it at a later time,it would simply start an instance and run on that new instance. I amjust stating this to be maximum explicit.
Best regards,
Xaver Stiensmeier
PS: This is the first time I use the slurm-user list and I hope I amnot violating any rules with this question. Please let me know, if I do.

Re: [slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

Reply via email to