Re: [slurm-users] [External] Hibernating a whole cluster

Sean Mc Grath Tue, 07 Feb 2023 06:47:56 -0800

Hi Analabha,

Yes, unfortunately for your needs, I expect a time limited reservation along my 
suggestion would not accept jobs that would be scheduled to end outside of the 
reservations availability times. I'd suggest looking at check-pointing in this 
case, e.g. with DMTCP: Distributed MultiThreaded Checkpointing, 
http://dmtcp.sourceforge.net/. That could allow jobs to have their state saved 
and then re-loaded when they are started again.


Best

Sean

---
Sean McGrath
Senior Systems Administrator, IT Services

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Analabha 
Roy <hariseldo...@gmail.com>
Sent: Tuesday 7 February 2023 12:14
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] [External] Hibernating a whole cluster

Hi Sean,

Thanks for your awesome suggestion! I'm going through the reservation docs now. 
At first glance, it seems like a daily reservation would turn down jobs that 
are too big for the reservation. It'd be nice if
slurm could suspend (in the manner of 'scontrol suspend') jobs during reserved 
downtime and resume them after. That way, folks can submit large jobs without 
having to worry about the downtimes. Perhaps the FLEX option in reservations 
can accomplish this somehow?


I suppose that I can do it using a shell script iterator and a cron job, but 
that seems like an ugly hack. I was hoping if there is a way to config this in 
slurm itself?

AR

On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath 
<smcg...@tcd.ie<mailto:smcg...@tcd.ie>> wrote:
Hi Analabha,

Could you do something like create a daily reservation for 8 hours that starts 
at 9am, or whatever times work for you like the following untested command:

scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1 
flags=daily ReservationName=daily

Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY

Some more possible helpful documentation at 
https://slurm.schedmd.com/reservations.html, search for "daily".

My idea being that jobs can only run in that reservation, (that would have to 
be configured separately, not sure how from the top of my head), which is only 
active during the times you want the node to be working. So the cronjob that 
hibernates/shuts it down will do so when there are no jobs running. At least in 
theory.

Hope that helps.

Sean

---
Sean McGrath
Senior Systems Administrator, IT Services

________________________________
From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Analabha Roy 
<hariseldo...@gmail.com<mailto:hariseldo...@gmail.com>>
Sent: Tuesday 7 February 2023 10:05
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] [External] Hibernating a whole cluster

Hi,

Thanks. I had read the Slurm Power Saving Guide before. I believe the configs 
enable slurmctld to check other nodes for idleness and suspend/resume them. 
Slurmctld must run on a separate, always-on server for this to work, right?

My issue might be a little different. I literally have only one node that runs 
everything: slurmctld, slurmd, slurmdbd, everything.

This node must be set to "sudo systemctl hibernate"after business hours, 
regardless of whether jobs are queued or running. The next business day, it can 
be switched on manually.

systemctl hibernate is supposed to save the entire run state of the sole node 
to swap and poweroff. When powered on again, it should restore everything to 
its previous running state.

When the job queue is empty, this works well. I'm not sure how well this 
hibernate/resume will work with running jobs and would appreciate any 
suggestions or insights.

AR


On Tue, 7 Feb 2023 at 01:39, Florian Zillner 
<fzill...@lenovo.com<mailto:fzill...@lenovo.com>> wrote:
Hi,

follow this guide: https://slurm.schedmd.com/power_save.html

Create poweroff / poweron scripts and configure slurm to do the poweroff after 
X minutes. Works well for us. Make sure to set an appropriate time 
(ResumeTimeout) to allow the node to come back to service.
Note that we did not achieve good power saving with suspending the nodes, 
powering them off and on saves way more power. The downside is it takes ~ 5 
mins to resume (= power on) the nodes when needed.

Cheers,
Florian
________________________________
From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Analabha Roy 
<hariseldo...@gmail.com<mailto:hariseldo...@gmail.com>>
Sent: Monday, 6 February 2023 18:21
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: [External] [slurm-users] Hibernating a whole cluster

Hi,

I've just finished  setup of a single node "cluster" with slurm on ubuntu 
20.04. Infrastructural limitations  prevent me from running it 24/7, and it's 
only powered on during business hours.


Currently, I have a cron job running that hibernates that sole node before 
closing time.

The hibernation is done with standard systemd, and hibernates to the swap 
partition.

 I have not run any lengthy slurm jobs on it yet. Before I do, can I get some 
thoughts on a couple of things?

If it hibernated when slurm still had jobs running/queued, would they resume 
properly when the machine powers back on?

Note that my swap space is bigger than my  RAM.

Is it necessary to perhaps setup a pre-hibernate script for systemd to  iterate 
scontrol to suspend all the jobs before hibernating and resume them post-resume?

What about the wall times? I'm uessing that slurm will count the downtime as 
elapsed for each job. Is there a way to config this, or is the only alternative 
a post-hibernate script that iteratively updates the wall times of the running 
jobs using scontrol again?

Thanks for your attention.
Regards
AR


--
Analabha Roy
Assistant Professor
Department of Physics<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan<http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu<mailto:dan...@utexas.edu>, 
a...@phys.buruniv.ac.in<mailto:a...@phys.buruniv.ac.in>, 
hariseldo...@gmail.com<mailto:hariseldo...@gmail.com>
Webpage: http://www.ph.utexas.edu/~daneel/


--
Analabha Roy
Assistant Professor
Department of Physics<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan<http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu<mailto:dan...@utexas.edu>, 
a...@phys.buruniv.ac.in<mailto:a...@phys.buruniv.ac.in>, 
hariseldo...@gmail.com<mailto:hariseldo...@gmail.com>
Webpage: http://www.ph.utexas.edu/~daneel/

Re: [slurm-users] [External] Hibernating a whole cluster

Reply via email to