Re: [slurm-users] [External] Hibernating a whole cluster

Diego Zuccato Tue, 07 Feb 2023 04:43:47 -0800

RAM used by a suspended job is not released. At most it can be swappedout (if enough swap is available).


Il 07/02/2023 13:14, Analabha Roy ha scritto:

Hi Sean,

Thanks for your awesome suggestion! I'm going through the reservationdocs now. At first glance, it seems like a daily reservation would turndown jobs that are too big for the reservation. It'd be nice ifslurm could suspend (in the manner of 'scontrol suspend') jobs duringreserved downtime and resume them after. That way, folks can submitlarge jobs without having to worry about the downtimes. Perhaps the FLEXoption in reservations can accomplish this somehow?

I suppose that I can do it using a shell script iterator and a cron job,but that seems like an ugly hack. I was hoping if there is a way toconfig this in slurm itself?

AR

On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath <smcg...@tcd.ie<mailto:smcg...@tcd.ie>> wrote:


    Hi Analabha,

    Could you do something like create a daily reservation for 8 hours
    that starts at 9am, or whatever times work for you like the
    following untested command:

    scontrol create reservation starttime=09:00:00 duration=8:00:00
    nodecnt=1 flags=daily ReservationName=daily

    Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY
    <https://slurm.schedmd.com/scontrol.html#OPT_DAILY>

    Some more possible helpful documentation at
    https://slurm.schedmd.com/reservations.html
    <https://slurm.schedmd.com/reservations.html>, search for "daily".

    My idea being that jobs can only run in that reservation, (that
    would have to be configured separately, not sure how from the top of
    my head), which is only active during the times you want the node to
    be working. So the cronjob that hibernates/shuts it down will do so
    when there are no jobs running. At least in theory.

    Hope that helps.

    Sean

    ---
    Sean McGrath
    Senior Systems Administrator, IT Services

    ------------------------------------------------------------------------
    *From:* slurm-users <slurm-users-boun...@lists.schedmd.com
    <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of
    Analabha Roy <hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>>
    *Sent:* Tuesday 7 February 2023 10:05
    *To:* Slurm User Community List <slurm-users@lists.schedmd.com
    <mailto:slurm-users@lists.schedmd.com>>
    *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster
    Hi,

    Thanks. I had read the Slurm Power Saving Guide before. I believe
    the configs enable slurmctld to check other nodes for idleness and
    suspend/resume them. Slurmctld must run on a separate, always-on
    server for this to work, right?

    My issue might be a little different. I literally have only one node
    that runs everything: slurmctld, slurmd, slurmdbd, everything.

    This node must be set to "sudo systemctl hibernate"after business
    hours, regardless of whether jobs are queued or running. The next
    business day, it can be switched on manually.

    systemctl hibernate is supposed to save the entire run state of the
    sole node to swap and poweroff. When powered on again, it should
    restore everything to its previous running state.

    When the job queue is empty, this works well. I'm not sure how well
    this hibernate/resume will work with running jobs and would
    appreciate any suggestions or insights.

    AR


    On Tue, 7 Feb 2023 at 01:39, Florian Zillner <fzill...@lenovo.com
    <mailto:fzill...@lenovo.com>> wrote:

        Hi,

        follow this guide: https://slurm.schedmd.com/power_save.html
        <https://slurm.schedmd.com/power_save.html>

        Create poweroff / poweron scripts and configure slurm to do the
        poweroff after X minutes. Works well for us. Make sure to set an
        appropriate time (ResumeTimeout) to allow the node to come back
        to service.
        Note that we did not achieve good power saving with suspending
        the nodes, powering them off and on saves way more power. The
        downside is it takes ~ 5 mins to resume (= power on) the nodes
        when needed.

        Cheers,
        Florian
        ------------------------------------------------------------------------
        *From:* slurm-users <slurm-users-boun...@lists.schedmd.com
        <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of
        Analabha Roy <hariseldo...@gmail.com
        <mailto:hariseldo...@gmail.com>>
        *Sent:* Monday, 6 February 2023 18:21
        *To:* slurm-users@lists.schedmd.com
        <mailto:slurm-users@lists.schedmd.com>
        <slurm-users@lists.schedmd.com
        <mailto:slurm-users@lists.schedmd.com>>
        *Subject:* [External] [slurm-users] Hibernating a whole cluster
        Hi,

        I've just finished  setup of a single node "cluster" with slurm
        on ubuntu 20.04. Infrastructural limitations  prevent me from
        running it 24/7, and it's only powered on during business hours.


        Currently, I have a cron job running that hibernates that sole
        node before closing time.

        The hibernation is done with standard systemd, and hibernates to
        the swap partition.

          I have not run any lengthy slurm jobs on it yet. Before I do,
        can I get some thoughts on a couple of things?

        If it hibernated when slurm still had jobs running/queued, would
        they resume properly when the machine powers back on?

        Note that my swap space is bigger than my  RAM.

        Is it necessary to perhaps setup a pre-hibernate script for
        systemd to  iterate scontrol to suspend all the jobs before
        hibernating and resume them post-resume?

        What about the wall times? I'm uessing that slurm will count the
        downtime as elapsed for each job. Is there a way to config this,
        or is the only alternative a post-hibernate script that
        iteratively updates the wall times of the running jobs using
        scontrol again?

        Thanks for your attention.
        Regards
        AR

--Analabha Roy

    Assistant Professor
    Department of Physics
    <http://www.buruniv.ac.in/academics/department/physics>
    The University of Burdwan <http://www.buruniv.ac.in/>
    Golapbag Campus, Barddhaman 713104
    West Bengal, India
    Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>,
    a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>,
    hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>
    Webpage: http://www.ph.utexas.edu/~daneel/
    <http://www.ph.utexas.edu/~daneel/>



--
Analabha Roy
Assistant Professor

Department of Physics<http://www.buruniv.ac.in/academics/department/physics>

The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India

Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>,a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>,hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>Webpage: http://www.ph.utexas.edu/~daneel/<http://www.ph.utexas.edu/~daneel/>


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Re: [slurm-users] [External] Hibernating a whole cluster

Reply via email to