Are you seeking something simple rather than sophisticated? If so, you can
use the controller local disk for StateSaveLocation and place a cron job
(on the same node or somewhere else) to take that data out via e.g. rsync
and put it where you need it (NFS?) for the backup control node to use
if/when needed. That obviously introduces a time delay which might or might
not be problematic depending on what kind of failures you are trying to
protect from and with what level of guarantee you wish the HA would have:
you will not be protected in every possible scenario. On the other hand,
given the size of the cluster that might be adequate and it's basically
zero effort, so it might be "good enough" for you.

On Tue, May 7, 2024 at 4:44 AM Pierre Abele via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi all,
>
> I am looking for a clean way to set up Slurms native high availability
> feature. I am managing a Slurm cluster with one control node (hosting
> both slurmctld and slurmdbd), one login node and a few dozen compute
> nodes. I have a virtual machine that I want to set up as a backup
> control node.
>
> The Slurm documentation says the following about the StateSaveLocation
> directory:
>
> > The directory used should be on a low-latency local disk to prevent file
> system delays from affecting Slurm performance. If using a backup host, the
> StateSaveLocation should reside on a file system shared by the two hosts.
> We do not recommend using NFS to make the directory accessible to both
> hosts, but do recommend a shared mount that is accessible to the two
> controllers and allows low-latency reads and writes to the disk. If a
> controller comes up without access to the state information, queued and
> running jobs will be cancelled. [1]
>
> My question: How do I implement the shared file system for the
> StateSaveLocation?
>
> I do not want to introduce a single point of failure by having a single
> node that hosts the StateSaveLocation, neither do I want to put that
> directory on the clusters NFS storage since outages/downtime of the
> storage system will happen at some point and I do not want that to cause
> an outage of the Slurm controller.
>
> Any help or ideas would be appreciated.
>
> Best,
> Pierre
>
>
> [1] https://slurm.schedmd.com/quickstart_admin.html#Config
>
> --
> Pierre Abele, M.Sc.
>
> HPC Administrator
> Max-Planck-Institute for Evolutionary Anthropology
> Department of Primate Behavior and Evolution
>
> Deutscher Platz 6
> 04103 Leipzig
>
> Room: U2.80
> E-Mail: pierre_ab...@eva.mpg.de
> Phone: +49 (0) 341 3550 245
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to